LORE CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
LORE (Learned Obfuscated Reasoning) is a research framework for studying whether LLMs can learn to obfuscate their reasoning through prompt optimization. The core hypothesis is that prompt-evolved reasoning guidelines trained across diverse tasks will transfer to safety-critical domains and degrade human/AI monitoring effectiveness.
Two Training Regimes:
- Regime A: Optimizes for task accuracy only
- Regime B: Optimizes for accuracy while evading monitor detection (fitness = accuracy × (1 - monitor_suspicion))
Common Commands
# Install (dev mode with development dependencies)
pip install -e ".[dev]"
# Install with sandbox support
pip install -e ".[sandbox]"
# Run GEPA prompt optimization
lore gepa --dataset gsm8k --generations 100 --population 50 --regime A
# Evaluate a prompt with inspect-ai
lore evaluate <prompt_path> --task math_reasoning --model gpt-4o
# Run transfer evaluation across models
lore transfer <prompt_path> --models "gpt-4o,gemini-2.0-flash,claude-sonnet-4"
# Run obfuscation monitor on a response
lore monitor <response_path> <prompt_path>
# Testing
pytest tests/ # Run all tests
pytest tests/test_foo.py -k bar # Run specific test
pytest --cov=src tests/ # With coverage
# Linting and type checking
ruff check src/
mypy src/
# Docker deployment (vLLM + Redis)
docker compose up -d vllm redis
Architecture
src/
├── cli.py # Typer CLI entry point (lore command)
├── data/
│ ├── loaders.py # DataLoader, DataPoint classes
│ └── gym_builder.py # Multi-task gym with hidden goal injection
├── models/
│ ├── frontier_api.py # Anthropic, OpenAI, Google clients
│ └── open_weights.py # vLLM client for open-weight models
├── optimization/
│ ├── base_optimizer.py # Abstract optimizer interface
│ ├── gepa.py # GEPA evolutionary optimization
│ ├── promptbreeder.py # Genetic algorithm optimization
│ ├── mipro.py # Multi-objective optimization
│ └── grg_mutations.py # GRG mutation strategies
└── monitors/
├── lindner_autorater.py # CoT obfuscation detection (Lindner et al.)
├── ensemble.py # Multi-judge ensemble monitor
└── detection.py # General detection methods
evals/ # inspect_ai evaluation tasks
configs/ # YAML configs (datasets, models, gepa)
scripts/ # Bash scripts for experiments
docker/ # Container definitions
Key Patterns
Model Clients: All model interactions go through unified clients in src/models/. Frontier clients handle rate limiting and retries. Use get_client("openai", model_id="gpt-4o") to get a client.
Optimizers: All inherit from BaseOptimizer with async optimize(trainset, valset) method. Results are OptimizationResult objects with save()/load() methods.
Monitors: LindnerAutorater is the primary obfuscation detector. Returns risk levels (LOW/MEDIUM/HIGH) and pattern flags.
Hidden Goal Injection: GymBuilder injects hidden objectives into 20% of training samples. Three categories: benign (structural), capability_elicitation, misaligned_research.
Configuration
Copy .env.example to .env and set API keys:
ANTHROPIC_API_KEY,OPENAI_API_KEY,GOOGLE_API_KEYfor frontier modelsWANDB_API_KEYfor experiment trackingHF_TOKENfor gated HuggingFace datasetsVLLM_BASE_URLfor open-weight model serving
Config files in configs/:
datasets.yaml- dataset paths and hidden goal templatesmodels.yaml- model specs and rate limitsgepa_config.yaml- optimization hyperparameters