llm-cooperation CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a comprehensive LLM Cooperation System implementing intelligent multi-model routing and coordination using vLLM inference engine. The system provides enterprise-grade AI services with automatic model selection, distributed inference, and cooperative processing capabilities.
Architecture
Core System Layers
-
vLLM Inference Engine Layer (
vllm_engine.py)- Unified inference services with PagedAttention memory management
- Distributed inference across GPU clusters with tensor/pipeline parallelism
- Async HTTP API with comprehensive health monitoring
-
Model Resource Management Layer (
model_manager.py)- Dynamic model lifecycle management and load balancing
- Real-time performance metrics and resource monitoring
- Intelligent model selection based on current system state
-
Cooperation Scheduling Layer (
cooperation_scheduler.py)- Multi-model coordination with sequential/parallel/voting/pipeline modes
- Task decomposition and result integration strategies
- Advanced workflow orchestration with dependency management
-
Intelligent Routing Layer (
intelligent_router.py)- Request analysis for task type and complexity detection
- Performance-based routing decisions with caching
- User preference integration and optimization strategies
-
Application Service Layer (
application_service.py)- Enterprise services: document analysis, data insights, decision support
- Structured API for industry applications
- Comprehensive request/response tracking
Model Configuration
- Qwen3-32B Router: Fast routing decisions (2 GPU tensor parallel)
- Qwen3-32B Reasoning: Complex logic with thinking chains (2 GPU tensor parallel)
- Qwen2.5-VL-72B: Multimodal processing (4 GPU tensor parallel)
- Qwen2.5-7B: Lightweight tasks (1 GPU tensor parallel)
Development Commands
System Startup
# Development mode
python start_system.py --dev
# Production mode
python start_system.py
# Docker deployment
docker-compose up -d
Testing
# Run all tests
python test_system.py
# Run examples
python example_client.py --example all
# Specific example categories
python example_client.py --example cooperation
python example_client.py --example services
Monitoring
# System status
curl http://localhost:8080/status
# Model metrics
curl http://localhost:8080/models
# Performance stats
curl http://localhost:8080/routing/stats
Configuration
Environment Setup
Configure in API_Key_DeepSeek.env:
BASE_URL: AI service endpoint (default: https://api2.aigcbest.top/v1)API_KEY: Authentication keyMODEL: Default model identifier
OpenAI API Configuration
The system now supports flexible OpenAI-compatible API configuration:
# Quick setup with preset (AIGC Best)
python api_config.py preset --name aigcbest --api-key YOUR_API_KEY
# Custom API configuration
python api_config.py global --base-url https://api2.aigcbest.top/v1 --api-key YOUR_KEY
# Add individual models
python api_config.py add-model --name custom_model --path Qwen/Qwen3-32B --tasks reasoning math
# Test connectivity
python api_config.py test
# List all models
python api_config.py list
Supported Model Providers
- AIGC Best (aigcbest.top): Qwen3-32B, Qwen3-8B, Claude, GPT-4o, DeepSeek-v3
- OpenAI: GPT-4o, GPT-4-turbo, GPT-3.5-turbo
- Anthropic: Claude-3.5-Sonnet, Claude-3-Haiku
- Custom endpoints: Any OpenAI-compatible API
System Configuration (config.py)
- OpenAI API endpoints and model mappings
- Routing strategies and performance thresholds
- Task type mappings and cooperation modes
- Model-specific parameters (temperature, max_tokens)
- Monitoring and logging configuration
API Usage
Basic Query
POST /query
{
"query": "Your question here",
"preferences": {"strategy": "auto", "quality_priority": 0.8}
}
Application Services
POST /service
{
"service_type": "document_analysis",
"content": "Document text...",
"parameters": {"analysis_type": "comprehensive"}
}
Cooperation Tasks
POST /cooperation/task
{
"query": "Complex analysis request",
"mode": "sequential",
"models": ["qwen3_32b_reasoning", "qwen2_5_7b"]
}
Development Notes
- System uses vLLM v0.6.0+ for inference engine foundation
- All components implement comprehensive async patterns
- Extensive logging and metrics collection throughout
- GPU memory and compute resources are automatically managed
- Hot-swappable model configurations without system restart
- Built-in performance optimization and load balancing