Wispr-Flow CLAUDE.md
This document provides comprehensive technical details about the OpenWhispr project architecture for AI assistants working on the codebase.
OpenWhispr Technical Reference for AI Assistants
This document provides comprehensive technical details about the OpenWhispr project architecture for AI assistants working on the codebase.
Project Overview
OpenWhispr is an Electron-based desktop dictation application that uses OpenAI Whisper for speech-to-text transcription. It supports both local (privacy-focused) and cloud (OpenAI API) processing modes.
Architecture Overview
Core Technologies
- Frontend: React 19, TypeScript, Tailwind CSS v4, Vite
- Desktop Framework: Electron 36 with context isolation
- Database: better-sqlite3 for local transcription history
- UI Components: shadcn/ui with Radix primitives
- Speech Processing: OpenAI Whisper (local Python bridge + API)
- Audio Processing: FFmpeg (bundled via ffmpeg-static)
Key Architectural Decisions
-
Dual Window Architecture:
- Main Window: Minimal overlay for dictation (draggable, always on top)
- Control Panel: Full settings interface (normal window)
- Both use same React codebase with URL-based routing
-
Process Separation:
- Main Process: Electron main, IPC handlers, database operations
- Renderer Process: React app with context isolation
- Preload Script: Secure bridge between processes
-
Audio Pipeline:
- MediaRecorder API → Blob → ArrayBuffer → Base64 → IPC → File → FFmpeg → Whisper
- Automatic cleanup of temporary files after processing
File Structure and Responsibilities
Main Process Files
- main.js: Application entry point, initializes all managers
- preload.js: Exposes safe IPC methods to renderer via window.api
Helper Modules (src/helpers/)
- audioManager.js: Handles audio device management
- clipboard.js: Cross-platform clipboard operations with AppleScript fallback
- database.js: SQLite operations for transcription history
- debugLogger.js: Debug logging system with file output
- devServerManager.js: Vite dev server integration
- dragManager.js: Window dragging functionality
- environment.js: Environment variable and OpenAI API management
- hotkeyManager.js: Global hotkey registration and management
- ipcHandlers.js: Centralized IPC handler registration
- menuManager.js: Application menu management
- pythonInstaller.js: Automatic Python installation for all platforms
- tray.js: System tray icon and menu
- whisper.js: Local Whisper integration and Python bridge
- windowConfig.js: Centralized window configuration
- windowManager.js: Window creation and lifecycle management
React Components (src/components/)
- App.jsx: Main dictation interface with recording states
- ControlPanel.tsx: Settings, history, model management UI
- OnboardingFlow.tsx: 8-step first-time setup wizard
- SettingsPage.tsx: Comprehensive settings interface
- WhisperModelPicker.tsx: Model selection and download UI
- ui/: Reusable UI components (buttons, cards, inputs, etc.)
React Hooks (src/hooks/)
- useAudioRecording.js: MediaRecorder API wrapper with error handling
- useClipboard.ts: Clipboard operations hook
- useDialogs.ts: Electron dialog integration
- useHotkey.js: Hotkey state management
- useLocalStorage.ts: Type-safe localStorage wrapper
- usePermissions.ts: System permission checks
- usePython.ts: Python installation state
- useSettings.ts: Application settings management
- useWhisper.ts: Whisper model management
Services
- ReasoningService.ts: AI processing for agent-addressed commands
- Detects when user addresses their named agent
- Routes to appropriate AI provider (OpenAI/Anthropic/Gemini)
- Removes agent name from final output
- Supports GPT-5, Claude Opus 4.1, and Gemini 2.5 models
Python Bridge
- whisper_bridge.py: Standalone Python script for local Whisper
- Accepts audio file path and model selection
- Returns JSON with transcription result
- Handles FFmpeg path resolution for bundled executable
- 30-second timeout protection
Key Implementation Details
1. FFmpeg Integration
FFmpeg is bundled with the app and doesn't require system installation:
// FFmpeg is unpacked from ASAR to app.asar.unpacked/node_modules/ffmpeg-static/
// Python bridge receives FFmpeg path via environment variables:
// FFMPEG_PATH, FFMPEG_EXECUTABLE, FFMPEG_BINARY
2. Audio Recording Flow
- User presses hotkey → MediaRecorder starts
- Audio chunks collected in array
- User presses hotkey again → Recording stops
- Blob created from chunks → Converted to ArrayBuffer
- Sent via IPC as Base64 string (size limits)
- Main process writes to temporary file
- Whisper processes file → Result sent back
- Temporary file deleted
3. Local Whisper Models
Models stored in ~/.cache/whisper/:
- tiny: 39MB (fastest, lowest quality)
- base: 74MB (recommended balance)
- small: 244MB (better quality)
- medium: 769MB (high quality)
- large: 1.5GB (best quality)
- turbo: 809MB (fast with good quality)
4. Database Schema
CREATE TABLE transcriptions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
original_text TEXT NOT NULL,
processed_text TEXT,
is_processed BOOLEAN DEFAULT 0,
processing_method TEXT DEFAULT 'none',
agent_name TEXT,
error TEXT
);
5. Settings Storage
Settings stored in localStorage with these keys:
whisperModel: Selected Whisper modeluseLocalWhisper: Boolean for local vs cloudopenaiApiKey: Encrypted API keyanthropicApiKey: Encrypted API keygeminiApiKey: Encrypted API keylanguage: Selected language codeagentName: User's custom agent namereasoningModel: Selected AI model for processing (defaults to gpt-4o-mini)reasoningProvider: AI provider (openai/anthropic/gemini/local)hotkey: Custom hotkey configurationhasCompletedOnboarding: Onboarding completion flag
6. Language Support
58 languages supported (see src/utils/languages.ts):
- Each language has a two-letter code and label
- "auto" for automatic detection
- Passed to Whisper via --language parameter
7. Agent Naming System
- User names their agent during onboarding (step 6/8)
- Name stored in localStorage and database
- ReasoningService detects "Hey [AgentName]" patterns
- AI processes command and removes agent reference from output
- Supports multiple AI providers:
- OpenAI (Now using Responses API as of September 2025):
- GPT-5 Series (Nano/Mini/Full) - Latest models with fastest performance
- GPT-4.1 Series (Nano/Mini/Full) with 1M context window
- o-series reasoning models (o3/o3-pro/o4-mini) for deep reasoning tasks
- GPT-4o multimodal series (4o/4o-mini) - default model
- Legacy support for GPT-4 Turbo, GPT-4 classic, GPT-3.5 Turbo
- Anthropic (Via IPC bridge to avoid CORS):
- Claude Opus 4.1 (claude-opus-4-1-20250805) - Frontier intelligence
- Claude Sonnet 4 (claude-sonnet-4-20250514) - Latest balanced model
- Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) - Balanced performance
- Claude 3.5 Haiku (claude-3-5-haiku-20241022) - Fast and efficient
- Google Gemini (Direct API integration):
- Gemini 2.5 Pro (gemini-2.5-pro) - Most intelligent with thinking capability
- Gemini 2.5 Flash (gemini-2.5-flash) - High-performance with thinking
- Gemini 2.5 Flash Lite (gemini-2.5-flash-lite) - Fast and low-cost
- Gemini 2.0 Flash (gemini-2.0-flash) - 1M token context
- Local: Community models via LocalReasoningService (Qwen, LLaMA, Mistral)
- OpenAI (Now using Responses API as of September 2025):
8. API Integrations and Updates
OpenAI Responses API (September 2025):
- Migrated from Chat Completions to new Responses API
- Endpoint:
https://api.openai.com/v1/responses - Simplified request format with
inputarray instead ofmessages - New response format with
outputarray containing typed items - Automatic handling of GPT-5 and o-series model requirements
- No temperature parameter for newer models (GPT-5, o-series)
Anthropic Integration:
- Routes through IPC handler to avoid CORS issues in renderer process
- Uses main process for API calls with proper error handling
- Model names use hyphens (e.g.,
claude-3-5-sonnetnotclaude-3.5-sonnet)
Gemini Integration:
- Direct API calls from renderer process
- Increased token limits for Gemini 2.5 Pro (2000 minimum)
- Proper handling of thinking process in responses
- Error handling for MAX_TOKENS finish reason
API Key Persistence:
- All API keys now properly persist to
.envfile - Keys stored in environment variables and reloaded on app start
- Centralized
saveAllKeysToEnvFile()method ensures consistency
9. Debug Mode
Enable with --debug flag or OPENWHISPR_DEBUG=true:
- Logs saved to platform-specific app data directory
- Comprehensive logging of audio pipeline
- FFmpeg path resolution details
- Audio level analysis
- Complete reasoning pipeline debugging with stage-by-stage logging
Development Guidelines
Adding New Features
- New IPC Channel: Add to both ipcHandlers.js and preload.js
- New Setting: Update useSettings.ts and SettingsPage.tsx
- New UI Component: Follow shadcn/ui patterns in src/components/ui
- New Manager: Create in src/helpers/, initialize in main.js
Testing Checklist
- [ ] Test both local and cloud processing modes
- [ ] Verify hotkey works globally
- [ ] Check clipboard pasting on all platforms
- [ ] Test with different audio input devices
- [ ] Verify Python auto-installation
- [ ] Test all Whisper models
- [ ] Check agent naming functionality
Common Issues and Solutions
-
No Audio Detected:
- Check FFmpeg path resolution
- Verify microphone permissions
- Check audio levels in debug logs
-
Transcription Fails:
- Ensure Python/Whisper installed
- Check temporary file creation
- Verify FFmpeg is executable
-
Clipboard Not Working:
- macOS: Check accessibility permissions
- Use AppleScript fallback on macOS
-
Build Issues:
- Use
npm run packfor unsigned builds (CSC_IDENTITY_AUTO_DISCOVERY=false) - Signing requires Apple Developer account
- ASAR unpacking needed for FFmpeg/Python bridge
- afterSign.js automatically skips signing when CSC_IDENTITY_AUTO_DISCOVERY=false
- Use
Platform-Specific Notes
macOS:
- Requires accessibility permissions for clipboard
- Uses AppleScript for reliable pasting
- Notarization needed for distribution
- Shows in dock with indicator dot when running (LSUIElement: false)
Windows:
- Python installer handles PATH automatically
- No special permissions needed
- NSIS installer for distribution
Linux:
- Multiple package manager support
- Standard XDG directories
- AppImage for distribution
Code Style and Conventions
- Use TypeScript for new React components
- Follow existing patterns in helpers/
- Descriptive error messages for users
- Comprehensive debug logging
- Clean up resources (files, listeners)
- Handle edge cases gracefully
Performance Considerations
- Whisper model size vs speed tradeoff
- Audio blob size limits for IPC (10MB)
- Temporary file cleanup
- Memory usage with large models
- Process timeout protection (30s)
Security Considerations
- API keys stored in system keychain when possible
- Context isolation enabled
- No remote code execution
- Sanitized file paths
- Limited IPC surface area
Future Enhancements to Consider
- Streaming transcription support
- Custom wake word detection
- Multi-language UI
- Cloud model selection
- Batch transcription
- Export formats beyond clipboard