OpenWhispr Technical Reference for AI Assistants

This document provides comprehensive technical details about the OpenWhispr project architecture for AI assistants working on the codebase.

Project Overview

OpenWhispr is an Electron-based desktop dictation application that uses OpenAI Whisper for speech-to-text transcription. It supports both local (privacy-focused) and cloud (OpenAI API) processing modes.

Architecture Overview

Core Technologies

Frontend: React 19, TypeScript, Tailwind CSS v4, Vite
Desktop Framework: Electron 36 with context isolation
Database: better-sqlite3 for local transcription history
UI Components: shadcn/ui with Radix primitives
Speech Processing: OpenAI Whisper (local Python bridge + API)
Audio Processing: FFmpeg (bundled via ffmpeg-static)

Key Architectural Decisions

Dual Window Architecture:
- Main Window: Minimal overlay for dictation (draggable, always on top)
- Control Panel: Full settings interface (normal window)
- Both use same React codebase with URL-based routing
Process Separation:
- Main Process: Electron main, IPC handlers, database operations
- Renderer Process: React app with context isolation
- Preload Script: Secure bridge between processes
Audio Pipeline:
- MediaRecorder API → Blob → ArrayBuffer → Base64 → IPC → File → FFmpeg → Whisper
- Automatic cleanup of temporary files after processing

File Structure and Responsibilities

Main Process Files

main.js: Application entry point, initializes all managers
preload.js: Exposes safe IPC methods to renderer via window.api

Helper Modules (src/helpers/)

audioManager.js: Handles audio device management
clipboard.js: Cross-platform clipboard operations with AppleScript fallback
database.js: SQLite operations for transcription history
debugLogger.js: Debug logging system with file output
devServerManager.js: Vite dev server integration
dragManager.js: Window dragging functionality
environment.js: Environment variable and OpenAI API management
hotkeyManager.js: Global hotkey registration and management
ipcHandlers.js: Centralized IPC handler registration
menuManager.js: Application menu management
pythonInstaller.js: Automatic Python installation for all platforms
tray.js: System tray icon and menu
whisper.js: Local Whisper integration and Python bridge
windowConfig.js: Centralized window configuration
windowManager.js: Window creation and lifecycle management

React Components (src/components/)

App.jsx: Main dictation interface with recording states
ControlPanel.tsx: Settings, history, model management UI
OnboardingFlow.tsx: 8-step first-time setup wizard
SettingsPage.tsx: Comprehensive settings interface
WhisperModelPicker.tsx: Model selection and download UI
ui/: Reusable UI components (buttons, cards, inputs, etc.)

React Hooks (src/hooks/)

useAudioRecording.js: MediaRecorder API wrapper with error handling
useClipboard.ts: Clipboard operations hook
useDialogs.ts: Electron dialog integration
useHotkey.js: Hotkey state management
useLocalStorage.ts: Type-safe localStorage wrapper
usePermissions.ts: System permission checks
usePython.ts: Python installation state
useSettings.ts: Application settings management
useWhisper.ts: Whisper model management

Services

ReasoningService.ts: AI processing for agent-addressed commands
- Detects when user addresses their named agent
- Routes to appropriate AI provider (OpenAI/Anthropic/Gemini)
- Removes agent name from final output
- Supports GPT-5, Claude Opus 4.1, and Gemini 2.5 models

Python Bridge

whisper_bridge.py: Standalone Python script for local Whisper
- Accepts audio file path and model selection
- Returns JSON with transcription result
- Handles FFmpeg path resolution for bundled executable
- 30-second timeout protection

Key Implementation Details

1. FFmpeg Integration

FFmpeg is bundled with the app and doesn't require system installation:

// FFmpeg is unpacked from ASAR to app.asar.unpacked/node_modules/ffmpeg-static/
// Python bridge receives FFmpeg path via environment variables:
// FFMPEG_PATH, FFMPEG_EXECUTABLE, FFMPEG_BINARY

2. Audio Recording Flow

User presses hotkey → MediaRecorder starts
Audio chunks collected in array
User presses hotkey again → Recording stops
Blob created from chunks → Converted to ArrayBuffer
Sent via IPC as Base64 string (size limits)
Main process writes to temporary file
Whisper processes file → Result sent back
Temporary file deleted

3. Local Whisper Models

Models stored in ~/.cache/whisper/:

tiny: 39MB (fastest, lowest quality)
base: 74MB (recommended balance)
small: 244MB (better quality)
medium: 769MB (high quality)
large: 1.5GB (best quality)
turbo: 809MB (fast with good quality)

4. Database Schema

CREATE TABLE transcriptions (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
  original_text TEXT NOT NULL,
  processed_text TEXT,
  is_processed BOOLEAN DEFAULT 0,
  processing_method TEXT DEFAULT 'none',
  agent_name TEXT,
  error TEXT
);

5. Settings Storage

Settings stored in localStorage with these keys:

whisperModel: Selected Whisper model
useLocalWhisper: Boolean for local vs cloud
openaiApiKey: Encrypted API key
anthropicApiKey: Encrypted API key
geminiApiKey: Encrypted API key
language: Selected language code
agentName: User's custom agent name
reasoningModel: Selected AI model for processing (defaults to gpt-4o-mini)
reasoningProvider: AI provider (openai/anthropic/gemini/local)
hotkey: Custom hotkey configuration
hasCompletedOnboarding: Onboarding completion flag

6. Language Support

58 languages supported (see src/utils/languages.ts):

Each language has a two-letter code and label
"auto" for automatic detection
Passed to Whisper via --language parameter

7. Agent Naming System

User names their agent during onboarding (step 6/8)
Name stored in localStorage and database
ReasoningService detects "Hey [AgentName]" patterns
AI processes command and removes agent reference from output
Supports multiple AI providers:
- OpenAI (Now using Responses API as of September 2025):
  - GPT-5 Series (Nano/Mini/Full) - Latest models with fastest performance
  - GPT-4.1 Series (Nano/Mini/Full) with 1M context window
  - o-series reasoning models (o3/o3-pro/o4-mini) for deep reasoning tasks
  - GPT-4o multimodal series (4o/4o-mini) - default model
  - Legacy support for GPT-4 Turbo, GPT-4 classic, GPT-3.5 Turbo
- Anthropic (Via IPC bridge to avoid CORS):
  - Claude Opus 4.1 (claude-opus-4-1-20250805) - Frontier intelligence
  - Claude Sonnet 4 (claude-sonnet-4-20250514) - Latest balanced model
  - Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) - Balanced performance
  - Claude 3.5 Haiku (claude-3-5-haiku-20241022) - Fast and efficient
- Google Gemini (Direct API integration):
  - Gemini 2.5 Pro (gemini-2.5-pro) - Most intelligent with thinking capability
  - Gemini 2.5 Flash (gemini-2.5-flash) - High-performance with thinking
  - Gemini 2.5 Flash Lite (gemini-2.5-flash-lite) - Fast and low-cost
  - Gemini 2.0 Flash (gemini-2.0-flash) - 1M token context
- Local: Community models via LocalReasoningService (Qwen, LLaMA, Mistral)

8. API Integrations and Updates

OpenAI Responses API (September 2025):

Migrated from Chat Completions to new Responses API
Endpoint: https://api.openai.com/v1/responses
Simplified request format with input array instead of messages
New response format with output array containing typed items
Automatic handling of GPT-5 and o-series model requirements
No temperature parameter for newer models (GPT-5, o-series)

Anthropic Integration:

Routes through IPC handler to avoid CORS issues in renderer process
Uses main process for API calls with proper error handling
Model names use hyphens (e.g., claude-3-5-sonnet not claude-3.5-sonnet)

Gemini Integration:

Direct API calls from renderer process
Increased token limits for Gemini 2.5 Pro (2000 minimum)
Proper handling of thinking process in responses
Error handling for MAX_TOKENS finish reason

API Key Persistence:

All API keys now properly persist to .env file
Keys stored in environment variables and reloaded on app start
Centralized saveAllKeysToEnvFile() method ensures consistency

9. Debug Mode

Enable with --debug flag or OPENWHISPR_DEBUG=true:

Logs saved to platform-specific app data directory
Comprehensive logging of audio pipeline
FFmpeg path resolution details
Audio level analysis
Complete reasoning pipeline debugging with stage-by-stage logging

Development Guidelines

Adding New Features

New IPC Channel: Add to both ipcHandlers.js and preload.js
New Setting: Update useSettings.ts and SettingsPage.tsx
New UI Component: Follow shadcn/ui patterns in src/components/ui
New Manager: Create in src/helpers/, initialize in main.js

Testing Checklist

[ ] Test both local and cloud processing modes
[ ] Verify hotkey works globally
[ ] Check clipboard pasting on all platforms
[ ] Test with different audio input devices
[ ] Verify Python auto-installation
[ ] Test all Whisper models
[ ] Check agent naming functionality

Common Issues and Solutions

No Audio Detected:
- Check FFmpeg path resolution
- Verify microphone permissions
- Check audio levels in debug logs
Transcription Fails:
- Ensure Python/Whisper installed
- Check temporary file creation
- Verify FFmpeg is executable
Clipboard Not Working:
- macOS: Check accessibility permissions
- Use AppleScript fallback on macOS
Build Issues:
- Use npm run pack for unsigned builds (CSC_IDENTITY_AUTO_DISCOVERY=false)
- Signing requires Apple Developer account
- ASAR unpacking needed for FFmpeg/Python bridge
- afterSign.js automatically skips signing when CSC_IDENTITY_AUTO_DISCOVERY=false

Platform-Specific Notes

macOS:

Requires accessibility permissions for clipboard
Uses AppleScript for reliable pasting
Notarization needed for distribution
Shows in dock with indicator dot when running (LSUIElement: false)

Windows:

Python installer handles PATH automatically
No special permissions needed
NSIS installer for distribution

Linux:

Multiple package manager support
Standard XDG directories
AppImage for distribution

Code Style and Conventions

Use TypeScript for new React components
Follow existing patterns in helpers/
Descriptive error messages for users
Comprehensive debug logging
Clean up resources (files, listeners)
Handle edge cases gracefully

Performance Considerations

Whisper model size vs speed tradeoff
Audio blob size limits for IPC (10MB)
Temporary file cleanup
Memory usage with large models
Process timeout protection (30s)

Security Considerations

API keys stored in system keychain when possible
Context isolation enabled
No remote code execution
Sanitized file paths
Limited IPC surface area

Future Enhancements to Consider

Streaming transcription support
Custom wake word detection
Multi-language UI
Cloud model selection
Batch transcription
Export formats beyond clipboard

Wispr-Flow CLAUDE.md