CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Path Warning ⚠️

CRITICAL: The correct ecosystem path has changed!

OLD/WRONG: C:\Users\Earth\BEDROT PRODUCTIONS\BEDROT DATA LAKE
NEW/CORRECT: C:\Users\Earth\BEDROT PRODUCTIONS\bedrot-data-ecosystem

Component paths:

Data Lake: bedrot-data-ecosystem\data_lake\
Data Warehouse: bedrot-data-ecosystem\data_warehouse\
Dashboard: bedrot-data-ecosystem\data_dashboard\

Always verify you're in the correct directory before running commands!

Project Overview

BEDROT Data Ecosystem is a production-grade analytics infrastructure for music industry data, consisting of three integrated components:

Data Lake (data_lake/): Multi-zone ETL architecture for ingesting data from Spotify, TikTok, Meta Ads, DistroKid, Linktree, TooLost, YouTube, MailChimp
Data Warehouse (data_warehouse/): SQLite-based analytical database with normalized tables for streaming, financial, and social media metrics
Data Dashboard (data_dashboard/): Real-time Next.js/FastAPI dashboard displaying 20+ KPIs with live WebSocket updates

Zone Architecture

Data Lake Zones

1_landing/ - Raw data ingestion, timestamped files from extractors
2_raw/ - Validated, immutable copies in NDJSON/CSV format
3_staging/ - Cleaned, transformed, standardized data
4_curated/ - Business-ready datasets for analytics/dashboards
5_archive/ - Long-term storage (7+ year retention)
6_automated_cronjob/ - Master pipeline orchestration scripts

Service Implementation Status

| Service | Extractors | Cleaners | Priority | Critical Notes | |---------|------------|----------|----------|----------------| | TooLost | ✅ Multiple variants | ✅ All 3 | CRITICAL | JWT expires every 7 days! | | Spotify | ✅ audience_extractor | ✅ All 3 | HIGH | Artists API | | TikTok | ✅ 2 accounts | ✅ All 3 | HIGH | zonea0 + pig1987 | | DistroKid | ✅ dk_auth | ✅ All 3 | MEDIUM | Streams + financials | | Linktree | ✅ analytics_extractor | ✅ All 3 | MEDIUM | GraphQL | | MetaAds | ✅ daily_campaigns | ✅ All 3 | LOW | Graph API | | Instagram | ❌ Empty dirs | ❌ | - | Not implemented | | YouTube | ❌ Empty dirs | ❌ | - | Not implemented | | MailChimp | ❌ Empty dirs | ❌ | - | Not implemented |

Important Context: Semi-Manual Authentication Design

Authentication Philosophy

CRITICAL: The BEDROT Data Lake operates as a semi-manual system by design. This is not a limitation but a deliberate choice for compliance and security:

Extractors NEVER authenticate automatically - All authentication is handled manually by the user
2FA Compliance - All services use two-factor authentication, requiring human interaction
Cookie Management is Convenience, Not Critical - Expired cookies mean manual re-authentication, not system failure
User-Driven Process - The system respects service terms by requiring explicit user authentication

Cookie Management Role

The cookie system exists to reduce manual work, not eliminate it:

Storage: Preserves authentication state between runs
Monitoring: Warns when cookies are expiring
Refresh Assistance: Helps guide re-authentication when needed
NOT Critical Infrastructure: Cookie failures are inconveniences, not emergencies

Interpreting Cookie-Related Failures

When you see extraction failures due to cookies:

This is EXPECTED behavior - The system is working as designed
Not a critical error - Just means manual authentication is needed
Part of the workflow - Semi-manual means user intervention is normal
Compliance-friendly - Ensures we're not automating around security measures

Development Commands

Testing

# Run all tests with coverage
pytest -ra --cov=src --cov-report=term-missing

# Run tests for specific service
pytest tests/spotify/ -v

# Run cookie refresh tests
pytest tests/test_cookie_refresh.py -v
pytest tests/test_integration.py -v
pytest tests/test_e2e_scenarios.py -v

Code Quality

# Format code
black src/
isort src/

# Lint code  
flake8 src/
mypy src/

Data Pipeline Execution

New Semi-Manual Pipeline with Authentication (Recommended)

# Interactive mode - prompts for manual auth when needed
cd "C:\Users\Earth\BEDROT PRODUCTIONS\bedrot-data-ecosystem\data_lake"
cronjob\run_bedrot_pipeline.bat

# Automated mode - skips services needing auth
cronjob\run_bedrot_pipeline.bat --automated

# Alternative with data warehouse ETL
cronjob\run_pipeline_with_auth.bat

Legacy Pipeline Commands

# Run full data lake pipeline (Windows)
data_lake/cronjob/run_datalake_cron.bat

# Run pipeline without extractors (cleaners only)
data_lake/cronjob/run_datalake_cron_no_extractors.bat

# Manual Python execution requires PROJECT_ROOT environment variable
set PROJECT_ROOT=%cd%
python src/spotify/extractors/spotify_audience_extractor.py

Health Monitoring & Diagnostics

# Check pipeline health
python src\common\pipeline_health_monitor.py

# Check authentication status
python src\common\run_with_auth_check.py --check-only

# Run specific services with auth check
python src\common\run_with_auth_check.py spotify toolost

# Test pipeline components
python test_pipeline_components.py

Data Dashboard

# Frontend (Next.js)
cd data_dashboard
npm install
npm run dev        # Port 3000 with custom Socket.IO server
npm run build
npm run start
npm run lint
npm run type-check

# Backend (FastAPI)
cd data_dashboard/backend
pip install -r requirements.txt
python main.py     # Port 8000 with WebSocket support

Data Warehouse

cd data_warehouse
# Currently being reinitialized - structure pending
# Virtual environment preserved at .venv/
# Requirements preserved at requirements.txt

Architecture

System Data Flow

External APIs → Data Lake → Data Warehouse → Dashboard → Business Insights
     ↓             ↓             ↓              ↓              ↓
  [Collect]    [Process]    [Normalize]    [Visualize]    [Decide]

Data Lake Zones

landing/ → raw/ → staging/ → curated/ → archive/
   ↓        ↓        ↓          ↓          ↓
[Ingest] [Validate] [Clean] [Transform] [Preserve]

⚠️ CRITICAL Data Caveat: Spotify Stream Double-Counting

IMPORTANT: The curated zone contains Spotify streaming data from TWO separate sources:

tidy_daily_streams.csv - Spotify + Apple Music streams from distributors (DistroKid/TooLost)
spotify_audience_curated_*.csv - Spotify-only metrics from Spotify for Artists API

Never combine these datasets when calculating Spotify totals - you would be double-counting!

For total platform streams: Use tidy_daily_streams (includes both Spotify and Apple)
For Spotify-specific analytics: Use spotify_audience_curated
DO NOT add values from both datasets together

Key Components

Data Lake Organization (src/<platform>/):

extractors/ - API/web scraping scripts
cleaners/ - Zone promotion scripts (follow naming: <service>_landing2raw.py)
cookies/ - Authentication cookies

Additional Directories:

6_automated_cronjob/ - Automated pipeline execution scripts
src/common/cookie_refresh/ - Cookie refresh system with service strategies
Root CSV files - Ad campaign and catalog data exports

Data Warehouse (Currently being reinitialized):

SQLite database: bedrot_analytics.db
Preserved virtual environment and requirements
Will integrate with data_lake curated zone

Dashboard Architecture:

Frontend: Next.js 14, React 18, TypeScript, Tailwind CSS
Backend: FastAPI with WebSocket support
Real-time updates via Socket.IO

Cookie Authentication System

Service-Specific Expiration

| Service | Max Age | Refresh Interval | Strategy | 2FA | |---------|---------|------------------|----------|-----| | TooLost | 7 days | 5 days | JWT Manual | No | | Spotify | 30 days | 7 days | OAuth Manual | No | | TikTok Zone A0 | 30 days | 7 days | QR/Manual | Yes | | TikTok PIG1987 | 30 days | 7 days | QR/Manual | Yes | | DistroKid | 90 days | 10 days | Automated | No | | Linktree | 30 days | 14 days | Standard | No | | MetaAds | 60 days | 30 days | OAuth | No |

Cookie Management

# Check all services
python src/common/cookie_refresh/check_status.py

# Manual refresh
python src/common/cookie_refresh/manual_refresh.py --service toolost

# View dashboard (http://localhost:8080)
python src/common/cookie_refresh/start_dashboard.py

Common Tasks

Adding a New Data Source

Create directory structure:

mkdir -p src/newservice/{extractors,cleaners,cookies}
touch src/newservice/__init__.py

Implement extractor (src/newservice/extractors/newservice_extractor.py)
Create three cleaners following naming convention
Add tests in tests/newservice/
No registration needed - pipelines auto-discover

Running a Single Test

# Specific test file
pytest tests/spotify/test_spotify_extractor.py -v

# Specific test function
pytest tests/spotify/test_spotify_extractor.py::test_extract_data -v

# With coverage for module
pytest tests/spotify/ --cov=src.spotify --cov-report=term-missing

Debugging Failed Extractors

# Set environment and run with debug
cd data_lake
set PROJECT_ROOT=%cd%
python src/spotify/extractors/spotify_audience_extractor.py --debug

# Check cookies
python -c "from common.cookies import validate_cookies; print(validate_cookies('spotify'))"

# View logs
type logs\extractor_YYYYMMDD.log

Key Conventions

Environment: PROJECT_ROOT must be set for all scripts
File Naming: Timestamps as YYYYMMDD_HHMMSS
Data Formats: NDJSON for structured data, CSV for tabular
Testing: pytest with coverage targets
Authentication: Cookie-based with automatic refresh system
Deployment: Supports both Windows batch files and Linux scripts

WebSocket Events (Dashboard)

// Client → Server
socket.emit('subscribe', { channel: 'kpis' })
socket.emit('unsubscribe', { channel: 'kpis' })

// Server → Client  
socket.on('kpi:update', (data) => { /* New KPI data */ })
socket.on('connection:status', (status) => { /* Connected/Disconnected */ })

Data Quality Standards

SHA-256 hash deduplication tracked in _hashes.json
Archive existing curated files before overwriting
Validate data types and constraints in cleaners
Log all operations with timestamps
Handle errors gracefully with descriptive messages

Performance Optimization

Data Lake:

Parallel processing for different services
Batch operations (1000 records/chunk)
Generator patterns for large files

Data Warehouse:

Indexes on frequently queried columns
Regular VACUUM operations
Batch inserts with transactions

Dashboard:

React.memo for expensive components
WebSocket throttling (max 1/second)
Query result caching

Security Considerations

Store credentials in environment variables only
Never commit cookies or API keys
Hash PII data before storage
Use HTTPS for all external APIs
Implement authentication for dashboard access

Current Status (September 2025)

Pipeline Health

Working: Check current status with python src/common/pipeline_health_monitor.py
Cookie Status: Monitor with python src/common/cookie_refresh/check_status.py
Dashboard: View at http://localhost:8080 via python src/common/cookie_refresh/start_dashboard.py

Recent Changes (September 2025)

Fixed TikTok zone.a0 date range selector to properly select 365 days
Updated TikTok HTML selectors to match current UI structure
Fixed browser profile conflicts between TikTok accounts
Improved download button selectors for TikTok analytics
Fixed MetaAds cleaner nlargest() syntax error

Known Issues

TooLost JWT expires every 7 days
TikTok zone.a0 downloads may still show limited data (account limitation)
Some services have data stuck in landing/raw zones
Need weekly reminder for TooLost refresh

Understanding Log Outputs

Log File Locations

The data pipeline generates several types of logs in the logs/ directory:

Pipeline Execution Logs (pipeline_YYYYMMDD_HHMM.log)
- Main pipeline execution log from cron job runs
- Contains step-by-step execution details
- Includes cookie checks, extractor runs, cleaner runs, and health reports
Pipeline Executor Logs (Enhanced logging as of 2025-07-15)
- pipeline_executor.log - Detailed execution log with timestamps
- pipeline_executor_errors.log - Error-only log for quick debugging
- pipeline_executor_structured.jsonl - JSON-formatted logs for programmatic parsing
Service-Specific Logs
- cookie_refresh.log - Cookie refresh operations
- bedrot_pipeline.log - Semi-manual pipeline runs
- bedrot_pipeline_errors.log - Error-only view

Common Error Resolutions

| Error Pattern | Likely Cause | Resolution | Severity | |--------------|--------------|------------|----------| | TimeoutError: Page.wait_for_selector | Page didn't load or login required | Run manual auth - this is NORMAL | Low | | No cookies found | Missing authentication | Run extractor manually to login | Low | | Multiple extraction failures | Cookies expired across services | Batch re-authenticate all services | Low | | attempted relative import | Script run incorrectly | Use proper execution method | High | | no healthy upstream | Network/proxy issue | Check network, disable proxy | High | | STALE data | Extraction hasn't run recently | Run specific extractor with auth | Medium |

Understanding Failure Severity

Low Severity (Cookie/Auth Issues):

Expected in semi-manual system
Requires user intervention
Not a system failure
Part of normal workflow

High Severity (System Issues):

Actual infrastructure problems
Code/configuration errors
Network connectivity issues
These need immediate fixing

Medium Severity (Data Issues):

Data quality concerns
Processing bottlenecks
May indicate auth OR system issues

bedrot-data-ecosystem CLAUDE.md

CLAUDE.md

Path Warning ⚠️

Project Overview

Zone Architecture

Data Lake Zones

Service Implementation Status

Important Context: Semi-Manual Authentication Design

Authentication Philosophy

Cookie Management Role

Interpreting Cookie-Related Failures

Development Commands

Testing

Code Quality

Data Pipeline Execution

New Semi-Manual Pipeline with Authentication (Recommended)

Legacy Pipeline Commands

Health Monitoring & Diagnostics

Data Dashboard

Data Warehouse

Architecture

System Data Flow

Data Lake Zones

⚠️ CRITICAL Data Caveat: Spotify Stream Double-Counting

Key Components

Cookie Authentication System

Service-Specific Expiration

Cookie Management

Common Tasks

Adding a New Data Source

Running a Single Test

Debugging Failed Extractors

Key Conventions

WebSocket Events (Dashboard)

Data Quality Standards

Performance Optimization

Security Considerations

Current Status (September 2025)

Pipeline Health

Recent Changes (September 2025)

Known Issues

Understanding Log Outputs

Log File Locations

Common Error Resolutions

Understanding Failure Severity