Bluegrass-Songbook CLAUDE.md
Python utilities for building the search index and managing songs.
Build Scripts (scripts/lib)
Python utilities for building the search index and managing songs.
Pipeline Overview
The build pipeline now uses works/ as the primary data source:
PRIMARY (current):
works/*/work.yaml + lead-sheet.pro → build_works_index.py → index.jsonl
LEGACY (migration complete):
sources/*/parsed/*.pro → migrate_to_works.py → works/
Key files:
build_works_index.py- PRIMARY: Builds index from works/ directorywork_schema.py- Defines work.yaml schema and validationbuild_index.py- LEGACY: Builds from sources/ (kept for reference)
Local vs CI Operations
Some operations require external APIs/databases and only run locally. Others run everywhere.
| Operation | Where | Cache File | Notes |
|-----------|-------|------------|-------|
| Build index | Everywhere | - | Core build, always runs |
| Harmonic analysis | Everywhere | - | Computes JamFriendly, Modal tags from chords |
| MusicBrainz tags | Local only | artist_tags.json | Requires local MB database on port 5440 |
| Grassiness scores | Local only | bluegrass_recordings.json, bluegrass_tagged.json | Song-level bluegrass detection |
| Strum Machine URLs | Local only | strum_machine_cache.json | API rate limited (10 req/sec) |
| TuneArch fetch | Local only | - | Fetches new instrumentals |
How caching works:
- Run local command to populate cache (e.g.,
refresh-tags,strum-machine-match) - Commit the cache file to git
- CI reads cache during build - no external API calls
Cache files (commit these after updating):
docs/data/artist_tags.json- MusicBrainz artist → genre mappingsdocs/data/strum_machine_cache.json- Song title → Strum Machine URL mappingsdocs/data/bluegrass_recordings.json- Recordings by curated bluegrass artistsdocs/data/bluegrass_tagged.json- Recordings with MusicBrainz bluegrass tagsdocs/data/grassiness_scores.json- Computed grassiness scores per song
Files
scripts/lib/
├── build_works_index.py # PRIMARY: Build index.jsonl from works/
├── work_schema.py # work.yaml schema definition and validation
├── migrate_to_works.py # Migrate sources/ → works/ structure
├── build_index.py # LEGACY: Build index from sources/*.pro
├── build_posts.py # Build blog posts manifest (posts.json)
├── enrich_songs.py # Enrich .pro files (provenance, chord normalization)
├── tag_enrichment.py # Tag enrichment (MusicBrainz + harmonic analysis)
├── query_artist_tags.py # Optimized MusicBrainz artist tag queries
├── strum_machine.py # Strum Machine API integration
├── fetch_tune.py # Fetch tunes from TuneArch by URL
├── search_index.py # Search index utilities and testing
├── add_song.py # Add a song to manual/parsed/
├── process_submission.py # GitHub Action: process song-submission issues
├── process_correction.py # GitHub Action: process song-correction issues
├── chord_counter.py # Chord statistics utility
├── loc_counter.py # Lines of code counter for analytics
├── export_genre_suggestions.py # Export genre suggestions for review
├── batch_tag_songs.py # Batch tag songs using Claude API
├── fetch_tag_overrides.py # Fetch trusted user tag votes from Supabase
└── tagging/ # Song-level tagging system
├── CLAUDE.md # Detailed docs for grassiness scoring
├── build_artist_database.py # Build curated bluegrass artist database
└── grassiness.py # Bluegrass detection based on covers/tags
Quick Commands
# Full pipeline: build index from works/
./scripts/bootstrap --quick
# Build index with tag refresh (local only, requires MusicBrainz)
./scripts/bootstrap --quick --refresh-tags
# Add a song manually
./scripts/utility add-song /path/to/song.pro
# Count chord usage across all songs
./scripts/utility count-chords
# Refresh tags from MusicBrainz (LOCAL ONLY - requires MB database)
./scripts/utility refresh-tags
# Match songs to Strum Machine (LOCAL ONLY - ~30 min for 17k songs)
./scripts/utility strum-machine-match
Bootstrap Timing
Bootstrap now shows elapsed time and per-stage breakdown:
Bootstrap complete! (45s total)
Timing breakdown:
- Enrichment: 12s
- Build index: 33s
Performance Notes
The build pipeline uses pre-computed lookup dicts to avoid O(n*m) nested loops:
| Operation | Before | After | Savings | |-----------|--------|-------|---------| | Strum Machine "the" matching | 17k × 52k = 884M | 52k + 17k = 69k | 12,800× faster | | Grassiness title lookup | 17k × 56k = 952M | 56k + 17k = 73k | 13,000× faster |
These lookups are built once before the main song loop, then used for O(1) dict access.
enrich_songs.py
Enriches .pro files with provenance metadata and normalized chord patterns.
What It Does
- Adds provenance metadata (
x_source,x_source_file,x_enriched) - Normalizes chord patterns within sections of the same type
- Skips protected files (human corrections are authoritative)
Chord Pattern Normalization
Ensures consistent chord counts across verses/choruses of the same type:
Before: After:
Verse 1: [G]Your cheating... Verse 1: [G]Your cheating...
Verse 2: When tears come... Verse 2: [G]When tears come...
↑ Added from canonical
Algorithm:
- Group sections by type (verse, chorus, etc.)
- Find canonical section (most chords, starts with chord)
- For sections missing first chord, add canonical's first chord
Usage
# Enrich all sources
uv run python scripts/lib/enrich_songs.py
# Dry run (show what would change)
uv run python scripts/lib/enrich_songs.py --dry-run
# Single source only
uv run python scripts/lib/enrich_songs.py --source classic-country
# Single file (for testing)
uv run python scripts/lib/enrich_songs.py --file path/to/song.pro
Protected Files
Files listed in sources/{source}/protected.txt are skipped. These are human-corrected files that should not be auto-modified.
build_works_index.py (PRIMARY)
Generates docs/data/index.jsonl from the works/ directory.
What It Does
- Scans
works/*/work.yamlfor all works - Reads work metadata (title, artist, composers, tags, parts)
- Reads lead sheet content from
lead-sheet.pro - Detects key and computes Nashville numbers
- Identifies tablature parts and includes their paths
- Applies fuzzy grouping to merge similar titles
- Matches to Strum Machine cache
- Outputs unified JSON index
Version Grouping
Songs are grouped by group_id for the version picker. The grouping algorithm:
- Title normalization: Lowercase, remove accents, strip parenthetical suffixes like
(Live),(C),(D) - Article removal: Remove "the", "a", "an" so "Angeline the Baker" matches "Angeline Baker"
- Lyrics hash: First 200 chars of lyrics distinguish different songs with same title
- Fuzzy matching: Post-processing pass merges similar titles (85% similarity threshold):
- Handles contractions: "Lovin'" ↔ "Loving"
- Handles plurals: "Heartache" ↔ "Heartaches"
- Handles compound words: "Home Town" ↔ "Hometown"
Source Priority
When determining the work's source for attribution:
- x_source in lead-sheet content (highest priority) - e.g.,
{meta: x_source tunearch} - Lead-sheet part provenance from work.yaml
- Tablature part provenance (fallback)
This ensures works with both a TuneArch lead sheet and a Banjo Hangout tab show "tunearch" as the source.
Tablature Attribution
Tablature parts include provenance for frontend attribution:
"tablature_parts": [{
"instrument": "banjo",
"file": "data/tabs/red-haired-boy-banjo.otf.json",
"source": "banjo-hangout",
"source_id": "1687",
"author": "schlange",
"source_page_url": "https://www.banjohangout.org/tab/browse.asp?m=detail&v=1687",
"author_url": "https://www.banjohangout.org/my/schlange"
}]
Strum Machine Matching
Matches songs to Strum Machine backing tracks using cached results:
- Normalize title (lowercase, strip parenthetical suffixes)
- Try exact match in cache
- Try without articles ("the", "a", "an")
- Try matching cache keys with articles removed
This handles cases like "Angeline Baker (C)" matching "angeline the baker" in the cache.
Usage
uv run python scripts/lib/build_works_index.py # Full build
uv run python scripts/lib/build_works_index.py --no-tags # Skip tag enrichment
Output Format
{
"id": "blue-moon-of-kentucky",
"title": "Blue Moon of Kentucky",
"artist": "Patsy Cline",
"composers": ["Bill Monroe"],
"key": "C",
"tags": ["ClassicCountry", "JamFriendly"],
"content": "{meta: title...}[full ChordPro]",
"tablature_parts": [
{"type": "tablature", "instrument": "banjo", "path": "data/tabs/..."}
]
}
work_schema.py
Defines the work.yaml schema and validation.
Work Schema
@dataclass
class Part:
type: str # 'lead-sheet', 'tablature', 'abc-notation'
format: str # 'chordpro', 'opentabformat', 'abc'
file: str # Relative path to file
default: bool # Is this the default part?
instrument: str # Optional: 'banjo', 'fiddle', 'guitar'
provenance: dict # Source info (source, source_file, imported_at)
@dataclass
class Work:
id: str # Slug (e.g., 'blue-moon-of-kentucky')
title: str
artist: str
composers: list[str]
default_key: str
tags: list[str]
parts: list[Part]
build_index.py (LEGACY)
Generates docs/data/index.jsonl from all .pro files in sources/.
What It Does
- Scans
sources/*/parsed/*.profor all songs - Parses ChordPro metadata (title, artist, composer, version fields)
- Extracts lyrics (without chords) for search
- Detects key using diatonic heuristics
- Converts chords to Nashville numbers for chord search
- Computes group_id for song version grouping
- Deduplicates exact duplicates (same content hash)
- Outputs unified JSON index
Key Functions
def parse_chordpro_metadata(content) -> dict:
"""Extract {meta: key value} and {key: value} directives.
Includes version fields: x_version_label, x_version_type, etc."""
def detect_key(chords: list[str]) -> tuple[str, str]:
"""Detect key from chord list. Returns (key, mode)."""
def to_nashville(chord: str, key_name: str) -> str:
"""Convert chord to Nashville number given a key."""
def extract_lyrics(content: str) -> str:
"""Extract plain lyrics without chord markers."""
def normalize_for_grouping(text: str) -> str:
"""Normalize text for grouping comparison.
Lowercases, removes accents, strips common suffixes."""
def compute_group_id(title: str, artist: str) -> str:
"""Compute base group ID from normalized title + artist."""
def compute_lyrics_hash(lyrics: str) -> str:
"""Hash first 200 chars of normalized lyrics.
Used to distinguish different songs with same title."""
Output Format
{
"songs": [
{
"id": "songfilename",
"title": "Song Title",
"artist": "Artist Name",
"composer": "Writer Name",
"first_line": "First line of lyrics...",
"lyrics": "Lyrics for search (500 chars)",
"content": "Full ChordPro content",
"key": "G",
"mode": "major",
"nashville": ["I", "IV", "V"],
"progression": ["I", "I", "IV", "V", "I"],
"group_id": "abc123def456_12345678",
"chord_count": 3,
"version_label": "Simplified",
"version_type": "simplified",
"arrangement_by": "John Smith"
}
]
}
Version Grouping
Songs are grouped by group_id, which combines:
- Base hash: MD5 of normalized title + artist
- Lyrics hash: MD5 of first 200 chars of normalized lyrics
This ensures songs with the same title but different lyrics (different songs) get different group_ids, while true versions (same lyrics, different arrangements) share a group_id.
Deduplication
Exact duplicates (identical content) are removed at build time. The first occurrence is kept.
Key Detection Algorithm
Scores each possible key by:
- How many song chords fit the key's diatonic scale
- Bonus weight for tonic chord appearances
- Tie-breaking: prefer common keys (G, C, D, A, E, Am, Em)
Tag System
Tags are added to songs during index build via tag_enrichment.py.
Tag Taxonomy
| Category | Tags | |----------|------| | Genre | Bluegrass, ClassicCountry, OldTime, Gospel, Folk, HonkyTonk, Outlaw, Rockabilly, etc. | | Vibe | JamFriendly, Modal, Jazzy | | Structure | Instrumental, Waltz |
Tag Sources (Priority Order)
- LLM tags (primary) - Genre tags from Claude batch API (
llm_tags.json) - Harmonic analysis - Vibe tags computed from chord content:
JamFriendly: ≤5 unique chords, has I-IV-V, no complex extensionsModal: Has bVII chord (e.g., F in key of G)Jazzy: Has 7th, 9th, dim, aug, or slash chords
- MusicBrainz artist tags (fallback) - Only used if LLM tags unavailable
- Trusted user overrides - Downvotes from trusted users exclude bad tags
Data Files
| File | Purpose |
|------|---------|
| docs/data/llm_tags.json | LLM-generated tags (primary source, checked into git) |
| docs/data/tag_overrides.json | Trusted user tag exclusions (checked into git) |
| docs/data/artist_tags.json | Cached MusicBrainz artist tags (fallback) |
Build Workflow
Tags are applied automatically during every index build:
| Where | What happens |
|-------|--------------|
| Local or CI | tag_enrichment.py reads llm_tags.json → applies genre tags |
| Local or CI | Harmonic analysis runs → applies vibe tags (JamFriendly, Modal) |
| Local or CI | tag_overrides.json exclusions remove bad tags |
Normal flow: LLM tags are pre-computed and checked into git. CI uses them directly.
Re-tagging all songs (local only, requires Anthropic API key):
# Submit batch job (takes ~2 hours to process)
uv run python scripts/lib/batch_tag_songs.py
# Check status
uv run python scripts/lib/batch_tag_songs.py --status <batch_id>
# Fetch results when complete
uv run python scripts/lib/batch_tag_songs.py --results <batch_id>
# Rebuild index and commit
./scripts/bootstrap --quick
git add docs/data/llm_tags.json && git commit -m "Refresh LLM tags"
Syncing trusted user votes (local only, requires Supabase credentials):
./scripts/utility sync-tag-votes
git add docs/data/tag_overrides.json && git commit -m "Sync tag overrides"
query_artist_tags.py
Optimized MusicBrainz queries using LATERAL joins with indexed lookups:
# Query tags for artists (0.9s for 900 artists)
from query_artist_tags import query_artist_tags_batch
results = query_artist_tags_batch(['Bill Monroe', 'Hank Williams'])
# Returns: {'Bill Monroe': [('bluegrass', 45), ('country', 12), ...], ...}
add_song.py
Adds a .pro file to sources/manual/parsed/ and rebuilds index.
./scripts/utility add-song ~/Downloads/my_song.pro
./scripts/utility add-song song.pro --skip-index-rebuild
process_submission.py / process_correction.py
Called by GitHub Actions when issues are approved.
Trigger: Issue labeled song-submission + approved (or song-correction)
Process:
- Extract ChordPro from issue body (```chordpro block)
- Extract song ID from issue body
- Write to
sources/manual/parsed/{id}.pro - Add to
protected.txt(for corrections) - Rebuild index
- Commit changes
Metadata Parsing
The build script handles both formats:
# Our format
{meta: title Song Name}
{meta: artist Artist}
# Standard ChordPro format
{title: Song Name}
{artist: Artist}
Both are extracted and normalized.
Adding a New Source
To add songs from a new source:
- Create
sources/{source-name}/parsed/directory - Add
.profiles there - Run
./scripts/bootstrap --quickto rebuild index
The build script automatically scans all sources/*/parsed/ directories.