Build Scripts (scripts/lib)

Python utilities for building the search index and managing songs.

Pipeline Overview

The build pipeline now uses works/ as the primary data source:

PRIMARY (current):
works/*/work.yaml + lead-sheet.pro  →  build_works_index.py  →  index.jsonl

LEGACY (migration complete):
sources/*/parsed/*.pro  →  migrate_to_works.py  →  works/

Key files:

build_works_index.py - PRIMARY: Builds index from works/ directory
work_schema.py - Defines work.yaml schema and validation
build_index.py - LEGACY: Builds from sources/ (kept for reference)

Local vs CI Operations

Some operations require external APIs/databases and only run locally. Others run everywhere.

| Operation | Where | Cache File | Notes | |-----------|-------|------------|-------| | Build index | Everywhere | - | Core build, always runs | | Harmonic analysis | Everywhere | - | Computes JamFriendly, Modal tags from chords | | MusicBrainz tags | Local only | artist_tags.json | Requires local MB database on port 5440 | | Grassiness scores | Local only | bluegrass_recordings.json, bluegrass_tagged.json | Song-level bluegrass detection | | Strum Machine URLs | Local only | strum_machine_cache.json | API rate limited (10 req/sec) | | TuneArch fetch | Local only | - | Fetches new instrumentals |

How caching works:

Run local command to populate cache (e.g., refresh-tags, strum-machine-match)
Commit the cache file to git
CI reads cache during build - no external API calls

Cache files (commit these after updating):

docs/data/artist_tags.json - MusicBrainz artist → genre mappings
docs/data/strum_machine_cache.json - Song title → Strum Machine URL mappings
docs/data/bluegrass_recordings.json - Recordings by curated bluegrass artists
docs/data/bluegrass_tagged.json - Recordings with MusicBrainz bluegrass tags
docs/data/grassiness_scores.json - Computed grassiness scores per song

Files

scripts/lib/
├── build_works_index.py  # PRIMARY: Build index.jsonl from works/
├── work_schema.py        # work.yaml schema definition and validation
├── migrate_to_works.py   # Migrate sources/ → works/ structure
├── build_index.py        # LEGACY: Build index from sources/*.pro
├── build_posts.py        # Build blog posts manifest (posts.json)
├── enrich_songs.py       # Enrich .pro files (provenance, chord normalization)
├── tag_enrichment.py     # Tag enrichment (MusicBrainz + harmonic analysis)
├── query_artist_tags.py  # Optimized MusicBrainz artist tag queries
├── strum_machine.py      # Strum Machine API integration
├── fetch_tune.py         # Fetch tunes from TuneArch by URL
├── search_index.py       # Search index utilities and testing
├── add_song.py           # Add a song to manual/parsed/
├── process_submission.py # GitHub Action: process song-submission issues
├── process_correction.py # GitHub Action: process song-correction issues
├── chord_counter.py      # Chord statistics utility
├── loc_counter.py        # Lines of code counter for analytics
├── export_genre_suggestions.py  # Export genre suggestions for review
├── batch_tag_songs.py    # Batch tag songs using Claude API
├── fetch_tag_overrides.py # Fetch trusted user tag votes from Supabase
└── tagging/              # Song-level tagging system
    ├── CLAUDE.md         # Detailed docs for grassiness scoring
    ├── build_artist_database.py  # Build curated bluegrass artist database
    └── grassiness.py     # Bluegrass detection based on covers/tags

Quick Commands

# Full pipeline: build index from works/
./scripts/bootstrap --quick

# Build index with tag refresh (local only, requires MusicBrainz)
./scripts/bootstrap --quick --refresh-tags

# Add a song manually
./scripts/utility add-song /path/to/song.pro

# Count chord usage across all songs
./scripts/utility count-chords

# Refresh tags from MusicBrainz (LOCAL ONLY - requires MB database)
./scripts/utility refresh-tags

# Match songs to Strum Machine (LOCAL ONLY - ~30 min for 17k songs)
./scripts/utility strum-machine-match

Bootstrap Timing

Bootstrap now shows elapsed time and per-stage breakdown:

Bootstrap complete! (45s total)
  Timing breakdown:
    - Enrichment: 12s
    - Build index: 33s

Performance Notes

The build pipeline uses pre-computed lookup dicts to avoid O(n*m) nested loops:

| Operation | Before | After | Savings | |-----------|--------|-------|---------| | Strum Machine "the" matching | 17k × 52k = 884M | 52k + 17k = 69k | 12,800× faster | | Grassiness title lookup | 17k × 56k = 952M | 56k + 17k = 73k | 13,000× faster |

These lookups are built once before the main song loop, then used for O(1) dict access.

enrich_songs.py

Enriches .pro files with provenance metadata and normalized chord patterns.

What It Does

Adds provenance metadata (x_source, x_source_file, x_enriched)
Normalizes chord patterns within sections of the same type
Skips protected files (human corrections are authoritative)

Chord Pattern Normalization

Ensures consistent chord counts across verses/choruses of the same type:

Before:                          After:
Verse 1: [G]Your cheating...     Verse 1: [G]Your cheating...
Verse 2: When tears come...      Verse 2: [G]When tears come...
                                          ↑ Added from canonical

Algorithm:

Group sections by type (verse, chorus, etc.)
Find canonical section (most chords, starts with chord)
For sections missing first chord, add canonical's first chord

Usage

# Enrich all sources
uv run python scripts/lib/enrich_songs.py

# Dry run (show what would change)
uv run python scripts/lib/enrich_songs.py --dry-run

# Single source only
uv run python scripts/lib/enrich_songs.py --source classic-country

# Single file (for testing)
uv run python scripts/lib/enrich_songs.py --file path/to/song.pro

Protected Files

Files listed in sources/{source}/protected.txt are skipped. These are human-corrected files that should not be auto-modified.

build_works_index.py (PRIMARY)

Generates docs/data/index.jsonl from the works/ directory.

What It Does

Scans works/*/work.yaml for all works
Reads work metadata (title, artist, composers, tags, parts)
Reads lead sheet content from lead-sheet.pro
Detects key and computes Nashville numbers
Identifies tablature parts and includes their paths
Applies fuzzy grouping to merge similar titles
Matches to Strum Machine cache
Outputs unified JSON index

Version Grouping

Songs are grouped by group_id for the version picker. The grouping algorithm:

Title normalization: Lowercase, remove accents, strip parenthetical suffixes like (Live), (C), (D)
Article removal: Remove "the", "a", "an" so "Angeline the Baker" matches "Angeline Baker"
Lyrics hash: First 200 chars of lyrics distinguish different songs with same title
Fuzzy matching: Post-processing pass merges similar titles (85% similarity threshold):
- Handles contractions: "Lovin'" ↔ "Loving"
- Handles plurals: "Heartache" ↔ "Heartaches"
- Handles compound words: "Home Town" ↔ "Hometown"

Source Priority

When determining the work's source for attribution:

x_source in lead-sheet content (highest priority) - e.g., {meta: x_source tunearch}
Lead-sheet part provenance from work.yaml
Tablature part provenance (fallback)

This ensures works with both a TuneArch lead sheet and a Banjo Hangout tab show "tunearch" as the source.

Tablature Attribution

Tablature parts include provenance for frontend attribution:

"tablature_parts": [{
  "instrument": "banjo",
  "file": "data/tabs/red-haired-boy-banjo.otf.json",
  "source": "banjo-hangout",
  "source_id": "1687",
  "author": "schlange",
  "source_page_url": "https://www.banjohangout.org/tab/browse.asp?m=detail&v=1687",
  "author_url": "https://www.banjohangout.org/my/schlange"
}]

Strum Machine Matching

Matches songs to Strum Machine backing tracks using cached results:

Normalize title (lowercase, strip parenthetical suffixes)
Try exact match in cache
Try without articles ("the", "a", "an")
Try matching cache keys with articles removed

This handles cases like "Angeline Baker (C)" matching "angeline the baker" in the cache.

Usage

uv run python scripts/lib/build_works_index.py           # Full build
uv run python scripts/lib/build_works_index.py --no-tags # Skip tag enrichment

Output Format

{
  "id": "blue-moon-of-kentucky",
  "title": "Blue Moon of Kentucky",
  "artist": "Patsy Cline",
  "composers": ["Bill Monroe"],
  "key": "C",
  "tags": ["ClassicCountry", "JamFriendly"],
  "content": "{meta: title...}[full ChordPro]",
  "tablature_parts": [
    {"type": "tablature", "instrument": "banjo", "path": "data/tabs/..."}
  ]
}

work_schema.py

Defines the work.yaml schema and validation.

Work Schema

@dataclass
class Part:
    type: str           # 'lead-sheet', 'tablature', 'abc-notation'
    format: str         # 'chordpro', 'opentabformat', 'abc'
    file: str           # Relative path to file
    default: bool       # Is this the default part?
    instrument: str     # Optional: 'banjo', 'fiddle', 'guitar'
    provenance: dict    # Source info (source, source_file, imported_at)

@dataclass
class Work:
    id: str             # Slug (e.g., 'blue-moon-of-kentucky')
    title: str
    artist: str
    composers: list[str]
    default_key: str
    tags: list[str]
    parts: list[Part]

build_index.py (LEGACY)

Generates docs/data/index.jsonl from all .pro files in sources/.

What It Does

Scans sources/*/parsed/*.pro for all songs
Parses ChordPro metadata (title, artist, composer, version fields)
Extracts lyrics (without chords) for search
Detects key using diatonic heuristics
Converts chords to Nashville numbers for chord search
Computes group_id for song version grouping
Deduplicates exact duplicates (same content hash)
Outputs unified JSON index

Key Functions

def parse_chordpro_metadata(content) -> dict:
    """Extract {meta: key value} and {key: value} directives.
    Includes version fields: x_version_label, x_version_type, etc."""

def detect_key(chords: list[str]) -> tuple[str, str]:
    """Detect key from chord list. Returns (key, mode)."""

def to_nashville(chord: str, key_name: str) -> str:
    """Convert chord to Nashville number given a key."""

def extract_lyrics(content: str) -> str:
    """Extract plain lyrics without chord markers."""

def normalize_for_grouping(text: str) -> str:
    """Normalize text for grouping comparison.
    Lowercases, removes accents, strips common suffixes."""

def compute_group_id(title: str, artist: str) -> str:
    """Compute base group ID from normalized title + artist."""

def compute_lyrics_hash(lyrics: str) -> str:
    """Hash first 200 chars of normalized lyrics.
    Used to distinguish different songs with same title."""

Output Format

{
  "songs": [
    {
      "id": "songfilename",
      "title": "Song Title",
      "artist": "Artist Name",
      "composer": "Writer Name",
      "first_line": "First line of lyrics...",
      "lyrics": "Lyrics for search (500 chars)",
      "content": "Full ChordPro content",
      "key": "G",
      "mode": "major",
      "nashville": ["I", "IV", "V"],
      "progression": ["I", "I", "IV", "V", "I"],
      "group_id": "abc123def456_12345678",
      "chord_count": 3,
      "version_label": "Simplified",
      "version_type": "simplified",
      "arrangement_by": "John Smith"
    }
  ]
}

Version Grouping

Songs are grouped by group_id, which combines:

Base hash: MD5 of normalized title + artist
Lyrics hash: MD5 of first 200 chars of normalized lyrics

This ensures songs with the same title but different lyrics (different songs) get different group_ids, while true versions (same lyrics, different arrangements) share a group_id.

Deduplication

Exact duplicates (identical content) are removed at build time. The first occurrence is kept.

Key Detection Algorithm

Scores each possible key by:

How many song chords fit the key's diatonic scale
Bonus weight for tonic chord appearances
Tie-breaking: prefer common keys (G, C, D, A, E, Am, Em)

Tag System

Tags are added to songs during index build via tag_enrichment.py.

Tag Taxonomy

| Category | Tags | |----------|------| | Genre | Bluegrass, ClassicCountry, OldTime, Gospel, Folk, HonkyTonk, Outlaw, Rockabilly, etc. | | Vibe | JamFriendly, Modal, Jazzy | | Structure | Instrumental, Waltz |

Tag Sources (Priority Order)

LLM tags (primary) - Genre tags from Claude batch API (llm_tags.json)
Harmonic analysis - Vibe tags computed from chord content:
- JamFriendly: ≤5 unique chords, has I-IV-V, no complex extensions
- Modal: Has bVII chord (e.g., F in key of G)
- Jazzy: Has 7th, 9th, dim, aug, or slash chords
MusicBrainz artist tags (fallback) - Only used if LLM tags unavailable
Trusted user overrides - Downvotes from trusted users exclude bad tags

Data Files

| File | Purpose | |------|---------| | docs/data/llm_tags.json | LLM-generated tags (primary source, checked into git) | | docs/data/tag_overrides.json | Trusted user tag exclusions (checked into git) | | docs/data/artist_tags.json | Cached MusicBrainz artist tags (fallback) |

Build Workflow

Tags are applied automatically during every index build:

| Where | What happens | |-------|--------------| | Local or CI | tag_enrichment.py reads llm_tags.json → applies genre tags | | Local or CI | Harmonic analysis runs → applies vibe tags (JamFriendly, Modal) | | Local or CI | tag_overrides.json exclusions remove bad tags |

Normal flow: LLM tags are pre-computed and checked into git. CI uses them directly.

Re-tagging all songs (local only, requires Anthropic API key):

# Submit batch job (takes ~2 hours to process)
uv run python scripts/lib/batch_tag_songs.py

# Check status
uv run python scripts/lib/batch_tag_songs.py --status <batch_id>

# Fetch results when complete
uv run python scripts/lib/batch_tag_songs.py --results <batch_id>

# Rebuild index and commit
./scripts/bootstrap --quick
git add docs/data/llm_tags.json && git commit -m "Refresh LLM tags"

Syncing trusted user votes (local only, requires Supabase credentials):

./scripts/utility sync-tag-votes
git add docs/data/tag_overrides.json && git commit -m "Sync tag overrides"

query_artist_tags.py

Optimized MusicBrainz queries using LATERAL joins with indexed lookups:

# Query tags for artists (0.9s for 900 artists)
from query_artist_tags import query_artist_tags_batch
results = query_artist_tags_batch(['Bill Monroe', 'Hank Williams'])
# Returns: {'Bill Monroe': [('bluegrass', 45), ('country', 12), ...], ...}

add_song.py

Adds a .pro file to sources/manual/parsed/ and rebuilds index.

./scripts/utility add-song ~/Downloads/my_song.pro
./scripts/utility add-song song.pro --skip-index-rebuild

process_submission.py / process_correction.py

Called by GitHub Actions when issues are approved.

Trigger: Issue labeled song-submission + approved (or song-correction)

Process:

Extract ChordPro from issue body (```chordpro block)
Extract song ID from issue body
Write to sources/manual/parsed/{id}.pro
Add to protected.txt (for corrections)
Rebuild index
Commit changes

Metadata Parsing

The build script handles both formats:

# Our format
{meta: title Song Name}
{meta: artist Artist}

# Standard ChordPro format
{title: Song Name}
{artist: Artist}

Both are extracted and normalized.

Adding a New Source

To add songs from a new source:

Create sources/{source-name}/parsed/ directory
Add .pro files there
Run ./scripts/bootstrap --quick to rebuild index

The build script automatically scans all sources/*/parsed/ directories.

Bluegrass-Songbook CLAUDE.md

Build Scripts (scripts/lib)

Pipeline Overview

Local vs CI Operations

Files

Quick Commands

Bootstrap Timing

Performance Notes

enrich_songs.py

What It Does

Chord Pattern Normalization

Usage

Protected Files

build_works_index.py (PRIMARY)

What It Does

Version Grouping

Source Priority

Tablature Attribution

Strum Machine Matching

Usage

Output Format

work_schema.py

Work Schema

build_index.py (LEGACY)

What It Does

Key Functions

Output Format

Version Grouping

Deduplication

Key Detection Algorithm

Tag System

Tag Taxonomy

Tag Sources (Priority Order)

Data Files

Build Workflow

query_artist_tags.py

add_song.py

process_submission.py / process_correction.py

Metadata Parsing

Adding a New Source