CI Optimize Command

Analyze CI workflow performance, identify bottlenecks, and interactively apply optimizations.

Usage: /ci:optimize <workflow-name>

Examples: /ci:optimize lint or /ci:optimize publish

Key Documentation

Full docs: 1
Docker caching: 2
Cost reduction: 3
Dockerfile optimization: 4
Cache is King guide: 5
Dashboard: https://app.blacksmith.sh

Workflow

Phase 1: Data Collection

Workflow to analyze: $1

Validate workflow exists:

gh workflow list --json name,path | jq -r '.[] | select(.name | test("$1"; "i")) | .path'

If no match found, list available workflows and ask user to specify correct name.

Fetch recent runs (last 30):

gh run list --workflow "$1" --limit 30 --json databaseId,startedAt,updatedAt,conclusion,createdAt

Calculate baseline metrics:

Average duration: (updatedAt - startedAt) in seconds
P50, P90, P95 percentiles
Success rate: conclusion == "success" count / total
Get top 3 most recent successful runs for detailed analysis

Get detailed step timings for representative runs:

gh run view RUN_ID --json jobs

Extract for each job:

Job name, duration
Each step: name, startedAt, completedAt
Calculate step duration: (completedAt - startedAt)
Sort steps by duration descending

Identify top 5 slowest steps across analyzed runs (with percentage of total time)

Phase 2: Analysis

Detect optimization opportunities by checking workflow YAML and run data:

Read workflow file from path obtained in Phase 1 using Read tool.

Check for common issues:

Missing concurrency control (High Impact):
- Pattern: No concurrency: + cancel-in-progress: true on PR-triggered workflows
- Expected improvement: -30% wasted runs on rapid commits
Missing timeouts (Medium Impact):
- Pattern: No timeout-minutes: on jobs
- Recommend: 3x average duration (from metrics)
- Expected improvement: Prevent indefinite hangs
Slow setup steps (High Impact):
- Pattern: Setup steps (Go, Node, Docker) taking >30s
- Check: Is Blacksmith action used? (useblacksmith/setup-*)
- Expected improvement: -40-60% setup time
Missing fail-fast on matrix (Low Impact):
- Pattern: strategy: matrix: without fail-fast: true
- Expected improvement: -5-10% on failures
Dockerfile optimization (Medium Impact):
- If workflow contains Docker build steps, check Dockerfile
- Pattern: Single-stage, COPY . . before dependency layers
- Expected improvement: -20-40% build time
Missing workflow telemetry (Observability):
- Pattern: No catchpoint/workflow-telemetry-action step
- Benefit: CPU/memory/disk metrics for debugging
Wrong runner size (Cost/Performance):
- Compare job duration vs CPU cores used
- If job completes in <2min on 8vcpu runner, suggest 2vcpu
- If job takes >5min on 2vcpu with high CPU usage, suggest 8vcpu

Flag unusual patterns for researcher:

Any step taking >120s (2 minutes)
Success rate <70%
Duration variance >50% (P95/P50 ratio)
Unexplained recent performance degradation (compare first 10 vs last 10 runs)

If flagged, use Task tool with subagent_type=researcher:

Investigate performance issue in $1 workflow:

Problem detected:

[SPECIFIC_ISSUE: e.g., "Setup Go step taking 58s, expected <20s"]
Baseline: [METRICS]

Research tasks:

Search Blacksmith documentation for optimization of [SPECIFIC_COMPONENT]
- Start with: 1
- Focus on caching strategies, runner configuration
Find similar workflow optimization case studies
Check GitHub Actions best practices for [ISSUE_TYPE]

Return:

Root cause analysis
Specific optimization recommendations with expected improvements
Configuration examples
Relevant documentation links

Phase 3: Report Generation

Generate markdown report:

## Workflow Performance Report: $1

### Baseline (Last 30 runs)

- **Average Duration**: [X]m [Y]s
- **P50**: [X]m [Y]s | **P90**: [X]m [Y]s | **P95**: [X]m [Y]s
- **Success Rate**: [Z]% ([SUCCESS]/30 successful)

### Top 5 Slowest Steps

1. [Step Name] - [X]s ([Y]% of total) [SLOW if >30s]
2. [Step Name] - [X]s ([Y]% of total)
3. [Step Name] - [X]s ([Y]% of total)
4. [Step Name] - [X]s ([Y]% of total)
5. [Step Name] - [X]s ([Y]% of total)

### Optimization Opportunities

#### HIGH IMPACT

[For each high-impact issue:]

**[N]. [Issue Title]** (Expected: [IMPROVEMENT])

- **Current**: [CURRENT_STATE]
- **Issue**: [DESCRIPTION]
- **Fix**: [SPECIFIC_ACTION]
- **Docs**: [BLACKSMITH_LINK]

#### MEDIUM IMPACT

[Same format]

#### LOW IMPACT

[Same format]

### Blacksmith Dashboard Recommendations

Review these metrics manually at https://app.blacksmith.sh:

- **Cache Analytics**: Check hit rates for Go modules, Docker layers
- **Job Monitoring**: Step-by-step duration trends, failure patterns
- **Cost Breakdown**: Spending by job, identify expensive operations
- **Test Analytics** (Beta): Failing test trends

### Next Steps

[Generated in Phase 4 based on user selections]

Phase 4: Interactive Fixes

Present optimizations to user using AskUserQuestion tool:

Question: "Which optimizations would you like to apply to $1?" Multi-select: true Options: [Generated from detected issues, max 4 highest-priority items]

Example options:

"Add concurrency control (cancel old runs on new commits)"
"Add timeout protection (15min based on P95 duration)"
"Add fail-fast to matrix strategy"
"Add workflow telemetry for observability"

Based on selections, use Task tool with subagent_type=editor:

Apply the following CI optimizations to $1 ([PATH]):

User selected:

[OPTIMIZATION_1]
[OPTIMIZATION_2] ...

For each optimization:

Add concurrency control: Add after on: block:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

Add timeout: Add to each job:

jobs:
  job-name:
    timeout-minutes: [RECOMMENDED_VALUE]

Add fail-fast: Modify matrix strategy:

strategy:
  fail-fast: true
  matrix:
    ...

Add workflow telemetry: Add step to each job (after checkout):

- name: Workflow Telemetry
  uses: catchpoint/workflow-telemetry-action@v2
  with:
    proc_trace_sys_enable: true
    comment_on_pr: true

Ensure:

Preserve existing workflow structure
Maintain proper YAML indentation
Don't duplicate existing keys
Test YAML validity after changes


If Dockerfile optimization selected, provide specific diff:

```markdown
Optimize [DOCKERFILE_PATH] for better layer caching:

Current issues:
- [ISSUE_1: e.g., "COPY . . before dependency installation"]
- [ISSUE_2: e.g., "Single-stage build includes build artifacts"]

Apply this optimization pattern:
[Provide specific Dockerfile diff based on detected language/framework]

Example for Go application:
- Separate dependency download layer (COPY go.mod go.sum)
- Separate build assets layer (COPY static/ templates/)
- Multi-stage build (builder + runtime stages)
- Order layers by change frequency (least to most)

See: [4]

Phase 5: Completion

After editor agent completes:

Summarize changes made
Show expected improvements (duration reduction, cost savings)
Recommend follow-up:
- Run workflow to verify improvements
- Check Blacksmith dashboard after 5-10 runs for cache metrics
- Consider applying same optimizations to other workflows

Notes

Command is read-only until user approves specific fixes in Phase 4
Researcher agent triggered only for anomalies (not routine optimizations)
All metrics calculated from gh CLI JSON output using jq
Blacksmith dashboard features (cache hit rates, cost data) require manual review
If workflow name ambiguous, list all workflows and ask for clarification
For workflows with matrix jobs, analyze each matrix combination separately

/optimize Command