Etcd Troubleshooting Skill

This document defines the Claude Code skill for troubleshooting etcd issues on two-node OpenShift clusters with fencing topology. When activated, Claude becomes an expert etcd/Pacemaker troubleshooter capable of iterative diagnosis and remediation.

Skill Overview

This skill enables Claude to:

Validate and test access to cluster components via Ansible and OpenShift CLI
Iteratively collect diagnostic data from Pacemaker, etcd, and OpenShift
Analyze symptoms and identify root causes
Propose and execute remediation steps
Verify fixes and adjust approach based on results
Provide comprehensive troubleshooting throughout the diagnostic process

Step-by-Step Procedure

1. Validate Access

1.1 Ansible Inventory Validation:

Check if deploy/openshift-clusters/inventory.ini exists
Verify the inventory file has valid cluster node entries
Test SSH connectivity to cluster nodes using Ansible ping module

1.2 OpenShift Cluster Access Validation:

Test direct cluster access with oc version
If direct access fails, check for deploy/openshift-clusters/proxy.env
If proxy.env exists, source it before running oc commands
Verify cluster access with oc get nodes
Remember proxy requirement for all subsequent oc commands

IMPORTANT: No Cluster Access Scenario

If OpenShift cluster API access is unavailable (which is expected when etcd is down), all diagnostics and remediation must be performed via Ansible using direct VM access. The troubleshooting workflow remains fully functional using only:

Ansible ad-hoc commands to cluster VMs
Ansible playbooks for diagnostics collection
Direct SSH access to nodes via Ansible

When cluster access is unavailable:

✓ You can still diagnose and fix etcd issues completely
✓ All Pacemaker operations work via Ansible
✓ All etcd container operations work via Ansible (podman commands)
✓ All logs are accessible via Ansible (journalctl commands)
✗ Cannot query OpenShift operators or cluster-level resources
✗ Cannot use oc commands for verification (use Ansible equivalents instead)

This is a normal scenario when etcd is down - proceed with VM-based troubleshooting.

2. Collect Data

Choose Your Diagnostic Approach

There are two approaches to data collection:

Quick Manual Triage (recommended for initial assessment)

Start with a few targeted commands to assess the situation:

pcs status - Check Pacemaker cluster state and failed actions
podman ps -a --filter name=etcd - Verify etcd containers are running
etcdctl endpoint health - Confirm etcd health

This takes ~30 seconds and is often sufficient to identify simple issues (stale failures, container restarts, etc.) that can be fixed immediately with pcs resource cleanup etcd.

Full Diagnostic Collection (for complex/unclear issues)

If quick triage reveals complex problems or the root cause is unclear, run the comprehensive diagnostic script:

./helpers/etcd/collect-all-diagnostics.sh

This script (~5-10 minutes):

Validates Ansible and cluster access automatically
Collects all VM-level diagnostics via Ansible playbook
Collects OpenShift cluster-level data (if accessible)
Saves everything to /tmp/etcd-diagnostics-<timestamp>/
Generates a DIAGNOSTIC_REPORT.txt with analysis commands

Use the full collection when:

Quick triage doesn't reveal the cause
Multiple components appear affected
You need to preserve diagnostic data for later analysis
The issue involves cluster ID mismatches or split-brain scenarios

Manual Collection Commands

For manual data collection, use the commands below.

IMPORTANT: Target the Correct Host Group

All etcd/Pacemaker commands must target the cluster_vms host group (the OpenShift cluster nodes)
VM lifecycle commands (start/stop VMs) target the hypervisor host
Use Ansible ad-hoc commands with -m shell or run playbooks that target cluster_vms
All commands on cluster VMs require sudo/become privileges

Example Ansible targeting:

# Correct - targets cluster VMs
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m shell -a "pcs status" -b

# Incorrect - would target hypervisor
ansible hypervisor -i deploy/openshift-clusters/inventory.ini -m shell -a "pcs status" -b

Pacemaker Status (on cluster_vms):

sudo pcs status
sudo pcs resource status
sudo pcs constraint list
sudo crm_mon -1

Etcd Container Status (on cluster_vms):

sudo podman ps -a --filter name=etcd
sudo podman inspect etcd
sudo podman logs --tail 100 etcd

Etcd Cluster Health (on cluster_vms):

sudo podman exec etcd etcdctl member list -w table
sudo podman exec etcd etcdctl endpoint health -w table
sudo podman exec etcd etcdctl endpoint status -w table

System Logs (on cluster_vms):

sudo journalctl -u pacemaker --since "1 hour ago" -n 200
sudo journalctl -u corosync --since "1 hour ago" -n 100
sudo journalctl --grep etcd --since "1 hour ago" -n 200

Cluster Attributes (on cluster_vms):

sudo crm_attribute --query --name standalone_node
sudo crm_attribute --query --name learner_node
sudo crm_attribute --query --name force_new_cluster --lifetime reboot

OpenShift Cluster Status (use proxy.env if needed):

oc get nodes -o wide
oc get co etcd -o yaml
oc get pods -n openshift-etcd
oc get mcp master -o yaml
oc get co --no-headers | grep -v "True.*False.*False"
oc get events -n openshift-etcd --sort-by='.lastTimestamp' | tail -50

3. Analyze Collected Data

Look for these key issues:

Cluster Quorum:

Corosync quorum status
Pacemaker partition state
Node online/offline status

Etcd Health:

Member list consistency
Leader election status
Endpoint health
Learner vs. voting member status
Cluster ID mismatches between nodes

Resource State:

Etcd resource running status
Failed actions in Pacemaker
Resource constraint violations

Common Error Patterns:

Certificate expiration/rotation issues
Network connectivity problems
Split-brain scenarios
Fencing failures
Data corruption indicators

OpenShift Integration:

Etcd operator status and conditions
Unexpected etcd pods in openshift-etcd namespace (should not exist in TNF)
Machine config pool degradation
Cluster operator degradation related to etcd

4. Provide Troubleshooting Procedure

Based on your analysis, provide:

Diagnosis Summary: Clear statement of identified issues
Root Cause Analysis: Likely causes based on symptoms
Step-by-Step Remediation:
- Ordered steps to resolve issues
- Specific commands to execute
- Expected outcomes at each step
- Rollback procedures if available
Verification Steps: How to confirm the issue is resolved
Prevention Recommendations: How to avoid recurrence

Analysis Guidelines

Component-Specific Analysis Functions

Pacemaker Cluster Analysis

Key Indicators:

Quorum status: pcs status shows "quorum" or "no quorum"
Node status: online, standby, offline, UNCLEAN
Resource status: Started, Stopped, Failed, Master/Slave
Failed actions count and descriptions

Analysis Questions:

Do both nodes show as online in the cluster?
Is quorum achieved? (Should show quorum with 2 nodes)
Are there any failed actions for the etcd resource?
Is stonith enabled and configured correctly?
Are there any location/order/colocation constraints preventing etcd from starting?

Common Issues:

No quorum: One node is offline or network partition - check corosync logs
Failed actions: Resource failed to start - examine failure reason and run pcs resource cleanup
UNCLEAN node: Fencing failed - check fence agent configuration and BMC access

Etcd Container Analysis

Key Indicators:

Container state: running, stopped, exited
Exit code if stopped (0 = clean, non-zero = error)
Container restart count
Last log messages from container

Analysis Questions:

Is the etcd container running on both nodes?
If stopped, what was the exit code?
What do the last 20 lines of container logs show?
Are there certificate errors in the logs?
Are there network connectivity errors?

Common Issues:

Container not found: Pacemaker hasn't created it yet or resource is stopped
Exit code 1: Check logs for specific error (certs, permissions, corruption)
Repeated restarts: Likely configuration or persistent error - check logs

Etcd Cluster Health Analysis

Key Indicators:

Member list: number of members, their status (started/unstarted)
Endpoint health: healthy/unhealthy, latency
Endpoint status: leader election, raft index, DB size
Cluster ID consistency across nodes

Analysis Questions:

How many members are in the member list?
Are all members started?
Is there a leader elected?
Do both nodes show the same cluster ID in CIB attributes?
Are raft indices progressing or stuck?

Common Issues:

Different cluster IDs: Nodes are in different etcd clusters - need force-new-cluster
No leader: Split-brain or quorum loss - check network and member list
Unstarted member: Node hasn't joined yet or failed to join - check logs
3+ members: Unexpected member entries from previous configs - need cleanup

CIB Attributes Analysis

Key Indicators:

standalone_node: which node (if any) is running alone
learner_node: which node (if any) is rejoining
force_new_cluster: which node should bootstrap new cluster
cluster_id: must match between nodes in healthy state
member_id: etcd member ID for each node

Analysis Questions:

Are there conflicting attributes set (e.g., both standalone and learner)?
Do cluster_id values match between both nodes?
Is force_new_cluster set when it shouldn't be?
Are learner/standalone attributes stuck from previous operations?

Common Issues:

Stuck learner_node: Previous rejoin didn't complete - may need manual cleanup
Mismatched cluster_id: Nodes diverged - need force-new-cluster recovery
Stale force_new_cluster: Attribute survived reboot when it shouldn't - manual cleanup needed

System Logs Analysis

Key Patterns to Search:

Pacemaker Logs:

"Failed" - resource failures
"fencing" - stonith operations
"could not" - operation failures
"timeout" - timing issues
"certificate" - cert problems

Corosync Logs:

"quorum" - quorum changes
"lost" - connection losses
"join" - membership changes
"totem" - ring protocol issues

Etcd Logs:

"panic" - fatal errors
"error" - general errors
"certificate" - cert issues
"member" - membership changes
"leader" - leadership changes
"database space exceeded" - quota issues
"mvcc: database space exceeded" - DB full

Troubleshooting Decision Tree

Use this decision tree to systematically diagnose issues:

START: Etcd not working as expected
│
├─> Can you access cluster VMs via Ansible?
│   ├─ NO → Fix Ansible connectivity first (check inventory, SSH keys, ProxyJump)
│   └─ YES → Continue
│
├─> Is Pacemaker running on both nodes? (systemctl status pacemaker)
│   ├─ NO → Start Pacemaker: systemctl start pacemaker
│   └─ YES → Continue
│
├─> Do both nodes show as online in pcs status?
│   ├─ NO → Check which node is offline
│   │      ├─ Node shows UNCLEAN → Fencing failed
│   │      │  └─ ACTION: Check stonith status, fence agent config, BMC access
│   │      └─ Node shows offline → Network or Pacemaker issue
│   │         └─ ACTION: Check corosync logs, network connectivity
│   └─ YES → Continue
│
├─> Does cluster have quorum? (pcs status shows "quorum")
│   ├─ NO → Investigate corosync/quorum issues
│   │      └─ ACTION: Check corosync logs for membership changes
│   └─ YES → Continue
│
├─> Is etcd resource started? (pcs resource status)
│   ├─ NO → Check for failed actions
│   │      ├─ Failed actions present → Resource failed to start
│   │      │  └─ ACTION: Check failure reason, fix root cause, run pcs resource cleanup
│   │      └─ No failed actions → Check constraints
│   │         └─ ACTION: Review pcs constraint list, check node attributes
│   └─ YES → Continue
│
├─> Is etcd container running on expected nodes? (podman ps)
│   ├─ NO → Container not started or crashed
│   │      └─ ACTION: Check podman logs for errors (certs, corruption, config)
│   └─ YES → Continue
│
├─> Check cluster IDs in CIB attributes on both nodes
│   ├─ DIFFERENT → Nodes are in separate etcd clusters!
│   │      └─ ACTION: Use force-new-cluster helper to recover
│   └─ SAME → Continue
│
├─> Check etcd member list (podman exec etcd etcdctl member list)
│   ├─ Lists unexpected members (>2 members) → Stale members from previous config
│   │      └─ ACTION: Remove stale members with etcdctl member remove
│   ├─ Shows "unstarted" members → Node hasn't joined yet
│   │      └─ ACTION: Check logs on unstarted node, may need cleanup and rejoin
│   └─ Lists 2 members, both started → Continue
│
├─> Check etcd endpoint health (podman exec etcd etcdctl endpoint health)
│   ├─ Unhealthy → Network or performance issues
│   │      └─ ACTION: Check network latency, system load, disk I/O
│   └─ Healthy → Continue
│
├─> Check etcd endpoint status (podman exec etcd etcdctl endpoint status)
│   ├─ No leader → Leadership election failing
│   │      └─ ACTION: Check logs for raft errors, verify member communication
│   ├─ Leader elected but errors in logs → Operational issues
│   │      └─ ACTION: Investigate specific errors (disk full, corruption, etc.)
│   └─ Leader elected, no errors → Cluster appears healthy
│
└─> If still experiencing issues → Check OpenShift integration
    ├─ Etcd operator degraded? (oc get co etcd)
    │  └─ ACTION: Review operator conditions, check for cert rotation, API issues
    └─ Check for related operator degradation (oc get co)
       └─ ACTION: Review degraded operators, may indicate cluster-wide issues

Error Pattern Matching Guidelines

When analyzing logs and status output, look for these common patterns:

Certificate Issues

Symptoms:

"certificate has expired" in logs
"x509: certificate" errors
etcd container exits immediately
TLS handshake failures

Diagnosis:

# Check cert expiration on nodes
sudo podman exec etcd ls -la /etc/kubernetes/static-pod-resources/etcd-certs/
# Look at recent cert-related log messages
sudo journalctl --grep certificate --since "2 hours ago"

Resolution:

Wait for automatic cert rotation (if in progress)
Verify etcd operator is healthy and can rotate certs
Check machine config pool status for cert updates

Split-Brain / Cluster ID Mismatch

Symptoms:

Different cluster_id in CIB attributes between nodes
Nodes can't join each other's cluster
"cluster ID mismatch" in logs
Etcd won't start on one or both nodes

Diagnosis:

# Compare cluster IDs
ansible cluster_vms -i inventory.ini -m shell \
  -a "crm_attribute --query --name cluster_id" -b

Resolution (RECOMMENDED):

# Use the automated force-new-cluster helper playbook
ansible-playbook helpers/force-new-cluster.yml -i deploy/openshift-clusters/inventory.ini

This playbook:

Takes snapshots for safety
Clears conflicting CIB attributes
Auto-detects the etcd leader (or uses first node with running etcd, or falls back to inventory order)
Removes follower from member list
Handles all cleanup and recovery steps automatically

Resource Failures / Failed Actions

Symptoms:

pcs status shows "Failed Resource Actions"
Resource shows as "Stopped" but should be running
Migration failures

Diagnosis:

# Check detailed failure info
sudo pcs resource status --full
sudo pcs resource failcount show etcd

Resolution:

Identify and fix root cause (see logs)
Run: sudo pcs resource cleanup etcd
Verify resource starts successfully

Fencing Failures

Symptoms:

Node shows as "UNCLEAN" in pcs status
"fencing failed" in logs
Stonith errors
Cluster can't recover from node failure

Diagnosis:

# Check stonith status and configuration
sudo pcs stonith status
sudo pcs stonith show
# Check fence agent logs
sudo journalctl -u pacemaker --grep fence --since "1 hour ago"

Resolution:

Verify BMC/RedFish access from both nodes
Check fence agent credentials
Ensure network connectivity to BMC interfaces
Review stonith timeout settings
Test fence agent manually: sudo fence_redfish -a <bmc_ip> -l <user> -p <pass> -o status

Key Context

Cluster States

Standalone: Single node running as "cluster-of-one"
Learner: Node rejoining cluster, not yet voting member
Force-new-cluster: Flag to bootstrap new cluster from single node

Critical Attributes

standalone_node - Which node is running standalone
learner_node - Which node is rejoining as learner
force_new_cluster - Bootstrap flag (lifetime: reboot)
cluster_id - Etcd cluster ID (must match on both nodes)

Failure Scenarios

Graceful Disruption (4.19+):

Pacemaker intercepts reboot
Removes node from etcd cluster
Cluster continues as single node
Node resyncs and rejoins on return

Ungraceful Disruption (4.20+):

Unreachable node is fenced (powered off)
Surviving node restarts etcd as cluster-of-one
New cluster ID is assigned
Failed node discards old DB and resyncs on restart

Available Remediation Tools

IMPORTANT: Prefer Automated Tools

When dealing with cluster recovery scenarios (split-brain, mismatched cluster IDs, both nodes down), always use the automated helper playbook first before attempting manual recovery:

ansible-playbook helpers/force-new-cluster.yml -i deploy/openshift-clusters/inventory.ini

This playbook handles all the complex steps safely and is the recommended approach. Manual steps should only be used if the playbook is unavailable or fails.

Pacemaker Resource Cleanup

Use pcs resource cleanup to clear failed resource states and retry operations:

# Clean up etcd resource on specific node
sudo pcs resource cleanup etcd <node-name>

# Clean up etcd resource on all nodes
sudo pcs resource cleanup etcd

When to use:

After fixing underlying issues (certificates, network, etc.)
When resource shows as failed but root cause is resolved
To retry resource start after transient failures
After manual CIB attribute changes

Force New Cluster Helper

Ansible playbook at helpers/force-new-cluster.yml automates cluster recovery when both nodes have stopped etcd or cluster IDs are mismatched.

What it does:

Disables stonith temporarily for safety
Takes etcd snapshots on both nodes (if etcd not running)
Clears conflicting CIB attributes (learner_node, standalone_node)
Sets force_new_cluster attribute on detected leader node
Removes follower from etcd member list (if etcd running on leader)
Runs pcs resource cleanup etcd on both nodes
Re-enables stonith
Verifies recovery

Usage:

ansible-playbook helpers/force-new-cluster.yml -i deploy/openshift-clusters/inventory.ini

When to use:

Both nodes show different etcd cluster IDs
Etcd is not running on either node and won't start
After ungraceful disruptions that left cluster in inconsistent state
Manual recovery attempts have failed
Need to bootstrap from one node as new cluster

Precautions:

Only use when normal recovery procedures fail
Ensure follower node can afford to lose its etcd data
Detected leader will become the source of truth
This creates a NEW cluster, follower will resync from leader

Reference Documentation

You have access to these slash commands for detailed information:

/etcd:etcd-ops-guide:clustering - Cluster membership operations
/etcd:etcd-ops-guide:recovery - Recovery procedures
/etcd:etcd-ops-guide:monitoring - Monitoring and health checks
/etcd:etcd-ops-guide:failures - Failure scenarios
/etcd:etcd-ops-guide:data_corruption - Data corruption handling

Pacemaker documentation is available in .claude/commands/etcd/pacemaker/ directory.

Podman-etcd Resource Agent Source: To consult the resource agent source code, first fetch it from upstream:

./helpers/etcd/fetch-podman-etcd.sh

Then read .claude/commands/etcd/pacemaker/podman-etcd.txt

Output Format

Provide clear, concise diagnostics with:

Markdown formatting for readability
Code blocks for commands
Clear sections for diagnosis, remediation, and verification
Actionable next steps
Links to relevant files when referencing code or logs

Important Notes

Handle cases where some data collection fails gracefully
Provide useful output even with partial data
Warn user clearly if proxy.env is required but missing
Always use sudo/become for commands on cluster VMs via Ansible
Be specific about which node to run commands on when relevant
CRITICAL: Always target the cluster_vms host group for all etcd/Pacemaker operations
- Never target the hypervisor host for etcd-related commands
- The hypervisor is only for VM lifecycle management (virsh, kcli commands)
- All Pacemaker, etcd container, and cluster diagnostics run on cluster VMs
When cluster API access is unavailable, rely exclusively on Ansible-based VM access
- This is normal and expected when etcd is down
- All troubleshooting can be completed without oc commands

/TROUBLESHOOTING_SKILL Command

Etcd Troubleshooting Skill

Skill Overview

Step-by-Step Procedure

1. Validate Access

2. Collect Data

3. Analyze Collected Data

4. Provide Troubleshooting Procedure

Analysis Guidelines

Component-Specific Analysis Functions

Pacemaker Cluster Analysis

Etcd Container Analysis

Etcd Cluster Health Analysis

CIB Attributes Analysis

System Logs Analysis

Troubleshooting Decision Tree

Error Pattern Matching Guidelines

Certificate Issues

Split-Brain / Cluster ID Mismatch

Resource Failures / Failed Actions

Fencing Failures

Key Context

Cluster States

Critical Attributes

Failure Scenarios

Available Remediation Tools

Pacemaker Resource Cleanup

Force New Cluster Helper

Reference Documentation

Output Format

Important Notes