Runbooks

Symptom-keyed incident runbooks — the page you open when something is broken. Each one starts from what you observe and walks to the fix.

The pages you open when something is broken. Each runbook is keyed by the symptom — what you actually see — not by the system, because at 11pm you know “the order list is stale,” not “the cron on nvrbackup didn’t fire.” Find the row that matches, follow the link.

Every runbook follows the same shape: Symptom → Likely cause → Diagnose → Fix → Verify. The diagnosis steps are concrete tool calls, not “investigate.”

Triage table

You’re seeing…Runbook
Claude can’t call Leif tools at all; everything times outLeif / MCP unreachable
A pricing import reports success but no products appearPricing import landed nothing
The master order list / ESA buckets look staleRS ticket sync stale
finance_health_check, runpod_*, or a worker won’t respondA backend service is down
A Cloudflare-fronted site (e.g. super-ht.com) redirect-loopsCloudflare redirect loop
The docs site won’t build / deploy on Cloudflare PagesDocs site deploy failure

Before you start

A few things that turn “broken” into “not actually broken,” worth checking first:

  • Right tool, right host. A huge share of “it’s broken” is a wrong-tool bug — local_* (Leif) vs remote_* (nvrbackup), cwa_get vs cwa_search. See the routing table.
  • Known-benign noise. The No module named 'models' line in the RS sync cron log is expected — don’t chase it. CWA’s command channel returning Output: "ERR" is a quirk, not a failure.
  • Ephemeral by design. A Connection refused from runpod_* usually means the pod is stopped, not that anything failed.
  • Hosts — host inventory and the tool-routing table
  • Service Map — services and the tools that manage each
  • Tools — the full tool catalog with per-namespace references