Runbooks
Symptom-keyed incident runbooks — the page you open when something is broken. Each one starts from what you observe and walks to the fix.
The pages you open when something is broken. Each runbook is keyed by the symptom — what you actually see — not by the system, because at 11pm you know “the order list is stale,” not “the cron on nvrbackup didn’t fire.” Find the row that matches, follow the link.
Every runbook follows the same shape: Symptom → Likely cause → Diagnose → Fix → Verify. The diagnosis steps are concrete tool calls, not “investigate.”
Triage table
| You’re seeing… | Runbook |
|---|---|
| Claude can’t call Leif tools at all; everything times out | Leif / MCP unreachable |
| A pricing import reports success but no products appear | Pricing import landed nothing |
| The master order list / ESA buckets look stale | RS ticket sync stale |
finance_health_check, runpod_*, or a worker won’t respond | A backend service is down |
A Cloudflare-fronted site (e.g. super-ht.com) redirect-loops | Cloudflare redirect loop |
| The docs site won’t build / deploy on Cloudflare Pages | Docs site deploy failure |
Before you start
A few things that turn “broken” into “not actually broken,” worth checking first:
- Right tool, right host. A huge share of “it’s broken” is a wrong-tool
bug —
local_*(Leif) vsremote_*(nvrbackup),cwa_getvscwa_search. See the routing table. - Known-benign noise. The
No module named 'models'line in the RS sync cron log is expected — don’t chase it. CWA’s command channel returningOutput: "ERR"is a quirk, not a failure. - Ephemeral by design. A
Connection refusedfromrunpod_*usually means the pod is stopped, not that anything failed.
Related pages
- Hosts — host inventory and the tool-routing table
- Service Map — services and the tools that manage each
- Tools — the full tool catalog with per-namespace references