Runbooks

Symptom-keyed incident runbooks — the page you open when something is broken. Each one starts from what you observe and walks to the fix.

The pages you open when something is broken. Each runbook is keyed by the symptom — what you actually see — not by the system, because at 11pm you know “the order list is stale,” not “the cron on nvrbackup didn’t fire.” Find the row that matches, follow the link.

Every runbook follows the same shape: Symptom → Likely cause → Diagnose → Fix → Verify. The diagnosis steps are concrete tool calls, not “investigate.”

Triage table

You’re seeing…	Runbook
Claude can’t call Leif tools at all; everything times out	Leif / MCP unreachable
A pricing import reports success but no products appear	Pricing import landed nothing
The master order list / ESA buckets look stale	RS ticket sync stale
`finance_health_check`, `runpod_*`, or a worker won’t respond	A backend service is down
A Cloudflare-fronted site (e.g. `super-ht.com`) redirect-loops	Cloudflare redirect loop
The docs site won’t build / deploy on Cloudflare Pages	Docs site deploy failure

Before you start

A few things that turn “broken” into “not actually broken,” worth checking first:

Right tool, right host. A huge share of “it’s broken” is a wrong-tool bug — local_* (Leif) vs remote_* (nvrbackup), cwa_get vs cwa_search. See the routing table.
Known-benign noise. The No module named 'models' line in the RS sync cron log is expected — don’t chase it. CWA’s command channel returning Output: "ERR" is a quirk, not a failure.
Ephemeral by design. A Connection refused from runpod_* usually means the pod is stopped, not that anything failed.

Hosts — host inventory and the tool-routing table
Service Map — services and the tools that manage each
Tools — the full tool catalog with per-namespace references

Runbooks

Triage table

Before you start

Related pages