A backend service is down

Runbook for a non-responding backend — the finance :3000 API timing out, the RunPod pod refusing connections, or a worker that's stopped. Per-host diagnosis and restart.

Symptom: A tool that talks to a specific backend keeps failing while the rest of Leif works fine — finance_health_check times out, runpod_* returns Connection refused, or a host’s worker won’t respond.

Likely cause

A single backend service is stopped (or, for RunPod, the pod is down / rotated). This is a one-service outage, not a Leif outage — if other namespaces work, Leif itself is fine. Several of these have a benign explanation that looks like a failure:

Backend”Down” signalUsually means
Finance API (:3000 on nvrbackup)finance_health_check connect timeoutThe backend is stopped, not broken
RunPod GPU podrunpod_* Connection refusedPod is stopped or its SSH endpoint rotated
Pricing workerimports stuck pendingWorker process died

Diagnose

Finance API (:3000)

finance_health_check()
finance_app_execute_command(command="systemctl status <finance-service> --no-pager")

A connect timeout from finance_health_check means the :3000 backend isn’t running — confirm with the service status via the finance_app_* tools (they’re scoped to the finance tree; see finance).

RunPod pod

runpod_get_system_info()

Connection refused means the pod is stopped or RunPod assigned a new IP / port since you last used it — the endpoint is ephemeral. This is not a tool fault. Start the pod and confirm the current endpoint before relying on runpod_* again (see Hosts).

Pricing worker

See Pricing import landed nothing — the stuck-pending path covers the worker restart.

Fix

  • Finance API stopped: start/restart it via the finance source tree, then re-check health.

    finance_app_execute_command(command="sudo systemctl restart <finance-service>")
    finance_health_check()
  • RunPod stopped/rotated: start the pod (RunPod console), then re-establish the current endpoint. Once it’s up, runpod_get_system_info() should respond.

  • Generic host service: use the host’s own family — local_restart_service on Leif, shtops_restart_service on SHTops, pve_lxc_exec for a container — rather than the generic remote_* shell. See Shell & file tools.

Verify

Re-run the health read for the specific backend and confirm a clean response:

finance_health_check()      # finance
runpod_get_system_info()    # runpod