CIS490/receiver/__main__.py
max 49eba2fd60 fleet-health: proactive alerts on the Pi + per-host doctor reports
Two pieces of self-monitoring so the maintainer isn't the alarm:

(2) Receiver-side fleet health monitor

cis490-fleet-health.timer runs check_fleet_health.py every 5 min.
Detects three symptoms and writes them to
/var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy
to forward to a notifier):

  silent      — host shipped in last 24h but has been quiet >30 min
  fatal-only  — actively shipping but every PUT 4xx
  unstamped   — shipping without X-Cis490-Code-Commit header

Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault
fires once per hour, not every 5 min. 15 unit tests cover the index
parser, three detectors, and dedup.

(3) Per-host doctor snapshots

Lab hosts run cis490-doctor-check.timer once a day (10 min after
boot, then daily with 30-min jitter). The timer runs
cis490_doctor.py --json and PUTs the result to a new endpoint:

  PUT /v1/host-health/<host>   →  /var/lib/cis490/host-health/<host>.json
  GET /v1/host-health          →  aggregate across all hosts

Endpoint is NOT gated by version_gate — sick hosts running stale
code MUST still be able to report sickness. 11 unit tests cover
PUT/GET, atomic-write semantics, bearer auth, and the
not-gated-by-version-gate property.

ship_health_check.py reuses the existing shipper transport (mTLS +
bearer + receiver URL from lab-host.toml) so we don't reimplement
auth.

Both timers wired into install-lab-host.sh — the loop also enables
the previously-added autoupdate + cert-fetch timers, so a single
install run gives a host all four self-healing mechanisms.

Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2
pre-existing test_fleet.py failures from the elliott-ThinkPad
merge (667f042) are unrelated to this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:48:31 -05:00

71 lines
2.1 KiB
Python

from __future__ import annotations
import argparse
import logging
import os
from pathlib import Path
import uvicorn
from .app import make_app
from .config import ReceiverConfig
from .store import EpisodeStore
from .version_gate import VersionGate
def main() -> None:
parser = argparse.ArgumentParser(prog="cis490-receiver")
parser.add_argument(
"--config",
default=os.environ.get("CIS490_RECEIVER_CONFIG", "/etc/cis490/receiver.toml"),
help="path to receiver TOML config",
)
args = parser.parse_args()
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s %(message)s",
)
cfg = ReceiverConfig.load(args.config)
store = EpisodeStore(
store_root=cfg.store_root,
incoming_root=cfg.incoming_root,
index_path=cfg.index_path,
)
version_gate = None
if cfg.version_gate_enabled:
# Auto-detect /opt/cis490/.git as a fallback so a forgejo blip
# at startup doesn't reject every PUT with not-in-window. The
# receiver service has read access to /opt under
# ProtectSystem=strict, and that path is where the production
# install lands — so it's the natural local source of truth.
repo_path = cfg.version_gate_local_repo
if repo_path is None and Path("/opt/cis490/.git").is_dir():
repo_path = Path("/opt/cis490")
version_gate = VersionGate(
repo_path=repo_path,
window=cfg.version_gate_window,
forgejo_url=cfg.version_gate_forgejo_url,
repo_owner=cfg.version_gate_repo_owner,
repo_name=cfg.version_gate_repo_name,
branch=cfg.version_gate_branch,
auth_token=cfg.version_gate_auth_token,
)
app = make_app(
store=store,
max_episode_bytes=cfg.max_episode_bytes,
bearer_token=cfg.bearer_token,
version_gate=version_gate,
health_root=cfg.health_root,
)
uvicorn.run(
app,
host=cfg.listen_host,
port=cfg.listen_port,
log_config=None,
)
if __name__ == "__main__":
main()