CIS490

Author SHA1 Message Date

Author	SHA1	Message	Date
max	05bf785f0a	fleet-health: exit 0 when alerts found (don't mark unit failed) The detector previously returned 1 on alerts, which made systemd mark cis490-fleet-health.service as 'failed' every tick that found a sick host. That's the wrong UX — a detector finding a fault is working correctly, not crashing. The alert is the signal (via WARNING log + alerts.jsonl); the unit's success state should mean "the detector itself ran cleanly." Test added. Caught while live-deploying on the Pi: the first run found elliott-thinkpad fatal-only at 943×4xx + 1425×5xx and correctly emitted the alert — but systemd showed the unit red, which would have caused operators to chase the wrong tail. Side note: the same first run also caught a real bug — pycache for receiver.store on /opt/cis490 was stale after I deployed the new app.py + store.py from main, causing 1464 × 500 responses. Cleared the pycache and the index immediately resumed growing (4465 → 4515 in 30 seconds). The detector earned its keep on the very first cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:51:20 -05:00
max	49eba2fd60	fleet-health: proactive alerts on the Pi + per-host doctor reports Two pieces of self-monitoring so the maintainer isn't the alarm: (2) Receiver-side fleet health monitor cis490-fleet-health.timer runs check_fleet_health.py every 5 min. Detects three symptoms and writes them to /var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy to forward to a notifier): silent — host shipped in last 24h but has been quiet >30 min fatal-only — actively shipping but every PUT 4xx unstamped — shipping without X-Cis490-Code-Commit header Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault fires once per hour, not every 5 min. 15 unit tests cover the index parser, three detectors, and dedup. (3) Per-host doctor snapshots Lab hosts run cis490-doctor-check.timer once a day (10 min after boot, then daily with 30-min jitter). The timer runs cis490_doctor.py --json and PUTs the result to a new endpoint: PUT /v1/host-health/<host> → /var/lib/cis490/host-health/<host>.json GET /v1/host-health → aggregate across all hosts Endpoint is NOT gated by version_gate — sick hosts running stale code MUST still be able to report sickness. 11 unit tests cover PUT/GET, atomic-write semantics, bearer auth, and the not-gated-by-version-gate property. ship_health_check.py reuses the existing shipper transport (mTLS + bearer + receiver URL from lab-host.toml) so we don't reimplement auth. Both timers wired into install-lab-host.sh — the loop also enables the previously-added autoupdate + cert-fetch timers, so a single install run gives a host all four self-healing mechanisms. Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2 pre-existing test_fleet.py failures from the elliott-ThinkPad merge (`667f042`) are unrelated to this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:48:31 -05:00

max

05bf785f0a

fleet-health: exit 0 when alerts found (don't mark unit failed)

The detector previously returned 1 on alerts, which made systemd
mark cis490-fleet-health.service as 'failed' every tick that found
a sick host. That's the wrong UX — a detector finding a fault is
working correctly, not crashing. The alert is the signal (via
WARNING log + alerts.jsonl); the unit's success state should mean
"the detector itself ran cleanly." Test added.

Caught while live-deploying on the Pi: the first run found
elliott-thinkpad fatal-only at 943×4xx + 1425×5xx and correctly
emitted the alert — but systemd showed the unit red, which would
have caused operators to chase the wrong tail.

Side note: the same first run also caught a real bug — pycache for
receiver.store on /opt/cis490 was stale after I deployed the new
app.py + store.py from main, causing 1464 × 500 responses. Cleared
the pycache and the index immediately resumed growing (4465 →
4515 in 30 seconds). The detector earned its keep on the very
first cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 13:51:20 -05:00

max

49eba2fd60

fleet-health: proactive alerts on the Pi + per-host doctor reports

Two pieces of self-monitoring so the maintainer isn't the alarm:

(2) Receiver-side fleet health monitor

cis490-fleet-health.timer runs check_fleet_health.py every 5 min.
Detects three symptoms and writes them to
/var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy
to forward to a notifier):

  silent      — host shipped in last 24h but has been quiet >30 min
  fatal-only  — actively shipping but every PUT 4xx
  unstamped   — shipping without X-Cis490-Code-Commit header

Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault
fires once per hour, not every 5 min. 15 unit tests cover the index
parser, three detectors, and dedup.

(3) Per-host doctor snapshots

Lab hosts run cis490-doctor-check.timer once a day (10 min after
boot, then daily with 30-min jitter). The timer runs
cis490_doctor.py --json and PUTs the result to a new endpoint:

  PUT /v1/host-health/<host>   →  /var/lib/cis490/host-health/<host>.json
  GET /v1/host-health          →  aggregate across all hosts

Endpoint is NOT gated by version_gate — sick hosts running stale
code MUST still be able to report sickness. 11 unit tests cover
PUT/GET, atomic-write semantics, bearer auth, and the
not-gated-by-version-gate property.

ship_health_check.py reuses the existing shipper transport (mTLS +
bearer + receiver URL from lab-host.toml) so we don't reimplement
auth.

Both timers wired into install-lab-host.sh — the loop also enables
the previously-added autoupdate + cert-fetch timers, so a single
install run gives a host all four self-healing mechanisms.

Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2
pre-existing test_fleet.py failures from the elliott-ThinkPad
merge (667f042) are unrelated to this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 13:48:31 -05:00

2 commits