Two pieces of self-monitoring so the maintainer isn't the alarm:
(2) Receiver-side fleet health monitor
cis490-fleet-health.timer runs check_fleet_health.py every 5 min.
Detects three symptoms and writes them to
/var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy
to forward to a notifier):
silent — host shipped in last 24h but has been quiet >30 min
fatal-only — actively shipping but every PUT 4xx
unstamped — shipping without X-Cis490-Code-Commit header
Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault
fires once per hour, not every 5 min. 15 unit tests cover the index
parser, three detectors, and dedup.
(3) Per-host doctor snapshots
Lab hosts run cis490-doctor-check.timer once a day (10 min after
boot, then daily with 30-min jitter). The timer runs
cis490_doctor.py --json and PUTs the result to a new endpoint:
PUT /v1/host-health/<host> → /var/lib/cis490/host-health/<host>.json
GET /v1/host-health → aggregate across all hosts
Endpoint is NOT gated by version_gate — sick hosts running stale
code MUST still be able to report sickness. 11 unit tests cover
PUT/GET, atomic-write semantics, bearer auth, and the
not-gated-by-version-gate property.
ship_health_check.py reuses the existing shipper transport (mTLS +
bearer + receiver URL from lab-host.toml) so we don't reimplement
auth.
Both timers wired into install-lab-host.sh — the loop also enables
the previously-added autoupdate + cert-fetch timers, so a single
install run gives a host all four self-healing mechanisms.
Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2
pre-existing test_fleet.py failures from the elliott-ThinkPad
merge (667f042) are unrelated to this change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
27 lines
794 B
Desktop File
27 lines
794 B
Desktop File
[Unit]
|
|
Description=CIS490 fleet health check (silent / fatal-only / unstamped detection)
|
|
Documentation=https://maxgit.wg/spectral/CIS490
|
|
After=cis490-receiver.service
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
# Reads /var/lib/cis490/index.jsonl + journalctl, writes
|
|
# /var/lib/cis490/alerts.jsonl. journalctl needs the systemd-journal
|
|
# group; this unit runs as root so we don't have to fiddle with that.
|
|
User=root
|
|
Group=root
|
|
WorkingDirectory=/opt/cis490
|
|
ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/check_fleet_health.py
|
|
StandardOutput=journal
|
|
StandardError=journal
|
|
|
|
# Hardening — read /var/lib/cis490 for index + alerts, write the
|
|
# alerts file there.
|
|
NoNewPrivileges=true
|
|
PrivateTmp=true
|
|
ProtectSystem=strict
|
|
ProtectHome=true
|
|
ReadWritePaths=/var/lib/cis490
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|