Commit graph

4 commits

Author SHA1 Message Date
max
49eba2fd60 fleet-health: proactive alerts on the Pi + per-host doctor reports
Two pieces of self-monitoring so the maintainer isn't the alarm:

(2) Receiver-side fleet health monitor

cis490-fleet-health.timer runs check_fleet_health.py every 5 min.
Detects three symptoms and writes them to
/var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy
to forward to a notifier):

  silent      — host shipped in last 24h but has been quiet >30 min
  fatal-only  — actively shipping but every PUT 4xx
  unstamped   — shipping without X-Cis490-Code-Commit header

Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault
fires once per hour, not every 5 min. 15 unit tests cover the index
parser, three detectors, and dedup.

(3) Per-host doctor snapshots

Lab hosts run cis490-doctor-check.timer once a day (10 min after
boot, then daily with 30-min jitter). The timer runs
cis490_doctor.py --json and PUTs the result to a new endpoint:

  PUT /v1/host-health/<host>   →  /var/lib/cis490/host-health/<host>.json
  GET /v1/host-health          →  aggregate across all hosts

Endpoint is NOT gated by version_gate — sick hosts running stale
code MUST still be able to report sickness. 11 unit tests cover
PUT/GET, atomic-write semantics, bearer auth, and the
not-gated-by-version-gate property.

ship_health_check.py reuses the existing shipper transport (mTLS +
bearer + receiver URL from lab-host.toml) so we don't reimplement
auth.

Both timers wired into install-lab-host.sh — the loop also enables
the previously-added autoupdate + cert-fetch timers, so a single
install run gives a host all four self-healing mechanisms.

Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2
pre-existing test_fleet.py failures from the elliott-ThinkPad
merge (667f042) are unrelated to this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:48:31 -05:00
max
cc0c96953e version_gate: Forgejo as canonical commit source (no fs perms needed)
Initial git-log-based gate ran into a permission wall: the cis490
service user can't read /home/max/cis490/.git (ProtectHome=true +
home-dir mode). Switching the production source to the local Forgejo
HTTP API (already accessible to all WG peers, single source of truth
both lab hosts and the receiver pull from). When the maintainer
pushes new code to spectral/CIS490, the next 5-second cache refresh
sees the new commit and lab hosts can immediately ship under it.

VersionGate now takes either:
  - forgejo_url + repo_owner + repo_name + branch (+ optional
    auth_token for private repos): hits
    /api/v1/repos/<owner>/<name>/commits?sha=<branch>&limit=<n>
  - repo_path: dev-only fallback, runs `git log` locally

Local-git path retained for tests + the dev-only case.

receiver.toml.example gains forgejo_url/repo_owner/repo_name/branch
with auth_token commented; live-deployed receiver.toml on the Pi has
the spectral org + token.

Live state on the Pi: 41 valid hashes loaded, head=f8ad02b. Verified
end-to-end:
  bogus commit → 412 + remediation
  HEAD commit  → clears gate (fails downstream at sha-mismatch as
                 expected for the empty-body verify probe)

Test added: test_forgejo_backend_accepts_returned_commits stands up
a tiny canned-response HTTPServer in-process, exercises the parser
without depending on a live Forgejo instance. Brings test_version_gate
to 10 cases; total 158/158.
2026-05-01 01:42:45 -05:00
max
f8ad02b2d7 Receiver enforces X-Cis490-Code-Commit allow-list (live, auto-refreshed)
Stops out-of-date lab hosts from polluting the dataset with episodes
generated by buggy code. The valid-commits set mirrors the maintainer's
working clone on the Pi automatically — when the maintainer pulls or
pushes a new commit, the receiver picks it up within the 5-second
cache TTL with no service restart.

Receiver changes:

- receiver/version_gate.py (new): VersionGate(repo_path, window).
  Each check() consults a frozenset of the last `window` commit
  hashes from `git -C <repo> log --format=%H -n <window>`, refreshed
  every 5s under a lock. Resilient to transient git failure (keeps
  prior cache so a flaky `git` doesn't lock out every shipper).

- receiver/app.py: PUT extracts X-Cis490-Code-Commit; gate.check()
  before ingest. Rejects with:
    400 + remediation if header missing or malformed
    412 + remediation + your_commit + head_commit if not in window
  Remediation block is verbatim copy-pasteable into the lab-host
  shell:
    cd /opt/cis490 && sudo -u cis490 git pull origin main
    sudo /opt/cis490/scripts/install-lab-host.sh
    sudo systemctl restart cis490-orchestrator

- receiver/store.py: ingest_stream takes commit kwarg, stamps it on
  the index.jsonl row (new optional field). Backfilled rows from
  index_backfill.py also pull commit out of meta.json.

- receiver/config.py + etc/receiver.toml.example: new [version_gate]
  section. enabled=true, repo_path=/home/max/cis490, window=100 by
  default. Enabled toggle exists for emergency disable-and-collect.

Shipper changes:

- shipper/transport.py: ship_tarball() takes commit kwarg, sends
  X-Cis490-Code-Commit header. 412 maps to status='fatal' so the
  queue doesn't infinite-retry — operator must pull and reinstall
  before the next ship will succeed.

- shipper/queue.py: reads meta.json::code_version.commit per
  episode, passes through. On 412, logs the receiver's full
  remediation block at ERROR level so journalctl on the lab host
  shows exactly what to run.

Tests: 9 in test_version_gate (including 2 end-to-end via
starlette.testclient), 2 cover the boundary where new commits land
mid-cache and where missing-repo gracefully keeps prior cache.
157/157 total.

Index schema: existing rows stay valid (commit field is optional
on read). New rows from receiver-direct AND from index_backfill.py
include commit.
2026-05-01 01:38:50 -05:00
Maximus Gorog
83e111961d Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency
Implements docs/transport.md as a small Starlette app. The receiver streams
episode tarballs to disk, verifies sha256 against an X-Content-SHA256 header,
atomically renames into the store on success, and appends one row to a flat
index.jsonl. No DB. Idempotent re-PUTs return 200; conflicting bodies return
409. Optional bearer-token auth (mTLS terminates at Caddy in prod).

receiver/
  store.py        EpisodeStore: sha-verifying streaming ingest, atomic rename,
                  append-only index. No HTTP.
  app.py          make_app(): Starlette routes + bearer guard.
  config.py       ReceiverConfig.load(): TOML parser.
  __main__.py     uvicorn entrypoint, reads --config TOML.

tests/test_receiver.py — 13 tests via httpx.ASGITransport. Covers: 201 new,
200 idempotent replay, 409 conflict, 400 sha mismatch + cleanup, 400 missing/
short header, 400 bad id, 400 bad suffix, 413 too large, 401 bearer enforcement,
schema-version pass-through.

etc/cis490-receiver.service — systemd unit with hardening flags.
etc/receiver.toml.example — config template matching docs/deploy.md.

End-to-end smoke-tested with curl: 201 → 200 → 409 path verified, file
on disk, single index row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:34:04 -06:00