Commit graph

1 commit

Author SHA1 Message Date
max
98dcd4f9f8 lab-host: cis490-autoupdate.timer for self-healing on push
Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).

Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.

Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
  to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
  protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
  failure); divergence + install failures exit non-zero so the
  journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
  when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
  through bring-up; no special "auto" flow.

The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.

install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.

Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:59:31 -05:00