CIS490/scripts/auto-update.sh
max 98dcd4f9f8 lab-host: cis490-autoupdate.timer for self-healing on push
Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).

Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.

Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
  to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
  protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
  failure); divergence + install failures exit non-zero so the
  journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
  when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
  through bring-up; no special "auto" flow.

The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.

install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.

Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:59:31 -05:00

76 lines
3 KiB
Bash
Executable file

#!/usr/bin/env bash
# Lab-host auto-update. Pulls origin/main and re-runs install-lab-host.sh
# when there's a newer commit on the canonical remote.
#
# Run by cis490-autoupdate.timer. Idempotent; safe to re-invoke.
#
# Why this exists: when the receiver's commit-allow-list rolls forward,
# any lab host running older code starts getting 412/400 on every PUT.
# Without auto-update, that requires either the on-device AI agent or
# the operator to notice and run `git pull && install-lab-host.sh` —
# neither of which happens reliably (k-gamingcom + elliott-thinkpad
# both stalled silently on the post-cutover 2026-05-01 incident).
# With auto-update, hosts catch up within RandomizedDelaySec of the
# next timer fire (≤ 40 min) on their own.
#
# Safety:
# - git pull is `--ff-only` — never rewrites or merges; if local
# diverged from origin (operator hand-edit, partial install) it
# bails rather than guess.
# - install-lab-host.sh is the SAME script the operator runs by hand.
# No special "auto" path; we want exactly one path through bring-up.
# - On any failure we exit non-zero so systemd records it; the timer
# re-fires next interval. Failures don't disable the timer.
# - The version gate provides quality control: even if auto-update
# pulls a known-bad commit, the receiver's allow-list catches it
# downstream.
set -euo pipefail
INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
SERVICE_USER="${SERVICE_USER:-cis490}"
log() { printf '[auto-update] %s\n' "$*" >&2; }
[[ -d "$INSTALL_ROOT/.git" ]] || {
log "no .git in $INSTALL_ROOT — auto-update only supports git checkouts"
exit 0
}
cd "$INSTALL_ROOT"
# All git ops run as the service user (the owner of $INSTALL_ROOT).
# Running as root would trip git's "dubious ownership" guard.
GIT() { sudo -u "$SERVICE_USER" git -C "$INSTALL_ROOT" "$@"; }
if ! GIT fetch --quiet origin main; then
log "git fetch failed — network blip or remote down; will retry next tick"
exit 0 # don't fail the unit; this is expected on offline hosts
fi
LOCAL="$(GIT rev-parse HEAD)"
REMOTE="$(GIT rev-parse origin/main)"
if [[ "$LOCAL" == "$REMOTE" ]]; then
log "up to date at ${LOCAL:0:12}"
exit 0
fi
# Branch divergence check — operator hand-edits or partial installs
# could leave HEAD on a non-main commit. We don't want to silently
# overwrite that.
if ! GIT merge-base --is-ancestor HEAD origin/main; then
log "WARN: local HEAD ${LOCAL:0:12} is not an ancestor of origin/main"
log " ${REMOTE:0:12} — refusing to fast-forward. Investigate via"
log " 'git -C $INSTALL_ROOT log --all --oneline -10' on the host."
exit 1
fi
log "updating ${LOCAL:0:12} -> ${REMOTE:0:12}"
GIT pull --ff-only --quiet origin main
# install-lab-host.sh handles VERSION re-stamp, queue drain, daemon-reload,
# and systemctl restart of the lab-host services. Pass control to it
# directly via exec so its exit code is ours.
log "re-running install-lab-host.sh to apply new code"
exec "$INSTALL_ROOT/scripts/install-lab-host.sh"