Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).
Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.
Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
failure); divergence + install failures exit non-zero so the
journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
through bring-up; no special "auto" flow.
The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.
install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.
Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
76 lines
3 KiB
Bash
Executable file
76 lines
3 KiB
Bash
Executable file
#!/usr/bin/env bash
|
|
# Lab-host auto-update. Pulls origin/main and re-runs install-lab-host.sh
|
|
# when there's a newer commit on the canonical remote.
|
|
#
|
|
# Run by cis490-autoupdate.timer. Idempotent; safe to re-invoke.
|
|
#
|
|
# Why this exists: when the receiver's commit-allow-list rolls forward,
|
|
# any lab host running older code starts getting 412/400 on every PUT.
|
|
# Without auto-update, that requires either the on-device AI agent or
|
|
# the operator to notice and run `git pull && install-lab-host.sh` —
|
|
# neither of which happens reliably (k-gamingcom + elliott-thinkpad
|
|
# both stalled silently on the post-cutover 2026-05-01 incident).
|
|
# With auto-update, hosts catch up within RandomizedDelaySec of the
|
|
# next timer fire (≤ 40 min) on their own.
|
|
#
|
|
# Safety:
|
|
# - git pull is `--ff-only` — never rewrites or merges; if local
|
|
# diverged from origin (operator hand-edit, partial install) it
|
|
# bails rather than guess.
|
|
# - install-lab-host.sh is the SAME script the operator runs by hand.
|
|
# No special "auto" path; we want exactly one path through bring-up.
|
|
# - On any failure we exit non-zero so systemd records it; the timer
|
|
# re-fires next interval. Failures don't disable the timer.
|
|
# - The version gate provides quality control: even if auto-update
|
|
# pulls a known-bad commit, the receiver's allow-list catches it
|
|
# downstream.
|
|
|
|
set -euo pipefail
|
|
|
|
INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
|
|
SERVICE_USER="${SERVICE_USER:-cis490}"
|
|
|
|
log() { printf '[auto-update] %s\n' "$*" >&2; }
|
|
|
|
[[ -d "$INSTALL_ROOT/.git" ]] || {
|
|
log "no .git in $INSTALL_ROOT — auto-update only supports git checkouts"
|
|
exit 0
|
|
}
|
|
|
|
cd "$INSTALL_ROOT"
|
|
|
|
# All git ops run as the service user (the owner of $INSTALL_ROOT).
|
|
# Running as root would trip git's "dubious ownership" guard.
|
|
GIT() { sudo -u "$SERVICE_USER" git -C "$INSTALL_ROOT" "$@"; }
|
|
|
|
if ! GIT fetch --quiet origin main; then
|
|
log "git fetch failed — network blip or remote down; will retry next tick"
|
|
exit 0 # don't fail the unit; this is expected on offline hosts
|
|
fi
|
|
|
|
LOCAL="$(GIT rev-parse HEAD)"
|
|
REMOTE="$(GIT rev-parse origin/main)"
|
|
|
|
if [[ "$LOCAL" == "$REMOTE" ]]; then
|
|
log "up to date at ${LOCAL:0:12}"
|
|
exit 0
|
|
fi
|
|
|
|
# Branch divergence check — operator hand-edits or partial installs
|
|
# could leave HEAD on a non-main commit. We don't want to silently
|
|
# overwrite that.
|
|
if ! GIT merge-base --is-ancestor HEAD origin/main; then
|
|
log "WARN: local HEAD ${LOCAL:0:12} is not an ancestor of origin/main"
|
|
log " ${REMOTE:0:12} — refusing to fast-forward. Investigate via"
|
|
log " 'git -C $INSTALL_ROOT log --all --oneline -10' on the host."
|
|
exit 1
|
|
fi
|
|
|
|
log "updating ${LOCAL:0:12} -> ${REMOTE:0:12}"
|
|
GIT pull --ff-only --quiet origin main
|
|
|
|
# install-lab-host.sh handles VERSION re-stamp, queue drain, daemon-reload,
|
|
# and systemctl restart of the lab-host services. Pass control to it
|
|
# directly via exec so its exit code is ours.
|
|
log "re-running install-lab-host.sh to apply new code"
|
|
exec "$INSTALL_ROOT/scripts/install-lab-host.sh"
|