From 98dcd4f9f8df006d77406d534a5315dc35dda605 Mon Sep 17 00:00:00 2001 From: max Date: Fri, 1 May 2026 16:59:31 -0500 Subject: [PATCH] lab-host: cis490-autoupdate.timer for self-healing on push MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Today's incident: post-cutover, k-gamingcom went silent and elliott-thinkpad kept shipping pre-stamp episodes that the receiver gate 400'd in a 2300+ PUT loop. Both required `git pull && install- lab-host.sh` *on the host* — neither the on-device AI agent nor the operator pulled in time, and from the receiver Pi I cannot reach in (sshd off on the lab hosts). Fix the recurrence directly: a 30-min systemd timer that does git fetch + (if behind) ff-only pull + re-run install-lab-host.sh. Hosts catch up on the next tick on their own — no human or agent action required. Mechanics: - scripts/auto-update.sh runs as root, drops to cis490 for git ops to satisfy /opt/cis490 ownership ("dubious ownership" guard). - Refuses ff if local HEAD isn't an ancestor of origin/main — protects operator hand-edits from silent overwrite. - Network failures exit 0 (offline is normal, don't pin a unit failure); divergence + install failures exit non-zero so the journal records what broke. - RandomizedDelaySec=10min on the timer prevents thundering-herd when several hosts boot together. - Hands off to install-lab-host.sh via exec — exactly one path through bring-up; no special "auto" flow. The version-gate provides the quality boundary, so even if origin/ main moves forward unsafely, the receiver's allow-list still controls what lands in the index. install-lab-host.sh enables cis490-autoupdate.timer on every run, idempotent — existing hosts pick it up the next time they pull manually. Filed Forgejo #18 with the canonical command for elliott-thinkpad + k-gamingcom to bootstrap themselves out of the current incident (auto-update doesn't help them retroactively — it has to be running *before* the cutover to catch the next one). Co-Authored-By: Claude Opus 4.7 (1M context) --- AGENTS.md | 11 +++++ etc/cis490-autoupdate.service | 21 ++++++++++ etc/cis490-autoupdate.timer | 18 +++++++++ scripts/auto-update.sh | 76 +++++++++++++++++++++++++++++++++++ scripts/install-lab-host.sh | 12 ++++++ 5 files changed, 138 insertions(+) create mode 100644 etc/cis490-autoupdate.service create mode 100644 etc/cis490-autoupdate.timer create mode 100755 scripts/auto-update.sh diff --git a/AGENTS.md b/AGENTS.md index 2d92a15..f012970 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -101,6 +101,17 @@ pinning — no Rapid7 registration), and Tier-4 real-malware samples from theZoo (no API key, no signup). The orchestrator switches to Tier-3 episodes automatically once the prereqs are on disk. +**Hosts self-update.** `install-lab-host.sh` enables +`cis490-autoupdate.timer`, which runs every 30 min (with up to 10 min +of randomized delay) and does `git fetch + git pull --ff-only + +install-lab-host.sh` whenever origin/main has moved. So once a host +has done the canonical bring-up ONCE, it self-heals on every +subsequent maintainer push — you don't need to remember to pull. The +timer logs to `journalctl -u cis490-autoupdate.service`. If the +host's checkout has diverged from origin (operator hand-edits, +half-applied changes), auto-update bails rather than guessing — that +shows up as a unit failure with a clear log message. + If `index.jsonl` doesn't grow within a wave-interval (~60 s after `systemctl enable --now`), run `cis490-doctor` again. The most common silent failures it catches: diff --git a/etc/cis490-autoupdate.service b/etc/cis490-autoupdate.service new file mode 100644 index 0000000..ee05d17 --- /dev/null +++ b/etc/cis490-autoupdate.service @@ -0,0 +1,21 @@ +[Unit] +Description=CIS490 lab-host auto-update from origin/main +Documentation=https://maxgit.wg/spectral/CIS490 +After=network-online.target wg-quick@wg0.service +# We don't Want network-online so that a host that's offline just +# silently skips the update tick instead of pinning a unit failure. + +[Service] +Type=oneshot +# Runs as root because install-lab-host.sh writes to /etc/, /opt/, and +# calls systemctl. The script drops to the cis490 user via `sudo -u` +# for the git fetch + pull. +ExecStart=/opt/cis490/scripts/auto-update.sh +StandardOutput=journal +StandardError=journal + +[Install] +# The TIMER is what gets enabled, not the service itself. We still set +# WantedBy here so that an operator can `systemctl start +# cis490-autoupdate.service` manually for a one-shot pull. +WantedBy=multi-user.target diff --git a/etc/cis490-autoupdate.timer b/etc/cis490-autoupdate.timer new file mode 100644 index 0000000..ce1eda0 --- /dev/null +++ b/etc/cis490-autoupdate.timer @@ -0,0 +1,18 @@ +[Unit] +Description=Run CIS490 lab-host auto-update every 30 minutes +Documentation=https://maxgit.wg/spectral/CIS490 + +[Timer] +# 5 min after boot so a freshly-flashed host catches up promptly. +OnBootSec=5min +# Then every 30 min. RandomizedDelaySec spreads across hosts so +# the receiver / forgejo aren't hammered all at once when several +# lab hosts come up together. +OnUnitActiveSec=30min +RandomizedDelaySec=10min +# If the host was off when a tick was due, run on next boot. +Persistent=true +Unit=cis490-autoupdate.service + +[Install] +WantedBy=timers.target diff --git a/scripts/auto-update.sh b/scripts/auto-update.sh new file mode 100755 index 0000000..522aaf3 --- /dev/null +++ b/scripts/auto-update.sh @@ -0,0 +1,76 @@ +#!/usr/bin/env bash +# Lab-host auto-update. Pulls origin/main and re-runs install-lab-host.sh +# when there's a newer commit on the canonical remote. +# +# Run by cis490-autoupdate.timer. Idempotent; safe to re-invoke. +# +# Why this exists: when the receiver's commit-allow-list rolls forward, +# any lab host running older code starts getting 412/400 on every PUT. +# Without auto-update, that requires either the on-device AI agent or +# the operator to notice and run `git pull && install-lab-host.sh` — +# neither of which happens reliably (k-gamingcom + elliott-thinkpad +# both stalled silently on the post-cutover 2026-05-01 incident). +# With auto-update, hosts catch up within RandomizedDelaySec of the +# next timer fire (≤ 40 min) on their own. +# +# Safety: +# - git pull is `--ff-only` — never rewrites or merges; if local +# diverged from origin (operator hand-edit, partial install) it +# bails rather than guess. +# - install-lab-host.sh is the SAME script the operator runs by hand. +# No special "auto" path; we want exactly one path through bring-up. +# - On any failure we exit non-zero so systemd records it; the timer +# re-fires next interval. Failures don't disable the timer. +# - The version gate provides quality control: even if auto-update +# pulls a known-bad commit, the receiver's allow-list catches it +# downstream. + +set -euo pipefail + +INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}" +SERVICE_USER="${SERVICE_USER:-cis490}" + +log() { printf '[auto-update] %s\n' "$*" >&2; } + +[[ -d "$INSTALL_ROOT/.git" ]] || { + log "no .git in $INSTALL_ROOT — auto-update only supports git checkouts" + exit 0 +} + +cd "$INSTALL_ROOT" + +# All git ops run as the service user (the owner of $INSTALL_ROOT). +# Running as root would trip git's "dubious ownership" guard. +GIT() { sudo -u "$SERVICE_USER" git -C "$INSTALL_ROOT" "$@"; } + +if ! GIT fetch --quiet origin main; then + log "git fetch failed — network blip or remote down; will retry next tick" + exit 0 # don't fail the unit; this is expected on offline hosts +fi + +LOCAL="$(GIT rev-parse HEAD)" +REMOTE="$(GIT rev-parse origin/main)" + +if [[ "$LOCAL" == "$REMOTE" ]]; then + log "up to date at ${LOCAL:0:12}" + exit 0 +fi + +# Branch divergence check — operator hand-edits or partial installs +# could leave HEAD on a non-main commit. We don't want to silently +# overwrite that. +if ! GIT merge-base --is-ancestor HEAD origin/main; then + log "WARN: local HEAD ${LOCAL:0:12} is not an ancestor of origin/main" + log " ${REMOTE:0:12} — refusing to fast-forward. Investigate via" + log " 'git -C $INSTALL_ROOT log --all --oneline -10' on the host." + exit 1 +fi + +log "updating ${LOCAL:0:12} -> ${REMOTE:0:12}" +GIT pull --ff-only --quiet origin main + +# install-lab-host.sh handles VERSION re-stamp, queue drain, daemon-reload, +# and systemctl restart of the lab-host services. Pass control to it +# directly via exec so its exit code is ours. +log "re-running install-lab-host.sh to apply new code" +exec "$INSTALL_ROOT/scripts/install-lab-host.sh" diff --git a/scripts/install-lab-host.sh b/scripts/install-lab-host.sh index 0f2f88e..113a8f1 100755 --- a/scripts/install-lab-host.sh +++ b/scripts/install-lab-host.sh @@ -117,7 +117,19 @@ install -m 0644 "$REPO_ROOT/etc/cis490-shipper.service" \ /etc/systemd/system/cis490-shipper.service install -m 0644 "$REPO_ROOT/etc/cis490-orchestrator.service" \ /etc/systemd/system/cis490-orchestrator.service +# Auto-update: a 30-min timer that does git fetch + (if behind) pull +# and re-run this script. Prevents host-falls-behind incidents when +# the receiver's allow-list rolls forward and an on-device agent +# fails to act on the 412 remediation. See AGENTS.md "Auto-update". +install -m 0644 "$REPO_ROOT/etc/cis490-autoupdate.service" \ + /etc/systemd/system/cis490-autoupdate.service +install -m 0644 "$REPO_ROOT/etc/cis490-autoupdate.timer" \ + /etc/systemd/system/cis490-autoupdate.timer systemctl daemon-reload +# Enable the timer immediately — the operator gets self-healing on the +# next 30-min tick without an extra `systemctl enable`. Idempotent. +systemctl enable --now cis490-autoupdate.timer 2>/dev/null || \ + log "WARN: could not enable cis490-autoupdate.timer (will retry next install)" # --- 5. config template (only on first install) ----------------------- if [[ ! -f "$ETC_ROOT/lab-host.toml" ]]; then