lab-host: cis490-autoupdate.timer for self-healing on push
Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).
Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.
Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
failure); divergence + install failures exit non-zero so the
journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
through bring-up; no special "auto" flow.
The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.
install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.
Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
20ff76c1e0
commit
98dcd4f9f8
5 changed files with 138 additions and 0 deletions
11
AGENTS.md
11
AGENTS.md
|
|
@ -101,6 +101,17 @@ pinning — no Rapid7 registration), and Tier-4 real-malware samples
|
|||
from theZoo (no API key, no signup). The orchestrator switches to
|
||||
Tier-3 episodes automatically once the prereqs are on disk.
|
||||
|
||||
**Hosts self-update.** `install-lab-host.sh` enables
|
||||
`cis490-autoupdate.timer`, which runs every 30 min (with up to 10 min
|
||||
of randomized delay) and does `git fetch + git pull --ff-only +
|
||||
install-lab-host.sh` whenever origin/main has moved. So once a host
|
||||
has done the canonical bring-up ONCE, it self-heals on every
|
||||
subsequent maintainer push — you don't need to remember to pull. The
|
||||
timer logs to `journalctl -u cis490-autoupdate.service`. If the
|
||||
host's checkout has diverged from origin (operator hand-edits,
|
||||
half-applied changes), auto-update bails rather than guessing — that
|
||||
shows up as a unit failure with a clear log message.
|
||||
|
||||
If `index.jsonl` doesn't grow within a wave-interval (~60 s after
|
||||
`systemctl enable --now`), run `cis490-doctor` again. The most
|
||||
common silent failures it catches:
|
||||
|
|
|
|||
21
etc/cis490-autoupdate.service
Normal file
21
etc/cis490-autoupdate.service
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
[Unit]
|
||||
Description=CIS490 lab-host auto-update from origin/main
|
||||
Documentation=https://maxgit.wg/spectral/CIS490
|
||||
After=network-online.target wg-quick@wg0.service
|
||||
# We don't Want network-online so that a host that's offline just
|
||||
# silently skips the update tick instead of pinning a unit failure.
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
# Runs as root because install-lab-host.sh writes to /etc/, /opt/, and
|
||||
# calls systemctl. The script drops to the cis490 user via `sudo -u`
|
||||
# for the git fetch + pull.
|
||||
ExecStart=/opt/cis490/scripts/auto-update.sh
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
# The TIMER is what gets enabled, not the service itself. We still set
|
||||
# WantedBy here so that an operator can `systemctl start
|
||||
# cis490-autoupdate.service` manually for a one-shot pull.
|
||||
WantedBy=multi-user.target
|
||||
18
etc/cis490-autoupdate.timer
Normal file
18
etc/cis490-autoupdate.timer
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
[Unit]
|
||||
Description=Run CIS490 lab-host auto-update every 30 minutes
|
||||
Documentation=https://maxgit.wg/spectral/CIS490
|
||||
|
||||
[Timer]
|
||||
# 5 min after boot so a freshly-flashed host catches up promptly.
|
||||
OnBootSec=5min
|
||||
# Then every 30 min. RandomizedDelaySec spreads across hosts so
|
||||
# the receiver / forgejo aren't hammered all at once when several
|
||||
# lab hosts come up together.
|
||||
OnUnitActiveSec=30min
|
||||
RandomizedDelaySec=10min
|
||||
# If the host was off when a tick was due, run on next boot.
|
||||
Persistent=true
|
||||
Unit=cis490-autoupdate.service
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
76
scripts/auto-update.sh
Executable file
76
scripts/auto-update.sh
Executable file
|
|
@ -0,0 +1,76 @@
|
|||
#!/usr/bin/env bash
|
||||
# Lab-host auto-update. Pulls origin/main and re-runs install-lab-host.sh
|
||||
# when there's a newer commit on the canonical remote.
|
||||
#
|
||||
# Run by cis490-autoupdate.timer. Idempotent; safe to re-invoke.
|
||||
#
|
||||
# Why this exists: when the receiver's commit-allow-list rolls forward,
|
||||
# any lab host running older code starts getting 412/400 on every PUT.
|
||||
# Without auto-update, that requires either the on-device AI agent or
|
||||
# the operator to notice and run `git pull && install-lab-host.sh` —
|
||||
# neither of which happens reliably (k-gamingcom + elliott-thinkpad
|
||||
# both stalled silently on the post-cutover 2026-05-01 incident).
|
||||
# With auto-update, hosts catch up within RandomizedDelaySec of the
|
||||
# next timer fire (≤ 40 min) on their own.
|
||||
#
|
||||
# Safety:
|
||||
# - git pull is `--ff-only` — never rewrites or merges; if local
|
||||
# diverged from origin (operator hand-edit, partial install) it
|
||||
# bails rather than guess.
|
||||
# - install-lab-host.sh is the SAME script the operator runs by hand.
|
||||
# No special "auto" path; we want exactly one path through bring-up.
|
||||
# - On any failure we exit non-zero so systemd records it; the timer
|
||||
# re-fires next interval. Failures don't disable the timer.
|
||||
# - The version gate provides quality control: even if auto-update
|
||||
# pulls a known-bad commit, the receiver's allow-list catches it
|
||||
# downstream.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
|
||||
SERVICE_USER="${SERVICE_USER:-cis490}"
|
||||
|
||||
log() { printf '[auto-update] %s\n' "$*" >&2; }
|
||||
|
||||
[[ -d "$INSTALL_ROOT/.git" ]] || {
|
||||
log "no .git in $INSTALL_ROOT — auto-update only supports git checkouts"
|
||||
exit 0
|
||||
}
|
||||
|
||||
cd "$INSTALL_ROOT"
|
||||
|
||||
# All git ops run as the service user (the owner of $INSTALL_ROOT).
|
||||
# Running as root would trip git's "dubious ownership" guard.
|
||||
GIT() { sudo -u "$SERVICE_USER" git -C "$INSTALL_ROOT" "$@"; }
|
||||
|
||||
if ! GIT fetch --quiet origin main; then
|
||||
log "git fetch failed — network blip or remote down; will retry next tick"
|
||||
exit 0 # don't fail the unit; this is expected on offline hosts
|
||||
fi
|
||||
|
||||
LOCAL="$(GIT rev-parse HEAD)"
|
||||
REMOTE="$(GIT rev-parse origin/main)"
|
||||
|
||||
if [[ "$LOCAL" == "$REMOTE" ]]; then
|
||||
log "up to date at ${LOCAL:0:12}"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Branch divergence check — operator hand-edits or partial installs
|
||||
# could leave HEAD on a non-main commit. We don't want to silently
|
||||
# overwrite that.
|
||||
if ! GIT merge-base --is-ancestor HEAD origin/main; then
|
||||
log "WARN: local HEAD ${LOCAL:0:12} is not an ancestor of origin/main"
|
||||
log " ${REMOTE:0:12} — refusing to fast-forward. Investigate via"
|
||||
log " 'git -C $INSTALL_ROOT log --all --oneline -10' on the host."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "updating ${LOCAL:0:12} -> ${REMOTE:0:12}"
|
||||
GIT pull --ff-only --quiet origin main
|
||||
|
||||
# install-lab-host.sh handles VERSION re-stamp, queue drain, daemon-reload,
|
||||
# and systemctl restart of the lab-host services. Pass control to it
|
||||
# directly via exec so its exit code is ours.
|
||||
log "re-running install-lab-host.sh to apply new code"
|
||||
exec "$INSTALL_ROOT/scripts/install-lab-host.sh"
|
||||
|
|
@ -117,7 +117,19 @@ install -m 0644 "$REPO_ROOT/etc/cis490-shipper.service" \
|
|||
/etc/systemd/system/cis490-shipper.service
|
||||
install -m 0644 "$REPO_ROOT/etc/cis490-orchestrator.service" \
|
||||
/etc/systemd/system/cis490-orchestrator.service
|
||||
# Auto-update: a 30-min timer that does git fetch + (if behind) pull
|
||||
# and re-run this script. Prevents host-falls-behind incidents when
|
||||
# the receiver's allow-list rolls forward and an on-device agent
|
||||
# fails to act on the 412 remediation. See AGENTS.md "Auto-update".
|
||||
install -m 0644 "$REPO_ROOT/etc/cis490-autoupdate.service" \
|
||||
/etc/systemd/system/cis490-autoupdate.service
|
||||
install -m 0644 "$REPO_ROOT/etc/cis490-autoupdate.timer" \
|
||||
/etc/systemd/system/cis490-autoupdate.timer
|
||||
systemctl daemon-reload
|
||||
# Enable the timer immediately — the operator gets self-healing on the
|
||||
# next 30-min tick without an extra `systemctl enable`. Idempotent.
|
||||
systemctl enable --now cis490-autoupdate.timer 2>/dev/null || \
|
||||
log "WARN: could not enable cis490-autoupdate.timer (will retry next install)"
|
||||
|
||||
# --- 5. config template (only on first install) -----------------------
|
||||
if [[ ! -f "$ETC_ROOT/lab-host.toml" ]]; then
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue