lab-host: cis490-cert-fetch.timer for automatic mTLS bootstrap retry
k-gamingcom symptom (2026-05-02): the on-device agent successfully finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS material" because the cert auto-fetch step in install-lab-host.sh either ran with host_id still REPLACE_ME, or hit a transient bootstrap.wg failure, and there's no automatic retry. The Pi-side cert IS minted and the bootstrap endpoint serves it — the failure mode is purely "lab-host hasn't pulled it down." Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh (idempotent, no-op when certs are already on disk, no-op when host_id is unset, exit-0 on transient network failure so the unit doesn't get pinned as failed), and run it from a 5-minute systemd timer. The timer handles all three "stuck waiting on mTLS" cases without operator action: - operator edited host_id post-install but didn't re-run install - bootstrap.wg was briefly unreachable during install - lab host was offline when install ran but came up later The script `try-restart`s cis490-shipper after a successful fetch so the daemon picks up the new cert immediately instead of waiting for its lazy retry. install-lab-host.sh still calls the script on install for fast first-time bring-up — the timer is the safety net. Tarball extract is staged through a temp dir + atomic rename so a mid-extract crash never leaves us with a mismatched cert/key pair. AGENTS.md row 4 updated: "waiting on mTLS material" remediation now points at the timer, with the exact `systemctl start cis490-cert-fetch.service` command to force an immediate retry. Tests: 267/267 unchanged. The fetch script is idempotent + has all its happy/error paths handled inline; a unit test would mostly be testing systemd's behaviour. The integration test path is the timer running on a real lab host, which is the actual production case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
667f042707
commit
3180f7b5ac
5 changed files with 182 additions and 37 deletions
|
|
@ -18,7 +18,7 @@ Do not invent variations. Do not paraphrase.
|
||||||
| `journalctl -u cis490-shipper` shows `412 commit-rejected` or `code commit rejected` | `cd /opt/cis490 && sudo -u cis490 git pull origin main && sudo /opt/cis490/scripts/install-lab-host.sh` |
|
| `journalctl -u cis490-shipper` shows `412 commit-rejected` or `code commit rejected` | `cd /opt/cis490 && sudo -u cis490 git pull origin main && sudo /opt/cis490/scripts/install-lab-host.sh` |
|
||||||
| `journalctl -u cis490-shipper` shows `400` and `missing X-Cis490-Code-Commit` | Same command as above. |
|
| `journalctl -u cis490-shipper` shows `400` and `missing X-Cis490-Code-Commit` | Same command as above. |
|
||||||
| `systemctl status cis490-shipper` shows `inactive (dead)` after install | `sudo systemctl enable --now cis490-shipper cis490-orchestrator` |
|
| `systemctl status cis490-shipper` shows `inactive (dead)` after install | `sudo systemctl enable --now cis490-shipper cis490-orchestrator` |
|
||||||
| Shipper logs `waiting on mTLS material` | **Do nothing** — this is expected. The cert auto-arrives; the daemon retries each request. |
|
| Shipper logs `waiting on mTLS material` | **Do nothing** — `cis490-cert-fetch.timer` polls bootstrap.wg every 5 min; once the cert lands, the script restarts the shipper automatically. To force a fetch right now: `sudo systemctl start cis490-cert-fetch.service` then `sudo journalctl -u cis490-cert-fetch -n 20`. |
|
||||||
| Receiver index isn't growing (Pi side: `tail -f /var/lib/cis490/index.jsonl`) | `/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host` — read the **red** rows; each one prints the exact fix command. |
|
| Receiver index isn't growing (Pi side: `tail -f /var/lib/cis490/index.jsonl`) | `/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host` — read the **red** rows; each one prints the exact fix command. |
|
||||||
| `bootstrap.wg fetch failed` after install | `echo '10.100.0.1 bootstrap.wg collector.wg' \| sudo tee -a /etc/hosts && sudo /opt/cis490/scripts/install-lab-host.sh` |
|
| `bootstrap.wg fetch failed` after install | `echo '10.100.0.1 bootstrap.wg collector.wg' \| sudo tee -a /etc/hosts && sudo /opt/cis490/scripts/install-lab-host.sh` |
|
||||||
| `cis490-doctor` says VERSION stamp missing | `sudo /opt/cis490/scripts/install-lab-host.sh` |
|
| `cis490-doctor` says VERSION stamp missing | `sudo /opt/cis490/scripts/install-lab-host.sh` |
|
||||||
|
|
|
||||||
20
etc/cis490-cert-fetch.service
Normal file
20
etc/cis490-cert-fetch.service
Normal file
|
|
@ -0,0 +1,20 @@
|
||||||
|
[Unit]
|
||||||
|
Description=CIS490 lab-host mTLS leaf cert fetch (idempotent)
|
||||||
|
Documentation=https://maxgit.wg/spectral/CIS490
|
||||||
|
After=network-online.target wg-quick@wg0.service
|
||||||
|
# We don't Want network-online — if the network is down the script
|
||||||
|
# exits 0 silently and the timer will retry.
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
# Runs as root because the script writes /etc/cis490/certs/ (owned by
|
||||||
|
# root, gid cis490) and may need to systemctl-restart cis490-shipper.
|
||||||
|
ExecStart=/opt/cis490/scripts/fetch-lab-host-cert.sh
|
||||||
|
StandardOutput=journal
|
||||||
|
StandardError=journal
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
# The TIMER is what gets enabled. WantedBy here lets an operator
|
||||||
|
# `systemctl start cis490-cert-fetch.service` to force a one-shot
|
||||||
|
# fetch (e.g. right after editing host_id).
|
||||||
|
WantedBy=multi-user.target
|
||||||
22
etc/cis490-cert-fetch.timer
Normal file
22
etc/cis490-cert-fetch.timer
Normal file
|
|
@ -0,0 +1,22 @@
|
||||||
|
[Unit]
|
||||||
|
Description=Periodically fetch CIS490 mTLS leaf cert if missing
|
||||||
|
Documentation=https://maxgit.wg/spectral/CIS490
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
# Fire 30 seconds after boot — covers the freshly-installed host
|
||||||
|
# scenario (operator edits host_id; cert lands within ~30s of next
|
||||||
|
# boot or timer tick).
|
||||||
|
OnBootSec=30sec
|
||||||
|
# Then aggressively for the first hour (5 min cadence) to handle the
|
||||||
|
# common "operator just edited host_id" path without making them
|
||||||
|
# wait. Once the host has its cert the script is a no-op so the
|
||||||
|
# 5-min cadence is cheap.
|
||||||
|
OnUnitActiveSec=5min
|
||||||
|
# Resists clock-skew + matches systemd's coarse scheduling.
|
||||||
|
AccuracySec=10sec
|
||||||
|
# If the host was off when a tick was due, run on next boot.
|
||||||
|
Persistent=true
|
||||||
|
Unit=cis490-cert-fetch.service
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
121
scripts/fetch-lab-host-cert.sh
Executable file
121
scripts/fetch-lab-host-cert.sh
Executable file
|
|
@ -0,0 +1,121 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Fetch this lab-host's mTLS leaf cert from the Pi's bootstrap endpoint.
|
||||||
|
#
|
||||||
|
# Idempotent. Safe to run repeatedly:
|
||||||
|
# - If certs are already on disk, exit 0 immediately (no-op).
|
||||||
|
# - If host_id is unset / still REPLACE_ME, exit 0 — the operator
|
||||||
|
# hasn't told us who we are yet, so there's nothing to fetch.
|
||||||
|
# - If bootstrap.wg can't be reached, exit 0 — network blip; let the
|
||||||
|
# timer retry.
|
||||||
|
# - On a successful fetch, install certs into $ETC_ROOT/certs/
|
||||||
|
# atomically and `systemctl try-restart cis490-shipper` so the
|
||||||
|
# running daemon picks up the cert without waiting for its lazy
|
||||||
|
# retry.
|
||||||
|
#
|
||||||
|
# Run by cis490-cert-fetch.timer (every 5 min) AND by
|
||||||
|
# install-lab-host.sh on every install. Also safe for an operator to
|
||||||
|
# invoke manually.
|
||||||
|
#
|
||||||
|
# Why this exists as its own script instead of an inline block in
|
||||||
|
# install-lab-host.sh: install-lab-host.sh does a LOT (cp, venv,
|
||||||
|
# Tier-3+4 deploy, queue drain, daemon restart) — re-running it
|
||||||
|
# every 5 min for the cert is overkill and disruptive. Lift just the
|
||||||
|
# cert step into a fast, idempotent oneshot.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
|
||||||
|
ETC_ROOT="${ETC_ROOT:-/etc/cis490}"
|
||||||
|
SERVICE_USER="${SERVICE_USER:-cis490}"
|
||||||
|
|
||||||
|
log() { printf '[fetch-lab-host-cert] %s\n' "$*" >&2; }
|
||||||
|
|
||||||
|
[[ $EUID -eq 0 ]] || { log "must run as root (writes /etc/cis490/certs)"; exit 2; }
|
||||||
|
|
||||||
|
# Already on disk? No-op. We DON'T validate cert expiry / chain here —
|
||||||
|
# that's the shipper's job (the SSL context build catches a corrupt or
|
||||||
|
# expired cert; the operator gets the warning in journalctl). Refresh
|
||||||
|
# logic for cert renewal would belong in a separate script.
|
||||||
|
if [[ -f "$ETC_ROOT/certs/lab-host.pem" \
|
||||||
|
&& -f "$ETC_ROOT/certs/lab-host.key" \
|
||||||
|
&& -f "$ETC_ROOT/certs/wg-ca.pem" ]]; then
|
||||||
|
log "certs already present; nothing to do"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# host_id not set yet? Wait for the operator.
|
||||||
|
if [[ ! -f "$ETC_ROOT/lab-host.toml" ]]; then
|
||||||
|
log "no $ETC_ROOT/lab-host.toml yet; nothing to fetch"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
HOST_ID="$(grep -E '^host_id\s*=' "$ETC_ROOT/lab-host.toml" 2>/dev/null \
|
||||||
|
| head -1 | sed -E 's/^host_id\s*=\s*"([^"]+)".*/\1/' || true)"
|
||||||
|
if [[ -z "$HOST_ID" || "$HOST_ID" == "REPLACE_ME" ]]; then
|
||||||
|
log "host_id not set in $ETC_ROOT/lab-host.toml — operator must edit it first"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# We need the Caddy root CA to verify bootstrap.wg's TLS cert. It's
|
||||||
|
# bundled in the repo. If it's missing, our checkout is broken — that's
|
||||||
|
# a real failure.
|
||||||
|
CA_BUNDLE="$INSTALL_ROOT/etc/caddy-root.crt"
|
||||||
|
[[ -f "$CA_BUNDLE" ]] || { log "missing $CA_BUNDLE — install broken"; exit 1; }
|
||||||
|
|
||||||
|
install -d -m 0755 -o root -g "$SERVICE_USER" "$ETC_ROOT/certs"
|
||||||
|
|
||||||
|
# Use a per-pid tarball so concurrent runs (timer + manual operator)
|
||||||
|
# don't stomp each other.
|
||||||
|
TAR="/tmp/cis490-bootstrap-$$.tar"
|
||||||
|
trap 'rm -f "$TAR"' EXIT
|
||||||
|
|
||||||
|
log "fetching leaf cert for host_id=$HOST_ID from https://bootstrap.wg/"
|
||||||
|
if ! curl -fsS --cacert "$CA_BUNDLE" \
|
||||||
|
--connect-timeout 10 --max-time 60 \
|
||||||
|
"https://bootstrap.wg/v1/cert/$HOST_ID" -o "$TAR"; then
|
||||||
|
log "bootstrap.wg fetch failed — will retry on next timer tick"
|
||||||
|
log " if this persists, check:"
|
||||||
|
log " - /etc/hosts: 'getent hosts bootstrap.wg' should return 10.100.0.1"
|
||||||
|
log " - wg0: 'sudo wg show' should list the Pi as a peer"
|
||||||
|
log " - Pi-side: cis490-bootstrap.service active on 10.100.0.1"
|
||||||
|
# exit 0 (not 1) so transient network blips don't pin the unit as
|
||||||
|
# failed. The timer fires every few minutes — pile of failures isn't
|
||||||
|
# what we want in journalctl.
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Stage into a sibling temp dir then atomically rename, so a partial
|
||||||
|
# extract never leaves us with mixed-version cert + key on disk.
|
||||||
|
STAGE="$(mktemp -d "$ETC_ROOT/certs/.stage.XXXXXX")"
|
||||||
|
trap 'rm -rf "$STAGE" "$TAR"' EXIT
|
||||||
|
|
||||||
|
if ! tar -C "$STAGE" -xf "$TAR"; then
|
||||||
|
log "ERROR: tarball is malformed"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Validate the expected files are there before we install. Better to
|
||||||
|
# fail loudly than half-install.
|
||||||
|
for f in "ca.crt" "$HOST_ID.pem" "$HOST_ID.key"; do
|
||||||
|
[[ -f "$STAGE/$f" ]] || { log "ERROR: bootstrap tarball missing $f"; exit 1; }
|
||||||
|
done
|
||||||
|
|
||||||
|
mv "$STAGE/ca.crt" "$ETC_ROOT/certs/wg-ca.pem"
|
||||||
|
mv "$STAGE/$HOST_ID.pem" "$ETC_ROOT/certs/lab-host.pem"
|
||||||
|
mv "$STAGE/$HOST_ID.key" "$ETC_ROOT/certs/lab-host.key"
|
||||||
|
chown root:"$SERVICE_USER" \
|
||||||
|
"$ETC_ROOT/certs/wg-ca.pem" \
|
||||||
|
"$ETC_ROOT/certs/lab-host.pem" \
|
||||||
|
"$ETC_ROOT/certs/lab-host.key"
|
||||||
|
chmod 0644 "$ETC_ROOT/certs/wg-ca.pem" "$ETC_ROOT/certs/lab-host.pem"
|
||||||
|
chmod 0640 "$ETC_ROOT/certs/lab-host.key"
|
||||||
|
|
||||||
|
log "installed mTLS leaf for $HOST_ID"
|
||||||
|
|
||||||
|
# Try-restart the shipper so it picks up the cert immediately — but
|
||||||
|
# only if the unit's already enabled (don't auto-start a unit the
|
||||||
|
# operator deliberately didn't enable yet).
|
||||||
|
if systemctl is-enabled --quiet cis490-shipper 2>/dev/null; then
|
||||||
|
log "restarting cis490-shipper to load new cert"
|
||||||
|
systemctl try-restart cis490-shipper || \
|
||||||
|
log "WARN: cis490-shipper try-restart failed"
|
||||||
|
fi
|
||||||
|
|
@ -125,11 +125,21 @@ install -m 0644 "$REPO_ROOT/etc/cis490-autoupdate.service" \
|
||||||
/etc/systemd/system/cis490-autoupdate.service
|
/etc/systemd/system/cis490-autoupdate.service
|
||||||
install -m 0644 "$REPO_ROOT/etc/cis490-autoupdate.timer" \
|
install -m 0644 "$REPO_ROOT/etc/cis490-autoupdate.timer" \
|
||||||
/etc/systemd/system/cis490-autoupdate.timer
|
/etc/systemd/system/cis490-autoupdate.timer
|
||||||
|
# mTLS cert-fetch retry timer: handles the "operator edited host_id
|
||||||
|
# but didn't re-run install" case AND the "bootstrap.wg was briefly
|
||||||
|
# unreachable" case. Polls every 5 min; no-op once the cert is on
|
||||||
|
# disk. See scripts/fetch-lab-host-cert.sh.
|
||||||
|
install -m 0644 "$REPO_ROOT/etc/cis490-cert-fetch.service" \
|
||||||
|
/etc/systemd/system/cis490-cert-fetch.service
|
||||||
|
install -m 0644 "$REPO_ROOT/etc/cis490-cert-fetch.timer" \
|
||||||
|
/etc/systemd/system/cis490-cert-fetch.timer
|
||||||
systemctl daemon-reload
|
systemctl daemon-reload
|
||||||
# Enable the timer immediately — the operator gets self-healing on the
|
# Enable timers immediately — the operator gets self-healing on the
|
||||||
# next 30-min tick without an extra `systemctl enable`. Idempotent.
|
# next tick without an extra `systemctl enable`. Idempotent.
|
||||||
systemctl enable --now cis490-autoupdate.timer 2>/dev/null || \
|
systemctl enable --now cis490-autoupdate.timer 2>/dev/null || \
|
||||||
log "WARN: could not enable cis490-autoupdate.timer (will retry next install)"
|
log "WARN: could not enable cis490-autoupdate.timer (will retry next install)"
|
||||||
|
systemctl enable --now cis490-cert-fetch.timer 2>/dev/null || \
|
||||||
|
log "WARN: could not enable cis490-cert-fetch.timer (will retry next install)"
|
||||||
|
|
||||||
# --- 5. config template (only on first install) -----------------------
|
# --- 5. config template (only on first install) -----------------------
|
||||||
if [[ ! -f "$ETC_ROOT/lab-host.toml" ]]; then
|
if [[ ! -f "$ETC_ROOT/lab-host.toml" ]]; then
|
||||||
|
|
@ -189,40 +199,12 @@ if command -v ip >/dev/null && [[ -x "$REPO_ROOT/vm/setup_bridge.sh" ]]; then
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# --- 7. mTLS leaf cert (auto-fetch via bootstrap.wg) -------------------
|
# --- 7. mTLS leaf cert (auto-fetch via bootstrap.wg) -------------------
|
||||||
# Pull our leaf cert from the Pi's bootstrap endpoint if it isn't
|
# One-shot fetch via the standalone script (also wired to a 5-min
|
||||||
# already on disk. Trust boundary: "reached bootstrap.wg over WG"
|
# retry timer in step 4 above, so this is just the install-time
|
||||||
# (iptmonads already filters non-peers from 443). Caddy's TLS cert
|
# kick-off — the timer handles transient failures and the case
|
||||||
# is verified against the bundled etc/caddy-root.crt — no chicken-
|
# where the operator edits host_id later).
|
||||||
# and-egg.
|
"$INSTALL_ROOT/scripts/fetch-lab-host-cert.sh" || \
|
||||||
HOST_ID="$(grep -E '^host_id\s*=' "$ETC_ROOT/lab-host.toml" 2>/dev/null \
|
log "WARN: cert fetch failed — timer will retry"
|
||||||
| head -1 | sed -E 's/^host_id\s*=\s*"([^"]+)".*/\1/')"
|
|
||||||
if [[ -z "$HOST_ID" || "$HOST_ID" == "REPLACE_ME" ]]; then
|
|
||||||
log "skipping cert auto-fetch: host_id not set in $ETC_ROOT/lab-host.toml"
|
|
||||||
elif [[ ! -f "$ETC_ROOT/certs/lab-host.pem" ]]; then
|
|
||||||
log "fetching leaf cert from https://bootstrap.wg/v1/cert/$HOST_ID"
|
|
||||||
install -d -m 0755 -o root -g "$SERVICE_USER" "$ETC_ROOT/certs"
|
|
||||||
TAR="/tmp/cis490-bootstrap-$$.tar"
|
|
||||||
if curl -fsS --cacert "$REPO_ROOT/etc/caddy-root.crt" \
|
|
||||||
--connect-timeout 10 --max-time 60 \
|
|
||||||
"https://bootstrap.wg/v1/cert/$HOST_ID" -o "$TAR"; then
|
|
||||||
tar -C "$ETC_ROOT/certs" -xf "$TAR"
|
|
||||||
mv "$ETC_ROOT/certs/ca.crt" "$ETC_ROOT/certs/wg-ca.pem"
|
|
||||||
mv "$ETC_ROOT/certs/$HOST_ID.pem" "$ETC_ROOT/certs/lab-host.pem"
|
|
||||||
mv "$ETC_ROOT/certs/$HOST_ID.key" "$ETC_ROOT/certs/lab-host.key"
|
|
||||||
chown root:"$SERVICE_USER" "$ETC_ROOT/certs/"*.pem \
|
|
||||||
"$ETC_ROOT/certs/lab-host.key"
|
|
||||||
chmod 0644 "$ETC_ROOT/certs/"*.pem
|
|
||||||
chmod 0640 "$ETC_ROOT/certs/lab-host.key"
|
|
||||||
rm -f "$TAR"
|
|
||||||
log "leaf cert installed for host_id=$HOST_ID"
|
|
||||||
else
|
|
||||||
rm -f "$TAR"
|
|
||||||
log "WARN: bootstrap.wg fetch failed — make sure /etc/hosts maps it"
|
|
||||||
log " to 10.100.0.1 and that wg0 is up. cert delivery skipped."
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
log "$ETC_ROOT/certs/lab-host.pem present; skipping auto-fetch"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# --- 8. baseline VM image + cidata (best-effort) -----------------------
|
# --- 8. baseline VM image + cidata (best-effort) -----------------------
|
||||||
ALPINE_IMG="$DATA_ROOT/vm/images/alpine-baseline.qcow2"
|
ALPINE_IMG="$DATA_ROOT/vm/images/alpine-baseline.qcow2"
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue