CIS490/etc/cis490-cert-fetch.service
max 3180f7b5ac lab-host: cis490-cert-fetch.timer for automatic mTLS bootstrap retry
k-gamingcom symptom (2026-05-02): the on-device agent successfully
finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS
material" because the cert auto-fetch step in install-lab-host.sh
either ran with host_id still REPLACE_ME, or hit a transient
bootstrap.wg failure, and there's no automatic retry. The Pi-side
cert IS minted and the bootstrap endpoint serves it — the failure
mode is purely "lab-host hasn't pulled it down."

Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh
(idempotent, no-op when certs are already on disk, no-op when host_id
is unset, exit-0 on transient network failure so the unit doesn't
get pinned as failed), and run it from a 5-minute systemd timer.
The timer handles all three "stuck waiting on mTLS" cases without
operator action:

  - operator edited host_id post-install but didn't re-run install
  - bootstrap.wg was briefly unreachable during install
  - lab host was offline when install ran but came up later

The script `try-restart`s cis490-shipper after a successful fetch
so the daemon picks up the new cert immediately instead of waiting
for its lazy retry. install-lab-host.sh still calls the script
on install for fast first-time bring-up — the timer is the safety
net.

Tarball extract is staged through a temp dir + atomic rename so a
mid-extract crash never leaves us with a mismatched cert/key pair.

AGENTS.md row 4 updated: "waiting on mTLS material" remediation now
points at the timer, with the exact `systemctl start
cis490-cert-fetch.service` command to force an immediate retry.

Tests: 267/267 unchanged. The fetch script is idempotent + has all
its happy/error paths handled inline; a unit test would mostly be
testing systemd's behaviour. The integration test path is the timer
running on a real lab host, which is the actual production case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:30:16 -05:00

20 lines
755 B
Desktop File

[Unit]
Description=CIS490 lab-host mTLS leaf cert fetch (idempotent)
Documentation=https://maxgit.wg/spectral/CIS490
After=network-online.target wg-quick@wg0.service
# We don't Want network-online — if the network is down the script
# exits 0 silently and the timer will retry.
[Service]
Type=oneshot
# Runs as root because the script writes /etc/cis490/certs/ (owned by
# root, gid cis490) and may need to systemctl-restart cis490-shipper.
ExecStart=/opt/cis490/scripts/fetch-lab-host-cert.sh
StandardOutput=journal
StandardError=journal
[Install]
# The TIMER is what gets enabled. WantedBy here lets an operator
# `systemctl start cis490-cert-fetch.service` to force a one-shot
# fetch (e.g. right after editing host_id).
WantedBy=multi-user.target