k-gamingcom symptom (2026-05-02): the on-device agent successfully finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS material" because the cert auto-fetch step in install-lab-host.sh either ran with host_id still REPLACE_ME, or hit a transient bootstrap.wg failure, and there's no automatic retry. The Pi-side cert IS minted and the bootstrap endpoint serves it — the failure mode is purely "lab-host hasn't pulled it down." Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh (idempotent, no-op when certs are already on disk, no-op when host_id is unset, exit-0 on transient network failure so the unit doesn't get pinned as failed), and run it from a 5-minute systemd timer. The timer handles all three "stuck waiting on mTLS" cases without operator action: - operator edited host_id post-install but didn't re-run install - bootstrap.wg was briefly unreachable during install - lab host was offline when install ran but came up later The script `try-restart`s cis490-shipper after a successful fetch so the daemon picks up the new cert immediately instead of waiting for its lazy retry. install-lab-host.sh still calls the script on install for fast first-time bring-up — the timer is the safety net. Tarball extract is staged through a temp dir + atomic rename so a mid-extract crash never leaves us with a mismatched cert/key pair. AGENTS.md row 4 updated: "waiting on mTLS material" remediation now points at the timer, with the exact `systemctl start cis490-cert-fetch.service` command to force an immediate retry. Tests: 267/267 unchanged. The fetch script is idempotent + has all its happy/error paths handled inline; a unit test would mostly be testing systemd's behaviour. The integration test path is the timer running on a real lab host, which is the actual production case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
22 lines
766 B
SYSTEMD
22 lines
766 B
SYSTEMD
[Unit]
|
|
Description=Periodically fetch CIS490 mTLS leaf cert if missing
|
|
Documentation=https://maxgit.wg/spectral/CIS490
|
|
|
|
[Timer]
|
|
# Fire 30 seconds after boot — covers the freshly-installed host
|
|
# scenario (operator edits host_id; cert lands within ~30s of next
|
|
# boot or timer tick).
|
|
OnBootSec=30sec
|
|
# Then aggressively for the first hour (5 min cadence) to handle the
|
|
# common "operator just edited host_id" path without making them
|
|
# wait. Once the host has its cert the script is a no-op so the
|
|
# 5-min cadence is cheap.
|
|
OnUnitActiveSec=5min
|
|
# Resists clock-skew + matches systemd's coarse scheduling.
|
|
AccuracySec=10sec
|
|
# If the host was off when a tick was due, run on next boot.
|
|
Persistent=true
|
|
Unit=cis490-cert-fetch.service
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|