Three robustness items off the future-work list:
1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The
daemon sends READY=1 after queue construction and WATCHDOG=1 once
per scan pass via a heartbeat callback wired into run_forever.
Restart=on-failure only catches process death — silent stalls
(deadlock, hung tar subprocess, blocked I/O past timeout) used to
leave a zombie running with the data backlog growing. Now systemd
kills + restarts the daemon if no WATCHDOG=1 arrives within 180s.
Verified end-to-end against systemd via `systemd-run --transient
--property=Type=notify --property=WatchdogSec=10`: unit transitions
to active on READY=1; SIGSTOP'ing the process triggers
`Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at
exactly t+10s, then unit goes failed → restart cycle.
2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew
forever as fatal episodes piled up. New ShipperConfig fields:
quarantine_keep_days = 30 # opt-out: 0 disables
quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't
# statx() the whole tree
Cleanup runs at the start of run_once() but is gated to once per
hour. Removed entries logged.
3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper
journal and surfaces 412/400/transient patterns as red/yellow rows
with the canonical fix command. An on-device agent running
cis490_doctor.py now sees one line ("12 ship(s) rejected as
out-of-window") instead of needing to grep the journal.
Tests: 200/200 (was 188). New coverage: heartbeat callback fires +
survives exceptions; quarantine cleanup respects keep_days, gate, and
opt-out; doctor parser correctly classifies 412/400/transient/clean/
empty/journalctl-denied; both error classes prioritise 412 (more
actionable) when present together.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from the post-cutover diagnosis:
1. version_gate: forgejo → local git fallback. If forgejo refresh
returns empty AND a local repo path is configured, retry against
`git log` from the local checkout. The receiver service runs on
the same Pi as forgejo, so a simultaneous restart used to leave
the gate's cache empty and reject every PUT with not-in-window.
Auto-detects /opt/cis490/.git when the operator hasn't set
local_repo_path explicitly — that path is always present on a
production receiver and ProtectSystem=strict still allows reads.
Logs `source=git-fallback` so this isn't silent.
2. shipper/queue: sweep orphaned outbox tarballs. The lifecycle
invariant is `outbox/<id>.tar.zst exists ⇒ episodes/<id>/ exists`
— broken historically by the now-fixed fatal-loop, by operator
`rm` of an episode dir, or by an OS crash between rename(2) and
the post-ship cleanup. Without sweeping, dead bytes pile up
forever. New _sweep_outbox runs at the start of every scan,
bounded by the file count in outbox/.
Tests cover: fallback fires when forgejo unreachable + repo_path set;
no fallback when repo_path None (opt-in); orphan tarball + partial
get swept on the next pass; live tarballs untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why services weren't starting after the gate went live:
1. install-lab-host.sh self-copy. The receiver's 400 remediation tells
the agent to `cd /opt/cis490 && git pull && sudo
./scripts/install-lab-host.sh`. That makes REPO_ROOT==INSTALL_ROOT
and `cp -aT $REPO_ROOT $INSTALL_ROOT` errors with "are the same
file"; `set -e` aborts before the systemd units install or anything
restarts. Detect the same-dir case and skip the cp; chown still
runs.
2. Services never restart. install-lab-host.sh and install-tier-3-4.sh
both ended by *telling the operator* to restart, then exiting. The
running shipper/orchestrator kept executing pre-gate code from the
old module objects, so new `code_version` stamping never reached an
episode. Both scripts now `systemctl restart` the units they own
when those units are enabled.
3. Shipper queue fatal-loop. queue.py incremented `fatal++` but didn't
move the episode out of `data/episodes/`. Next scan re-tarred and
re-PUT the same dir, getting 400 again. With 4465+ pre-stamp
episodes on k-gamingcom this burned ~1 PUT/sec for 5+ hours of
receiver log. Fatal episodes now move to data/quarantine/<id>/ with
a quarantine_reason.json beside them; the outbox tarball is
deleted.
4. Pre-stamp backlog drain. tools/quarantine_unstamped.py is a
one-shot that scans data/episodes/ and quarantines anything without
a 40-char-hex code_version.commit. Wired into install-lab-host.sh
step 9 so a re-install drains the queue automatically. Idempotent;
safe to run while the shipper is active.
Tests cover the queue's new fatal-quarantine path and every drain
behaviour (kept/quarantined/dry-run/idempotent/missing-meta/collision).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First-boot bring-up enables cis490-shipper before the Pi has issued the
mTLS leaf, so ssl.create_default_context(cafile=...) raised
FileNotFoundError out of __init__ and systemd crash-looped the unit
every RestartSec=5. Now the transport pre-flights the configured
ca_bundle / client_cert / client_key paths, raises a recoverable
_CertNotReadyError, and ping/ship_tarball retry the build on each
request — daemon self-heals once the cert lands without a restart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the deployment loop end-to-end on the CIS490 side:
shipper/
config.py ShipperConfig (host_id, paths, receiver endpoint, mTLS)
transport.py httpx-based PUT + ping with mTLS + bearer support
queue.py scan data/episodes/, tar+zstd via system zstd, ship,
retire to data/shipped/. Idempotent across crashes per
the state machine in docs/transport.md.
__main__.py CLI: --ping (smoke test), --once (one pass), or daemon
receiver/app.py: new POST /v1/ping that requires the same auth as PUT
/v1/episodes but writes nothing. Used by `cis490-shipper --ping`
during lab-host bring-up to verify the WG/Caddy/mTLS path before
shipping any real bytes.
etc/
cis490-shipper.service systemd unit for the lab-host shipper
cis490-orchestrator.service systemd unit for the lab-host queue
(kept disabled by default until queue
mode lands)
lab-host.toml.example config template
scripts/
install-lab-host.sh idempotent installer; verifies prereqs,
creates cis490 service user, syncs repo to
/opt/cis490, builds venv, drops systemd units
and config template
install-receiver.sh same, for the receiver role on the central WG
node (Pi5 in our setup)
tests/test_shipper.py 11 end-to-end tests against a real Uvicorn
server hosting the receiver app. Exercises
ping, tar+ship, idempotent re-ship, 409
conflict, transient (receiver down), tarball
round-trip via system zstd.
AGENTS.md guidance for AI agents working on this and sibling repos.
Headline: when you hit an issue you can't fully fix in
scope, file a Forgejo issue rather than leaving a TODO.
51/51 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>