Three robustness items off the future-work list:
1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The
daemon sends READY=1 after queue construction and WATCHDOG=1 once
per scan pass via a heartbeat callback wired into run_forever.
Restart=on-failure only catches process death — silent stalls
(deadlock, hung tar subprocess, blocked I/O past timeout) used to
leave a zombie running with the data backlog growing. Now systemd
kills + restarts the daemon if no WATCHDOG=1 arrives within 180s.
Verified end-to-end against systemd via `systemd-run --transient
--property=Type=notify --property=WatchdogSec=10`: unit transitions
to active on READY=1; SIGSTOP'ing the process triggers
`Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at
exactly t+10s, then unit goes failed → restart cycle.
2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew
forever as fatal episodes piled up. New ShipperConfig fields:
quarantine_keep_days = 30 # opt-out: 0 disables
quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't
# statx() the whole tree
Cleanup runs at the start of run_once() but is gated to once per
hour. Removed entries logged.
3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper
journal and surfaces 412/400/transient patterns as red/yellow rows
with the canonical fix command. An on-device agent running
cis490_doctor.py now sees one line ("12 ship(s) rejected as
out-of-window") instead of needing to grep the journal.
Tests: 200/200 (was 188). New coverage: heartbeat callback fires +
survives exceptions; quarantine cleanup respects keep_days, gate, and
opt-out; doctor parser correctly classifies 412/400/transient/clean/
empty/journalctl-denied; both error classes prioritise 412 (more
actionable) when present together.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
40 lines
1.1 KiB
Desktop File
40 lines
1.1 KiB
Desktop File
[Unit]
|
|
Description=CIS490 lab-host episode shipper
|
|
Documentation=https://maxgit.wg/spectral/CIS490
|
|
# WG must be up before the shipper can reach the receiver.
|
|
After=network-online.target wg-quick@wg0.service
|
|
Wants=network-online.target
|
|
Requires=wg-quick@wg0.service
|
|
|
|
[Service]
|
|
# Type=notify so systemd waits for sd_notify("READY=1") before
|
|
# considering the unit started, and so WatchdogSec= can kick in.
|
|
# Without this, Restart=on-failure only catches process crashes —
|
|
# silent stalls (deadlock, blocked I/O past timeout, hung tar
|
|
# subprocess) leave a zombie running with the data backlog growing.
|
|
Type=notify
|
|
NotifyAccess=main
|
|
WatchdogSec=180
|
|
User=cis490
|
|
Group=cis490
|
|
WorkingDirectory=/opt/cis490
|
|
ExecStart=/opt/cis490/.venv/bin/python -m shipper --config /etc/cis490/lab-host.toml
|
|
Restart=on-failure
|
|
RestartSec=5
|
|
|
|
# Hardening
|
|
NoNewPrivileges=true
|
|
PrivateTmp=true
|
|
ProtectSystem=strict
|
|
ProtectHome=true
|
|
ReadWritePaths=/var/lib/cis490
|
|
ProtectKernelTunables=true
|
|
ProtectKernelModules=true
|
|
ProtectControlGroups=true
|
|
LockPersonality=true
|
|
RestrictNamespaces=true
|
|
RestrictRealtime=true
|
|
SystemCallArchitectures=native
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|