CIS490/tools
max f9b2e5c4e6 shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors
Three robustness items off the future-work list:

1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The
   daemon sends READY=1 after queue construction and WATCHDOG=1 once
   per scan pass via a heartbeat callback wired into run_forever.
   Restart=on-failure only catches process death — silent stalls
   (deadlock, hung tar subprocess, blocked I/O past timeout) used to
   leave a zombie running with the data backlog growing. Now systemd
   kills + restarts the daemon if no WATCHDOG=1 arrives within 180s.

   Verified end-to-end against systemd via `systemd-run --transient
   --property=Type=notify --property=WatchdogSec=10`: unit transitions
   to active on READY=1; SIGSTOP'ing the process triggers
   `Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at
   exactly t+10s, then unit goes failed → restart cycle.

2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew
   forever as fatal episodes piled up. New ShipperConfig fields:
     quarantine_keep_days = 30           # opt-out: 0 disables
     quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't
                                          # statx() the whole tree
   Cleanup runs at the start of run_once() but is gated to once per
   hour. Removed entries logged.

3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper
   journal and surfaces 412/400/transient patterns as red/yellow rows
   with the canonical fix command. An on-device agent running
   cis490_doctor.py now sees one line ("12 ship(s) rejected as
   out-of-window") instead of needing to grep the journal.

Tests: 200/200 (was 188). New coverage: heartbeat callback fires +
survives exceptions; quarantine cleanup respects keep_days, gate, and
opt-out; doctor parser correctly classifies 412/400/transient/clean/
empty/journalctl-denied; both error classes prioritise 412 (more
actionable) when present together.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:02:59 -05:00
..
auto_fetch_samples.py auto_fetch_samples: pick Linux i386 ELF; manifest matches theZoo 2026-05-01 03:28:26 -05:00
build_cidata.py Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
cis490_doctor.py shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors 2026-05-01 12:02:59 -05:00
fetch_sample.py Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
index_backfill.py Receiver enforces X-Cis490-Code-Commit allow-list (live, auto-refreshed) 2026-05-01 01:38:50 -05:00
index_reader.py Close out the deployment-readiness gaps 2026-04-30 00:31:55 -05:00
load_mimic.py Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00
plot_envelope.py Close out the deployment-readiness gaps 2026-04-30 00:31:55 -05:00
prune_episodes.py Multi-signal prune classifier: rescue valid episodes /proc misses 2026-04-30 19:10:01 -05:00
quarantine_unstamped.py fix: lab-host install loop after commit-gate cutover 2026-05-01 11:36:21 -05:00
run_envelope_demo.py Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00
run_fleet.py fleet: rotate exploit modules per (host, slot, ep); Tier 3 by default 2026-04-30 02:22:49 -05:00
run_real_vm_demo.py runners: take savevm baseline-v1 after boot so revert_at_* actually works 2026-04-30 02:37:05 -05:00
run_tier3_demo.py runners: take savevm baseline-v1 after boot so revert_at_* actually works 2026-04-30 02:37:05 -05:00
show_envelope.sh Interactive envelope plot via WebAgg (browser-based) 2026-04-29 00:06:22 -06:00
verify_tier3_local.py tools/verify_tier3_local.py: Pi-runnable Tier-3 verifier 2026-05-01 03:41:21 -05:00
vm_load_controller.py Fix workload-silent false-positive on Alpine busybox guests (closes #15) 2026-04-30 17:28:48 -05:00
vm_serial.py Tier 2: real Alpine VM, real workload, real envelope 2026-04-29 08:38:53 -06:00