CIS490/shipper/config.py
max f9b2e5c4e6 shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors
Three robustness items off the future-work list:

1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The
   daemon sends READY=1 after queue construction and WATCHDOG=1 once
   per scan pass via a heartbeat callback wired into run_forever.
   Restart=on-failure only catches process death — silent stalls
   (deadlock, hung tar subprocess, blocked I/O past timeout) used to
   leave a zombie running with the data backlog growing. Now systemd
   kills + restarts the daemon if no WATCHDOG=1 arrives within 180s.

   Verified end-to-end against systemd via `systemd-run --transient
   --property=Type=notify --property=WatchdogSec=10`: unit transitions
   to active on READY=1; SIGSTOP'ing the process triggers
   `Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at
   exactly t+10s, then unit goes failed → restart cycle.

2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew
   forever as fatal episodes piled up. New ShipperConfig fields:
     quarantine_keep_days = 30           # opt-out: 0 disables
     quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't
                                          # statx() the whole tree
   Cleanup runs at the start of run_once() but is gated to once per
   hour. Removed entries logged.

3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper
   journal and surfaces 412/400/transient patterns as red/yellow rows
   with the canonical fix command. An on-device agent running
   cis490_doctor.py now sees one line ("12 ship(s) rejected as
   out-of-window") instead of needing to grep the journal.

Tests: 200/200 (was 188). New coverage: heartbeat callback fires +
survives exceptions; quarantine cleanup respects keep_days, gate, and
opt-out; doctor parser correctly classifies 412/400/transient/clean/
empty/journalctl-denied; both error classes prioritise 412 (more
actionable) when present together.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:02:59 -05:00

109 lines
4.1 KiB
Python

"""Lab-host shipper config — loaded from /etc/cis490/lab-host.toml."""
from __future__ import annotations
import tomllib
from dataclasses import dataclass, field
from pathlib import Path
@dataclass(frozen=True)
class ReceiverEndpoint:
url: str # e.g. "https://collector.wg"
ca_bundle: Path | None = None
client_cert: Path | None = None
client_key: Path | None = None
bearer_token: str | None = None
verify_tls: bool = True
@dataclass(frozen=True)
class ShipperConfig:
host_id: str
data_root: Path # Lab-host data root; episodes/, outbox/, shipped/ live here.
receiver: ReceiverEndpoint
# Daemon mode: how often to scan for new done.marker files.
scan_interval_s: float = 5.0
# PUT timeout per episode. Tarballs are bounded by max_episode_bytes;
# at WG speeds this is well under 60s for a typical episode.
request_timeout_s: float = 60.0
# Backoff schedule on transient (5xx / network) failures, in seconds,
# capped at the last entry. The shipper's scan loop will pick the
# episode up again on the next pass regardless.
backoff_seconds: tuple[float, ...] = (1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 60.0, 120.0, 300.0)
# Local retention before pruning data/shipped/.
keep_local_for_days: int = 7
# Quarantine retention. Episodes that the receiver permanently
# rejected (400/412) sit here as evidence; without an upper bound
# they grow forever. Set to 0 to disable cleanup (operator
# responsibility).
quarantine_keep_days: int = 30
# How often the quarantine cleanup pass actually runs. Gated
# because a 5-second scan tick checking mtimes against a
# 30-day-old cutoff is wasteful — once an hour is plenty.
quarantine_cleanup_interval_s: float = 3600.0
@property
def episodes_dir(self) -> Path:
return self.data_root / "episodes"
@property
def outbox_dir(self) -> Path:
return self.data_root / "outbox"
@property
def shipped_dir(self) -> Path:
return self.data_root / "shipped"
@property
def quarantine_dir(self) -> Path:
# Episodes the receiver has refused permanently (4xx other than
# 409 — typically 400 missing-commit or 412 not-in-window). They
# don't belong in shipped/ (we have nothing to compare against)
# and re-shipping them would just re-burn the queue.
return self.data_root / "quarantine"
@classmethod
def load(cls, path: str | Path) -> "ShipperConfig":
with open(path, "rb") as f:
data = tomllib.load(f)
host_id = data.get("host_id")
if not isinstance(host_id, str) or not host_id:
raise ValueError("lab-host config: host_id (string) required at top level")
paths = data.get("paths", {})
data_root = Path(paths.get("data_root", "/var/lib/cis490/data")).resolve()
rcv = data.get("receiver", {})
url = rcv.get("url")
if not isinstance(url, str) or not url:
raise ValueError("lab-host config: receiver.url required")
receiver = ReceiverEndpoint(
url=url.rstrip("/"),
ca_bundle=_optional_path(rcv.get("ca_bundle")),
client_cert=_optional_path(rcv.get("client_cert")),
client_key=_optional_path(rcv.get("client_key")),
bearer_token=rcv.get("bearer_token"),
verify_tls=bool(rcv.get("verify_tls", True)),
)
retention = data.get("retention", {})
return cls(
host_id=host_id,
data_root=data_root,
receiver=receiver,
scan_interval_s=float(data.get("shipper", {}).get("scan_interval_s", 5.0)),
request_timeout_s=float(data.get("shipper", {}).get("request_timeout_s", 60.0)),
keep_local_for_days=int(retention.get("keep_local_for_days", 7)),
quarantine_keep_days=int(retention.get("quarantine_keep_days", 30)),
)
def _optional_path(v: object) -> Path | None:
if v in (None, ""):
return None
if isinstance(v, str):
return Path(v).expanduser()
raise TypeError(f"expected path string, got {type(v).__name__}")