Commit graph

17 commits

Author SHA1 Message Date
Max Gorog
bfb1c491f8 PIPELINE.md is canonical; rewrite AGENTS.md; delete FIXYOURSELF.md
PIPELINE.md is the canonical plan for the data-collection / emulation
/ labelling pipeline. It supersedes any guidance in AGENTS.md,
README.md, or other repo docs that contradicts it (§17). Future
sessions read it before changing anything in the pipeline.

AGENTS.md is rewritten to point at PIPELINE.md as canonical and to
strip the prescriptive symptom→fix table that absorbed producer-side
defects instead of fixing them (§7.1 compensating-layer pattern).

FIXYOURSELF.md is deleted (§4.12, §7.10 recovery-layer pattern). The
states it covered are made impossible by the §4.6 acceptance gate
landing later in §5; recovering from a state that shouldn't exist is
itself the bandaid we're removing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:04:43 -05:00
max
3180f7b5ac lab-host: cis490-cert-fetch.timer for automatic mTLS bootstrap retry
k-gamingcom symptom (2026-05-02): the on-device agent successfully
finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS
material" because the cert auto-fetch step in install-lab-host.sh
either ran with host_id still REPLACE_ME, or hit a transient
bootstrap.wg failure, and there's no automatic retry. The Pi-side
cert IS minted and the bootstrap endpoint serves it — the failure
mode is purely "lab-host hasn't pulled it down."

Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh
(idempotent, no-op when certs are already on disk, no-op when host_id
is unset, exit-0 on transient network failure so the unit doesn't
get pinned as failed), and run it from a 5-minute systemd timer.
The timer handles all three "stuck waiting on mTLS" cases without
operator action:

  - operator edited host_id post-install but didn't re-run install
  - bootstrap.wg was briefly unreachable during install
  - lab host was offline when install ran but came up later

The script `try-restart`s cis490-shipper after a successful fetch
so the daemon picks up the new cert immediately instead of waiting
for its lazy retry. install-lab-host.sh still calls the script
on install for fast first-time bring-up — the timer is the safety
net.

Tarball extract is staged through a temp dir + atomic rename so a
mid-extract crash never leaves us with a mismatched cert/key pair.

AGENTS.md row 4 updated: "waiting on mTLS material" remediation now
points at the timer, with the exact `systemctl start
cis490-cert-fetch.service` command to force an immediate retry.

Tests: 267/267 unchanged. The fetch script is idempotent + has all
its happy/error paths handled inline; a unit test would mostly be
testing systemd's behaviour. The integration test path is the timer
running on a real lab host, which is the actual production case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:30:16 -05:00
max
d1e1b1132d FIXYOURSELF.md: explicit decision tree for stuck lab hosts
The auto-update timer (98dcd4f) covers the routine case of a host
falling behind origin/main. It deliberately refuses to fast-forward
when local HEAD isn't an ancestor of origin/main — the right call
for safety, but it leaves on-device agents with no automatic path
out when they (or an operator) made a local commit.

That's exactly the elliott-thinkpad incident: ~31,738 episodes
shipped over 19 hours, all stamped with local commit 5568d77 that
isn't on origin/main, all 412'd. Auto-update can't fix it; the
on-device agent had no doc telling it what to do.

FIXYOURSELF.md is that doc. Pure decision tree, six branches
(behind / diverged / no-network / no-git / dirty-tree / clean) each
with verbatim commands and the order to try them. The diverged-HEAD
branch (§B) is the elliott-thinkpad case and offers three resolutions
(push, reset, file-issue-and-wait) so an agent that doesn't have
push permission isn't backed into discarding work.

Linked from the AGENTS.md top-of-file symptom table so a smaller
model finds it without having to know the filename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:53:16 -05:00
max
98dcd4f9f8 lab-host: cis490-autoupdate.timer for self-healing on push
Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).

Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.

Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
  to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
  protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
  failure); divergence + install failures exit non-zero so the
  journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
  when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
  through bring-up; no special "auto" flow.

The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.

install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.

Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:59:31 -05:00
max
20ff76c1e0 AGENTS.md: prescriptive symptom→command table for on-device agents
Smaller models running on lab hosts read AGENTS.md top-to-bottom and
need explicit if-this-then-that. Restructure to put a decision-tree
table at the very top mapping every realistic symptom to the exact
command to run (verbatim — no paraphrasing instruction). Adds an
unambiguous HARD RULES list.

Also fixes accumulated drift:

- Tier-4 section had two contradictory descriptions (theZoo flow +
  legacy MalwareBazaar flow). Removed the MalwareBazaar paragraphs;
  the table's MALWAREBAZAAR_API_KEY env var is gone (theZoo needs no
  auth). The "DO NOT push API key" bullet was about a flow that no
  longer exists.

- Canonical bring-up step 6 said the Metasploitable2 download was
  "registration-walled" requiring an operator-supplied URL+sha256.
  Not true since the SourceForge mirror + TOFU pinning fix —
  install-lab-host.sh handles it. Removed the manual step entirely
  and noted Tier-3+4 are part of step 1.

- The "Three install bugs in 95ac56a" historical table was churn that
  doesn't help current agents. Replaced with a generic
  "outdated-clone? pull main and re-run install-lab-host.sh" block
  that explicitly enumerates what the install script does (VERSION
  stamp, queue drain, daemon-reload+restart, watchdog).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:19:53 -05:00
max
f294e97875 AGENTS.md: how to recover from 400/412 commit-rejected loops
Smaller models running as on-device agents need a direct, prescriptive
remediation block for the gate-failure modes — the receiver's response
body is good but only visible if the agent reads journalctl carefully.
Document the exact sequence (git pull → install-lab-host.sh) and what
the install script now does on its own (drain pre-stamp queue, restart
services). Also calls out the two anti-patterns we don't want agents
trying: silencing the shipper to stop log noise, or fabricating a
code_version field to bypass the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:46:04 -05:00
max
265f3ad313 Tier-4 sample source: theZoo (no auth, no operator action)
Replaces MalwareBazaar with theZoo (https://github.com/ytisf/theZoo).
theZoo is a public security-research repo with hundreds of malware
samples organized by family, password-protected with the well-known
'infected'. No API key, no signup, nothing for an operator to do —
which is what zero-touch tier-4 actually means.

Changes:

- tools/auto_fetch_samples.py: rewrite. Clones theZoo (shallow, ~500 MB)
  to /var/lib/cis490/theZoo on first run, then for each manifest
  family without a sha256 it locates a matching Binaries/<Name>
  dir, extracts the .zip with password 'infected', picks the largest
  non-text payload as the binary, sha256s it, stages at
  samples/store/<sha256>, and rewrites manifest.toml in place
  (atomic tempfile + os.replace, stat preserved). Mandatory exit
  semantic: non-zero if no real samples landed.

- scripts/install-tier-3-4.sh: dropped the MB-key resolution chain
  (env var → local file → bootstrap.wg fetch). Now just runs
  auto_fetch_samples.py and dies if zero samples land. SKIP_TIER4
  remains as the explicit override but is documented as defeating
  the project.

- bootstrap/app.py + __main__.py + etc/cis490-bootstrap.service:
  removed the /v1/secret/<name> endpoint and the --secrets-root flag.
  Dead code now that no API key needs distributing. Live-rolled
  back on the Pi (404 verified post-restart, stale /etc/cis490/secrets
  dir removed).

- scripts/set-malwarebazaar-key.sh: deleted. No MB key means no
  one-time operator step.

- tests/test_bootstrap_secrets.py: deleted (route removed).

- AGENTS.md: rewrote tier-4 section to reflect zero-operator model.

148/148 tests pass. Bootstrap service rolled back live.
2026-05-01 01:17:50 -05:00
max
5d0e8e33a9 Tier 4 is mandatory: hard-fail on no real samples; auto-distribute MB key
User: 'we don't want it to be optional, this real malware IS the data
we want.' Acknowledged. Three changes make Tier 4 actually mandatory
without forcing per-host operator action:

1. bootstrap.wg /v1/secret/<name> endpoint
   - Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts
     over the same trust boundary as the cert endpoint (WG mesh,
     iptmonads-gated). Strict allow-list — only `malwarebazaar`
     resolves; everything else 404s. Secret returned as bare text
     with Cache-Control: no-store. Live-verified on the Pi.
   - tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned,
     200 with token, 404 unknown name, 500 on empty file.

2. install-tier-3-4.sh: Tier 4 is no longer optional
   - Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token
     → https://bootstrap.wg/v1/secret/malwarebazaar.
   - Caches the bootstrap-fetched key locally so re-runs are offline.
   - If all three resolution paths fail, dies with the exact
     remediation command for the operator (one-time set-malwarebazaar-key.sh
     on the Pi).
   - auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still
     works for emergency overrides but logs a warning that the host
     will produce only mimics). Deploy fails if zero binaries land
     in samples/store/ — no silent mimic-only fallback.
   - SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'.

3. scripts/set-malwarebazaar-key.sh
   - Pi-side helper: one operator command per fleet, ever. Accepts
     key via env or stdin, validates length, drops at the right
     path with the right perms. Lab hosts pull the rest automatically.

AGENTS.md: rewrote the Tier-4 section to reflect mandatory status +
the one-time-on-Pi distribution model.

152/152 tests pass. Bootstrap service updated live on the Pi.
2026-05-01 00:44:41 -05:00
max
683bfe9ce6 Tier 3 + Tier 4 auto-deploy: zero operator interaction
Replaces the manual runbook with scripts that just work. install-lab-host.sh
now runs the full Tier-3 deploy automatically as its 8th step (after the
mTLS cert lands), and Tier-4 auto-fetches when MALWAREBAZAAR_API_KEY is set.

Changes:

- install-msfrpcd.sh: actually runs the Rapid7 omnibus installer when
  metasploit-framework isn't present (was: bail with "install manually").
  apt-get and dnf paths both go through the same omnibus script with
  DEBIAN_FRONTEND=noninteractive. Idempotent.

- fetch-metasploitable2.sh: bakes in the SourceForge public-mirror URL
  (https://downloads.sourceforge.net/project/metasploitable/...) so no
  operator URL is required. sha256 is now optional and TOFU-pinned —
  first run records the hash to OUT_DIR/metasploitable2.qcow2.sha256;
  subsequent runs verify against that. Skips if qcow2 already present.

- scripts/install-tier-3-4.sh (new): orchestrates the four steps
  (msfrpcd → metasploitable2 → bridge → tier-3 verify) plus optional
  Tier-4 auto-fetch. Idempotent. SKIP_VERIFY / SKIP_BRIDGE / SKIP_TIER4
  env knobs for partial deploys.

- tools/auto_fetch_samples.py (new): when MALWAREBAZAAR_API_KEY is set,
  queries MB by each manifest entry's `family` (signature match), pulls
  the first match via fetch_sample.py, and rewrites manifest.toml in
  place (atomic tempfile + os.replace, preserving stat). Skips entries
  that already have sha256.

- install-lab-host.sh: gains a step 8 that calls install-tier-3-4.sh
  automatically when mTLS certs are on disk. --skip-tier3 flag for
  operators who want Tier 2 only. Skipped silently before certs land
  so first-pass install (host_id=REPLACE_ME) still works.

- AGENTS.md: rewrote the Tier-3 section to point at the one-shot
  script. Removed the old multi-command runbook so on-device agents
  can't accidentally follow stale steps.

Net effect: a fresh lab host now gets Tier 3 (and Tier 4 if API key
present) from a single sudo invocation. No operator picks for image
URLs, no manual metasploit installs, no manual manifest edits.
2026-04-30 23:12:08 -05:00
max
02b9d0a645 Tier 3 + Tier 4 deploy runbook in AGENTS.md
Repo has all the code paths for Tier 3 (real exploit fire via msfrpcd)
and Tier 4 (real malware execution via chunked upload), but neither
lab host has run a single Tier-3 episode because msfrpcd and the
Metasploitable2 image aren't deployed there. 3009 episodes in flight
to date are all Tier 2 (mimic workloads in clean Alpine), which is
useful pipeline-validation data but cannot answer the actual research
question.

This commit makes the deploy push-button:

- AGENTS.md: new "Tier 3 + Tier 4 deploy" section listing the three
  prereqs (install-msfrpcd.sh, fetch-metasploitable2.sh, setup_bridge.sh),
  the foreground verify command (run_tier3_demo.py), and the Tier-4
  promotion path (MB API key → fetch_sample.py → manifest edit →
  orchestrator restart).

- samples/manifest.toml: clearer per-entry comment showing the
  4-step sha256 → real-binary promotion path. Replaces the earlier
  "TBD" placeholder which suggested a single edit unlocks Tier 4
  when in fact you need to fetch the binary too.

The fleet runner already auto-detects msfrpcd (orchestrator/fleet.py
_msfrpcd_available()); once the lab-host operator-AI lands the
prereqs, episodes flip to Tier 3 with no orchestrator config change.
Tier 4 follows automatically the next time the deterministic
selector picks a sample whose sha256 file exists in samples/store/.
2026-04-30 22:57:23 -05:00
max
321ea63803 Multi-signal prune classifier: rescue valid episodes /proc misses
A laptop-class lab host (elliott-thinkpad) running 14 parallel fleet
slots can't deliver host /proc CPU% signal for the bursty profiles —
the per-VM share gets buried under contention. But the workloads ARE
running: qmp blockstats record 90+ MB written during infected_running
for io-walk episodes, netflow shows real packet bursts for
scan-and-dial, and the in-guest agent (when alive) shows load_1m
deltas the host can't see.

The classifier now cross-checks four sources before flagging an
episode:
  - /proc CPU% medians (host-side qemu)
  - netflow byte totals (bridge_pcap)
  - qmp blockstats per-phase DELTA (cumulative counters; deltas
    matter, not raw values)
  - guest-agent load_1m

An episode flags only if every available source agrees no
inter-phase signal. Missing sources are "unknown", not "flat".

Time-base bug also fixed: phase mapping now uses t_wall_ns (which
all sources stamp from CLOCK_REALTIME) rather than t_mono_ns —
netflow uses qemu boot-monotonic, /proc uses orchestrator-relative,
they don't share a number line.

Result on the live receiver:
  - 1067 active episodes, 100% kept under the new logic
  - 143 episodes rescued from a previous false-positive archive
  - Only the 9 genuinely-broken pre-Sample-propagation elliott-lab
    episodes remain archived (no-sample + no-workload-events)

Two new tests (test_flat_proc_rescued_by_netflow,
test_flat_everywhere_still_flags) pin the boundary so a future
regression surfaces immediately.

AGENTS.md gains a "classifier is multi-source" section explaining
the cross-check and the t_wall_ns invariant.
2026-04-30 19:10:01 -05:00
max
2707709299 Fix workload-silent false-positive on Alpine busybox guests (closes #15)
On-device agent (k-gamingcom) ran the diagnostic probe sequence and
proved the workload IS running on Alpine — yes saturating the vCPU,
loadavg=1.05, three yes PIDs visible — but two busybox incompatibilities
made every episode look silent:

1. _probe() used `pgrep -c yes`. The -c flag is procps-ng/util-linux,
   not busybox. busybox pgrep exits 1 with a usage banner; the
   `|| echo 0` fallback then reported yes=0 every time. Switched to
   `pgrep yes | wc -l` which both pgrep variants support.

2. _wrap_loop appended `disown` after the nohup-backgrounded script.
   busybox sh / ash have no disown builtin, so each infected_running
   phase printed `sh: disown: not found` into run()'s captured output.
   The script kept running (nohup gives SIGHUP immunity, which is
   what disown was for), but the spurious error is now gone.

Cross-validation in the classifier:
- prune_episodes.py: workload-silent now requires the probe AND
  host-side /proc CPU envelope (flat-cpu) to AGREE. A probe-only zero
  is treated as the busybox false-positive and dropped. This means
  the 244 already-on-disk episodes from elliott-thinkpad and
  k-gamingcom are correctly classified without re-collecting.

Test coverage:
- test_workload_silent_flag updated to require both signals
- test_workload_silent_suppressed_when_host_cpu_real new regression
  for the busybox false-positive

AGENTS.md gains a "Don't trust the in-guest probe alone" section with
the busybox-vs-procps gotcha + a list of busybox-incompatible patterns
to avoid in any new in-guest diagnostic.
2026-04-30 17:28:48 -05:00
max
f6d7d07837 Make mTLS bring-up unmistakable for on-device agents
Sysadmin observed lab-host agents still trying to "secure the
connection" — minting certs, generating CSRs, or otherwise reinventing
a cert-delivery flow that's already automated through bootstrap.wg.
Three reinforcements so an agent reading any of the three surfaces
(AGENTS.md, install script output, journalctl) gets the same message:

- AGENTS.md gains a top-of-file "do not mint your own certs" callout
  + a dedicated "Securing the connection (mTLS)" section with the
  one fix (re-run install-lab-host.sh after setting host_id) and an
  explicit "what NOT to do" list (no openssl, no copy from another
  host, no verify_tls=false).

- install-lab-host.sh's FIRST-INSTALL NEXT STEPS now spells out that
  the cert auto-fetch is silently skipped while host_id is REPLACE_ME,
  and that the operator MUST re-run the script after editing host_id.
  Step 2 is now "RE-RUN THIS SCRIPT" with a DO NOT openssl warning.

- The shipper's "waiting on mTLS material" warning now embeds the
  exact remediation command + a pointer to AGENTS.md, so an agent
  reading journalctl without ever opening the repo still gets it.

Tests: 12/12 in test_shipper still pass; warning string change is
not asserted on (only the dataclass error field).
2026-04-30 16:23:44 -05:00
max
c80a36d3ae AGENTS.md: prescriptive guidance for smaller models on lab hosts
Smaller (non-4.7) Claude models act as on-device agents on CIS490 lab
hosts and have hit the install gotchas that became issues #10–#12.
Their reports describe symptoms well but miss inferred context — so
this expands the runbook with explicit "do this, not that" notes:

- run tools from /opt/cis490 not a clone (CWD-on-sys.path trap)
- shipper "waiting on mTLS material" is expected and self-heals; do
  not try to fix it manually
- table of the three install bugs already closed in main, so a fresh
  agent can recognize the symptom and pull instead of re-filing
- "fix one red row at a time" rather than batching attempts

Closes nothing new; this is the followup to #10/#11/#12 promised
during their resolution.
2026-04-30 16:19:09 -05:00
max
6f8b744c33 cis490-doctor + AGENTS.md operator runbook + louder install script
Adds the missing diagnostic + onboarding tools so an agent (AI or
human) handed a fresh lab host can get to "shipping data" without
re-deriving every step from logs.

tools/cis490_doctor.py — one-shot health check that walks the full
stack from the bottom up. Each row is green/yellow/red with an
exact fix command for the red rows. Checks:
  - repo: branch, tree-clean, distance from origin/main
  - install: /opt/cis490, .venv python, /etc/cis490/{lab-host,receiver}.toml,
    /etc/cis490/lab-host.env
  - mTLS: /etc/cis490/certs/{wg-ca,lab-host}.{pem,key}, openssl chain verify
  - systemd: cis490-{shipper,orchestrator,receiver} active state
  - net: receiver.url DNS, TCP reach, mTLS handshake to collector.wg
  - vm prereqs: /dev/kvm, qemu-system-x86_64, zstd, alpine-baseline.qcow2,
    cidata.iso
  - tier3 prereqs: msfrpcd, metasploitable2.qcow2 (warn-level)
  - end-to-end: cis490-shipper --ping
Modes: --role {lab-host,receiver}, --json (machine-readable),
--no-tier3 (skip optional checks). Exits non-zero on any red row.
ANSI color (auto-disabled on non-tty / NO_COLOR).

AGENTS.md gains a "How a lab host gets to shipping data" canonical
flow at the top: cert delivery via wg-pki/deploy-cis490-cert.sh →
install-lab-host.sh → cis490-doctor → systemctl enable. Plus an
"on-demand episode" recipe + a "smallest E2E test" snippet for
agents that need to verify the pipe without waiting on the timer.
The strict "cloning the repo by itself does nothing" callout makes
the failure mode mu and elliott-lab hit explicit.

scripts/install-lab-host.sh prints a 5-step banner on first install
that points at cis490_doctor.py + the deploy-cis490-cert.sh flow,
plus an always-printed footer warning that "cloning + running
launchers manually is NOT enough." Same message the AGENTS.md
section reinforces.

Refs spectral/CIS490#8 (the "Tier-2 is shipping in the meantime"
claim that turned out to be untrue because no cis490-shipper
service was running on elliott-lab — exactly the case this
diagnostic tool targets).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:11:57 -05:00
max
c89dbe29e7 README + AGENTS.md: reflect fleet, driver v2, all 4 collectors
README:
- Intro now describes the multi-host fleet + cross-host sample
  diversity as the primary workflow.
- Tier 2 section: profile-driven workload table replaces the old
  "yes / dd" description.
- New Tier 3 section: covers driver v2 dispatch + setup automation
  scripts.
- Tier maturity table refreshed (1, 2 ; 3  code /  image; 4 🚧).
- Telemetry-sources table moved into the per-tier story so the
  oracle-vs-feature split is visible from the top of the doc.
- Status section restructured by section (Pipeline, Telemetry,
  Orchestrator + drivers, Fleet) instead of a flat list. Cross-links
  to the new Forgejo issues for the remaining gaps:
    #4 — Tier 4 MalwareBazaar fetcher
    #5 — source 3 (perf stat)
    #6 — bridge pcap per-episode wiring
- Quick-start sections rewritten:
    1) "fleet mode (the primary workflow)" with --capacity + --waves
    2) "single episode, no fleet" covering both Tier 2 + Tier 3
    3) "multi-host fleet — how cross-host diversity works" explains
       the deterministic per-(host, slot, ep) selection mechanism
- Repo-layout table updated to include shipper/, scripts/, AGENTS.md,
  and the workloads/fleet additions.
- Deploying section: replaces the "TODO scaffolds" wording with the
  actual sudo install-receiver / install-lab-host / wg-pki bring-up
  flow that's running on the Pi today.

AGENTS.md: adds a "don't put off the hard parts" convention as the
first item under Other conventions, with explicit guidance on when
"deferred-with-reason" is legitimate (genuine operator artifact
missing) and the requirement to file an issue + automate the
bring-up so it Just Works once the artifact lands.

86/86 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:11:35 -05:00
max
7c9f9582ca Lab-host shipper + receiver /v1/ping + install scripts
Implements the deployment loop end-to-end on the CIS490 side:

shipper/
  config.py      ShipperConfig (host_id, paths, receiver endpoint, mTLS)
  transport.py   httpx-based PUT + ping with mTLS + bearer support
  queue.py       scan data/episodes/, tar+zstd via system zstd, ship,
                 retire to data/shipped/. Idempotent across crashes per
                 the state machine in docs/transport.md.
  __main__.py    CLI: --ping (smoke test), --once (one pass), or daemon

receiver/app.py: new POST /v1/ping that requires the same auth as PUT
  /v1/episodes but writes nothing. Used by `cis490-shipper --ping`
  during lab-host bring-up to verify the WG/Caddy/mTLS path before
  shipping any real bytes.

etc/
  cis490-shipper.service       systemd unit for the lab-host shipper
  cis490-orchestrator.service  systemd unit for the lab-host queue
                               (kept disabled by default until queue
                               mode lands)
  lab-host.toml.example        config template

scripts/
  install-lab-host.sh   idempotent installer; verifies prereqs,
                        creates cis490 service user, syncs repo to
                        /opt/cis490, builds venv, drops systemd units
                        and config template
  install-receiver.sh   same, for the receiver role on the central WG
                        node (Pi5 in our setup)

tests/test_shipper.py  11 end-to-end tests against a real Uvicorn
                       server hosting the receiver app. Exercises
                       ping, tar+ship, idempotent re-ship, 409
                       conflict, transient (receiver down), tarball
                       round-trip via system zstd.

AGENTS.md  guidance for AI agents working on this and sibling repos.
           Headline: when you hit an issue you can't fully fix in
           scope, file a Forgejo issue rather than leaving a TODO.

51/51 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:41:32 -05:00