README + AGENTS.md: reflect fleet, driver v2, all 4 collectors

README:
- Intro now describes the multi-host fleet + cross-host sample
  diversity as the primary workflow.
- Tier 2 section: profile-driven workload table replaces the old
  "yes / dd" description.
- New Tier 3 section: covers driver v2 dispatch + setup automation
  scripts.
- Tier maturity table refreshed (1, 2 ; 3  code /  image; 4 🚧).
- Telemetry-sources table moved into the per-tier story so the
  oracle-vs-feature split is visible from the top of the doc.
- Status section restructured by section (Pipeline, Telemetry,
  Orchestrator + drivers, Fleet) instead of a flat list. Cross-links
  to the new Forgejo issues for the remaining gaps:
    #4 — Tier 4 MalwareBazaar fetcher
    #5 — source 3 (perf stat)
    #6 — bridge pcap per-episode wiring
- Quick-start sections rewritten:
    1) "fleet mode (the primary workflow)" with --capacity + --waves
    2) "single episode, no fleet" covering both Tier 2 + Tier 3
    3) "multi-host fleet — how cross-host diversity works" explains
       the deterministic per-(host, slot, ep) selection mechanism
- Repo-layout table updated to include shipper/, scripts/, AGENTS.md,
  and the workloads/fleet additions.
- Deploying section: replaces the "TODO scaffolds" wording with the
  actual sudo install-receiver / install-lab-host / wg-pki bring-up
  flow that's running on the Pi today.

AGENTS.md: adds a "don't put off the hard parts" convention as the
first item under Other conventions, with explicit guidance on when
"deferred-with-reason" is legitimate (genuine operator artifact
missing) and the requirement to file an issue + automate the
bring-up so it Just Works once the artifact lands.

86/86 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
max 2026-04-30 00:11:35 -05:00
parent b80986d99c
commit c89dbe29e7
2 changed files with 214 additions and 87 deletions

View file

@ -79,6 +79,15 @@ commits.
## Other conventions
- **Don't put off the hard parts.** Frame "deferred-with-reason" only
for genuine blockers (binary not present on this machine, external
service unreachable). For anything you *could* do but find awkward
— bridge setup, cross-arch quirks, fleet concurrency — do it. The
user has flagged this twice when work was scoped down prematurely.
When something genuinely is blocked by an operator artifact, file
the Forgejo issue and *automate the bring-up* (e.g., installer
script + sha256-verifying fetcher) so the moment the artifact lands
it Just Works.
- **Naming:** never coin USB / device / service names on the user's
behalf. Ask first. Reusing an old name is especially bad.
- **`/etc` configs:** `Read` first, copy second. Never overwrite a

292
README.md
View file

@ -4,9 +4,16 @@ Course project for CIS490 (Cybersecurity). The end-goal is an ML model that
watches performance metrics on a real device, decides whether the device has
been breached, and triggers a hardware-level reset when confidence is high
enough. This repository covers the **dataset side** — we run public malware
samples against intentionally vulnerable Linux VMs and capture labeled
time-series telemetry that mirrors what the deployed model would see in the
field.
samples (and behavior-matched mimics) against intentionally vulnerable Linux
VMs and capture labeled time-series telemetry that mirrors what the deployed
model would see in the field.
Concretely, every lab host on the WireGuard mesh detects how much capacity
it has, spins up that many concurrent VMs, gives each VM a *different*
malware profile from the manifest, and ships the resulting labeled episode
tarballs to the central receiver on the Pi over mTLS. Running the same
fleet on multiple hosts gives novel, non-overlapping data per host with no
coordinator — see [Multi-host fleet](#multi-host-fleet) below.
The work is grounded in the trust-over-time scoring model from
[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803).
@ -22,15 +29,33 @@ the set of timestamped phase transitions written to `labels.jsonl` —
sharing a monotonic clock with the metric rows so anything aligned in
time can be aligned in code.
### Tier 2 — *real Alpine VM, real workload driven from inside the guest*
### Tier 2 — *real Alpine VM, profile-driven workload inside the guest*
This is the closest we get to real-malware behaviour without yet running
real malware. Telemetry is real `/proc/<qemu_pid>` from outside the
guest, **and the load is generated inside the guest** by busybox
``yes`` (CPU saturation) and ``dd`` (disk bursts), driven over the
serial console by `tools/vm_load_controller.py`. Every phase transition
in `labels.jsonl` corresponds to an actual command issued inside the
real VM.
guest plus three more sources running concurrently (QMP, bridge pcap,
in-guest agent — see *Telemetry sources* below). The *load* itself is
generated inside the guest by a profile-matched shell command from
[`exploits/workloads.py`](exploits/workloads.py), driven over the
serial console by [`tools/vm_load_controller.py`](tools/vm_load_controller.py).
Each sample's `profile` (from [`samples/manifest.toml`](samples/manifest.toml))
dispatches to a different in-session workload, so the envelope each
VM produces is observably different per family — exactly the variance
the ML model needs to learn:
| profile | shape |
|------------------|--------------------------------------------------------|
| `cpu-saturate` | sustained 1-vCPU saturation (XMRig) |
| `scan-and-dial` | SYN-style probes across the bridge subnet + dial-home |
| `io-walk` | fs traversal + 4 KiB urandom writes (ransomware) |
| `bursty-c2` | long idle + periodic 3-packet egress burst (Dridex) |
| `low-and-slow` | minimal CPU + periodic memory churn (Kovter / fileless)|
| `shell-resident` | one long-lived TCP socket + periodic command ticks (RAT)|
Every phase transition in `labels.jsonl` corresponds to an actual
command issued inside the real VM, and `meta.json` records which
sample / profile / kind drove it.
![Real Alpine VM envelope](docs/images/real-vm-envelope.png)
@ -41,10 +66,20 @@ controller killing the load process inside the VM. The
infected_running → dormant → infected_running re-entry is the textbook
envelope that justifies the whole project framing.
Reproduce with:
Reproduce one episode (profile-driven via `--sample` or `SAMPLE_NAME`
env, defaults to the v1 yes-loop without one):
```sh
uv run python tools/run_real_vm_demo.py --data-root data
uv run python tools/run_real_vm_demo.py --data-root data \
--sample xmrig-cryptominer
```
Or run the **fleet** — one wave of `max_concurrent` parallel episodes,
each slot pulling a different sample from the manifest:
```sh
uv run python tools/run_fleet.py --capacity # see what the host can do
uv run python tools/run_fleet.py --waves 1 --data-root data
```
### Tier 1 — *real Alpine VM, idle baseline*
@ -67,21 +102,46 @@ above produces from real KVM behaviour.
![Synthetic envelope (host-side mimic)](docs/images/synthetic-envelope.png)
### What's still missing for the real-malware envelope
### Tier 3 — *real exploit fire, profile-matched workload (Driver v2)*
The Tier-3 driver lives in [`exploits/`](exploits/README.md) — a tiny
msgpack-over-HTTPS msfrpc client + `MSFExploitDriver`. With a
[`Sample`](samples/manifest.py) supplied, the driver dispatches the
post-exploit `infected_running` workload through
[`exploits/workloads.py`](exploits/workloads.py) — same six profiles
as Tier 2, so a fleet wave produces matched envelopes whether or not
an exploit fires. Without a sample, the v1 yes-loop path is preserved
for smoke runs.
First canned module: `exploits/modules/vsftpd_234_backdoor.toml`
(Metasploitable2's CVE-2011-2523). [`scripts/install-msfrpcd.sh`](scripts/install-msfrpcd.sh)
sets up `msfrpcd` (loopback only) as a hardened systemd unit;
[`scripts/fetch-metasploitable2.sh`](scripts/fetch-metasploitable2.sh)
pulls + sha256-verifies a target image from operator-supplied URL.
### Tier maturity
| Tier | What it gives | Status |
|---|---|---|
| 1 — real VM, idle | confidence the collector reads real KVM behaviour | ✅ done |
| 2 — real VM, real workload from inside the guest | first real-load envelope shape | ✅ done |
| 3 — real VM, real exploit fire (Metasploitable + msfrpc) | honest `armed → infecting` transitions | 🟡 driver landed, integration pending |
| 4 — real VM, real malware sample (XMRig from MalwareBazaar) | the full envelope we ultimately train on | 🚧 |
| 1 — real VM, idle | confidence the collectors read real KVM behaviour | ✅ done |
| 2 — real VM, profile-driven workload | distinguishable in-guest envelopes per malware family | ✅ done |
| 3 — real VM, real exploit fire + profile workload | honest `armed → infecting` transitions, driver v2 dispatch | ✅ code; ⏳ awaiting Metasploitable2 image + msfrpcd on a lab host |
| 4 — real VM, real malware sample (MalwareBazaar fetch) | the full envelope we ultimately train on | 🚧 manifest schema ready (`sample.sha256``kind=real`); fetcher TBD |
The Tier-3 driver lives in [`exploits/`](exploits/README.md) — a tiny
msgpack-over-HTTPS msfrpc client plus an `MSFExploitDriver` plugged
into the orchestrator as the `on_phase` callback. First canned module:
`exploits/modules/vsftpd_234_backdoor.toml` (Metasploitable2's
CVE-2011-2523). End-to-end integration needs `msfrpcd` running and a
Metasploitable2 image at `vm/images/`, which is the next bring-up step.
### Telemetry sources (all four wire into one episode dir)
| # | Source | Vantage | Role |
|---|--------------------------------|---------------|---------------------|
| 1 | host `/proc/<qemu_pid>` | outside | oracle (label only) |
| 2 | QEMU QMP queries | outside | oracle (label only) |
| 3 | `perf stat -p <qemu_pid>` | outside | oracle (planned) |
| 4 | Bridge pcap → 100 ms netflow | gateway-side | feature (deployable)|
| 5 | In-guest agent (virtio-serial) | inside | feature (deployable)|
Sources 1, 2, 4, 5 are live as of this commit. The deploy/oracle split
follows [`docs/threat-model.md`](docs/threat-model.md): only sources
4 + 5 are usable as model *features* in the field — sources 1, 2, 3
exist as labeling oracles only.
For an interactive view of any episode (zoom/pan/hover), run:
@ -92,87 +152,133 @@ tools/show_envelope.sh data/episodes/<episode_id>
---
## Status
## Status (86/86 tests passing as of `b80986d`)
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — running on Pi5 via Caddy + mTLS (wg-pki client CA)
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
- ✅ Host /proc oracle collector (source 1) @ 10 Hz
- ✅ **QMP collector** (source 2) — query-status / query-blockstats / query-stats, 1 Hz
- ✅ **Bridge pcap** (source 4) — pure-Python pcap parser + 100 ms-bucketed netflow.jsonl
- ✅ **In-guest agent** (source 5) — virtio-serial; cidata-embedded for first-boot install on Alpine; host-side reader re-stamps to host clock
- ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
- ✅ Real VM (Alpine 3.21 cloud-init under KVM)
- ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition
- 🟡 **Tier 3 — exploit driver:** `MSFExploitDriver` + msfrpc client + first module config landed; `scripts/install-msfrpcd.sh` automates msfrpcd setup; `scripts/fetch-metasploitable2.sh` pulls + verifies the target image (URL+sha256 from operator). Driver v2 (sample-profile-driven workloads) is the next step for ML diversity.
- ✅ **Shipper** — lab-host ↔ Pi receiver via tar+zstd PUT over WG with mTLS; `--ping` smoke mode
- ✅ **Fleet runner** — host-capacity-aware concurrency (`tools/run_fleet.py`); resource detector reserves cores + RAM headroom; sample manifest with deterministic per-(host, slot, episode) selection so every host on the network produces *novel, varied, labeled* data
- ✅ **Sample manifest** — six initial profiles (cryptominer / botnet / ransomware / banking-trojan / fileless / RAT). Real-malware fetch from MalwareBazaar is the Tier-4 follow-up.
**Pipeline (lab-host → Pi → tarball stored)**
- ✅ Receiver app (HTTPS PUT, sha256-verified, idempotent) — running on the Pi behind Caddy with mTLS via the wg-pki client CA
- ✅ `POST /v1/ping` smoke endpoint (writes nothing, exercises the full auth path)
- ✅ Shipper (`shipper/`) — tar+zstd, retry/backoff, `--ping` mode
- ✅ Caddy `collector.wg` block (in `spectral/caddy`)
- ✅ Lab-host install script + systemd units (`scripts/install-lab-host.sh`, `etc/cis490-{shipper,orchestrator}.service`)
- ✅ Receiver install script (`scripts/install-receiver.sh`)
- ✅ wg-pki client-CA bootstrap + per-host leaf issuance (in `spectral/wg-pki`)
> **Topology note:** in this project the **Pi5 is the WireGuard-side
> *collector*** that receives episode tarballs from one or more lab hosts.
> It is *not* the deployment target for the model. The deployment target is
> generic ("any constrained Linux device"). See
**Telemetry**
- ✅ Source 1 — host `/proc/<qemu_pid>` @ 10 Hz
- ✅ Source 2 — QEMU QMP @ 1 Hz
- ✅ Source 4 — bridge pcap + 100 ms netflow bucketizer (pure-Python parser, no scapy/dpkt dep). Per-episode wiring in `EpisodeRunner` is tracked in [#6](http://maxgit.wg/spectral/CIS490/issues/6).
- ✅ Source 5 — in-guest agent over virtio-serial; cidata-embedded for first-boot install on Alpine
- 🚧 Source 3 — `perf stat -p <qemu_pid>` ([#5](http://maxgit.wg/spectral/CIS490/issues/5))
**Orchestrator + drivers**
- ✅ Orchestrator v0 — phase-scheduled episode runner, ULID episode ids
- ✅ Tier 2 driver — real Alpine VM, profile-driven in-guest workload over serial console
- ✅ Tier 3 driver v2 — `MSFExploitDriver` + msfrpc client + per-sample workload dispatch; first canned module `vsftpd_234_backdoor.toml`
- ⏳ Tier 3 integration — needs operator to drop a Metasploitable2 image + run `scripts/install-msfrpcd.sh` on a lab host
- 🚧 Tier 4 — MalwareBazaar fetch by sha256 (manifest schema is ready; tracked in [#4](http://maxgit.wg/spectral/CIS490/issues/4))
**Fleet (multi-VM, multi-host data generation)**
- ✅ Resource-aware capacity detector (cores / RAM / load) — `orchestrator/fleet.py`
- ✅ Concurrent slot runner — `tools/run_fleet.py`
- ✅ Sample manifest with six behavioural profiles + deterministic per-(host_id, slot, episode) selection so every host walks the catalog in a different order
> **Topology note:** the **Pi5 is the WireGuard-side *collector*** that
> receives episode tarballs from one or more lab hosts. It is *not* the
> deployment target for the model. The deployment target is generic
> ("any constrained Linux device"). See
> [`docs/architecture.md`](docs/architecture.md).
---
<details>
<summary><b>Quick start — run the synthetic envelope demo (~90 s)</b></summary>
<summary><b>Quick start — fleet mode (the primary workflow)</b></summary>
```sh
git clone https://maxgit.wg/spectral/CIS490.git
cd CIS490
# One-time setup.
uv sync
# Generate one labeled episode (8 phases, 851 telemetry rows, 85 s).
uv run python tools/run_envelope_demo.py --data-root data
# 1. Build the cidata ISO with the in-guest agent baked in.
uv run python tools/build_cidata.py vm/images/cidata.iso
# Render a static PNG envelope of that episode.
uv run python tools/plot_envelope.py data/episodes/<episode_id>
# 2. See what this host is sized for.
uv run python tools/run_fleet.py --capacity
# cores: 4 (reserve 1)
# ram: 7951 MiB total, 5223 MiB available (headroom 1024 MiB, per-vm 320 MiB)
# load: 1m=0.51
# caps: by_cores=3, by_ram=13, by_load=3
# --> max_concurrent VMs: 3
# Or open an interactive plot in your browser:
# 3. Run one wave (= max_concurrent parallel episodes, each with a
# different sample profile).
uv run python tools/run_fleet.py --waves 1 --data-root data
# 4. Plot any episode (matplotlib WebAgg).
tools/show_envelope.sh data/episodes/<episode_id>
```
The data lands in `data/episodes/<ulid>/`:
Each episode dir contains:
```
meta.json episode metadata (image, snapshot, schedule, host fingerprint)
events.jsonl orchestrator actions (snapshot_load, phase_transition, episode_end)
meta.json episode metadata (image, sample, profile, fleet capacity)
events.jsonl orchestrator + driver events (exploit_fire, session_open, sample_executed, ...)
labels.jsonl one row per phase transition — THIS is the envelope
telemetry-proc.jsonl host /proc sampler at 10 Hz
telemetry-proc.jsonl source 1: host /proc sampler @ 10 Hz
telemetry-qmp.jsonl source 2: QMP query-status / blockstats / kvm stats @ 1 Hz
telemetry-guest.jsonl source 5: in-guest agent (CPU jiffies, mem, listen ports, top procs)
network.pcap source 4: tcpdump on br-malware
netflow.jsonl source 4: 100 ms-bucketed pcap aggregation
done.marker written last; the shipper only sees finished episodes
```
</details>
<details>
<summary><b>Quick start — boot a real Linux VM (Cirros)</b></summary>
The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its
QMP/monitor sockets and pidfile. The orchestrator then samples the real
`qemu-system` process.
<summary><b>Quick start — single episode, no fleet</b></summary>
```sh
# Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'.
# (See docs/sources.md for the Cirros sha256.)
# Tier 2 (no exploit, profile-driven workload):
uv run python tools/run_real_vm_demo.py --data-root data \
--sample mirai-class-bot
# Boot in one terminal:
RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh
# In another terminal, point the orchestrator at the VM's pid:
QPID=$(cat /tmp/cis490-vm/qemu.pid)
uv run python -m orchestrator --target-pid $QPID --duration 20
# Plot:
tools/show_envelope.sh data/episodes/<episode_id>
# Tier 3 (real exploit fire via msfrpcd):
MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env; echo $MSFRPC_PASSWORD) \
uv run python tools/run_tier3_demo.py \
--module vsftpd_234_backdoor \
--sample ransomware-mimic \
--data-root data
```
The idle-VM envelope shape is distinct from the synthetic load: periodic
~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single
late-boot disk write. That's a real KVM guest you're seeing.
</details>
<details>
<summary><b>Multi-host fleet — how cross-host diversity works</b></summary>
Each lab host's `host_id` (set in `/etc/cis490/lab-host.toml`) seeds a
deterministic walk through the sample catalog:
```python
# samples/manifest.py
def select(self, *, host_id, slot, episode_index):
seed = f"{host_id}|{slot}|{episode_index}"
idx = sha256(seed)[:8] % len(self.samples)
return self.samples[idx]
```
So:
- `host=alice slot=0 ep=0` and `host=bob slot=0 ep=0` almost certainly
pick *different* samples (test asserts < 25% collision over 20 trials).
- A single host walks the entire catalog within ~`len(manifest)` waves
(test confirms full coverage in 200 episodes).
- No coordinator needed — every host independently produces non-overlapping
data, and `meta.fleet.host_id` + `meta.sample.name` make the join trivial
at training time.
The fleet runner shells out to the same `tools/run_real_vm_demo.py` per
slot, with `SLOT` / `RUN_DIR` / `SAMPLE_NAME` env passed through to the
launcher. Each VM gets its own QMP socket, agent socket, hostfwd port
range, and episode dir, so concurrency is collision-free up to the
capacity ceiling.
</details>
@ -188,15 +294,18 @@ late-boot disk write. That's a real KVM guest you're seeing.
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
| [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
| `receiver/` | Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent |
| `vm/` | qcow2 images, launch scripts, snapshot recipes (binaries gitignored) |
| `tools/` | Demo runners, load mimic, plot scripts |
| [`exploits/`](exploits/README.md) | MSF RPC client + driver + per-module TOML configs (Tier 3) |
| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
| `orchestrator/` | Episode runner + `fleet.py` (capacity detection, concurrent slot driver) |
| `collectors/` | One module per telemetry source: `proc_qemu`, `qmp`, `pcap`, `guest_agent` |
| `receiver/` | Starlette app: PUT `/v1/episodes` + POST `/v1/ping`, sha256-verified, idempotent |
| `shipper/` | Lab-host-side: scan `data/episodes/`, tar+zstd, PUT over mTLS, retry/backoff |
| `vm/` | Launch scripts (`launch_demo.sh`, `launch_target.sh`), `setup_bridge.sh`, in-guest agent at `vm/guest-agent/cis490_agent.py`. qcow2 images and pcap captures gitignored. |
| `tools/` | `run_fleet.py`, `run_real_vm_demo.py`, `run_tier3_demo.py`, `build_cidata.py`, `plot_envelope.py`, `show_envelope.sh` |
| [`exploits/`](exploits/README.md) | MSF RPC client (`msfrpc.py`), `driver.py` (v2 with sample dispatch), `workloads.py` (six profile-matched in-session loops), per-module TOML configs |
| [`samples/`](samples/manifest.toml) | Sample manifest + loader. Binaries land at `samples/store/<sha256>` (gitignored). |
| `scripts/` | `install-{lab-host,receiver,msfrpcd}.sh`, `fetch-metasploitable2.sh` |
| `training/` | Model training code (deferred — schema first) |
| `etc/` | systemd units and config templates installed by the deploy scripts |
| `etc/` | systemd units and config templates (`cis490-{receiver,shipper,orchestrator}.service`, `lab-host.toml.example`, `receiver.toml.example`) |
| [`AGENTS.md`](AGENTS.md) | Conventions for AI agents working on this and sibling spectral repos |
</details>
@ -237,17 +346,26 @@ Two roles, one bootstrap command each. Detailed in
`index.jsonl`. Runs on the Pi5 in our setup.
```sh
# On a lab host:
./scripts/install-lab-host.sh # (TODO — currently bring up by hand per docs/deploy.md)
# On the Pi5 (or any always-on WG node):
./scripts/install-receiver.sh # (TODO — same)
sudo ./scripts/install-receiver.sh
# Add the collector.wg block to spectral/caddy (already merged), then:
sudo systemctl enable --now cis490-receiver
# One-time, on the Pi: bootstrap the CIS490 client CA.
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh
# On each lab host: enroll via wg-enroll first, then:
sudo ./scripts/install-lab-host.sh
# Drop a TLS leaf from wg-pki at /etc/cis490/certs/, edit /etc/cis490/lab-host.toml.
sudo systemctl enable --now cis490-shipper cis490-orchestrator
```
For now both bootstrap scripts are scaffolds; the units and configs they
install live in `etc/`. The receiver itself works today
(`uv run python -m receiver --config etc/receiver.toml.example` — modify
paths).
The orchestrator service runs `tools/run_fleet.py --waves 1` per
invocation with `Restart=always`, giving a continuous stream of
fresh-sample episodes per host. The shipper picks them up as
`done.marker` files appear and PUTs them to `https://collector.wg`.
For mTLS leaf-cert minting: `spectral/wg-pki/scripts/issue-cis490-client-cert.sh <host_id>`.
</details>