Scaffold project: docs, repo skeleton, transport + deploy design

Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:21:00 -06:00 · 2026-04-28 23:21:00 -06:00 · fa1574a0a6
commit fa1574a0a6
parent 7a0fefc02e
14 changed files with 1080 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1 +1,48 @@
+# Disk images and snapshots
 *.iso
+*.img
+*.qcow2
+*.qcow2.*
+*.vmdk
+*.vdi
+*.raw
+vm/images/
+vm/snapshots/
+
+# Telemetry output
+data/episodes/
+*.pcap
+*.pcapng
+
+# Malware samples — NEVER commit binaries
+samples/store/
+*.bin
+*.elf
+*.exe
+*.dll
+*.so.malware
+
+# Python
+__pycache__/
+*.py[cod]
+.venv/
+venv/
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+*.egg-info/
+dist/
+build/
+
+# Editor
+.vscode/
+.idea/
+*.swp
+.DS_Store
+
+# Local secrets (never commit)
+.env
+.env.local
+secrets.toml
+*.pat
+*.token
--- a/README.md
+++ b/README.md
@ -0,0 +1,51 @@
+# CIS490 — Behavioral Malware Detection Dataset & Model
+
+Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches
+performance metrics on a real device, decides whether the device has been breached,
+and triggers a hardware-level reset when confidence is high enough.
+
+This repository covers the **dataset side** of that pipeline: we run real, public
+malware samples against intentionally vulnerable Linux VMs and capture labeled
+time-series telemetry that mirrors what the same model would see in deployment on
+a Raspberry Pi or similarly-constrained target.
+
+The work is grounded in the trust-over-time scoring model from
+[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803) and a related
+proprietary follow-on that pairs detection with blockchain-anchored hardware reset.
+
+## What lives where
+
+| Path | What it holds |
+|---|---|
+| [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning |
+| [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split |
+| [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum |
+| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG |
+| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
+| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
+| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
+| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
+| `vm/` | qcow2 images and snapshot scripts (binaries gitignored) |
+| `exploits/` | Metasploit resource scripts for repeatable exploitation |
+| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
+| `training/` | Model training code (deferred — schema first) |
+
+## Quick orientation
+
+1. **Why VMs?** We need a clean snapshot/revert loop and we need to run real malware
+   without burning hardware. KVM gives us both at near-native speed.
+2. **Why is the network isolated?** A host-only bridge keeps malware off the
+   internet and off the WG overlay. The Pi5 gateway is the **lab-side observer**,
+   playing the same role it would play in a deployed setting.
+3. **Why JSONL and not a database (yet)?** Schema-last: collect first, decide
+   storage shape after we see what's actually useful. JSONL is crash-safe,
+   append-only, and reshapes trivially into Postgres/Timescale/Parquet later.
+4. **Why two models?** One trained on features that exist on a real Pi
+   (*deployable*), one trained on host-side QEMU-only features (*oracle*). The
+   accuracy gap measures how much detection power a privileged rootkit can take
+   from the deployed model. See [docs/threat-model.md](docs/threat-model.md).
+
+## Status
+
+Project bootstrap. Skeleton, documentation, and design decisions in place;
+collection and orchestration code in progress.
--- a/collectors/README.md
+++ b/collectors/README.md
@ -0,0 +1,23 @@
+# collectors/
+
+One module per telemetry source. All collectors:
+
+- Receive an `episode_id`, an output directory, and a shared `t_mono_origin_ns`.
+- Write JSONL into `data/episodes/<episode_id>/telemetry-<name>.jsonl`.
+- Stamp every row with the same `t_mono_ns` / `t_wall_ns` clock pair.
+- Stamp every row with `source` and `available_in_deployment` (true/false).
+- Exit cleanly on `SIGTERM` from the orchestrator.
+
+| Module | Source | Vantage | Role |
+|---|---|---|---|
+| `proc_qemu.py` | host `/proc/<qemu_pid>/{stat,io,status,schedstat}` | outside guest | oracle |
+| `qmp.py` | QEMU QMP `query-stats`, `query-blockstats`, netdev | outside guest | oracle |
+| `perf_qemu.py` | `perf stat -p <qemu_pid>` | outside guest | oracle |
+| `pcap.py` | `tcpdump -i br-malware`, bucketed | gateway-side | feature |
+| `guest_agent.py` | virtio-serial reader, parses agent JSONL | inside guest | feature |
+
+The in-guest agent itself (a small Python+psutil program that runs on the
+guest and writes to `/dev/virtio-ports/cis490.guest.agent`) lives under
+`vm/guest-agent/` because it is shipped *into* the guest at image-build time.
+
+See [`docs/data-model.md`](../docs/data-model.md) for row schemas.
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -0,0 +1,107 @@
+# Architecture
+
+## One-paragraph summary
+
+A QEMU/KVM host runs short, repeatable **episodes** against a vulnerable Linux
+guest. Each episode boots from a clean snapshot, captures a baseline, fires a
+known exploit, drops a public malware sample, observes the infection envelope,
+and reverts the snapshot. Telemetry is captured from five vantage points
+simultaneously, all stamped with the host monotonic clock so rows align. The
+output of an episode is a self-contained directory of JSONL files plus a pcap.
+
+## Lab topology
+
+```
+---------------------------------------------------------------+
+|  VM HOST  (this machine, /home/maximus/.env/qemu)             |
+|                                                               |
+|   +-----------------------+      +------------------------+   |
+|   |  KVM guest            |      |  orchestrator (host)   |   |
+|   |  (Metasploitable2,    |      |  - snapshot loop       |   |
+|   |   1 vCPU, capped)     |      |  - exploit driver      |   |
+|   |                       |<====>|  - phase labeler       |   |
+|   |  in-guest agent  -----|virtio|                        |   |
+|   |                       |serial|  collectors:           |   |
+|   |  vNIC ----------------|      |   * host /proc/qemu_pid|   |
+|   |        |              |      |   * QMP query-stats    |   |
+|   +--------|--------------+      |   * perf -p qemu_pid   |   |
+|            |                     |   * tcpdump on br      |   |
+|            v                     |   * guest agent rx     |   |
+|   br-malware (host-only, NO NAT) |                        |   |
+|            |                     +-----------|------------+   |
+|            +--- isolated, no internet                    |    |
+|                                                          v    |
+|                                                  data/episodes/
+----------------------------------------------------------|----+
+                                                           | (later)
+                                                           v
+                                              WG overlay -> Pi5 (DB + ingest)
+```
+
+The malware bridge `br-malware` is **host-only** — no NAT, no route to the WG
+overlay, no DNS. The orchestrator also blocks egress with nftables on the host
+as a belt-and-suspenders measure.
+
+## Why KVM, not TCG and not Docker
+
+| Option | Speed | Determinism | Real OS isolation | Verdict |
+|---|---|---|---|---|
+| TCG `-icount` | slow | bit-exact replay | yes | overkill — ML wants noise |
+| **KVM** | near-native | host-scheduler noise (good) | yes | **chosen** |
+| Docker | fastest | low | shares host kernel — unsafe for malware | ruled out |
+
+KVM is roughly 15× faster than TCG for boot/snapshot-revert cycles, which directly
+multiplies dataset size for a fixed wall-clock budget. The "constrained
+single-threaded device" framing from the project goal is preserved by pinning to
+1 vCPU and applying a host cgroup CPU cap.
+
+## The episode state machine
+
+```
+  snapshot_load(baseline)
+        |
+        v
+  [clean]  ---- record T_baseline seconds of idle telemetry ----+
+        |                                                       |
+        v                                                       |
+  [armed]  ---- exploit module fires; session opens ------------+
+        |                                                       |
+        v                                                       |
+  [infecting]  ---- sample uploaded + executed -----------------+
+        |                                                       |
+        v                                                       |
+  [infected_running]  ---- observe T_active seconds ------------+
+        |                                                       |
+        v                                                       |
+  [dormant]  ---- (optional) wait for sample's idle window ----+
+        |                                                       |
+        v                                                       |
+  [reverting]  ---- snapshot_load(baseline); episode ends -----+
+                                                                |
+                                                                v
+                                                     write meta.json + close jsonl
+```
+
+Phase transitions are emitted by the orchestrator into `labels.jsonl` *at the
+moment the orchestrator takes the action*, not inferred from metrics afterward.
+This is what makes the dataset honestly labeled.
+
+## Why the lab topology mirrors deployment
+
+In the field, the ML model runs on a real Pi or constrained device. Whatever
+sees the device's network traffic from outside (router, gateway, hypervisor) is
+the **gateway observer**. In our lab, the host-only bridge plays exactly that
+role — bridge-side pcap features at training time map 1:1 to gateway-side
+NetFlow/pcap features at inference time. This is what makes
+*train/serve parity* possible for the network channel even though we'll later
+run on bare metal.
+
+See [`threat-model.md`](threat-model.md) for the rest of the parity story
+(host-side QEMU features must NOT be used as model inputs — they are labeling
+oracles only).
+
+## Out of scope for this repo
+
+- Authoring novel malware or zero-day exploits.
+- Detection-evasion research targeting other vendors' AV.
+- Production deployment of the trained model — that lives in a separate repo.
--- a/docs/data-model.md
+++ b/docs/data-model.md
@ -0,0 +1,205 @@
+# Data Model
+
+JSONL only, no database, schema-last. Each episode is a self-contained directory.
+
+## Per-episode layout
+
+```
+data/episodes/<episode_id>/
+  meta.json               # one-time, written at start; updated at end with summary
+  events.jsonl            # orchestrator actions, one row per event
+  labels.jsonl            # phase transitions, one row per transition
+  telemetry-proc.jsonl    # source 1 (oracle)        host /proc/<qemu_pid>
+  telemetry-qmp.jsonl     # source 2 (oracle)        QEMU QMP queries
+  telemetry-perf.jsonl    # source 3 (oracle)        perf stat -p <qemu_pid>
+  telemetry-guest.jsonl   # source 5 (feature)       in-guest agent over virtio-serial
+  network.pcap            # source 4 raw             tcpdump -i br-malware
+  netflow.jsonl           # source 4 bucketed        100ms aggregations of pcap
+  stderr.log              # raw qemu + agent logs
+```
+
+`<episode_id>` is a [ULID](https://github.com/ulid/spec) — sortable by time,
+unique without coordination, URL-safe.
+
+## Common fields on every telemetry row
+
+| Field | Type | Notes |
+|---|---|---|
+| `t_mono_ns` | int | host `CLOCK_MONOTONIC` at sample time, episode-relative origin |
+| `t_wall_ns` | int | host wall clock, ns since epoch |
+| `source` | string | one of `host_proc`, `host_qmp`, `host_perf`, `bridge_pcap`, `guest_agent` |
+| `available_in_deployment` | bool | **true = feature, false = oracle** |
+
+The `available_in_deployment` flag is denormalized onto every row so downstream
+loaders don't have to look up a separate manifest to filter for the realistic
+model.
+
+## meta.json schema
+
+```json
+{
+  "episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
+  "schema_version": 1,
+  "started_at_wall": "2026-04-28T22:30:00Z",
+  "ended_at_wall": "2026-04-28T22:31:42Z",
+  "git_commit": "<sha>",
+  "host_fingerprint": {
+    "kernel": "6.18.8",
+    "qemu_version": "9.0.0",
+    "cpu_model": "...",
+    "smt_off": true
+  },
+  "vm": {
+    "image_name": "metasploitable2",
+    "image_sha256": "...",
+    "vcpus": 1,
+    "ram_mib": 512,
+    "cgroup_cpu_cap": "800ms/1s",
+    "snapshot_name": "baseline-v1"
+  },
+  "exploit": {
+    "framework": "metasploit",
+    "module": "exploit/multi/samba/usermap_script",
+    "rport": 445,
+    "rhost": "10.200.0.10"
+  },
+  "sample": {
+    "name": "linux.miner.xmrig.elf",
+    "sha256": "...",
+    "source": "MalwareBazaar",
+    "first_seen": "2024-...",
+    "category": "miner"
+  },
+  "schedule": {
+    "baseline_seconds": 30,
+    "infected_seconds": 90,
+    "dormant_seconds": 60
+  },
+  "result": {
+    "phases_observed": ["clean","armed","infecting","infected_running","dormant"],
+    "exploit_succeeded": true,
+    "sample_executed": true,
+    "snapshot_revert_ok": true
+  }
+}
+```
+
+## events.jsonl
+
+One row per orchestrator action. Tells you exactly what happened and when.
+
+```json
+{"t_mono_ns": 0,         "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"}
+{"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"}
+{"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1}
+{"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."}
+{"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042}
+{"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"}
+```
+
+## labels.jsonl
+
+```json
+{"t_mono_ns": 0,            "phase": "clean",           "prev": null,                "reason": "snapshot_loaded"}
+{"t_mono_ns": 30100000000,  "phase": "armed",           "prev": "clean",             "reason": "exploit_module_running"}
+{"t_mono_ns": 31250000000,  "phase": "infecting",       "prev": "armed",             "reason": "session_open"}
+{"t_mono_ns": 31400000000,  "phase": "infected_running","prev": "infecting",         "reason": "sample_executed"}
+{"t_mono_ns": 91400000000,  "phase": "dormant",         "prev": "infected_running",  "reason": "scheduler_transition"}
+```
+
+### Phase enum (closed)
+
+```
+clean              — known-good, post-snapshot-load, pre-exploit
+armed              — exploit module is running but no session yet
+infecting          — session opened, sample landing/starting
+infected_running   — sample is actively producing observable behavior
+dormant            — sample is present but idle (sleep timer, beacon interval)
+reverting          — snapshot_load triggered, episode ending
+```
+
+## telemetry-proc.jsonl (source 1, oracle)
+
+```json
+{
+  "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
+  "source": "host_proc", "available_in_deployment": false,
+  "cpu_user_jiffies": 142, "cpu_sys_jiffies": 38,
+  "rss_bytes": 542113792, "vsize_bytes": 1842933760,
+  "io_read_bytes": 0, "io_write_bytes": 4096,
+  "voluntary_ctxsw": 12, "involuntary_ctxsw": 3,
+  "minor_faults": 412, "major_faults": 0
+}
+```
+
+## telemetry-qmp.jsonl (source 2, oracle)
+
+```json
+{
+  "t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000,
+  "source": "host_qmp", "available_in_deployment": false,
+  "blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}},
+  "kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110},
+  "netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}}
+}
+```
+
+## telemetry-perf.jsonl (source 3, oracle)
+
+```json
+{
+  "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
+  "source": "host_perf", "available_in_deployment": false,
+  "cycles": 184_213_104, "instructions": 121_987_001,
+  "cache_references": 1_041_213, "cache_misses": 38_104,
+  "branches": 24_198_421, "branch_misses": 412_004,
+  "page_faults": 12, "context_switches": 18,
+  "ipc": 0.66, "cache_miss_rate": 0.0366
+}
+```
+
+## netflow.jsonl (source 4, feature)
+
+Bucketed from the pcap. The pcap stays raw on disk for re-derivation later.
+
+```json
+{
+  "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
+  "source": "bridge_pcap", "available_in_deployment": true,
+  "bucket_ms": 100,
+  "pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0,
+  "unique_dst_ips": 0, "unique_dst_ports": 0,
+  "syn_count": 0, "fin_count": 0, "rst_count": 0,
+  "dns_query_count": 0, "tcp_new_flows": 0
+}
+```
+
+## telemetry-guest.jsonl (source 5, feature)
+
+```json
+{
+  "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
+  "source": "guest_agent", "available_in_deployment": true,
+  "cpu_pct_total": 12.4, "load_1m": 0.41,
+  "mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872,
+  "thermal_milli_c": 47200,
+  "net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}},
+  "top_procs": [
+    {"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576},
+    {"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304}
+  ],
+  "listen_ports": [22, 80, 445]
+}
+```
+
+## Versioning
+
+`schema_version` lives in `meta.json`. Bump when any row schema changes. Keep
+old episodes untouched; loaders dispatch on version.
+
+## Ingest later
+
+When we move to a database (Timescale most likely), each `telemetry-*.jsonl`
+becomes one hypertable, partitioned by `t_wall_ns`, indexed on
+`(episode_id, source)`. The deployment-tag flag becomes a column we filter on
+when materializing the realistic-model training view.
--- a/docs/deploy.md
+++ b/docs/deploy.md
@ -0,0 +1,138 @@
+# Deploy
+
+Two roles. One install command each.
+
+## Roles
+
+| Role | Where it runs | What it does |
+|---|---|---|
+| `lab-host` | any KVM-capable Linux box on WG | runs episodes, ships completed episodes to the receiver |
+| `receiver` | Pi5 (or any always-on WG node) | accepts ship uploads, stores tarballs + `index.jsonl` |
+
+## Lab host install
+
+```sh
+git clone https://maxgit.wg/spectral/CIS490.git
+cd CIS490
+./scripts/install-lab-host.sh
+```
+
+The installer:
+
+1. Verifies KVM (`/dev/kvm` exists, user in `kvm` group).
+2. Installs system deps via the host package manager (qemu, tcpdump,
+   linux-tools/perf, zstd, python ≥ 3.11).
+3. Bootstraps a [`uv`](https://github.com/astral-sh/uv)-managed venv at
+   `.venv/` and installs the pinned Python deps from `uv.lock`.
+4. Drops two systemd units into `/etc/systemd/system/`:
+   - `cis490-orchestrator.service` — runs the episode loop on a queue
+   - `cis490-shipper.service` — watches `data/episodes/` and ships completed
+     episodes
+5. Writes a config template to `/etc/cis490/lab-host.toml` (idempotent — only
+   on first install).
+
+You finish by editing `/etc/cis490/lab-host.toml` to point at your receiver
+and to enroll your lab host's WG-issued client cert, then:
+
+```sh
+sudo systemctl enable --now cis490-orchestrator cis490-shipper
+```
+
+### `lab-host.toml`
+
+```toml
+host_id = "lab-host-1"
+
+[paths]
+data_root = "/var/lib/cis490/data"
+samples_store = "/var/lib/cis490/samples/store"
+qcow_image = "/var/lib/cis490/vm/images/metasploitable2.qcow2"
+
+[receiver]
+url = "https://collector.wg"
+client_cert = "/etc/cis490/certs/lab-host-1.pem"
+client_key = "/etc/cis490/certs/lab-host-1.key"
+ca_bundle = "/etc/cis490/certs/wg-ca.pem"
+
+[episode]
+baseline_seconds = 30
+infected_seconds = 90
+dormant_seconds = 60
+
+[retention]
+keep_local_for_days = 7
+prune_at_disk_pct = 80
+```
+
+## Receiver install
+
+On the Pi5 (or designated central node):
+
+```sh
+git clone https://maxgit.wg/spectral/CIS490.git
+cd CIS490
+./scripts/install-receiver.sh
+```
+
+The installer:
+
+1. Installs Python ≥ 3.11 + zstd + a tiny WSGI runner (uvicorn).
+2. Bootstraps the same `uv`-managed venv.
+3. Drops `cis490-receiver.service` listening on `127.0.0.1:8443` (TLS
+   terminated by the existing Caddy in `spectral/caddy`, which already binds
+   `*.wg`).
+4. Writes a config template to `/etc/cis490/receiver.toml`.
+
+Caddy block (added to your `spectral/caddy` config) for the receiver:
+
+```caddy
+collector.wg {
+    tls internal
+    reverse_proxy 127.0.0.1:8443 {
+        transport http {
+            tls
+            tls_client_auth /etc/cis490/certs/wg-ca.pem
+        }
+    }
+}
+```
+
+(mTLS terminates at the receiver, not Caddy — so the receiver sees the
+client cert and can enforce per-host policies later.)
+
+### `receiver.toml`
+
+```toml
+listen_addr = "127.0.0.1:8443"
+store_root = "/var/lib/cis490/episodes"
+incoming_root = "/var/lib/cis490/incoming"
+index_path = "/var/lib/cis490/index.jsonl"
+ca_bundle = "/etc/cis490/certs/wg-ca.pem"
+
+[limits]
+max_episode_bytes = 268_435_456    # 256 MiB
+```
+
+## Day-2 operations
+
+```sh
+# How many episodes have been shipped?
+ssh collector.wg 'wc -l /var/lib/cis490/index.jsonl'
+
+# What's in the outbox on a lab host? (failed/pending shipments)
+ls /var/lib/cis490/data/outbox/
+
+# Tail the orchestrator log
+journalctl -u cis490-orchestrator -f
+
+# Tail the shipper log
+journalctl -u cis490-shipper -f
+```
+
+## Updating
+
+```sh
+git pull
+./scripts/install-lab-host.sh   # idempotent; re-syncs deps and units
+sudo systemctl restart cis490-orchestrator cis490-shipper
+```
--- a/docs/lab-setup.md
+++ b/docs/lab-setup.md
@ -0,0 +1,145 @@
+# Lab Setup
+
+How to bring up the host, build the guest, and verify the snapshot loop.
+
+## Host prerequisites
+
+```
+qemu-system-x86_64   >= 8.0
+qemu-img             >= 8.0
+bridge-utils
+tcpdump / tshark
+linux-tools-common           (for `perf`)
+zstd
+python >= 3.11
+uv                           (https://github.com/astral-sh/uv)
+```
+
+`scripts/install-lab-host.sh` installs all of these and wires up systemd —
+see [`deploy.md`](deploy.md).
+
+KVM must be enabled in the kernel and the user must be in the `kvm` group:
+
+```
+ls /dev/kvm                  # must exist
+groups                       # must include kvm
+```
+
+## Network: host-only malware bridge
+
+`br-malware` (10.200.0.1/24) is the only network the guest sees, and it is
+host-only — no NAT, no upstream route. The host's WG interface is on a
+*separate* link (`wg0`) used only for shipping completed episodes to the
+collector; the bridge and WG never touch.
+
+| Interface | Purpose |
+|---|---|
+| `br-malware` (10.200.0.1/24) | host-only bridge, only NIC attached to the guest |
+| guest `eth0` | DHCP from a dnsmasq bound only to `br-malware` |
+| host WG (`wg0`) | shipping channel to the collector — not connected to the bridge |
+
+> Detailed firewall rules and the egress-drop safety net are out of scope for
+> this document and live in the deploy script. The relevant invariant for
+> readers is: **the guest cannot route off `br-malware`, period.**
+
+## Guest: Metasploitable 2
+
+1. Download from the [Rapid7 mirror](https://information.rapid7.com/download-metasploitable-2017.html)
+   (verify sha256 against the published value before use).
+2. Convert VMware → qcow2:
+
+   ```
+   qemu-img convert -O qcow2 -p Metasploitable.vmdk metasploitable2.qcow2
+   ```
+
+3. First boot (no snapshot yet) — let it come up, log in (msfadmin/msfadmin),
+   confirm services are listening on the expected ports, shut down cleanly.
+4. Take the baseline snapshot:
+
+   ```
+   qemu-img snapshot -c baseline-v1 metasploitable2.qcow2
+   ```
+
+   Internal qcow2 snapshots load in well under a second — this is the
+   "factory reset" mechanism for every episode.
+
+## Single-vCPU constrained-device emulation
+
+```
+-cpu host -smp 1,sockets=1,cores=1,threads=1
+-m 512
+-machine type=q35,accel=kvm
+```
+
+Plus a host-side cgroup CPU cap on the QEMU process (e.g. 80% of one core) so
+the guest behaves like a small, constrained device under load.
+
+## Telemetry channels
+
+### virtio-serial for the in-guest agent
+
+```
+-device virtio-serial-pci
+-chardev socket,path=/run/qemu/guest-agent.sock,server=on,wait=off,id=ga
+-device virtserialport,chardev=ga,name=cis490.guest.agent
+```
+
+The in-guest agent opens `/dev/virtio-ports/cis490.guest.agent` and writes
+JSONL to it. Host side, the orchestrator reads from the unix socket. No network
+involvement = the malware cannot interfere with this channel.
+
+### QMP for live oracle queries
+
+```
+-qmp unix:/run/qemu/qmp.sock,server=on,wait=off
+```
+
+The orchestrator polls `query-stats`, `query-blockstats`, and netdev stats over
+this socket.
+
+### perf stat on the QEMU process
+
+```
+perf stat -p <qemu_pid> -I 100 \
+  -e cycles,instructions,cache-references,cache-misses,branches,branch-misses,page-faults,context-switches \
+  -x , -o telemetry-perf.csv
+```
+
+The collector tails the CSV, parses, and emits JSONL.
+
+### tcpdump on `br-malware`
+
+```
+tcpdump -i br-malware -w network.pcap -B 4096 -s 200
+```
+
+Post-process to `netflow.jsonl` with 100ms buckets.
+
+## Snapshot loop sanity check
+
+A green light before any data collection:
+
+1. `qemu-img snapshot -l metasploitable2.qcow2` shows `baseline-v1`.
+2. Boot the VM with the qcow2.
+3. Touch a file in the guest. Shut down.
+4. `qemu-img snapshot -a baseline-v1 metasploitable2.qcow2`.
+5. Boot again. The file is gone. ✅
+
+## Safety checks before running real samples
+
+- `ip route show table all | grep br-malware` shows no route off the bridge.
+- `dig @host example.com` from a guest fails (no DNS for malware).
+- The host's WG interface is **not** bridged to `br-malware`.
+
+(See `scripts/install-lab-host.sh` for the firewall plumbing — it isn't the
+focus of this project.)
+
+## Where to put VMs and snapshots
+
+```
+vm/images/      # qcow2 disk images (gitignored)
+vm/snapshots/   # named snapshot exports if we ever externalize them
+```
+
+Both directories are gitignored. The repo only carries the *recipes* for
+reproducing them.
--- a/docs/threat-model.md
+++ b/docs/threat-model.md
@ -0,0 +1,94 @@
+# Threat Model & Train/Serve Parity
+
+The single most important design rule in this project:
+
+> **A feature used by the deployed model must exist on the deployed device.**
+
+Violating this rule produces a model that scores 99% in the lab and is useless in
+the field. This document spells out which features fall on which side of that
+line, and why we still bother capturing both.
+
+## The setting
+
+The deployed model runs on a real, non-virtualized device — a Raspberry Pi, an
+IoT endpoint, or similar. It tries to detect the moment that device gets
+breached. Two adversarial facts shape the design:
+
+1. **Malware can lie to in-device tools.** A sufficiently-privileged rootkit can
+   hook `/proc`, intercept `perf_event_open`, and hide its own processes.
+2. **There is no host-side QEMU view.** The deployed device is the actual
+   machine. Nothing is watching it from outside *the OS itself*.
+
+So the model has two trustworthy floors:
+
+- **In-device features that survive most malware** (perf counters via the syscall
+  interface, thermals, gross resource counters) — fast to lie to in principle,
+  but in practice most commodity malware doesn't bother.
+- **Off-device features at the gateway** (network telemetry observed by an
+  upstream router/gateway) — physics-bound, the malware cannot prevent bytes
+  from leaving the NIC.
+
+## Two roles: features vs. oracles
+
+Every measurement we capture in the lab gets one of two roles:
+
+| Role | What it's used for | Available in deployment? |
+|---|---|---|
+| **Feature** | Input to the trained model | **Must be yes** |
+| **Oracle** | Ground-truth labeling during training only | No — but we have it in the lab |
+
+The oracle channels (host `/proc/<qemu_pid>`, QMP `query-stats`,
+`perf -p qemu_pid`) are how we know with certainty what the guest is *actually*
+doing — not what it claims to be doing. We use that certainty to assign correct
+labels in `labels.jsonl`. The model never sees them at training time.
+
+## Channel taxonomy
+
+| # | Channel | Vantage | Role | Why |
+|---|---|---|---|---|
+| 1 | Host `/proc/<qemu_pid>` | outside guest | oracle | doesn't exist on real device |
+| 2 | QEMU QMP `query-stats`, `query-blockstats` | outside guest | oracle | same |
+| 3 | `perf stat -p <qemu_pid>` | outside guest | oracle | same |
+| 4 | Bridge-side pcap (`tcpdump -i br-malware`) | gateway | **feature** | matches Pi5 gateway in field |
+| 5 | In-guest `/proc/*`, `perf_event_open`, `/sys/class/thermal/*` | inside guest | **feature** | same exact source on real device |
+
+Note: in-guest features (5) are the same syscall surfaces we'd read on a real
+Pi. The data we capture from them in the lab and the data we capture from them
+on a Pi at deployment are pulled from identical kernel APIs — that is what makes
+parity hold.
+
+## The two-model evaluation strategy
+
+We will train two classifiers from the same dataset:
+
+1. **Realistic model** — trained only on features (channels 4 + 5).
+   *This is what would be deployed.*
+2. **Oracle model** — trained on everything (channels 1–5).
+   *This is the upper bound on what was learnable from this dataset.*
+
+The interesting numbers are:
+
+- The realistic model's accuracy → **how well the deployable system actually works**.
+- The gap between realistic and oracle → **how much detection power a privileged
+  rootkit can take away from us by lying** (the cost of in-device measurement
+  being adversarial).
+- Realistic-without-pcap vs. realistic-with-pcap → **how much the gateway
+  observer is worth** (the value of off-device telemetry).
+
+Reporting all three is the paper-worthy result — not just one accuracy number.
+
+## Using oracle disagreement as a label
+
+When the in-guest agent reports CPU at 8% and the host says the QEMU process is
+burning 70%, that *is* the rootkit-active signal — and it's more reliable than
+any single in-guest measurement. Episodes where this divergence appears get a
+high-confidence `infected_running` label even if the in-guest data alone looks
+ambiguous. This is the practical payoff of capturing both sides.
+
+## What we are not claiming
+
+- We are not claiming to detect kernel rootkits robustly from in-guest data alone.
+  The oracle/feature gap will quantify the limit.
+- We are not claiming the trained model is safe to deploy without the gateway
+  observer in production — for the strongest threat model, gateway-side fusion
+  is required.
--- a/docs/transport.md
+++ b/docs/transport.md
@ -0,0 +1,164 @@
+# Transport — Centralized Episode Collection over WG
+
+The dataset lives wherever it is convenient to train from. In our setup that is
+the Pi5 (or whatever the team designates as the central collector), reachable
+over the WG overlay at `<receiver-host>.wg`. This document describes how
+episodes get from a lab host to the central collector.
+
+## Design goals
+
+1. **Easy to deploy.** One config file, one systemd unit per side. No DB
+   required to start collecting.
+2. **WG-native.** Sender and receiver both live on the WG overlay; transport is
+   just HTTPS over WG. We use the existing wg-pki CA for mTLS.
+3. **Idempotent.** Re-shipping the same episode is safe and cheap; the
+   receiver responds 200 if the bytes already match.
+4. **Crash-safe.** Lab host crash mid-episode does not corrupt the central
+   store. Receiver crash mid-upload leaves no partial visible.
+5. **Schema-free.** The receiver does not parse JSONL; it stores tarballs and
+   an append-only index. The schema lives only at training time.
+
+## What gets shipped
+
+A complete episode directory is tarred and zstd-compressed:
+
+```
+data/episodes/<episode_id>/   →   <episode_id>.tar.zst
+```
+
+The orchestrator marks an episode complete by writing a `done.marker` file at
+the *end* of the directory after `meta.json` is finalized. The shipper only
+considers directories that contain `done.marker` — partially-written episodes
+are invisible to it.
+
+## Wire protocol
+
+```
+PUT https://<receiver-host>.wg/v1/episodes/<host_id>/<episode_id>.tar.zst
+    Content-Type: application/zstd
+    Content-Length: <bytes>
+    X-Content-SHA256: <sha256-of-body>
+    X-Schema-Version: 1
+    X-Lab-Host: <host_id>
+    X-Episode-Id: <episode_id>
+    body: <the tar.zst bytes>
+```
+
+Auth: mTLS using a leaf certificate issued by the wg-pki CA. The receiver
+trusts only certs issued by that CA.
+
+Responses:
+
+| Status | Meaning |
+|---|---|
+| 201 | Stored; new |
+| 200 | Already present with matching sha256; nothing to do |
+| 409 | Already present with **different** sha256; receiver refuses to overwrite |
+| 4xx | Bad request (missing header, malformed id, etc.) |
+| 5xx | Server error; sender retries with backoff |
+
+There is no DELETE. Episodes are immutable once shipped.
+
+## Sender (`shipper`) state machine
+
+```
+  scan data/episodes/
+        |
+        v
+  for each <id>/done.marker:
+        |
+        v
+  tar+zstd → data/outbox/<id>.tar.zst.partial
+        |
+        v
+  rename → data/outbox/<id>.tar.zst        (atomic; visible to retry loop)
+        |
+        v
+  PUT to receiver
+        |
+        +-- 200/201 → mv data/episodes/<id> data/shipped/<id>;
+        |              rm data/outbox/<id>.tar.zst
+        |
+        +-- 409 → log mismatch, leave files in place, alert (manual triage)
+        |
+        +-- 5xx/network → backoff (1s, 2s, 4s, 8s, ... cap 5min); retry
+```
+
+The shipper does the same scan on every wake-up, so a crash mid-tar or
+mid-PUT is harmless — the next pass picks up wherever it left off.
+
+## Receiver state machine
+
+```
+  PUT body received
+        |
+        v
+  stream into /var/lib/cis490/incoming/<host>/<id>.tar.zst.partial
+        |
+        v
+  compute sha256 while streaming
+        |
+        +-- mismatch with header → 400, delete partial
+        |
+        +-- match:
+              |
+              v
+          if final path exists:
+              |
+              +-- existing sha256 == new sha256 → 200, delete partial
+              |
+              +-- existing sha256 != new sha256 → 409, delete partial
+          else:
+              |
+              v
+          atomic rename → /var/lib/cis490/episodes/<host>/<id>.tar.zst
+              |
+              v
+          append index.jsonl row
+              |
+              v
+          201
+```
+
+`index.jsonl` row:
+
+```json
+{
+  "received_at_wall": "2026-04-28T22:31:43Z",
+  "host_id": "lab-host-1",
+  "episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
+  "sha256": "...",
+  "size_bytes": 8412331,
+  "schema_version": 1
+}
+```
+
+That index is the closest thing to a database we have until we decide on one.
+A trainer can stream it to know what episodes exist, then untar on demand.
+
+## Why not just rsync?
+
+`rsync` works, but:
+
+- No schema-version tagging at the protocol layer.
+- No clean way to enforce "immutable once written".
+- mTLS via WG-issued certs is more uniform with the rest of the overlay than
+  ssh-key juggling.
+- A tiny FastAPI receiver is also a natural place to add ingest-time hooks
+  later (e.g. emit a Matrix notification on successful receipt, kick off a
+  training run when N new episodes arrive).
+
+We may switch to rsync if the FastAPI receiver becomes a bottleneck. For a
+class project that is unlikely.
+
+## Operational notes
+
+- **Disk on lab host.** The shipper keeps episodes locally in
+  `data/shipped/<id>/` until a retention pass prunes them. Default retention:
+  7 days *or* 80% disk usage, whichever comes first.
+- **Disk on receiver.** No retention enforced by default — the central store
+  is the dataset.
+- **Backpressure.** If the receiver is unreachable (WG down, Pi rebooting),
+  the shipper accumulates tarballs in `data/outbox/`. No data is lost.
+- **Multiple lab hosts.** Each writes under its own `<host_id>/` prefix. No
+  coordination needed; episode ids are globally unique (ULID).
--- a/exploits/README.md
+++ b/exploits/README.md
@ -0,0 +1,12 @@
+# exploits/
+
+Metasploit resource scripts (`*.rc`) that drive specific exploit modules
+deterministically — same inputs, same module options, every time.
+
+Each script:
+- Sets `RHOSTS` to the guest's bridge IP.
+- Sets a payload that opens a session usable for sample upload + execute.
+- Avoids any options that introduce randomness in the exploit fire timing
+  (so that the `armed → infecting` transition lands at a predictable offset).
+
+These scripts pair with public Metasploit modules. We do not author exploits.
--- a/orchestrator/README.md
+++ b/orchestrator/README.md
@ -0,0 +1,21 @@
+# orchestrator/
+
+The state machine that drives a single **episode**:
+
+```
+snapshot_load → clean → armed → infecting → infected_running → dormant → reverting
+```
+
+Responsibilities:
+
+- Bring up the host-only bridge and verify isolation before the guest starts.
+- Boot the guest from a named snapshot.
+- Spawn the five telemetry collectors (`collectors/`) with a shared episode id
+  and shared monotonic clock origin.
+- Drive the Metasploit Framework over RPC to fire the configured exploit module.
+- Upload + execute the configured malware sample once a session is open.
+- Emit phase transitions to `labels.jsonl` *at the moment the action is taken*.
+- Revert the snapshot at episode end.
+- Write `meta.json` with the result summary.
+
+Implementation lives in this directory and is imported as `orchestrator.*`.
--- a/samples/README.md
+++ b/samples/README.md
@ -0,0 +1,33 @@
+# samples/
+
+**Sample binaries are NEVER committed to this repo.** This directory holds:
+
+- `manifest.yaml` — sha256-pinned list of samples to fetch, with metadata
+  (source, category, expected behavior, target CVE).
+- `fetch.py` — script that pulls samples from configured sources
+  (MalwareBazaar, theZoo, vx-underground), verifies sha256, and stores them
+  under `samples/store/` (gitignored).
+- Per-sample notes in markdown describing observed behavior in our lab.
+
+`samples/store/` lives only on the lab host. It is gitignored *and* should
+sit on a disk that is not auto-mounted on developer workstations.
+
+## Manifest entry shape (placeholder)
+
+```yaml
+samples:
+  - name: linux.miner.xmrig.elf
+    sha256: "..."                # pinned
+    source: MalwareBazaar
+    category: miner
+    target_cve: null              # cryptominers are usually post-exploit payloads
+    behavior: "high CPU, periodic stratum protocol traffic"
+    pairs_with_exploit: exploit/multi/samba/usermap_script
+```
+
+## Safety rules
+
+- Only download to the lab host, never to a developer workstation.
+- Verify sha256 immediately, before any other read.
+- Keep the directory on a path that is *not* on the WG overlay.
+- Re-verify sha256 before each detonation; refuse to run on mismatch.
--- a/training/README.md
+++ b/training/README.md
@ -0,0 +1,23 @@
+# training/
+
+Deferred until the dataset has substance. The plan, recorded so we don't lose
+it:
+
+1. Two models will be trained from the same episodes:
+   - **Realistic** — features only (`available_in_deployment: true`).
+   - **Oracle** — all rows, regardless of the deployment flag.
+2. Baseline architecture: a rolling-window feature builder + a gradient-boosted
+   trees classifier (XGBoost or LightGBM). Cheap, strong, interpretable.
+3. Window: 1–5 second sliding windows with per-channel summary stats
+   (mean, std, p95, slope, count of zero buckets).
+4. Target: the phase enum from `labels.jsonl`, projected onto each window's
+   center timestamp.
+5. Evaluation:
+   - Held-out *samples* (not just held-out time slices) — generalization to
+     unseen malware matters more than within-sample accuracy.
+   - Confusion matrix + per-phase precision/recall.
+   - Realistic vs. oracle gap, reported.
+6. Stretch: trust-over-time scoring per the IEEE 9881803 paper, with a reset
+   threshold tuned for low false-positive cost.
+
+See [`docs/threat-model.md`](../docs/threat-model.md) for why this split exists.
--- a/vm/README.md
+++ b/vm/README.md
@ -0,0 +1,17 @@
+# vm/
+
+Recipes and helpers for building and snapshotting guest VMs. Disk images and
+snapshots themselves are gitignored — this directory carries the *how*, not
+the bytes.
+
+```
+vm/
+  images/                # qcow2 staging (gitignored)
+  snapshots/             # exported snapshots if needed (gitignored)
+  guest-agent/           # in-guest telemetry agent (shipped into the guest)
+  metasploitable2.md     # download/convert/snapshot procedure (TODO)
+  custom-debian/         # cloud-init for our own vulnerable Debian (TODO)
+```
+
+See [`docs/lab-setup.md`](../docs/lab-setup.md) for the full host + guest
+bring-up procedure.