# Architecture ## One-paragraph summary A QEMU/KVM host runs short, repeatable **episodes** against a vulnerable Linux guest. Each episode boots from a clean snapshot, captures a baseline, fires a known exploit, drops a public malware sample, observes the infection envelope, and reverts the snapshot. Telemetry is captured from five vantage points simultaneously, all stamped with the host monotonic clock so rows align. The output of an episode is a self-contained directory of JSONL files plus a pcap. ## Lab topology ``` +---------------------------------------------------------------+ | VM HOST (this machine, /home/maximus/.env/qemu) | | | | +-----------------------+ +------------------------+ | | | KVM guest | | orchestrator (host) | | | | (Metasploitable2, | | - snapshot loop | | | | 1 vCPU, capped) | | - exploit driver | | | | |<====>| - phase labeler | | | | in-guest agent -----|virtio| | | | | |serial| collectors: | | | | vNIC ----------------| | * host /proc/qemu_pid| | | | | | | * QMP query-stats | | | +--------|--------------+ | * perf -p qemu_pid | | | | | * tcpdump on br | | | v | * guest agent rx | | | br-malware (host-only, NO NAT) | | | | | +-----------|------------+ | | +--- isolated, no internet | | | v | | data/episodes/ +----------------------------------------------------------|----+ | (later) v WG overlay -> Pi5 (DB + ingest) ``` The malware bridge `br-malware` is **host-only** — no NAT, no route to the WG overlay, no DNS. The orchestrator also blocks egress with nftables on the host as a belt-and-suspenders measure. ## Why KVM, not TCG and not Docker | Option | Speed | Determinism | Real OS isolation | Verdict | |---|---|---|---|---| | TCG `-icount` | slow | bit-exact replay | yes | overkill — ML wants noise | | **KVM** | near-native | host-scheduler noise (good) | yes | **chosen** | | Docker | fastest | low | shares host kernel — unsafe for malware | ruled out | KVM is roughly 15× faster than TCG for boot/snapshot-revert cycles, which directly multiplies dataset size for a fixed wall-clock budget. The "constrained single-threaded device" framing from the project goal is preserved by pinning to 1 vCPU and applying a host cgroup CPU cap. ## The episode state machine ``` snapshot_load(baseline) | v [clean] ---- record T_baseline seconds of idle telemetry ----+ | | v | [armed] ---- exploit module fires; session opens ------------+ | | v | [infecting] ---- sample uploaded + executed -----------------+ | | v | [infected_running] ---- observe T_active seconds ------------+ | | v | [dormant] ---- (optional) wait for sample's idle window ----+ | | v | [reverting] ---- snapshot_load(baseline); episode ends -----+ | v write meta.json + close jsonl ``` Phase transitions are emitted by the orchestrator into `labels.jsonl` *at the moment the orchestrator takes the action*, not inferred from metrics afterward. This is what makes the dataset honestly labeled. ## Why the lab topology mirrors deployment In the field, the ML model runs on some real, non-virtualized constrained Linux device — the specific form factor doesn't matter, only that it has its own kernel and isn't living under our hypervisor. > **Not the Pi5.** The Pi5 in this project is the WireGuard-side *collector* > for episode tarballs (see [`transport.md`](transport.md)). It is not the > device the model deploys to. If a deployment topology happens to include an upstream observer that sees the device's network traffic (router, gateway, hypervisor), that observer is a useful additional vantage point for the model — call it the **gateway observer**. In our lab, the host-only bridge plays exactly that role: bridge-side pcap features at training time map 1:1 to gateway-side NetFlow/pcap features at inference time *if* such a gateway exists in deployment. Whether one does is a deployment decision outside the scope of this dataset repo. See [`threat-model.md`](threat-model.md) for the rest of the parity story (host-side QEMU features must NOT be used as model inputs — they are labeling oracles only). ## Out of scope for this repo - Authoring novel malware or zero-day exploits. - Detection-evasion research targeting other vendors' AV. - Production deployment of the trained model — that lives in a separate repo.