CIS490/docs/architecture.md

# Architecture

## One-paragraph summary

A QEMU/KVM host runs short, repeatable **episodes** against a vulnerable Linux
guest. Each episode boots from a clean snapshot, captures a baseline, fires a
known exploit, drops a public malware sample, observes the infection envelope,
and reverts the snapshot. Telemetry is captured from five vantage points
simultaneously, all stamped with the host monotonic clock so rows align. The
output of an episode is a self-contained directory of JSONL files plus a pcap.

## Lab topology

```
+---------------------------------------------------------------+
|  VM HOST  (this machine, /home/maximus/.env/qemu)             |
|                                                               |
|   +-----------------------+      +------------------------+   |
|   |  KVM guest            |      |  orchestrator (host)   |   |
|   |  (Metasploitable2,    |      |  - snapshot loop       |   |
|   |   1 vCPU, capped)     |      |  - exploit driver      |   |
|   |                       |<====>|  - phase labeler       |   |
|   |  in-guest agent  -----|virtio|                        |   |
|   |                       |serial|  collectors:           |   |
|   |  vNIC ----------------|      |   * host /proc/qemu_pid|   |
|   |        |              |      |   * QMP query-stats    |   |
|   +--------|--------------+      |   * perf -p qemu_pid   |   |
|            |                     |   * tcpdump on br      |   |
|            v                     |   * guest agent rx     |   |
|   br-malware (host-only, NO NAT) |                        |   |
|            |                     +-----------|------------+   |
|            +--- isolated, no internet                    |    |
|                                                          v    |
|                                                  data/episodes/
+----------------------------------------------------------|----+
                                                           | (later)
                                                           v
                                              WG overlay -> Pi5 (DB + ingest)
```

The malware bridge `br-malware` is **host-only** — no NAT, no route to the WG
overlay, no DNS. The orchestrator also blocks egress with nftables on the host
as a belt-and-suspenders measure.

## Why KVM, not TCG and not Docker

| Option | Speed | Determinism | Real OS isolation | Verdict |
|---|---|---|---|---|
| TCG `-icount` | slow | bit-exact replay | yes | overkill — ML wants noise |
| **KVM** | near-native | host-scheduler noise (good) | yes | **chosen** |
| Docker | fastest | low | shares host kernel — unsafe for malware | ruled out |

KVM is roughly 15× faster than TCG for boot/snapshot-revert cycles, which directly
multiplies dataset size for a fixed wall-clock budget. The "constrained
single-threaded device" framing from the project goal is preserved by pinning to
1 vCPU and applying a host cgroup CPU cap.

## The episode state machine

```
  snapshot_load(baseline)
        |
        v
  [clean]  ---- record T_baseline seconds of idle telemetry ----+
        |                                                       |
        v                                                       |
  [armed]  ---- exploit module fires; session opens ------------+
        |                                                       |
        v                                                       |
  [infecting]  ---- sample uploaded + executed -----------------+
        |                                                       |
        v                                                       |
  [infected_running]  ---- observe T_active seconds ------------+
        |                                                       |
        v                                                       |
  [dormant]  ---- (optional) wait for sample's idle window ----+
        |                                                       |
        v                                                       |
  [reverting]  ---- snapshot_load(baseline); episode ends -----+
                                                                |
                                                                v
                                                     write meta.json + close jsonl
```

Phase transitions are emitted by the orchestrator into `labels.jsonl` *at the
moment the orchestrator takes the action*, not inferred from metrics afterward.
This is what makes the dataset honestly labeled.

## Why the lab topology mirrors deployment

In the field, the ML model runs on some real, non-virtualized constrained
Linux device — the specific form factor doesn't matter, only that it has its
own kernel and isn't living under our hypervisor.

> **Not the Pi5.** The Pi5 in this project is the WireGuard-side *collector*
> for episode tarballs (see [`transport.md`](transport.md)). It is not the
> device the model deploys to.

If a deployment topology happens to include an upstream observer that sees
the device's network traffic (router, gateway, hypervisor), that observer is
a useful additional vantage point for the model — call it the **gateway
observer**. In our lab, the host-only bridge plays exactly that role:
bridge-side pcap features at training time map 1:1 to gateway-side
NetFlow/pcap features at inference time *if* such a gateway exists in
deployment. Whether one does is a deployment decision outside the scope of
this dataset repo.

See [`threat-model.md`](threat-model.md) for the rest of the parity story
(host-side QEMU features must NOT be used as model inputs — they are labeling
oracles only).

## Out of scope for this repo

- Authoring novel malware or zero-day exploits.
- Detection-evasion research targeting other vendors' AV.
- Production deployment of the trained model — that lives in a separate repo.