CIS490/docs/architecture.md
Maximus Gorog 970698af83 Synthetic envelope demo: phase-driven load mimic + plotter
End-to-end pipeline now produces a labeled envelope from a single command.
Drives the orchestrator through an 8-phase XMRig-shaped schedule and
renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands
sourced from labels.jsonl. Real telemetry, simulated load — validates the
collection + labeling shape before a real VM is involved.

Components:
- tools/load_mimic.py        phase-driven load generator. Reads phase
                             commands on stdin; CPU/IO behavior matches
                             the named phase (clean=idle, armed=light burst,
                             infecting=disk burst+CPU, infected_running=
                             CPU saturation+stratum-shaped writes,
                             dormant=quieter than clean).
- tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with
                             a default 85s schedule that includes the
                             classic infected_running → dormant → re-entry
                             pattern.
- tools/plot_envelope.py     reads telemetry + labels from an episode dir,
                             writes envelope.png with colored phase bands.

orchestrator: EpisodeRunner now takes an optional phase_schedule and an
on_phase callback. Walks the schedule emitting one label per transition.
Backwards-compatible — existing single-phase tests still green.

Doc fix (user pushback): README + architecture + threat-model no longer
imply the Pi5 is the deployment target. Pi5's actual role here is the
WireGuard-side collector for episode tarballs. Deployment target is
generic ("constrained Linux device"). The "gateway observer" concept
remains a deployment pattern, decoupled from the Pi5's collector role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:53:20 -06:00

116 lines
5.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Architecture
## One-paragraph summary
A QEMU/KVM host runs short, repeatable **episodes** against a vulnerable Linux
guest. Each episode boots from a clean snapshot, captures a baseline, fires a
known exploit, drops a public malware sample, observes the infection envelope,
and reverts the snapshot. Telemetry is captured from five vantage points
simultaneously, all stamped with the host monotonic clock so rows align. The
output of an episode is a self-contained directory of JSONL files plus a pcap.
## Lab topology
```
+---------------------------------------------------------------+
| VM HOST (this machine, /home/maximus/.env/qemu) |
| |
| +-----------------------+ +------------------------+ |
| | KVM guest | | orchestrator (host) | |
| | (Metasploitable2, | | - snapshot loop | |
| | 1 vCPU, capped) | | - exploit driver | |
| | |<====>| - phase labeler | |
| | in-guest agent -----|virtio| | |
| | |serial| collectors: | |
| | vNIC ----------------| | * host /proc/qemu_pid| |
| | | | | * QMP query-stats | |
| +--------|--------------+ | * perf -p qemu_pid | |
| | | * tcpdump on br | |
| v | * guest agent rx | |
| br-malware (host-only, NO NAT) | | |
| | +-----------|------------+ |
| +--- isolated, no internet | |
| v |
| data/episodes/
+----------------------------------------------------------|----+
| (later)
v
WG overlay -> Pi5 (DB + ingest)
```
The malware bridge `br-malware` is **host-only** — no NAT, no route to the WG
overlay, no DNS. The orchestrator also blocks egress with nftables on the host
as a belt-and-suspenders measure.
## Why KVM, not TCG and not Docker
| Option | Speed | Determinism | Real OS isolation | Verdict |
|---|---|---|---|---|
| TCG `-icount` | slow | bit-exact replay | yes | overkill — ML wants noise |
| **KVM** | near-native | host-scheduler noise (good) | yes | **chosen** |
| Docker | fastest | low | shares host kernel — unsafe for malware | ruled out |
KVM is roughly 15× faster than TCG for boot/snapshot-revert cycles, which directly
multiplies dataset size for a fixed wall-clock budget. The "constrained
single-threaded device" framing from the project goal is preserved by pinning to
1 vCPU and applying a host cgroup CPU cap.
## The episode state machine
```
snapshot_load(baseline)
|
v
[clean] ---- record T_baseline seconds of idle telemetry ----+
| |
v |
[armed] ---- exploit module fires; session opens ------------+
| |
v |
[infecting] ---- sample uploaded + executed -----------------+
| |
v |
[infected_running] ---- observe T_active seconds ------------+
| |
v |
[dormant] ---- (optional) wait for sample's idle window ----+
| |
v |
[reverting] ---- snapshot_load(baseline); episode ends -----+
|
v
write meta.json + close jsonl
```
Phase transitions are emitted by the orchestrator into `labels.jsonl` *at the
moment the orchestrator takes the action*, not inferred from metrics afterward.
This is what makes the dataset honestly labeled.
## Why the lab topology mirrors deployment
In the field, the ML model runs on some real, non-virtualized constrained
Linux device — the specific form factor doesn't matter, only that it has its
own kernel and isn't living under our hypervisor.
> **Not the Pi5.** The Pi5 in this project is the WireGuard-side *collector*
> for episode tarballs (see [`transport.md`](transport.md)). It is not the
> device the model deploys to.
If a deployment topology happens to include an upstream observer that sees
the device's network traffic (router, gateway, hypervisor), that observer is
a useful additional vantage point for the model — call it the **gateway
observer**. In our lab, the host-only bridge plays exactly that role:
bridge-side pcap features at training time map 1:1 to gateway-side
NetFlow/pcap features at inference time *if* such a gateway exists in
deployment. Whether one does is a deployment decision outside the scope of
this dataset repo.
See [`threat-model.md`](threat-model.md) for the rest of the parity story
(host-side QEMU features must NOT be used as model inputs — they are labeling
oracles only).
## Out of scope for this repo
- Authoring novel malware or zero-day exploits.
- Detection-evasion research targeting other vendors' AV.
- Production deployment of the trained model — that lives in a separate repo.