End-to-end pipeline now produces a labeled envelope from a single command.
Drives the orchestrator through an 8-phase XMRig-shaped schedule and
renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands
sourced from labels.jsonl. Real telemetry, simulated load — validates the
collection + labeling shape before a real VM is involved.
Components:
- tools/load_mimic.py phase-driven load generator. Reads phase
commands on stdin; CPU/IO behavior matches
the named phase (clean=idle, armed=light burst,
infecting=disk burst+CPU, infected_running=
CPU saturation+stratum-shaped writes,
dormant=quieter than clean).
- tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with
a default 85s schedule that includes the
classic infected_running → dormant → re-entry
pattern.
- tools/plot_envelope.py reads telemetry + labels from an episode dir,
writes envelope.png with colored phase bands.
orchestrator: EpisodeRunner now takes an optional phase_schedule and an
on_phase callback. Walks the schedule emitting one label per transition.
Backwards-compatible — existing single-phase tests still green.
Doc fix (user pushback): README + architecture + threat-model no longer
imply the Pi5 is the deployment target. Pi5's actual role here is the
WireGuard-side collector for episode tarballs. Deployment target is
generic ("constrained Linux device"). The "gateway observer" concept
remains a deployment pattern, decoupled from the Pi5's collector role.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
116 lines
5.7 KiB
Markdown
116 lines
5.7 KiB
Markdown
# Architecture
|
||
|
||
## One-paragraph summary
|
||
|
||
A QEMU/KVM host runs short, repeatable **episodes** against a vulnerable Linux
|
||
guest. Each episode boots from a clean snapshot, captures a baseline, fires a
|
||
known exploit, drops a public malware sample, observes the infection envelope,
|
||
and reverts the snapshot. Telemetry is captured from five vantage points
|
||
simultaneously, all stamped with the host monotonic clock so rows align. The
|
||
output of an episode is a self-contained directory of JSONL files plus a pcap.
|
||
|
||
## Lab topology
|
||
|
||
```
|
||
+---------------------------------------------------------------+
|
||
| VM HOST (this machine, /home/maximus/.env/qemu) |
|
||
| |
|
||
| +-----------------------+ +------------------------+ |
|
||
| | KVM guest | | orchestrator (host) | |
|
||
| | (Metasploitable2, | | - snapshot loop | |
|
||
| | 1 vCPU, capped) | | - exploit driver | |
|
||
| | |<====>| - phase labeler | |
|
||
| | in-guest agent -----|virtio| | |
|
||
| | |serial| collectors: | |
|
||
| | vNIC ----------------| | * host /proc/qemu_pid| |
|
||
| | | | | * QMP query-stats | |
|
||
| +--------|--------------+ | * perf -p qemu_pid | |
|
||
| | | * tcpdump on br | |
|
||
| v | * guest agent rx | |
|
||
| br-malware (host-only, NO NAT) | | |
|
||
| | +-----------|------------+ |
|
||
| +--- isolated, no internet | |
|
||
| v |
|
||
| data/episodes/
|
||
+----------------------------------------------------------|----+
|
||
| (later)
|
||
v
|
||
WG overlay -> Pi5 (DB + ingest)
|
||
```
|
||
|
||
The malware bridge `br-malware` is **host-only** — no NAT, no route to the WG
|
||
overlay, no DNS. The orchestrator also blocks egress with nftables on the host
|
||
as a belt-and-suspenders measure.
|
||
|
||
## Why KVM, not TCG and not Docker
|
||
|
||
| Option | Speed | Determinism | Real OS isolation | Verdict |
|
||
|---|---|---|---|---|
|
||
| TCG `-icount` | slow | bit-exact replay | yes | overkill — ML wants noise |
|
||
| **KVM** | near-native | host-scheduler noise (good) | yes | **chosen** |
|
||
| Docker | fastest | low | shares host kernel — unsafe for malware | ruled out |
|
||
|
||
KVM is roughly 15× faster than TCG for boot/snapshot-revert cycles, which directly
|
||
multiplies dataset size for a fixed wall-clock budget. The "constrained
|
||
single-threaded device" framing from the project goal is preserved by pinning to
|
||
1 vCPU and applying a host cgroup CPU cap.
|
||
|
||
## The episode state machine
|
||
|
||
```
|
||
snapshot_load(baseline)
|
||
|
|
||
v
|
||
[clean] ---- record T_baseline seconds of idle telemetry ----+
|
||
| |
|
||
v |
|
||
[armed] ---- exploit module fires; session opens ------------+
|
||
| |
|
||
v |
|
||
[infecting] ---- sample uploaded + executed -----------------+
|
||
| |
|
||
v |
|
||
[infected_running] ---- observe T_active seconds ------------+
|
||
| |
|
||
v |
|
||
[dormant] ---- (optional) wait for sample's idle window ----+
|
||
| |
|
||
v |
|
||
[reverting] ---- snapshot_load(baseline); episode ends -----+
|
||
|
|
||
v
|
||
write meta.json + close jsonl
|
||
```
|
||
|
||
Phase transitions are emitted by the orchestrator into `labels.jsonl` *at the
|
||
moment the orchestrator takes the action*, not inferred from metrics afterward.
|
||
This is what makes the dataset honestly labeled.
|
||
|
||
## Why the lab topology mirrors deployment
|
||
|
||
In the field, the ML model runs on some real, non-virtualized constrained
|
||
Linux device — the specific form factor doesn't matter, only that it has its
|
||
own kernel and isn't living under our hypervisor.
|
||
|
||
> **Not the Pi5.** The Pi5 in this project is the WireGuard-side *collector*
|
||
> for episode tarballs (see [`transport.md`](transport.md)). It is not the
|
||
> device the model deploys to.
|
||
|
||
If a deployment topology happens to include an upstream observer that sees
|
||
the device's network traffic (router, gateway, hypervisor), that observer is
|
||
a useful additional vantage point for the model — call it the **gateway
|
||
observer**. In our lab, the host-only bridge plays exactly that role:
|
||
bridge-side pcap features at training time map 1:1 to gateway-side
|
||
NetFlow/pcap features at inference time *if* such a gateway exists in
|
||
deployment. Whether one does is a deployment decision outside the scope of
|
||
this dataset repo.
|
||
|
||
See [`threat-model.md`](threat-model.md) for the rest of the parity story
|
||
(host-side QEMU features must NOT be used as model inputs — they are labeling
|
||
oracles only).
|
||
|
||
## Out of scope for this repo
|
||
|
||
- Authoring novel malware or zero-day exploits.
|
||
- Detection-evasion research targeting other vendors' AV.
|
||
- Production deployment of the trained model — that lives in a separate repo.
|