End-to-end pipeline now produces a labeled envelope from a single command.
Drives the orchestrator through an 8-phase XMRig-shaped schedule and
renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands
sourced from labels.jsonl. Real telemetry, simulated load — validates the
collection + labeling shape before a real VM is involved.
Components:
- tools/load_mimic.py phase-driven load generator. Reads phase
commands on stdin; CPU/IO behavior matches
the named phase (clean=idle, armed=light burst,
infecting=disk burst+CPU, infected_running=
CPU saturation+stratum-shaped writes,
dormant=quieter than clean).
- tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with
a default 85s schedule that includes the
classic infected_running → dormant → re-entry
pattern.
- tools/plot_envelope.py reads telemetry + labels from an episode dir,
writes envelope.png with colored phase bands.
orchestrator: EpisodeRunner now takes an optional phase_schedule and an
on_phase callback. Walks the schedule emitting one label per transition.
Backwards-compatible — existing single-phase tests still green.
Doc fix (user pushback): README + architecture + threat-model no longer
imply the Pi5 is the deployment target. Pi5's actual role here is the
WireGuard-side collector for episode tarballs. Deployment target is
generic ("constrained Linux device"). The "gateway observer" concept
remains a deployment pattern, decoupled from the Pi5's collector role.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.6 KiB
Threat Model & Train/Serve Parity
The single most important design rule in this project:
A feature used by the deployed model must exist on the deployed device.
Violating this rule produces a model that scores 99% in the lab and is useless in the field. This document spells out which features fall on which side of that line, and why we still bother capturing both.
The setting
The deployed model runs on a real, non-virtualized device — typically a constrained Linux endpoint (server, IoT box, edge node — the specific form factor doesn't matter, only that it has its own kernel and isn't running under our hypervisor). It tries to detect the moment that device gets breached.
Not the Pi5. In our project topology, the Pi5 is the WireGuard-side collector that receives episode tarballs from lab hosts. It is not the deployment target for the model. Don't conflate the two roles.
Two adversarial facts shape the design:
- Malware can lie to in-device tools. A sufficiently-privileged rootkit can
hook
/proc, interceptperf_event_open, and hide its own processes. - There is no host-side QEMU view. The deployed device is the actual machine. Nothing is watching it from outside the OS itself.
So the model has two trustworthy floors:
- In-device features that survive most malware (perf counters via the syscall interface, thermals, gross resource counters) — fast to lie to in principle, but in practice most commodity malware doesn't bother.
- Off-device features at the gateway (network telemetry observed by an upstream router/gateway) — physics-bound, the malware cannot prevent bytes from leaving the NIC.
Two roles: features vs. oracles
Every measurement we capture in the lab gets one of two roles:
| Role | What it's used for | Available in deployment? |
|---|---|---|
| Feature | Input to the trained model | Must be yes |
| Oracle | Ground-truth labeling during training only | No — but we have it in the lab |
The oracle channels (host /proc/<qemu_pid>, QMP query-stats,
perf -p qemu_pid) are how we know with certainty what the guest is actually
doing — not what it claims to be doing. We use that certainty to assign correct
labels in labels.jsonl. The model never sees them at training time.
Channel taxonomy
| # | Channel | Vantage | Role | Why |
|---|---|---|---|---|
| 1 | Host /proc/<qemu_pid> |
outside guest | oracle | doesn't exist on real device |
| 2 | QEMU QMP query-stats, query-blockstats |
outside guest | oracle | same |
| 3 | perf stat -p <qemu_pid> |
outside guest | oracle | same |
| 4 | Bridge-side pcap (tcpdump -i br-malware) |
gateway | feature | matches Pi5 gateway in field |
| 5 | In-guest /proc/*, perf_event_open, /sys/class/thermal/* |
inside guest | feature | same exact source on real device |
Note: in-guest features (5) are the same syscall surfaces we'd read on a real Pi. The data we capture from them in the lab and the data we capture from them on a Pi at deployment are pulled from identical kernel APIs — that is what makes parity hold.
The two-model evaluation strategy
We will train two classifiers from the same dataset:
- Realistic model — trained only on features (channels 4 + 5). This is what would be deployed.
- Oracle model — trained on everything (channels 1–5). This is the upper bound on what was learnable from this dataset.
The interesting numbers are:
- The realistic model's accuracy → how well the deployable system actually works.
- The gap between realistic and oracle → how much detection power a privileged rootkit can take away from us by lying (the cost of in-device measurement being adversarial).
- Realistic-without-pcap vs. realistic-with-pcap → how much the gateway observer is worth (the value of off-device telemetry).
Reporting all three is the paper-worthy result — not just one accuracy number.
Using oracle disagreement as a label
When the in-guest agent reports CPU at 8% and the host says the QEMU process is
burning 70%, that is the rootkit-active signal — and it's more reliable than
any single in-guest measurement. Episodes where this divergence appears get a
high-confidence infected_running label even if the in-guest data alone looks
ambiguous. This is the practical payoff of capturing both sides.
What we are not claiming
- We are not claiming to detect kernel rootkits robustly from in-guest data alone. The oracle/feature gap will quantify the limit.
- We are not claiming the trained model is safe to deploy without the gateway observer in production — for the strongest threat model, gateway-side fusion is required.