Maximus Gorog 970698af83 Synthetic envelope demo: phase-driven load mimic + plotter

End-to-end pipeline now produces a labeled envelope from a single command.
Drives the orchestrator through an 8-phase XMRig-shaped schedule and
renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands
sourced from labels.jsonl. Real telemetry, simulated load — validates the
collection + labeling shape before a real VM is involved.

Components:
- tools/load_mimic.py        phase-driven load generator. Reads phase
                             commands on stdin; CPU/IO behavior matches
                             the named phase (clean=idle, armed=light burst,
                             infecting=disk burst+CPU, infected_running=
                             CPU saturation+stratum-shaped writes,
                             dormant=quieter than clean).
- tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with
                             a default 85s schedule that includes the
                             classic infected_running → dormant → re-entry
                             pattern.
- tools/plot_envelope.py     reads telemetry + labels from an episode dir,
                             writes envelope.png with colored phase bands.

orchestrator: EpisodeRunner now takes an optional phase_schedule and an
on_phase callback. Walks the schedule emitting one label per transition.
Backwards-compatible — existing single-phase tests still green.

Doc fix (user pushback): README + architecture + threat-model no longer
imply the Pi5 is the deployment target. Pi5's actual role here is the
WireGuard-side collector for episode tarballs. Deployment target is
generic ("constrained Linux device"). The "gateway observer" concept
remains a deployment pattern, decoupled from the Pi5's collector role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 23:53:20 -06:00

4.6 KiB

Raw Permalink Blame History

Threat Model & Train/Serve Parity

The single most important design rule in this project:

A feature used by the deployed model must exist on the deployed device.

Violating this rule produces a model that scores 99% in the lab and is useless in the field. This document spells out which features fall on which side of that line, and why we still bother capturing both.

The setting

The deployed model runs on a real, non-virtualized device — typically a constrained Linux endpoint (server, IoT box, edge node — the specific form factor doesn't matter, only that it has its own kernel and isn't running under our hypervisor). It tries to detect the moment that device gets breached.

Not the Pi5. In our project topology, the Pi5 is the WireGuard-side collector that receives episode tarballs from lab hosts. It is not the deployment target for the model. Don't conflate the two roles.

Two adversarial facts shape the design:

Malware can lie to in-device tools. A sufficiently-privileged rootkit can hook /proc, intercept perf_event_open, and hide its own processes.
There is no host-side QEMU view. The deployed device is the actual machine. Nothing is watching it from outside the OS itself.

So the model has two trustworthy floors:

In-device features that survive most malware (perf counters via the syscall interface, thermals, gross resource counters) — fast to lie to in principle, but in practice most commodity malware doesn't bother.
Off-device features at the gateway (network telemetry observed by an upstream router/gateway) — physics-bound, the malware cannot prevent bytes from leaving the NIC.

Two roles: features vs. oracles

Every measurement we capture in the lab gets one of two roles:

Role	What it's used for	Available in deployment?
Feature	Input to the trained model	Must be yes
Oracle	Ground-truth labeling during training only	No — but we have it in the lab

The oracle channels (host /proc/<qemu_pid>, QMP query-stats, perf -p qemu_pid) are how we know with certainty what the guest is actually doing — not what it claims to be doing. We use that certainty to assign correct labels in labels.jsonl. The model never sees them at training time.

Channel taxonomy

#	Channel	Vantage	Role	Why
1	Host `/proc/<qemu_pid>`	outside guest	oracle	doesn't exist on real device
2	QEMU QMP `query-stats`, `query-blockstats`	outside guest	oracle	same
3	`perf stat -p <qemu_pid>`	outside guest	oracle	same
4	Bridge-side pcap (`tcpdump -i br-malware`)	gateway	feature	matches Pi5 gateway in field
5	In-guest `/proc/`, `perf_event_open`, `/sys/class/thermal/`	inside guest	feature	same exact source on real device

Note: in-guest features (5) are the same syscall surfaces we'd read on a real Pi. The data we capture from them in the lab and the data we capture from them on a Pi at deployment are pulled from identical kernel APIs — that is what makes parity hold.

The two-model evaluation strategy

We will train two classifiers from the same dataset:

Realistic model — trained only on features (channels 4 + 5). This is what would be deployed.
Oracle model — trained on everything (channels 1–5). This is the upper bound on what was learnable from this dataset.

The interesting numbers are:

The realistic model's accuracy → how well the deployable system actually works.
The gap between realistic and oracle → how much detection power a privileged rootkit can take away from us by lying (the cost of in-device measurement being adversarial).
Realistic-without-pcap vs. realistic-with-pcap → how much the gateway observer is worth (the value of off-device telemetry).

Reporting all three is the paper-worthy result — not just one accuracy number.

Using oracle disagreement as a label

When the in-guest agent reports CPU at 8% and the host says the QEMU process is burning 70%, that is the rootkit-active signal — and it's more reliable than any single in-guest measurement. Episodes where this divergence appears get a high-confidence infected_running label even if the in-guest data alone looks ambiguous. This is the practical payoff of capturing both sides.

What we are not claiming

We are not claiming to detect kernel rootkits robustly from in-guest data alone. The oracle/feature gap will quantify the limit.
We are not claiming the trained model is safe to deploy without the gateway observer in production — for the strongest threat model, gateway-side fusion is required.

4.6 KiB Raw Permalink Blame History Unescape Escape