End-to-end pipeline now produces a labeled envelope from a single command.
Drives the orchestrator through an 8-phase XMRig-shaped schedule and
renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands
sourced from labels.jsonl. Real telemetry, simulated load — validates the
collection + labeling shape before a real VM is involved.
Components:
- tools/load_mimic.py phase-driven load generator. Reads phase
commands on stdin; CPU/IO behavior matches
the named phase (clean=idle, armed=light burst,
infecting=disk burst+CPU, infected_running=
CPU saturation+stratum-shaped writes,
dormant=quieter than clean).
- tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with
a default 85s schedule that includes the
classic infected_running → dormant → re-entry
pattern.
- tools/plot_envelope.py reads telemetry + labels from an episode dir,
writes envelope.png with colored phase bands.
orchestrator: EpisodeRunner now takes an optional phase_schedule and an
on_phase callback. Walks the schedule emitting one label per transition.
Backwards-compatible — existing single-phase tests still green.
Doc fix (user pushback): README + architecture + threat-model no longer
imply the Pi5 is the deployment target. Pi5's actual role here is the
WireGuard-side collector for episode tarballs. Deployment target is
generic ("constrained Linux device"). The "gateway observer" concept
remains a deployment pattern, decoupled from the Pi5's collector role.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
102 lines
4.6 KiB
Markdown
102 lines
4.6 KiB
Markdown
# Threat Model & Train/Serve Parity
|
||
|
||
The single most important design rule in this project:
|
||
|
||
> **A feature used by the deployed model must exist on the deployed device.**
|
||
|
||
Violating this rule produces a model that scores 99% in the lab and is useless in
|
||
the field. This document spells out which features fall on which side of that
|
||
line, and why we still bother capturing both.
|
||
|
||
## The setting
|
||
|
||
The deployed model runs on a real, non-virtualized device — typically a
|
||
constrained Linux endpoint (server, IoT box, edge node — the specific form
|
||
factor doesn't matter, only that it has its own kernel and isn't running
|
||
under our hypervisor). It tries to detect the moment that device gets
|
||
breached.
|
||
|
||
> **Not the Pi5.** In our project topology, the Pi5 is the WireGuard-side
|
||
> *collector* that receives episode tarballs from lab hosts. It is *not* the
|
||
> deployment target for the model. Don't conflate the two roles.
|
||
|
||
Two adversarial facts shape the design:
|
||
|
||
1. **Malware can lie to in-device tools.** A sufficiently-privileged rootkit can
|
||
hook `/proc`, intercept `perf_event_open`, and hide its own processes.
|
||
2. **There is no host-side QEMU view.** The deployed device is the actual
|
||
machine. Nothing is watching it from outside *the OS itself*.
|
||
|
||
So the model has two trustworthy floors:
|
||
|
||
- **In-device features that survive most malware** (perf counters via the syscall
|
||
interface, thermals, gross resource counters) — fast to lie to in principle,
|
||
but in practice most commodity malware doesn't bother.
|
||
- **Off-device features at the gateway** (network telemetry observed by an
|
||
upstream router/gateway) — physics-bound, the malware cannot prevent bytes
|
||
from leaving the NIC.
|
||
|
||
## Two roles: features vs. oracles
|
||
|
||
Every measurement we capture in the lab gets one of two roles:
|
||
|
||
| Role | What it's used for | Available in deployment? |
|
||
|---|---|---|
|
||
| **Feature** | Input to the trained model | **Must be yes** |
|
||
| **Oracle** | Ground-truth labeling during training only | No — but we have it in the lab |
|
||
|
||
The oracle channels (host `/proc/<qemu_pid>`, QMP `query-stats`,
|
||
`perf -p qemu_pid`) are how we know with certainty what the guest is *actually*
|
||
doing — not what it claims to be doing. We use that certainty to assign correct
|
||
labels in `labels.jsonl`. The model never sees them at training time.
|
||
|
||
## Channel taxonomy
|
||
|
||
| # | Channel | Vantage | Role | Why |
|
||
|---|---|---|---|---|
|
||
| 1 | Host `/proc/<qemu_pid>` | outside guest | oracle | doesn't exist on real device |
|
||
| 2 | QEMU QMP `query-stats`, `query-blockstats` | outside guest | oracle | same |
|
||
| 3 | `perf stat -p <qemu_pid>` | outside guest | oracle | same |
|
||
| 4 | Bridge-side pcap (`tcpdump -i br-malware`) | gateway | **feature** | matches Pi5 gateway in field |
|
||
| 5 | In-guest `/proc/*`, `perf_event_open`, `/sys/class/thermal/*` | inside guest | **feature** | same exact source on real device |
|
||
|
||
Note: in-guest features (5) are the same syscall surfaces we'd read on a real
|
||
Pi. The data we capture from them in the lab and the data we capture from them
|
||
on a Pi at deployment are pulled from identical kernel APIs — that is what makes
|
||
parity hold.
|
||
|
||
## The two-model evaluation strategy
|
||
|
||
We will train two classifiers from the same dataset:
|
||
|
||
1. **Realistic model** — trained only on features (channels 4 + 5).
|
||
*This is what would be deployed.*
|
||
2. **Oracle model** — trained on everything (channels 1–5).
|
||
*This is the upper bound on what was learnable from this dataset.*
|
||
|
||
The interesting numbers are:
|
||
|
||
- The realistic model's accuracy → **how well the deployable system actually works**.
|
||
- The gap between realistic and oracle → **how much detection power a privileged
|
||
rootkit can take away from us by lying** (the cost of in-device measurement
|
||
being adversarial).
|
||
- Realistic-without-pcap vs. realistic-with-pcap → **how much the gateway
|
||
observer is worth** (the value of off-device telemetry).
|
||
|
||
Reporting all three is the paper-worthy result — not just one accuracy number.
|
||
|
||
## Using oracle disagreement as a label
|
||
|
||
When the in-guest agent reports CPU at 8% and the host says the QEMU process is
|
||
burning 70%, that *is* the rootkit-active signal — and it's more reliable than
|
||
any single in-guest measurement. Episodes where this divergence appears get a
|
||
high-confidence `infected_running` label even if the in-guest data alone looks
|
||
ambiguous. This is the practical payoff of capturing both sides.
|
||
|
||
## What we are not claiming
|
||
|
||
- We are not claiming to detect kernel rootkits robustly from in-guest data alone.
|
||
The oracle/feature gap will quantify the limit.
|
||
- We are not claiming the trained model is safe to deploy without the gateway
|
||
observer in production — for the strongest threat model, gateway-side fusion
|
||
is required.
|