From 233390a40e7c2f503ba507646cd88b79f1eac952 Mon Sep 17 00:00:00 2001
From: Max Gorog Each lab host on the WireGuard mesh boots a real Alpine VM, runs
+ a profile-driven workload inside it, and samples
+ The counter on the left is the running total, sourced from the
+ receiver's The task is multi-class classification:
the target is one of five mutually-exclusive phase labels.
Not regression (no continuous target), not ranking
@@ -894,15 +923,16 @@
Literature on behaviour-based malware detection is rich but
uneven. Most published results either (a) use richer
telemetry than what a constrained host actually exports, or
- (b) frame evaluation in ways that hide the cross-host
- generalization problem. The card on the left summarises the
+ (b) frame evaluation in ways that hide same-sample overfit
+ (training and testing on the same malware instances). The card on the left summarises the
gap. This project asks three concrete questions: RQ1. How well can a per-window classifier
identify workload phases from RQ2. Does the model still work when test
- episodes come from a host the training set never saw? RQ2. Does the model still work on
+ RQ3. Of the standard sequence-model
families (RNN, GRU, LSTM, CNN, Transformer) plus a
non-parametric baseline (KNN) and a tabular baseline
@@ -927,7 +957,7 @@
into one shared training loop. KNN, GBT, MLP, CNN, RNN,
GRU, LSTM, and Transformer all reuse the same standardization,
schema-hashed checkpoint format, class-weighted CE loss,
- and held-out-by-host evaluation — so the comparison is
+ and held-out-by-sample evaluation — so the comparison is
genuinely apples-to-apples. The detector's per-window verdict feeds two downstream
loops: a fleet-wide trust score that
@@ -953,19 +983,6 @@
Each lab host on the WireGuard mesh boots a real Alpine VM, runs
- a profile-driven workload inside it, and samples
- The counter on the left is the running total, sourced from the
- receiver's Held-out by host. Train and validate on
- one machine; test on a different machine. A model that
- wins by memorising the train host's idle profile loses
- here, which is what you want — a fleet detector has to
- generalize across hosts it never saw at training time. Held-out by Macro-F1, not accuracy. The dataset is
heavily skewed: roughly half the labelled time is
RNN, GRU, LSTM — recurrent models that read the
- window one timestep at a time and carry state forward. Cheap,
- mature, easy to interpret. BERT-style transformer — the window becomes a
- sequence of "tokens"; attention captures cross-position context
- instead of accumulating it through a hidden state. More
- parameters, more compute, more room to overfit a small dataset. Same input, same labels, four different inductive biases. The
- comparison on the left is the punchline of the whole project. RNN, GRU, LSTM — recurrent models that read the
+ window one timestep at a time and carry state forward. Cheap,
+ mature, easy to interpret. BERT-style transformer — the window becomes a
+ sequence of "tokens"; attention captures cross-position context
+ instead of accumulating it through a hidden state. More
+ parameters, more compute, more room to overfit a small dataset. Same input, same labels, four different inductive biases. The
+ comparison on the left is the punchline of the whole project. Cross-host as the eval axis.
- Held-out-by-host is reported as a first-class number
- alongside held-out-by-sample — the two often disagree by
- ~0.4 macro-F1, and only the cross-host number predicts
- real fleet behaviour. Held-out-by-sample, profile-stratified.
+ Hosts in the fleet are uniform — same orchestrator, same
+ workload, just different production rates — so we split by
+ malware
sample_name, profile-stratifiedsample_name: the specific instances in
+ the test set never appear during training.
+ Generalization axis is "unseen malware", not
+ "unseen device". Two profiles with only one sample
+ (cpu-saturate, low-and-slow) are excluded from
+ held-out-by-sample eval and reported separately.sample_names never appear during
+ training.Collecting the dataset
+ /proc/<qemu_pid> at 10 Hz. Every ~30 seconds
+ the labeled tarball is shipped to this Pi over mTLS.index.jsonl on disk. The sparkline is the
+ arrival rate over the last sixty seconds — proof that the deck
+ is reading live data, not a fixed slide.Why detect at all?
@@ -877,8 +906,8 @@
trained on twelve channels of /proc telemetry
classify five workload phases (clean / armed / infecting /
infected_running / dormant) accurately enough to drive
- automated containment, and generalize across hosts
- and malware profiles it has never seen during training?
+ automated containment, and generalize to malware
+ sample_names it has never seen during training?
/proc alone, with
no syscall traces and no kernel hooks?sample_names the training set never saw —
+ i.e., new instances of malware profiles it does know?Collecting the dataset
- /proc/<qemu_pid> at 10 Hz. Every ~30 seconds
- the labeled tarball is shipped to this Pi over mTLS.index.jsonl on disk. The sparkline is the
- arrival rate over the last sixty seconds.A multi-host fleet
@@ -1044,11 +1061,15 @@
split recipe, the primary metric, and what we measure next
to accuracy. The temptation is to report a single big
number; we report a number you can argue with.
- sample_name,
+ profile-stratified. The fleet is uniform — every
+ host runs the same orchestrator and the same set of
+ profiles — so we don't split by device. Both hosts
+ contribute data to train, val, and test. What's held out is
+ specific malware instances: the
+ sample_names in the test set never appear
+ during training. The model has to generalize to unseen
+ samples, not unseen devices.infected_running and only ~5 % is
@@ -1062,21 +1083,6 @@
Sequence models
- How we trained them
@@ -1093,6 +1099,21 @@
Sequence models
+ Nearest-neighbor as a sanity check
@@ -1157,11 +1178,13 @@
trained on. Loading a model against a different schema
fails fast. Without this, retroactive comparison silently
scores models on misaligned columns and reports nonsense.
- sample_name instead of by device. The
+ generalization claim is "unseen malware sample", tested on
+ the same population of hosts that contributed the training
+ data./proc provides.
KNN cross-host gap. KNN scores val - macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the - held-out one. Instance-based memorization of the training - host's feature space — informative as a baseline, not a - deployment candidate.
+KNN val ↔ test gap. KNN scores val
+ macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
+ held-out sample_names. Instance-based
+ memorization of the specific training samples — informative
+ as a baseline, not a deployment candidate.