diff --git a/training/dashboard/static/index.html b/training/dashboard/static/index.html index 3e6abd0..e11ab00 100644 --- a/training/dashboard/static/index.html +++ b/training/dashboard/static/index.html @@ -161,7 +161,23 @@ - + +
sample_name, profile-stratifiedsample_name: the specific instances in
+ the test set never appear during training.
+ Generalization axis is "unseen malware", not
+ "unseen device". Two profiles with only one sample
+ (cpu-saturate, low-and-slow) are excluded from
+ held-out-by-sample eval and reported separately.sample_names never appear during
+ training.Each lab host on the WireGuard mesh boots a real Alpine VM, runs
+ a profile-driven workload inside it, and samples
+ /proc/<qemu_pid> at 10 Hz. Every ~30 seconds
+ the labeled tarball is shipped to this Pi over mTLS.
The counter on the left is the running total, sourced from the
+ receiver's index.jsonl on disk. The sparkline is the
+ arrival rate over the last sixty seconds — proof that the deck
+ is reading live data, not a fixed slide.
/proc telemetry
classify five workload phases (clean / armed / infecting /
infected_running / dormant) accurately enough to drive
- automated containment, and generalize across hosts
- and malware profiles it has never seen during training?
+ automated containment, and generalize to malware
+ sample_names it has never seen during training?
The task is multi-class classification: the target is one of five mutually-exclusive phase labels. Not regression (no continuous target), not ranking @@ -894,15 +923,16 @@
Literature on behaviour-based malware detection is rich but uneven. Most published results either (a) use richer telemetry than what a constrained host actually exports, or - (b) frame evaluation in ways that hide the cross-host - generalization problem. The card on the left summarises the + (b) frame evaluation in ways that hide same-sample overfit + (training and testing on the same malware instances). The card on the left summarises the gap.
This project asks three concrete questions:
RQ1. How well can a per-window classifier
identify workload phases from /proc alone, with
no syscall traces and no kernel hooks?
RQ2. Does the model still work when test - episodes come from a host the training set never saw?
+RQ2. Does the model still work on
+ sample_names the training set never saw —
+ i.e., new instances of malware profiles it does know?
RQ3. Of the standard sequence-model families (RNN, GRU, LSTM, CNN, Transformer) plus a non-parametric baseline (KNN) and a tabular baseline @@ -927,7 +957,7 @@ into one shared training loop. KNN, GBT, MLP, CNN, RNN, GRU, LSTM, and Transformer all reuse the same standardization, schema-hashed checkpoint format, class-weighted CE loss, - and held-out-by-host evaluation — so the comparison is + and held-out-by-sample evaluation — so the comparison is genuinely apples-to-apples.
The detector's per-window verdict feeds two downstream loops: a fleet-wide trust score that @@ -953,19 +983,6 @@
Each lab host on the WireGuard mesh boots a real Alpine VM, runs
- a profile-driven workload inside it, and samples
- /proc/<qemu_pid> at 10 Hz. Every ~30 seconds
- the labeled tarball is shipped to this Pi over mTLS.
The counter on the left is the running total, sourced from the
- receiver's index.jsonl on disk. The sparkline is the
- arrival rate over the last sixty seconds.
Held-out by host. Train and validate on - one machine; test on a different machine. A model that - wins by memorising the train host's idle profile loses - here, which is what you want — a fleet detector has to - generalize across hosts it never saw at training time.
+Held-out by sample_name,
+ profile-stratified. The fleet is uniform — every
+ host runs the same orchestrator and the same set of
+ profiles — so we don't split by device. Both hosts
+ contribute data to train, val, and test. What's held out is
+ specific malware instances: the
+ sample_names in the test set never appear
+ during training. The model has to generalize to unseen
+ samples, not unseen devices.
Macro-F1, not accuracy. The dataset is
heavily skewed: roughly half the labelled time is
infected_running and only ~5 % is
@@ -1062,21 +1083,6 @@
RNN, GRU, LSTM — recurrent models that read the - window one timestep at a time and carry state forward. Cheap, - mature, easy to interpret.
-BERT-style transformer — the window becomes a - sequence of "tokens"; attention captures cross-position context - instead of accumulating it through a hidden state. More - parameters, more compute, more room to overfit a small dataset.
-Same input, same labels, four different inductive biases. The - comparison on the left is the punchline of the whole project.
-RNN, GRU, LSTM — recurrent models that read the + window one timestep at a time and carry state forward. Cheap, + mature, easy to interpret.
+BERT-style transformer — the window becomes a + sequence of "tokens"; attention captures cross-position context + instead of accumulating it through a hidden state. More + parameters, more compute, more room to overfit a small dataset.
+Same input, same labels, four different inductive biases. The + comparison on the left is the punchline of the whole project.
+Cross-host as the eval axis. - Held-out-by-host is reported as a first-class number - alongside held-out-by-sample — the two often disagree by - ~0.4 macro-F1, and only the cross-host number predicts - real fleet behaviour.
+Held-out-by-sample, profile-stratified.
+ Hosts in the fleet are uniform — same orchestrator, same
+ workload, just different production rates — so we split by
+ malware sample_name instead of by device. The
+ generalization claim is "unseen malware sample", tested on
+ the same population of hosts that contributed the training
+ data.
/proc provides.
- KNN cross-host gap. KNN scores val - macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the - held-out one. Instance-based memorization of the training - host's feature space — informative as a baseline, not a - deployment candidate.
+KNN val ↔ test gap. KNN scores val
+ macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
+ held-out sample_names. Instance-based
+ memorization of the specific training samples — informative
+ as a baseline, not a deployment candidate.