diff --git a/training/dashboard/static/index.html b/training/dashboard/static/index.html index 3e6abd0..e11ab00 100644 --- a/training/dashboard/static/index.html +++ b/training/dashboard/static/index.html @@ -161,7 +161,23 @@ - + +
+
+
episodes ingested
+
0
+
+ 0.0 / sec · last 60 s · + total bytes on disk: 0 B +
+ + + + +
+
+ +
what detection unlocks
@@ -263,8 +279,9 @@
- -
-
-
episodes ingested
-
0
-
- 0.0 / sec · last 60 s · - total bytes on disk: 0 B -
- - - - -
-
-
@@ -460,13 +461,18 @@
split recipe
-
train ∪ val: elliott-thinkpad
-
test: k-gamingcom
-
held-out by host so the test set - measures cross-device generalization, not in-distribution - self-prediction. A 90 % accuracy that comes from - recognising the host's idle profile is worthless for - a fleet detector.
+
train / val / test: held-out by + sample_name, profile-stratified
+
both hosts contribute to all three slices
+
the fleet is uniform — every + host runs the same orchestrator and every profile — + so we don't split by host. We split by malware + sample_name: the specific instances in + the test set never appear during training. + Generalization axis is "unseen malware", not + "unseen device". Two profiles with only one sample + (cpu-saturate, low-and-slow) are excluded from + held-out-by-sample eval and reported separately.
@@ -495,25 +501,19 @@
reported alongside accuracy
μs / window — inference cost at batch=64
-
cross-host gap — val − test macro-F1
-
latency translates to containment - lag; the gap is the honest measure of generalization. - Both are plotted on the perf scene.
+
val ↔ test gap — val − test macro-F1
+
latency translates to + containment lag; the val ↔ test gap is the honest + measure of how much accuracy survives the move from + "samples we saw" to "samples we didn't". Both plot + on the perf scene.
- -
-
-
sequence models · accuracy on held-out samples
-
-
-
- - +
how we trained the sequence models
@@ -530,6 +530,14 @@
+ +
+
+
sequence models · accuracy on held-out samples
+
+
+
+
@@ -622,12 +630,14 @@
-
cross-host as the eval axis
-
Held-out-by-host - is reported as a first-class number alongside - held-out-by-sample. The two often disagree by 0.4 - macro-F1, and only the cross-host number predicts - fleet behaviour.
+
held-out-by-sample as the eval axis
+
The hosts in the + fleet are uniform — same orchestrator, same workload, + different production rates. The generalization claim + is therefore "unseen malware sample", tested on the + same population of devices the training data came + from. Profile-stratified so every profile gets fair + train/val/test cells.
@@ -724,10 +734,12 @@
two-host fleet
-
Cross-host generalization - is reported between exactly two machines - (elliott-thinkpad → k-gamingcom). N-host claims need - more hosts on the WireGuard mesh.
+
Both hosts contribute + to train, val, and test, but the device population + is small (n = 2). Adding more hosts on the WireGuard + mesh wouldn't change the split recipe but would make + the dataset more representative of real-world + hardware variety.
@@ -754,12 +766,12 @@
-
KNN cross-host gap
+
KNN val ↔ test gap
KNN scores val - macro-F1 ≈ 0.74 on elliott-thinkpad but only 0.13 on - the held-out k-gamingcom. Instance-based memorization - of the training host's feature space — informative - as a baseline, but not a deployment candidate.
+ macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 + on held-out sample_names. Instance-based memorization + of the specific training samples — informative as a + baseline, not a deployment candidate.
@@ -778,9 +790,12 @@ /proc-only telemetry can classify workload phases at multi-class macro-F1 well above chance. -
  • Held-out-by-host evaluation is the - right generalization axis; held-out-by-sample - overstates real fleet performance by 0.3+ F1.
  • +
  • Held-out-by-sample, + profile-stratified, is the right generalization + axis: both fleet hosts contribute to all three + slices, and the test set's + sample_names never appear during + training.
  • The recurrent family (LSTM/GRU) and Transformer sit on the upper-left of the accuracy-vs-cost frontier; KNN and @@ -835,6 +850,20 @@ +
    +
    +

    Collecting the dataset

    +

    Each lab host on the WireGuard mesh boots a real Alpine VM, runs + a profile-driven workload inside it, and samples + /proc/<qemu_pid> at 10 Hz. Every ~30 seconds + the labeled tarball is shipped to this Pi over mTLS.

    +

    The counter on the left is the running total, sourced from the + receiver's index.jsonl on disk. The sparkline is the + arrival rate over the last sixty seconds — proof that the deck + is reading live data, not a fixed slide.

    +
    +
    +

    Why detect at all?

    @@ -877,8 +906,8 @@ trained on twelve channels of /proc telemetry classify five workload phases (clean / armed / infecting / infected_running / dormant) accurately enough to drive - automated containment, and generalize across hosts - and malware profiles it has never seen during training?

    + automated containment, and generalize to malware + sample_names it has never seen during training?

    The task is multi-class classification: the target is one of five mutually-exclusive phase labels. Not regression (no continuous target), not ranking @@ -894,15 +923,16 @@

    Literature on behaviour-based malware detection is rich but uneven. Most published results either (a) use richer telemetry than what a constrained host actually exports, or - (b) frame evaluation in ways that hide the cross-host - generalization problem. The card on the left summarises the + (b) frame evaluation in ways that hide same-sample overfit + (training and testing on the same malware instances). The card on the left summarises the gap.

    This project asks three concrete questions:

    RQ1. How well can a per-window classifier identify workload phases from /proc alone, with no syscall traces and no kernel hooks?

    -

    RQ2. Does the model still work when test - episodes come from a host the training set never saw?

    +

    RQ2. Does the model still work on + sample_names the training set never saw — + i.e., new instances of malware profiles it does know?

    RQ3. Of the standard sequence-model families (RNN, GRU, LSTM, CNN, Transformer) plus a non-parametric baseline (KNN) and a tabular baseline @@ -927,7 +957,7 @@ into one shared training loop. KNN, GBT, MLP, CNN, RNN, GRU, LSTM, and Transformer all reuse the same standardization, schema-hashed checkpoint format, class-weighted CE loss, - and held-out-by-host evaluation — so the comparison is + and held-out-by-sample evaluation — so the comparison is genuinely apples-to-apples.

    The detector's per-window verdict feeds two downstream loops: a fleet-wide trust score that @@ -953,19 +983,6 @@

    -
    -
    -

    Collecting the dataset

    -

    Each lab host on the WireGuard mesh boots a real Alpine VM, runs - a profile-driven workload inside it, and samples - /proc/<qemu_pid> at 10 Hz. Every ~30 seconds - the labeled tarball is shipped to this Pi over mTLS.

    -

    The counter on the left is the running total, sourced from the - receiver's index.jsonl on disk. The sparkline is the - arrival rate over the last sixty seconds.

    -
    -
    -

    A multi-host fleet

    @@ -1044,11 +1061,15 @@ split recipe, the primary metric, and what we measure next to accuracy. The temptation is to report a single big number; we report a number you can argue with.

    -

    Held-out by host. Train and validate on - one machine; test on a different machine. A model that - wins by memorising the train host's idle profile loses - here, which is what you want — a fleet detector has to - generalize across hosts it never saw at training time.

    +

    Held-out by sample_name, + profile-stratified. The fleet is uniform — every + host runs the same orchestrator and the same set of + profiles — so we don't split by device. Both hosts + contribute data to train, val, and test. What's held out is + specific malware instances: the + sample_names in the test set never appear + during training. The model has to generalize to unseen + samples, not unseen devices.

    Macro-F1, not accuracy. The dataset is heavily skewed: roughly half the labelled time is infected_running and only ~5 % is @@ -1062,21 +1083,6 @@

    -
    -
    -

    Sequence models

    -

    RNN, GRU, LSTM — recurrent models that read the - window one timestep at a time and carry state forward. Cheap, - mature, easy to interpret.

    -

    BERT-style transformer — the window becomes a - sequence of "tokens"; attention captures cross-position context - instead of accumulating it through a hidden state. More - parameters, more compute, more room to overfit a small dataset.

    -

    Same input, same labels, four different inductive biases. The - comparison on the left is the punchline of the whole project.

    -
    -
    -

    How we trained them

    @@ -1093,6 +1099,21 @@
    +
    +
    +

    Sequence models

    +

    RNN, GRU, LSTM — recurrent models that read the + window one timestep at a time and carry state forward. Cheap, + mature, easy to interpret.

    +

    BERT-style transformer — the window becomes a + sequence of "tokens"; attention captures cross-position context + instead of accumulating it through a hidden state. More + parameters, more compute, more room to overfit a small dataset.

    +

    Same input, same labels, four different inductive biases. The + comparison on the left is the punchline of the whole project.

    +
    +
    +

    Nearest-neighbor as a sanity check

    @@ -1157,11 +1178,13 @@ trained on. Loading a model against a different schema fails fast. Without this, retroactive comparison silently scores models on misaligned columns and reports nonsense.

    -

    Cross-host as the eval axis. - Held-out-by-host is reported as a first-class number - alongside held-out-by-sample — the two often disagree by - ~0.4 macro-F1, and only the cross-host number predicts - real fleet behaviour.

    +

    Held-out-by-sample, profile-stratified. + Hosts in the fleet are uniform — same orchestrator, same + workload, just different production rates — so we split by + malware sample_name instead of by device. The + generalization claim is "unseen malware sample", tested on + the same population of hosts that contributed the training + data.

    @@ -1236,11 +1259,11 @@ behaviours fall inside a single sample. Detection of millisecond-scale privilege checks would need faster telemetry than /proc provides.

    -

    KNN cross-host gap. KNN scores val - macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the - held-out one. Instance-based memorization of the training - host's feature space — informative as a baseline, not a - deployment candidate.

    +

    KNN val ↔ test gap. KNN scores val + macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on + held-out sample_names. Instance-based + memorization of the specific training samples — informative + as a baseline, not a deployment candidate.