deck: reorder + correct eval framing to held-out-by-sample

REORDER - collect (big-number ingest counter) moved from #7 to #2 — sits right after the title as the dataset-quantity hook - training-code moved from #15 to #14 — "how we trained" now appears before "what we got" (models accuracy bars) EVAL FRAMING CORRECTION The fleet hosts are uniform — every host runs every profile, just at different rates — so the actual split is held-out-by-sample (profile-stratified), NOT held-out-by-host. Both hosts contribute to train, val, AND test. The generalization claim is "unseen malware sample_name", not "unseen device". Fixed across: - evaluation-setup: split-recipe block, val↔test gap (was "cross-host gap"), prose - problem-statement: RQ wording, "generalize across hosts" → "generalize to sample_names" - research-questions: RQ2 ("from a host the training set never saw" → "sample_names the training set never saw"); literature-gap bullet flipped from "cross-host generalization" to "sample- stratified evaluation"; prose - solution-overview: pipeline diagram caption - theoretical-contributions: "cross-host as the eval axis" → "held-out-by-sample as the eval axis" - limitations: two-host-fleet card now states "both hosts contribute to train/val/test"; "KNN cross-host gap" → "KNN val ↔ test gap" - conclusion-future: bullet flipped to held-out-by-sample as primary axis Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:59:22 -05:00 · 2026-05-08 15:59:22 -05:00 · 233390a40e
commit 233390a40e
parent db9f013969
1 changed files with 131 additions and 108 deletions
--- a/training/dashboard/static/index.html
+++ b/training/dashboard/static/index.html
@ -161,7 +161,23 @@
          </div>
        </div>

-        <!-- 2. motivation — what detection unlocks -->
+        <!-- 2. collect — big-number hook right after the title -->
+        <div class="stage-view" data-view="collect">
+          <div class="metric-stack">
+            <div class="metric-eyebrow">episodes ingested</div>
+            <div class="metric-big" id="ingest-total">0</div>
+            <div class="metric-sub">
+              <span id="ingest-rate">0.0</span> / sec · last 60 s ·
+              total bytes on disk: <span id="ingest-bytes">0 B</span>
+            </div>
+            <svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
+              <path id="ingest-spark-fill" d=""></path>
+              <path id="ingest-spark-path" d=""></path>
+            </svg>
+          </div>
+        </div>
+
+        <!-- 3. motivation — what detection unlocks -->
        <div class="stage-view" data-view="motivation">
          <div class="metric-stack metric-stack-wide motivation-stack">
            <div class="metric-eyebrow">what detection unlocks</div>
@ -263,8 +279,9 @@
                <ul class="research-list">
                  <li><strong>/proc-only signal</strong> — most work
                    assumes syscalls or kernel hooks</li>
-                  <li><strong>Cross-host generalization</strong> — eval
-                    splits often hide it (held-out by sample, not host)</li>
+                  <li><strong>Sample-stratified evaluation</strong> —
+                    papers often hide same-sample overfit by training
+                    and testing on the same malware instances</li>
                  <li><strong>Real-time per-window classification</strong>
                    for containment, not post-hoc batch labelling</li>
                  <li><strong>Side-by-side cell-choice comparison</strong>
@ -309,7 +326,7 @@
                <text x="400" y="198" text-anchor="middle" class="pipeline-stage-title">model zoo</text>
                <text x="400" y="226" text-anchor="middle" class="pipeline-detail">KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer</text>
                <text x="400" y="252" text-anchor="middle" class="pipeline-detail">trained per (model × split-recipe)</text>
-                <text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">cross-host eval · class-weighted CE · early stop on val macro-F1</text>
+                <text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">held-out-by-sample · class-weighted CE · early stop on val macro-F1</text>
              </g>
              <g class="pipeline-stage">
                <rect x="60" y="350" width="200" height="60" rx="4"/>
@ -356,22 +373,6 @@
          </div>
        </div>

-        <!-- 3. collect -->
-        <div class="stage-view" data-view="collect">
-          <div class="metric-stack">
-            <div class="metric-eyebrow">episodes ingested</div>
-            <div class="metric-big" id="ingest-total">0</div>
-            <div class="metric-sub">
-              <span id="ingest-rate">0.0</span> / sec · last 60 s ·
-              total bytes on disk: <span id="ingest-bytes">0 B</span>
-            </div>
-            <svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
-              <path id="ingest-spark-fill" d=""></path>
-              <path id="ingest-spark-path" d=""></path>
-            </svg>
-          </div>
-        </div>
-
        <!-- 4. hosts -->
        <div class="stage-view" data-view="hosts">
          <div class="metric-stack">
@ -460,13 +461,18 @@
              <div class="eval-block">
                <div class="eval-block-title">split recipe</div>
                <div class="eval-block-body">
-                  <div><strong>train ∪ val:</strong> elliott-thinkpad</div>
-                  <div><strong>test:</strong> k-gamingcom</div>
-                  <div class="eval-detail">held-out by host so the test set
-                    measures cross-device generalization, not in-distribution
-                    self-prediction. A 90 % accuracy that comes from
-                    recognising the host's idle profile is worthless for
-                    a fleet detector.</div>
+                  <div><strong>train / val / test:</strong> held-out by
+                    <code>sample_name</code>, profile-stratified</div>
+                  <div><strong>both hosts</strong> contribute to all three slices</div>
+                  <div class="eval-detail">the fleet is uniform — every
+                    host runs the same orchestrator and every profile —
+                    so we don't split by host. We split by malware
+                    <code>sample_name</code>: the specific instances in
+                    the test set never appear during training.
+                    Generalization axis is "unseen malware", not
+                    "unseen device". Two profiles with only one sample
+                    (cpu-saturate, low-and-slow) are excluded from
+                    held-out-by-sample eval and reported separately.</div>
                </div>
              </div>
              <div class="eval-block">
@ -495,25 +501,19 @@
                <div class="eval-block-title">reported alongside accuracy</div>
                <div class="eval-block-body">
                  <div><strong>μs / window</strong> — inference cost at batch=64</div>
-                  <div><strong>cross-host gap</strong> — val − test macro-F1</div>
-                  <div class="eval-detail">latency translates to containment
-                    lag; the gap is the honest measure of generalization.
-                    Both are plotted on the perf scene.</div>
+                  <div><strong>val ↔ test gap</strong> — val − test macro-F1</div>
+                  <div class="eval-detail">latency translates to
+                    containment lag; the val ↔ test gap is the honest
+                    measure of how much accuracy survives the move from
+                    "samples we saw" to "samples we didn't". Both plot
+                    on the perf scene.</div>
                </div>
              </div>
            </div>
          </div>
        </div>

-        <!-- 10. models -->
-        <div class="stage-view" data-view="models">
-          <div class="metric-stack">
-            <div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
-            <div class="model-bars" id="model-bars"></div>
-          </div>
-        </div>
-
-        <!-- 10. training-code — how we trained the sequence models -->
+        <!-- training-code — how we trained, before showing results -->
        <div class="stage-view" data-view="training-code">
          <div class="metric-stack metric-stack-wide">
            <div class="metric-eyebrow">how we trained the sequence models</div>
@ -530,6 +530,14 @@
          </div>
        </div>

+        <!-- models — accuracy bars (results after training-code) -->
+        <div class="stage-view" data-view="models">
+          <div class="metric-stack">
+            <div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
+            <div class="model-bars" id="model-bars"></div>
+          </div>
+        </div>
+
        <!-- 11. knn — interactive 3-D scatter with mode toggle -->
        <div class="stage-view" data-view="knn">
          <div class="metric-stack">
@ -622,12 +630,14 @@
              <div class="motivation-card">
                <div class="motivation-card-marker mc-recover"></div>
                <div class="motivation-card-body">
-                  <div class="motivation-card-title">cross-host as the eval axis</div>
-                  <div class="motivation-card-text">Held-out-by-host
-                    is reported as a first-class number alongside
-                    held-out-by-sample. The two often disagree by 0.4
-                    macro-F1, and only the cross-host number predicts
-                    fleet behaviour.</div>
+                  <div class="motivation-card-title">held-out-by-sample as the eval axis</div>
+                  <div class="motivation-card-text">The hosts in the
+                    fleet are uniform — same orchestrator, same workload,
+                    different production rates. The generalization claim
+                    is therefore "unseen malware sample", tested on the
+                    same population of devices the training data came
+                    from. Profile-stratified so every profile gets fair
+                    train/val/test cells.</div>
                </div>
              </div>
            </div>
@ -724,10 +734,12 @@
                <div class="motivation-card-marker mc-armed"></div>
                <div class="motivation-card-body">
                  <div class="motivation-card-title">two-host fleet</div>
-                  <div class="motivation-card-text">Cross-host generalization
-                    is reported between exactly two machines
-                    (elliott-thinkpad → k-gamingcom). N-host claims need
-                    more hosts on the WireGuard mesh.</div>
+                  <div class="motivation-card-text">Both hosts contribute
+                    to train, val, and test, but the device population
+                    is small (n = 2). Adding more hosts on the WireGuard
+                    mesh wouldn't change the split recipe but would make
+                    the dataset more representative of real-world
+                    hardware variety.</div>
                </div>
              </div>
              <div class="motivation-card">
@ -754,12 +766,12 @@
              <div class="motivation-card">
                <div class="motivation-card-marker mc-armed"></div>
                <div class="motivation-card-body">
-                  <div class="motivation-card-title">KNN cross-host gap</div>
+                  <div class="motivation-card-title">KNN val ↔ test gap</div>
                  <div class="motivation-card-text">KNN scores val
-                    macro-F1 ≈ 0.74 on elliott-thinkpad but only 0.13 on
-                    the held-out k-gamingcom. Instance-based memorization
-                    of the training host's feature space — informative
-                    as a baseline, but not a deployment candidate.</div>
+                    macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13
+                    on held-out sample_names. Instance-based memorization
+                    of the specific training samples — informative as a
+                    baseline, not a deployment candidate.</div>
                </div>
              </div>
            </div>
@ -778,9 +790,12 @@
                    <strong>/proc-only telemetry</strong> can classify
                    workload phases at multi-class macro-F1 well above
                    chance.</li>
-                  <li>Held-out-<strong>by-host</strong> evaluation is the
-                    right generalization axis; held-out-by-sample
-                    overstates real fleet performance by 0.3+ F1.</li>
+                  <li>Held-out-by-<strong>sample</strong>,
+                    profile-stratified, is the right generalization
+                    axis: both fleet hosts contribute to all three
+                    slices, and the test set's
+                    <code>sample_name</code>s never appear during
+                    training.</li>
                  <li>The recurrent family (LSTM/GRU) and Transformer
                    sit on the upper-left of the
                    <strong>accuracy-vs-cost frontier</strong>; KNN and
@ -835,6 +850,20 @@
        </div>
      </section>

+      <section class="scene" data-stage="collect">
+        <div class="prose">
+          <h2>Collecting the dataset</h2>
+          <p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
+            a profile-driven workload inside it, and samples
+            <code>/proc/&lt;qemu_pid&gt;</code> at 10&nbsp;Hz. Every ~30&nbsp;seconds
+            the labeled tarball is shipped to this Pi over mTLS.</p>
+          <p>The counter on the left is the running total, sourced from the
+            receiver's <code>index.jsonl</code> on disk. The sparkline is the
+            arrival rate over the last sixty seconds — proof that the deck
+            is reading live data, not a fixed slide.</p>
+        </div>
+      </section>
+
      <section class="scene" data-stage="motivation">
        <div class="prose">
          <h2>Why detect at all?</h2>
@ -877,8 +906,8 @@
            trained on twelve channels of <code>/proc</code> telemetry
            classify five workload phases (clean / armed / infecting /
            infected_running / dormant) accurately enough to drive
-            automated containment, <em>and</em> generalize across hosts
-            and malware profiles it has never seen during training?</p>
+            automated containment, <em>and</em> generalize to malware
+            <code>sample_name</code>s it has never seen during training?</p>
          <p>The task is <strong>multi-class classification</strong>:
            the target is one of five mutually-exclusive phase labels.
            Not regression (no continuous target), not ranking
@ -894,15 +923,16 @@
          <p>Literature on behaviour-based malware detection is rich but
            uneven. Most published results either (a) use richer
            telemetry than what a constrained host actually exports, or
-            (b) frame evaluation in ways that hide the cross-host
-            generalization problem. The card on the left summarises the
+            (b) frame evaluation in ways that hide same-sample overfit
+            (training and testing on the same malware instances). The card on the left summarises the
            gap.</p>
          <p>This project asks three concrete questions:</p>
          <p><strong>RQ1.</strong> How well can a per-window classifier
            identify workload phases from <code>/proc</code> alone, with
            no syscall traces and no kernel hooks?</p>
-          <p><strong>RQ2.</strong> Does the model still work when test
-            episodes come from a host the training set never saw?</p>
+          <p><strong>RQ2.</strong> Does the model still work on
+            <code>sample_name</code>s the training set never saw —
+            i.e., new instances of malware profiles it does know?</p>
          <p><strong>RQ3.</strong> Of the standard sequence-model
            families (RNN, GRU, LSTM, CNN, Transformer) plus a
            non-parametric baseline (KNN) and a tabular baseline
@ -927,7 +957,7 @@
            into one shared training loop. KNN, GBT, MLP, CNN, RNN,
            GRU, LSTM, and Transformer all reuse the same standardization,
            schema-hashed checkpoint format, class-weighted CE loss,
-            and held-out-by-host evaluation — so the comparison is
+            and held-out-by-sample evaluation — so the comparison is
            genuinely apples-to-apples.</p>
          <p>The detector's per-window verdict feeds two downstream
            loops: a fleet-wide <strong>trust score</strong> that
@ -953,19 +983,6 @@
        </div>
      </section>

-      <section class="scene" data-stage="collect">
-        <div class="prose">
-          <h2>Collecting the dataset</h2>
-          <p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
-            a profile-driven workload inside it, and samples
-            <code>/proc/&lt;qemu_pid&gt;</code> at 10&nbsp;Hz. Every ~30&nbsp;seconds
-            the labeled tarball is shipped to this Pi over mTLS.</p>
-          <p>The counter on the left is the running total, sourced from the
-            receiver's <code>index.jsonl</code> on disk. The sparkline is the
-            arrival rate over the last sixty seconds.</p>
-        </div>
-      </section>
-
      <section class="scene" data-stage="hosts">
        <div class="prose">
          <h2>A multi-host fleet</h2>
@ -1044,11 +1061,15 @@
            split recipe, the primary metric, and what we measure next
            to accuracy. The temptation is to report a single big
            number; we report a number you can argue with.</p>
-          <p><strong>Held-out by host.</strong> Train and validate on
-            one machine; test on a different machine. A model that
-            wins by memorising the train host's idle profile loses
-            here, which is what you want — a fleet detector has to
-            generalize across hosts it never saw at training time.</p>
+          <p><strong>Held-out by <code>sample_name</code>,
+            profile-stratified.</strong> The fleet is uniform — every
+            host runs the same orchestrator and the same set of
+            profiles — so we don't split by device. Both hosts
+            contribute data to train, val, and test. What's held out is
+            specific malware <em>instances</em>: the
+            <code>sample_name</code>s in the test set never appear
+            during training. The model has to generalize to unseen
+            samples, not unseen devices.</p>
          <p><strong>Macro-F1, not accuracy.</strong> The dataset is
            heavily skewed: roughly half the labelled time is
            <code>infected_running</code> and only ~5 % is
@ -1062,21 +1083,6 @@
        </div>
      </section>

-      <section class="scene" data-stage="models">
-        <div class="prose">
-          <h2>Sequence models</h2>
-          <p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
-            window one timestep at a time and carry state forward. Cheap,
-            mature, easy to interpret.</p>
-          <p><strong>BERT-style transformer</strong> — the window becomes a
-            sequence of "tokens"; attention captures cross-position context
-            instead of accumulating it through a hidden state. More
-            parameters, more compute, more room to overfit a small dataset.</p>
-          <p>Same input, same labels, four different inductive biases. The
-            comparison on the left is the punchline of the whole project.</p>
-        </div>
-      </section>
-
      <section class="scene" data-stage="training-code">
        <div class="prose">
          <h2>How we trained them</h2>
@ -1093,6 +1099,21 @@
        </div>
      </section>

+      <section class="scene" data-stage="models">
+        <div class="prose">
+          <h2>Sequence models</h2>
+          <p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
+            window one timestep at a time and carry state forward. Cheap,
+            mature, easy to interpret.</p>
+          <p><strong>BERT-style transformer</strong> — the window becomes a
+            sequence of "tokens"; attention captures cross-position context
+            instead of accumulating it through a hidden state. More
+            parameters, more compute, more room to overfit a small dataset.</p>
+          <p>Same input, same labels, four different inductive biases. The
+            comparison on the left is the punchline of the whole project.</p>
+        </div>
+      </section>
+
      <section class="scene" data-stage="knn">
        <div class="prose">
          <h2>Nearest-neighbor as a sanity check</h2>
@ -1157,11 +1178,13 @@
            trained on. Loading a model against a different schema
            fails fast. Without this, retroactive comparison silently
            scores models on misaligned columns and reports nonsense.</p>
-          <p><strong>Cross-host as the eval axis.</strong>
-            Held-out-by-host is reported as a first-class number
-            alongside held-out-by-sample — the two often disagree by
-            ~0.4 macro-F1, and only the cross-host number predicts
-            real fleet behaviour.</p>
+          <p><strong>Held-out-by-sample, profile-stratified.</strong>
+            Hosts in the fleet are uniform — same orchestrator, same
+            workload, just different production rates — so we split by
+            malware <code>sample_name</code> instead of by device. The
+            generalization claim is "unseen malware sample", tested on
+            the same population of hosts that contributed the training
+            data.</p>
        </div>
      </section>

@ -1236,11 +1259,11 @@
            behaviours fall inside a single sample. Detection of
            millisecond-scale privilege checks would need faster
            telemetry than <code>/proc</code> provides.</p>
-          <p><strong>KNN cross-host gap.</strong> KNN scores val
-            macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the
-            held-out one. Instance-based memorization of the training
-            host's feature space — informative as a baseline, not a
-            deployment candidate.</p>
+          <p><strong>KNN val ↔ test gap.</strong> KNN scores val
+            macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
+            held-out <code>sample_name</code>s. Instance-based
+            memorization of the specific training samples — informative
+            as a baseline, not a deployment candidate.</p>
        </div>
      </section>