From 233390a40e7c2f503ba507646cd88b79f1eac952 Mon Sep 17 00:00:00 2001
From: Max Gorog <mgorog@gmail.com>
Date: Fri, 8 May 2026 15:59:22 -0500
Subject: [PATCH] deck: reorder + correct eval framing to held-out-by-sample
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

REORDER
- collect (big-number ingest counter) moved from #7 to #2 — sits
  right after the title as the dataset-quantity hook
- training-code moved from #15 to #14 — "how we trained" now
  appears before "what we got" (models accuracy bars)

EVAL FRAMING CORRECTION
The fleet hosts are uniform — every host runs every profile, just
at different rates — so the actual split is held-out-by-sample
(profile-stratified), NOT held-out-by-host. Both hosts contribute
to train, val, AND test. The generalization claim is "unseen
malware sample_name", not "unseen device".

Fixed across:
- evaluation-setup: split-recipe block, val↔test gap (was
  "cross-host gap"), prose
- problem-statement: RQ wording, "generalize across hosts" →
  "generalize to sample_names"
- research-questions: RQ2 ("from a host the training set never
  saw" → "sample_names the training set never saw"); literature-gap
  bullet flipped from "cross-host generalization" to "sample-
  stratified evaluation"; prose
- solution-overview: pipeline diagram caption
- theoretical-contributions: "cross-host as the eval axis" →
  "held-out-by-sample as the eval axis"
- limitations: two-host-fleet card now states "both hosts
  contribute to train/val/test"; "KNN cross-host gap" → "KNN
  val ↔ test gap"
- conclusion-future: bullet flipped to held-out-by-sample as
  primary axis

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 training/dashboard/static/index.html | 239 +++++++++++++++------------
 1 file changed, 131 insertions(+), 108 deletions(-)
diff --git a/training/dashboard/static/index.html b/training/dashboard/static/index.html
index 3e6abd0..e11ab00 100644
--- a/training/dashboard/static/index.html
+++ b/training/dashboard/static/index.html
@@ -161,7 +161,23 @@
           </div>
         </div>
 
-        <!-- 2. motivation — what detection unlocks -->
+        <!-- 2. collect — big-number hook right after the title -->
+        <div class="stage-view" data-view="collect">
+          <div class="metric-stack">
+            <div class="metric-eyebrow">episodes ingested</div>
+            <div class="metric-big" id="ingest-total">0</div>
+            <div class="metric-sub">
+              <span id="ingest-rate">0.0</span> / sec · last 60 s ·
+              total bytes on disk: <span id="ingest-bytes">0 B</span>
+            </div>
+            <svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
+              <path id="ingest-spark-fill" d=""></path>
+              <path id="ingest-spark-path" d=""></path>
+            </svg>
+          </div>
+        </div>
+
+        <!-- 3. motivation — what detection unlocks -->
         <div class="stage-view" data-view="motivation">
           <div class="metric-stack metric-stack-wide motivation-stack">
             <div class="metric-eyebrow">what detection unlocks</div>
@@ -263,8 +279,9 @@
                 <ul class="research-list">
                   <li><strong>/proc-only signal</strong> — most work
                     assumes syscalls or kernel hooks</li>
-                  <li><strong>Cross-host generalization</strong> — eval
-                    splits often hide it (held-out by sample, not host)</li>
+                  <li><strong>Sample-stratified evaluation</strong> —
+                    papers often hide same-sample overfit by training
+                    and testing on the same malware instances</li>
                   <li><strong>Real-time per-window classification</strong>
                     for containment, not post-hoc batch labelling</li>
                   <li><strong>Side-by-side cell-choice comparison</strong>
@@ -309,7 +326,7 @@
                 <text x="400" y="198" text-anchor="middle" class="pipeline-stage-title">model zoo</text>
                 <text x="400" y="226" text-anchor="middle" class="pipeline-detail">KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer</text>
                 <text x="400" y="252" text-anchor="middle" class="pipeline-detail">trained per (model × split-recipe)</text>
-                <text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">cross-host eval · class-weighted CE · early stop on val macro-F1</text>
+                <text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">held-out-by-sample · class-weighted CE · early stop on val macro-F1</text>
               </g>
               <g class="pipeline-stage">
                 <rect x="60" y="350" width="200" height="60" rx="4"/>
@@ -356,22 +373,6 @@
           </div>
         </div>
 
-        <!-- 3. collect -->
-        <div class="stage-view" data-view="collect">
-          <div class="metric-stack">
-            <div class="metric-eyebrow">episodes ingested</div>
-            <div class="metric-big" id="ingest-total">0</div>
-            <div class="metric-sub">
-              <span id="ingest-rate">0.0</span> / sec · last 60 s ·
-              total bytes on disk: <span id="ingest-bytes">0 B</span>
-            </div>
-            <svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
-              <path id="ingest-spark-fill" d=""></path>
-              <path id="ingest-spark-path" d=""></path>
-            </svg>
-          </div>
-        </div>
-
         <!-- 4. hosts -->
         <div class="stage-view" data-view="hosts">
           <div class="metric-stack">
@@ -460,13 +461,18 @@
               <div class="eval-block">
                 <div class="eval-block-title">split recipe</div>
                 <div class="eval-block-body">
-                  <div><strong>train ∪ val:</strong> elliott-thinkpad</div>
-                  <div><strong>test:</strong> k-gamingcom</div>
-                  <div class="eval-detail">held-out by host so the test set
-                    measures cross-device generalization, not in-distribution
-                    self-prediction. A 90 % accuracy that comes from
-                    recognising the host's idle profile is worthless for
-                    a fleet detector.</div>
+                  <div><strong>train / val / test:</strong> held-out by
+                    <code>sample_name</code>, profile-stratified</div>
+                  <div><strong>both hosts</strong> contribute to all three slices</div>
+                  <div class="eval-detail">the fleet is uniform — every
+                    host runs the same orchestrator and every profile —
+                    so we don't split by host. We split by malware
+                    <code>sample_name</code>: the specific instances in
+                    the test set never appear during training.
+                    Generalization axis is "unseen malware", not
+                    "unseen device". Two profiles with only one sample
+                    (cpu-saturate, low-and-slow) are excluded from
+                    held-out-by-sample eval and reported separately.</div>
                 </div>
               </div>
               <div class="eval-block">
@@ -495,25 +501,19 @@
                 <div class="eval-block-title">reported alongside accuracy</div>
                 <div class="eval-block-body">
                   <div><strong>μs / window</strong> — inference cost at batch=64</div>
-                  <div><strong>cross-host gap</strong> — val − test macro-F1</div>
-                  <div class="eval-detail">latency translates to containment
-                    lag; the gap is the honest measure of generalization.
-                    Both are plotted on the perf scene.</div>
+                  <div><strong>val ↔ test gap</strong> — val − test macro-F1</div>
+                  <div class="eval-detail">latency translates to
+                    containment lag; the val ↔ test gap is the honest
+                    measure of how much accuracy survives the move from
+                    "samples we saw" to "samples we didn't". Both plot
+                    on the perf scene.</div>
                 </div>
               </div>
             </div>
           </div>
         </div>
 
-        <!-- 10. models -->
-        <div class="stage-view" data-view="models">
-          <div class="metric-stack">
-            <div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
-            <div class="model-bars" id="model-bars"></div>
-          </div>
-        </div>
-
-        <!-- 10. training-code — how we trained the sequence models -->
+        <!-- training-code — how we trained, before showing results -->
         <div class="stage-view" data-view="training-code">
           <div class="metric-stack metric-stack-wide">
             <div class="metric-eyebrow">how we trained the sequence models</div>
@@ -530,6 +530,14 @@
           </div>
         </div>
 
+        <!-- models — accuracy bars (results after training-code) -->
+        <div class="stage-view" data-view="models">
+          <div class="metric-stack">
+            <div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
+            <div class="model-bars" id="model-bars"></div>
+          </div>
+        </div>
+
         <!-- 11. knn — interactive 3-D scatter with mode toggle -->
         <div class="stage-view" data-view="knn">
           <div class="metric-stack">
@@ -622,12 +630,14 @@
               <div class="motivation-card">
                 <div class="motivation-card-marker mc-recover"></div>
                 <div class="motivation-card-body">
-                  <div class="motivation-card-title">cross-host as the eval axis</div>
-                  <div class="motivation-card-text">Held-out-by-host
-                    is reported as a first-class number alongside
-                    held-out-by-sample. The two often disagree by 0.4
-                    macro-F1, and only the cross-host number predicts
-                    fleet behaviour.</div>
+                  <div class="motivation-card-title">held-out-by-sample as the eval axis</div>
+                  <div class="motivation-card-text">The hosts in the
+                    fleet are uniform — same orchestrator, same workload,
+                    different production rates. The generalization claim
+                    is therefore "unseen malware sample", tested on the
+                    same population of devices the training data came
+                    from. Profile-stratified so every profile gets fair
+                    train/val/test cells.</div>
                 </div>
               </div>
             </div>
@@ -724,10 +734,12 @@
                 <div class="motivation-card-marker mc-armed"></div>
                 <div class="motivation-card-body">
                   <div class="motivation-card-title">two-host fleet</div>
-                  <div class="motivation-card-text">Cross-host generalization
-                    is reported between exactly two machines
-                    (elliott-thinkpad → k-gamingcom). N-host claims need
-                    more hosts on the WireGuard mesh.</div>
+                  <div class="motivation-card-text">Both hosts contribute
+                    to train, val, and test, but the device population
+                    is small (n = 2). Adding more hosts on the WireGuard
+                    mesh wouldn't change the split recipe but would make
+                    the dataset more representative of real-world
+                    hardware variety.</div>
                 </div>
               </div>
               <div class="motivation-card">
@@ -754,12 +766,12 @@
               <div class="motivation-card">
                 <div class="motivation-card-marker mc-armed"></div>
                 <div class="motivation-card-body">
-                  <div class="motivation-card-title">KNN cross-host gap</div>
+                  <div class="motivation-card-title">KNN val ↔ test gap</div>
                   <div class="motivation-card-text">KNN scores val
-                    macro-F1 ≈ 0.74 on elliott-thinkpad but only 0.13 on
-                    the held-out k-gamingcom. Instance-based memorization
-                    of the training host's feature space — informative
-                    as a baseline, but not a deployment candidate.</div>
+                    macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13
+                    on held-out sample_names. Instance-based memorization
+                    of the specific training samples — informative as a
+                    baseline, not a deployment candidate.</div>
                 </div>
               </div>
             </div>
@@ -778,9 +790,12 @@
                     <strong>/proc-only telemetry</strong> can classify
                     workload phases at multi-class macro-F1 well above
                     chance.</li>
-                  <li>Held-out-<strong>by-host</strong> evaluation is the
-                    right generalization axis; held-out-by-sample
-                    overstates real fleet performance by 0.3+ F1.</li>
+                  <li>Held-out-by-<strong>sample</strong>,
+                    profile-stratified, is the right generalization
+                    axis: both fleet hosts contribute to all three
+                    slices, and the test set's
+                    <code>sample_name</code>s never appear during
+                    training.</li>
                   <li>The recurrent family (LSTM/GRU) and Transformer
                     sit on the upper-left of the
                     <strong>accuracy-vs-cost frontier</strong>; KNN and
@@ -835,6 +850,20 @@
         </div>
       </section>
 
+      <section class="scene" data-stage="collect">
+        <div class="prose">
+          <h2>Collecting the dataset</h2>
+          <p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
+            a profile-driven workload inside it, and samples
+            <code>/proc/&lt;qemu_pid&gt;</code> at 10&nbsp;Hz. Every ~30&nbsp;seconds
+            the labeled tarball is shipped to this Pi over mTLS.</p>
+          <p>The counter on the left is the running total, sourced from the
+            receiver's <code>index.jsonl</code> on disk. The sparkline is the
+            arrival rate over the last sixty seconds — proof that the deck
+            is reading live data, not a fixed slide.</p>
+        </div>
+      </section>
+
       <section class="scene" data-stage="motivation">
         <div class="prose">
           <h2>Why detect at all?</h2>
@@ -877,8 +906,8 @@
             trained on twelve channels of <code>/proc</code> telemetry
             classify five workload phases (clean / armed / infecting /
             infected_running / dormant) accurately enough to drive
-            automated containment, <em>and</em> generalize across hosts
-            and malware profiles it has never seen during training?</p>
+            automated containment, <em>and</em> generalize to malware
+            <code>sample_name</code>s it has never seen during training?</p>
           <p>The task is <strong>multi-class classification</strong>:
             the target is one of five mutually-exclusive phase labels.
             Not regression (no continuous target), not ranking
@@ -894,15 +923,16 @@
           <p>Literature on behaviour-based malware detection is rich but
             uneven. Most published results either (a) use richer
             telemetry than what a constrained host actually exports, or
-            (b) frame evaluation in ways that hide the cross-host
-            generalization problem. The card on the left summarises the
+            (b) frame evaluation in ways that hide same-sample overfit
+            (training and testing on the same malware instances). The card on the left summarises the
             gap.</p>
           <p>This project asks three concrete questions:</p>
           <p><strong>RQ1.</strong> How well can a per-window classifier
             identify workload phases from <code>/proc</code> alone, with
             no syscall traces and no kernel hooks?</p>
-          <p><strong>RQ2.</strong> Does the model still work when test
-            episodes come from a host the training set never saw?</p>
+          <p><strong>RQ2.</strong> Does the model still work on
+            <code>sample_name</code>s the training set never saw —
+            i.e., new instances of malware profiles it does know?</p>
           <p><strong>RQ3.</strong> Of the standard sequence-model
             families (RNN, GRU, LSTM, CNN, Transformer) plus a
             non-parametric baseline (KNN) and a tabular baseline
@@ -927,7 +957,7 @@
             into one shared training loop. KNN, GBT, MLP, CNN, RNN,
             GRU, LSTM, and Transformer all reuse the same standardization,
             schema-hashed checkpoint format, class-weighted CE loss,
-            and held-out-by-host evaluation — so the comparison is
+            and held-out-by-sample evaluation — so the comparison is
             genuinely apples-to-apples.</p>
           <p>The detector's per-window verdict feeds two downstream
             loops: a fleet-wide <strong>trust score</strong> that
@@ -953,19 +983,6 @@
         </div>
       </section>
 
-      <section class="scene" data-stage="collect">
-        <div class="prose">
-          <h2>Collecting the dataset</h2>
-          <p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
-            a profile-driven workload inside it, and samples
-            <code>/proc/&lt;qemu_pid&gt;</code> at 10&nbsp;Hz. Every ~30&nbsp;seconds
-            the labeled tarball is shipped to this Pi over mTLS.</p>
-          <p>The counter on the left is the running total, sourced from the
-            receiver's <code>index.jsonl</code> on disk. The sparkline is the
-            arrival rate over the last sixty seconds.</p>
-        </div>
-      </section>
-
       <section class="scene" data-stage="hosts">
         <div class="prose">
           <h2>A multi-host fleet</h2>
@@ -1044,11 +1061,15 @@
             split recipe, the primary metric, and what we measure next
             to accuracy. The temptation is to report a single big
             number; we report a number you can argue with.</p>
-          <p><strong>Held-out by host.</strong> Train and validate on
-            one machine; test on a different machine. A model that
-            wins by memorising the train host's idle profile loses
-            here, which is what you want — a fleet detector has to
-            generalize across hosts it never saw at training time.</p>
+          <p><strong>Held-out by <code>sample_name</code>,
+            profile-stratified.</strong> The fleet is uniform — every
+            host runs the same orchestrator and the same set of
+            profiles — so we don't split by device. Both hosts
+            contribute data to train, val, and test. What's held out is
+            specific malware <em>instances</em>: the
+            <code>sample_name</code>s in the test set never appear
+            during training. The model has to generalize to unseen
+            samples, not unseen devices.</p>
           <p><strong>Macro-F1, not accuracy.</strong> The dataset is
             heavily skewed: roughly half the labelled time is
             <code>infected_running</code> and only ~5 % is
@@ -1062,21 +1083,6 @@
         </div>
       </section>
 
-      <section class="scene" data-stage="models">
-        <div class="prose">
-          <h2>Sequence models</h2>
-          <p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
-            window one timestep at a time and carry state forward. Cheap,
-            mature, easy to interpret.</p>
-          <p><strong>BERT-style transformer</strong> — the window becomes a
-            sequence of "tokens"; attention captures cross-position context
-            instead of accumulating it through a hidden state. More
-            parameters, more compute, more room to overfit a small dataset.</p>
-          <p>Same input, same labels, four different inductive biases. The
-            comparison on the left is the punchline of the whole project.</p>
-        </div>
-      </section>
-
       <section class="scene" data-stage="training-code">
         <div class="prose">
           <h2>How we trained them</h2>
@@ -1093,6 +1099,21 @@
         </div>
       </section>
 
+      <section class="scene" data-stage="models">
+        <div class="prose">
+          <h2>Sequence models</h2>
+          <p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
+            window one timestep at a time and carry state forward. Cheap,
+            mature, easy to interpret.</p>
+          <p><strong>BERT-style transformer</strong> — the window becomes a
+            sequence of "tokens"; attention captures cross-position context
+            instead of accumulating it through a hidden state. More
+            parameters, more compute, more room to overfit a small dataset.</p>
+          <p>Same input, same labels, four different inductive biases. The
+            comparison on the left is the punchline of the whole project.</p>
+        </div>
+      </section>
+
       <section class="scene" data-stage="knn">
         <div class="prose">
           <h2>Nearest-neighbor as a sanity check</h2>
@@ -1157,11 +1178,13 @@
             trained on. Loading a model against a different schema
             fails fast. Without this, retroactive comparison silently
             scores models on misaligned columns and reports nonsense.</p>
-          <p><strong>Cross-host as the eval axis.</strong>
-            Held-out-by-host is reported as a first-class number
-            alongside held-out-by-sample — the two often disagree by
-            ~0.4 macro-F1, and only the cross-host number predicts
-            real fleet behaviour.</p>
+          <p><strong>Held-out-by-sample, profile-stratified.</strong>
+            Hosts in the fleet are uniform — same orchestrator, same
+            workload, just different production rates — so we split by
+            malware <code>sample_name</code> instead of by device. The
+            generalization claim is "unseen malware sample", tested on
+            the same population of hosts that contributed the training
+            data.</p>
         </div>
       </section>
 
@@ -1236,11 +1259,11 @@
             behaviours fall inside a single sample. Detection of
             millisecond-scale privilege checks would need faster
             telemetry than <code>/proc</code> provides.</p>
-          <p><strong>KNN cross-host gap.</strong> KNN scores val
-            macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the
-            held-out one. Instance-based memorization of the training
-            host's feature space — informative as a baseline, not a
-            deployment candidate.</p>
+          <p><strong>KNN val ↔ test gap.</strong> KNN scores val
+            macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
+            held-out <code>sample_name</code>s. Instance-based
+            memorization of the specific training samples — informative
+            as a baseline, not a deployment candidate.</p>
         </div>
       </section>