deck: reorder + correct eval framing to held-out-by-sample

REORDER
- collect (big-number ingest counter) moved from #7 to #2 — sits
  right after the title as the dataset-quantity hook
- training-code moved from #15 to #14 — "how we trained" now
  appears before "what we got" (models accuracy bars)

EVAL FRAMING CORRECTION
The fleet hosts are uniform — every host runs every profile, just
at different rates — so the actual split is held-out-by-sample
(profile-stratified), NOT held-out-by-host. Both hosts contribute
to train, val, AND test. The generalization claim is "unseen
malware sample_name", not "unseen device".

Fixed across:
- evaluation-setup: split-recipe block, val↔test gap (was
  "cross-host gap"), prose
- problem-statement: RQ wording, "generalize across hosts" →
  "generalize to sample_names"
- research-questions: RQ2 ("from a host the training set never
  saw" → "sample_names the training set never saw"); literature-gap
  bullet flipped from "cross-host generalization" to "sample-
  stratified evaluation"; prose
- solution-overview: pipeline diagram caption
- theoretical-contributions: "cross-host as the eval axis" →
  "held-out-by-sample as the eval axis"
- limitations: two-host-fleet card now states "both hosts
  contribute to train/val/test"; "KNN cross-host gap" → "KNN
  val ↔ test gap"
- conclusion-future: bullet flipped to held-out-by-sample as
  primary axis

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Max Gorog 2026-05-08 15:59:22 -05:00
parent db9f013969
commit 233390a40e

View file

@ -161,7 +161,23 @@
</div>
</div>
<!-- 2. motivation — what detection unlocks -->
<!-- 2. collect — big-number hook right after the title -->
<div class="stage-view" data-view="collect">
<div class="metric-stack">
<div class="metric-eyebrow">episodes ingested</div>
<div class="metric-big" id="ingest-total">0</div>
<div class="metric-sub">
<span id="ingest-rate">0.0</span> / sec · last 60 s ·
total bytes on disk: <span id="ingest-bytes">0 B</span>
</div>
<svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
<path id="ingest-spark-fill" d=""></path>
<path id="ingest-spark-path" d=""></path>
</svg>
</div>
</div>
<!-- 3. motivation — what detection unlocks -->
<div class="stage-view" data-view="motivation">
<div class="metric-stack metric-stack-wide motivation-stack">
<div class="metric-eyebrow">what detection unlocks</div>
@ -263,8 +279,9 @@
<ul class="research-list">
<li><strong>/proc-only signal</strong> — most work
assumes syscalls or kernel hooks</li>
<li><strong>Cross-host generalization</strong> — eval
splits often hide it (held-out by sample, not host)</li>
<li><strong>Sample-stratified evaluation</strong>
papers often hide same-sample overfit by training
and testing on the same malware instances</li>
<li><strong>Real-time per-window classification</strong>
for containment, not post-hoc batch labelling</li>
<li><strong>Side-by-side cell-choice comparison</strong>
@ -309,7 +326,7 @@
<text x="400" y="198" text-anchor="middle" class="pipeline-stage-title">model zoo</text>
<text x="400" y="226" text-anchor="middle" class="pipeline-detail">KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer</text>
<text x="400" y="252" text-anchor="middle" class="pipeline-detail">trained per (model × split-recipe)</text>
<text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">cross-host eval · class-weighted CE · early stop on val macro-F1</text>
<text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">held-out-by-sample · class-weighted CE · early stop on val macro-F1</text>
</g>
<g class="pipeline-stage">
<rect x="60" y="350" width="200" height="60" rx="4"/>
@ -356,22 +373,6 @@
</div>
</div>
<!-- 3. collect -->
<div class="stage-view" data-view="collect">
<div class="metric-stack">
<div class="metric-eyebrow">episodes ingested</div>
<div class="metric-big" id="ingest-total">0</div>
<div class="metric-sub">
<span id="ingest-rate">0.0</span> / sec · last 60 s ·
total bytes on disk: <span id="ingest-bytes">0 B</span>
</div>
<svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
<path id="ingest-spark-fill" d=""></path>
<path id="ingest-spark-path" d=""></path>
</svg>
</div>
</div>
<!-- 4. hosts -->
<div class="stage-view" data-view="hosts">
<div class="metric-stack">
@ -460,13 +461,18 @@
<div class="eval-block">
<div class="eval-block-title">split recipe</div>
<div class="eval-block-body">
<div><strong>train val:</strong> elliott-thinkpad</div>
<div><strong>test:</strong> k-gamingcom</div>
<div class="eval-detail">held-out by host so the test set
measures cross-device generalization, not in-distribution
self-prediction. A 90 % accuracy that comes from
recognising the host's idle profile is worthless for
a fleet detector.</div>
<div><strong>train / val / test:</strong> held-out by
<code>sample_name</code>, profile-stratified</div>
<div><strong>both hosts</strong> contribute to all three slices</div>
<div class="eval-detail">the fleet is uniform — every
host runs the same orchestrator and every profile —
so we don't split by host. We split by malware
<code>sample_name</code>: the specific instances in
the test set never appear during training.
Generalization axis is "unseen malware", not
"unseen device". Two profiles with only one sample
(cpu-saturate, low-and-slow) are excluded from
held-out-by-sample eval and reported separately.</div>
</div>
</div>
<div class="eval-block">
@ -495,25 +501,19 @@
<div class="eval-block-title">reported alongside accuracy</div>
<div class="eval-block-body">
<div><strong>μs / window</strong> — inference cost at batch=64</div>
<div><strong>cross-host gap</strong> — val test macro-F1</div>
<div class="eval-detail">latency translates to containment
lag; the gap is the honest measure of generalization.
Both are plotted on the perf scene.</div>
<div><strong>val ↔ test gap</strong> — val test macro-F1</div>
<div class="eval-detail">latency translates to
containment lag; the val ↔ test gap is the honest
measure of how much accuracy survives the move from
"samples we saw" to "samples we didn't". Both plot
on the perf scene.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 10. models -->
<div class="stage-view" data-view="models">
<div class="metric-stack">
<div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
<div class="model-bars" id="model-bars"></div>
</div>
</div>
<!-- 10. training-code — how we trained the sequence models -->
<!-- training-code — how we trained, before showing results -->
<div class="stage-view" data-view="training-code">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">how we trained the sequence models</div>
@ -530,6 +530,14 @@
</div>
</div>
<!-- models — accuracy bars (results after training-code) -->
<div class="stage-view" data-view="models">
<div class="metric-stack">
<div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
<div class="model-bars" id="model-bars"></div>
</div>
</div>
<!-- 11. knn — interactive 3-D scatter with mode toggle -->
<div class="stage-view" data-view="knn">
<div class="metric-stack">
@ -622,12 +630,14 @@
<div class="motivation-card">
<div class="motivation-card-marker mc-recover"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">cross-host as the eval axis</div>
<div class="motivation-card-text">Held-out-by-host
is reported as a first-class number alongside
held-out-by-sample. The two often disagree by 0.4
macro-F1, and only the cross-host number predicts
fleet behaviour.</div>
<div class="motivation-card-title">held-out-by-sample as the eval axis</div>
<div class="motivation-card-text">The hosts in the
fleet are uniform — same orchestrator, same workload,
different production rates. The generalization claim
is therefore "unseen malware sample", tested on the
same population of devices the training data came
from. Profile-stratified so every profile gets fair
train/val/test cells.</div>
</div>
</div>
</div>
@ -724,10 +734,12 @@
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">two-host fleet</div>
<div class="motivation-card-text">Cross-host generalization
is reported between exactly two machines
(elliott-thinkpad → k-gamingcom). N-host claims need
more hosts on the WireGuard mesh.</div>
<div class="motivation-card-text">Both hosts contribute
to train, val, and test, but the device population
is small (n = 2). Adding more hosts on the WireGuard
mesh wouldn't change the split recipe but would make
the dataset more representative of real-world
hardware variety.</div>
</div>
</div>
<div class="motivation-card">
@ -754,12 +766,12 @@
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">KNN cross-host gap</div>
<div class="motivation-card-title">KNN val ↔ test gap</div>
<div class="motivation-card-text">KNN scores val
macro-F1 ≈ 0.74 on elliott-thinkpad but only 0.13 on
the held-out k-gamingcom. Instance-based memorization
of the training host's feature space — informative
as a baseline, but not a deployment candidate.</div>
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13
on held-out sample_names. Instance-based memorization
of the specific training samples — informative as a
baseline, not a deployment candidate.</div>
</div>
</div>
</div>
@ -778,9 +790,12 @@
<strong>/proc-only telemetry</strong> can classify
workload phases at multi-class macro-F1 well above
chance.</li>
<li>Held-out-<strong>by-host</strong> evaluation is the
right generalization axis; held-out-by-sample
overstates real fleet performance by 0.3+ F1.</li>
<li>Held-out-by-<strong>sample</strong>,
profile-stratified, is the right generalization
axis: both fleet hosts contribute to all three
slices, and the test set's
<code>sample_name</code>s never appear during
training.</li>
<li>The recurrent family (LSTM/GRU) and Transformer
sit on the upper-left of the
<strong>accuracy-vs-cost frontier</strong>; KNN and
@ -835,6 +850,20 @@
</div>
</section>
<section class="scene" data-stage="collect">
<div class="prose">
<h2>Collecting the dataset</h2>
<p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
a profile-driven workload inside it, and samples
<code>/proc/&lt;qemu_pid&gt;</code> at 10&nbsp;Hz. Every ~30&nbsp;seconds
the labeled tarball is shipped to this Pi over mTLS.</p>
<p>The counter on the left is the running total, sourced from the
receiver's <code>index.jsonl</code> on disk. The sparkline is the
arrival rate over the last sixty seconds — proof that the deck
is reading live data, not a fixed slide.</p>
</div>
</section>
<section class="scene" data-stage="motivation">
<div class="prose">
<h2>Why detect at all?</h2>
@ -877,8 +906,8 @@
trained on twelve channels of <code>/proc</code> telemetry
classify five workload phases (clean / armed / infecting /
infected_running / dormant) accurately enough to drive
automated containment, <em>and</em> generalize across hosts
and malware profiles it has never seen during training?</p>
automated containment, <em>and</em> generalize to malware
<code>sample_name</code>s it has never seen during training?</p>
<p>The task is <strong>multi-class classification</strong>:
the target is one of five mutually-exclusive phase labels.
Not regression (no continuous target), not ranking
@ -894,15 +923,16 @@
<p>Literature on behaviour-based malware detection is rich but
uneven. Most published results either (a) use richer
telemetry than what a constrained host actually exports, or
(b) frame evaluation in ways that hide the cross-host
generalization problem. The card on the left summarises the
(b) frame evaluation in ways that hide same-sample overfit
(training and testing on the same malware instances). The card on the left summarises the
gap.</p>
<p>This project asks three concrete questions:</p>
<p><strong>RQ1.</strong> How well can a per-window classifier
identify workload phases from <code>/proc</code> alone, with
no syscall traces and no kernel hooks?</p>
<p><strong>RQ2.</strong> Does the model still work when test
episodes come from a host the training set never saw?</p>
<p><strong>RQ2.</strong> Does the model still work on
<code>sample_name</code>s the training set never saw —
i.e., new instances of malware profiles it does know?</p>
<p><strong>RQ3.</strong> Of the standard sequence-model
families (RNN, GRU, LSTM, CNN, Transformer) plus a
non-parametric baseline (KNN) and a tabular baseline
@ -927,7 +957,7 @@
into one shared training loop. KNN, GBT, MLP, CNN, RNN,
GRU, LSTM, and Transformer all reuse the same standardization,
schema-hashed checkpoint format, class-weighted CE loss,
and held-out-by-host evaluation — so the comparison is
and held-out-by-sample evaluation — so the comparison is
genuinely apples-to-apples.</p>
<p>The detector's per-window verdict feeds two downstream
loops: a fleet-wide <strong>trust score</strong> that
@ -953,19 +983,6 @@
</div>
</section>
<section class="scene" data-stage="collect">
<div class="prose">
<h2>Collecting the dataset</h2>
<p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
a profile-driven workload inside it, and samples
<code>/proc/&lt;qemu_pid&gt;</code> at 10&nbsp;Hz. Every ~30&nbsp;seconds
the labeled tarball is shipped to this Pi over mTLS.</p>
<p>The counter on the left is the running total, sourced from the
receiver's <code>index.jsonl</code> on disk. The sparkline is the
arrival rate over the last sixty seconds.</p>
</div>
</section>
<section class="scene" data-stage="hosts">
<div class="prose">
<h2>A multi-host fleet</h2>
@ -1044,11 +1061,15 @@
split recipe, the primary metric, and what we measure next
to accuracy. The temptation is to report a single big
number; we report a number you can argue with.</p>
<p><strong>Held-out by host.</strong> Train and validate on
one machine; test on a different machine. A model that
wins by memorising the train host's idle profile loses
here, which is what you want — a fleet detector has to
generalize across hosts it never saw at training time.</p>
<p><strong>Held-out by <code>sample_name</code>,
profile-stratified.</strong> The fleet is uniform — every
host runs the same orchestrator and the same set of
profiles — so we don't split by device. Both hosts
contribute data to train, val, and test. What's held out is
specific malware <em>instances</em>: the
<code>sample_name</code>s in the test set never appear
during training. The model has to generalize to unseen
samples, not unseen devices.</p>
<p><strong>Macro-F1, not accuracy.</strong> The dataset is
heavily skewed: roughly half the labelled time is
<code>infected_running</code> and only ~5 % is
@ -1062,21 +1083,6 @@
</div>
</section>
<section class="scene" data-stage="models">
<div class="prose">
<h2>Sequence models</h2>
<p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
window one timestep at a time and carry state forward. Cheap,
mature, easy to interpret.</p>
<p><strong>BERT-style transformer</strong> — the window becomes a
sequence of "tokens"; attention captures cross-position context
instead of accumulating it through a hidden state. More
parameters, more compute, more room to overfit a small dataset.</p>
<p>Same input, same labels, four different inductive biases. The
comparison on the left is the punchline of the whole project.</p>
</div>
</section>
<section class="scene" data-stage="training-code">
<div class="prose">
<h2>How we trained them</h2>
@ -1093,6 +1099,21 @@
</div>
</section>
<section class="scene" data-stage="models">
<div class="prose">
<h2>Sequence models</h2>
<p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
window one timestep at a time and carry state forward. Cheap,
mature, easy to interpret.</p>
<p><strong>BERT-style transformer</strong> — the window becomes a
sequence of "tokens"; attention captures cross-position context
instead of accumulating it through a hidden state. More
parameters, more compute, more room to overfit a small dataset.</p>
<p>Same input, same labels, four different inductive biases. The
comparison on the left is the punchline of the whole project.</p>
</div>
</section>
<section class="scene" data-stage="knn">
<div class="prose">
<h2>Nearest-neighbor as a sanity check</h2>
@ -1157,11 +1178,13 @@
trained on. Loading a model against a different schema
fails fast. Without this, retroactive comparison silently
scores models on misaligned columns and reports nonsense.</p>
<p><strong>Cross-host as the eval axis.</strong>
Held-out-by-host is reported as a first-class number
alongside held-out-by-sample — the two often disagree by
~0.4 macro-F1, and only the cross-host number predicts
real fleet behaviour.</p>
<p><strong>Held-out-by-sample, profile-stratified.</strong>
Hosts in the fleet are uniform — same orchestrator, same
workload, just different production rates — so we split by
malware <code>sample_name</code> instead of by device. The
generalization claim is "unseen malware sample", tested on
the same population of hosts that contributed the training
data.</p>
</div>
</section>
@ -1236,11 +1259,11 @@
behaviours fall inside a single sample. Detection of
millisecond-scale privilege checks would need faster
telemetry than <code>/proc</code> provides.</p>
<p><strong>KNN cross-host gap.</strong> KNN scores val
macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the
held-out one. Instance-based memorization of the training
host's feature space — informative as a baseline, not a
deployment candidate.</p>
<p><strong>KNN val ↔ test gap.</strong> KNN scores val
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
held-out <code>sample_name</code>s. Instance-based
memorization of the specific training samples — informative
as a baseline, not a deployment candidate.</p>
</div>
</section>