deck: reorder + correct eval framing to held-out-by-sample
REORDER - collect (big-number ingest counter) moved from #7 to #2 — sits right after the title as the dataset-quantity hook - training-code moved from #15 to #14 — "how we trained" now appears before "what we got" (models accuracy bars) EVAL FRAMING CORRECTION The fleet hosts are uniform — every host runs every profile, just at different rates — so the actual split is held-out-by-sample (profile-stratified), NOT held-out-by-host. Both hosts contribute to train, val, AND test. The generalization claim is "unseen malware sample_name", not "unseen device". Fixed across: - evaluation-setup: split-recipe block, val↔test gap (was "cross-host gap"), prose - problem-statement: RQ wording, "generalize across hosts" → "generalize to sample_names" - research-questions: RQ2 ("from a host the training set never saw" → "sample_names the training set never saw"); literature-gap bullet flipped from "cross-host generalization" to "sample- stratified evaluation"; prose - solution-overview: pipeline diagram caption - theoretical-contributions: "cross-host as the eval axis" → "held-out-by-sample as the eval axis" - limitations: two-host-fleet card now states "both hosts contribute to train/val/test"; "KNN cross-host gap" → "KNN val ↔ test gap" - conclusion-future: bullet flipped to held-out-by-sample as primary axis Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
db9f013969
commit
233390a40e
1 changed files with 131 additions and 108 deletions
|
|
@ -161,7 +161,23 @@
|
|||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 2. motivation — what detection unlocks -->
|
||||
<!-- 2. collect — big-number hook right after the title -->
|
||||
<div class="stage-view" data-view="collect">
|
||||
<div class="metric-stack">
|
||||
<div class="metric-eyebrow">episodes ingested</div>
|
||||
<div class="metric-big" id="ingest-total">0</div>
|
||||
<div class="metric-sub">
|
||||
<span id="ingest-rate">0.0</span> / sec · last 60 s ·
|
||||
total bytes on disk: <span id="ingest-bytes">0 B</span>
|
||||
</div>
|
||||
<svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
|
||||
<path id="ingest-spark-fill" d=""></path>
|
||||
<path id="ingest-spark-path" d=""></path>
|
||||
</svg>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 3. motivation — what detection unlocks -->
|
||||
<div class="stage-view" data-view="motivation">
|
||||
<div class="metric-stack metric-stack-wide motivation-stack">
|
||||
<div class="metric-eyebrow">what detection unlocks</div>
|
||||
|
|
@ -263,8 +279,9 @@
|
|||
<ul class="research-list">
|
||||
<li><strong>/proc-only signal</strong> — most work
|
||||
assumes syscalls or kernel hooks</li>
|
||||
<li><strong>Cross-host generalization</strong> — eval
|
||||
splits often hide it (held-out by sample, not host)</li>
|
||||
<li><strong>Sample-stratified evaluation</strong> —
|
||||
papers often hide same-sample overfit by training
|
||||
and testing on the same malware instances</li>
|
||||
<li><strong>Real-time per-window classification</strong>
|
||||
for containment, not post-hoc batch labelling</li>
|
||||
<li><strong>Side-by-side cell-choice comparison</strong>
|
||||
|
|
@ -309,7 +326,7 @@
|
|||
<text x="400" y="198" text-anchor="middle" class="pipeline-stage-title">model zoo</text>
|
||||
<text x="400" y="226" text-anchor="middle" class="pipeline-detail">KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer</text>
|
||||
<text x="400" y="252" text-anchor="middle" class="pipeline-detail">trained per (model × split-recipe)</text>
|
||||
<text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">cross-host eval · class-weighted CE · early stop on val macro-F1</text>
|
||||
<text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">held-out-by-sample · class-weighted CE · early stop on val macro-F1</text>
|
||||
</g>
|
||||
<g class="pipeline-stage">
|
||||
<rect x="60" y="350" width="200" height="60" rx="4"/>
|
||||
|
|
@ -356,22 +373,6 @@
|
|||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 3. collect -->
|
||||
<div class="stage-view" data-view="collect">
|
||||
<div class="metric-stack">
|
||||
<div class="metric-eyebrow">episodes ingested</div>
|
||||
<div class="metric-big" id="ingest-total">0</div>
|
||||
<div class="metric-sub">
|
||||
<span id="ingest-rate">0.0</span> / sec · last 60 s ·
|
||||
total bytes on disk: <span id="ingest-bytes">0 B</span>
|
||||
</div>
|
||||
<svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
|
||||
<path id="ingest-spark-fill" d=""></path>
|
||||
<path id="ingest-spark-path" d=""></path>
|
||||
</svg>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 4. hosts -->
|
||||
<div class="stage-view" data-view="hosts">
|
||||
<div class="metric-stack">
|
||||
|
|
@ -460,13 +461,18 @@
|
|||
<div class="eval-block">
|
||||
<div class="eval-block-title">split recipe</div>
|
||||
<div class="eval-block-body">
|
||||
<div><strong>train ∪ val:</strong> elliott-thinkpad</div>
|
||||
<div><strong>test:</strong> k-gamingcom</div>
|
||||
<div class="eval-detail">held-out by host so the test set
|
||||
measures cross-device generalization, not in-distribution
|
||||
self-prediction. A 90 % accuracy that comes from
|
||||
recognising the host's idle profile is worthless for
|
||||
a fleet detector.</div>
|
||||
<div><strong>train / val / test:</strong> held-out by
|
||||
<code>sample_name</code>, profile-stratified</div>
|
||||
<div><strong>both hosts</strong> contribute to all three slices</div>
|
||||
<div class="eval-detail">the fleet is uniform — every
|
||||
host runs the same orchestrator and every profile —
|
||||
so we don't split by host. We split by malware
|
||||
<code>sample_name</code>: the specific instances in
|
||||
the test set never appear during training.
|
||||
Generalization axis is "unseen malware", not
|
||||
"unseen device". Two profiles with only one sample
|
||||
(cpu-saturate, low-and-slow) are excluded from
|
||||
held-out-by-sample eval and reported separately.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="eval-block">
|
||||
|
|
@ -495,25 +501,19 @@
|
|||
<div class="eval-block-title">reported alongside accuracy</div>
|
||||
<div class="eval-block-body">
|
||||
<div><strong>μs / window</strong> — inference cost at batch=64</div>
|
||||
<div><strong>cross-host gap</strong> — val − test macro-F1</div>
|
||||
<div class="eval-detail">latency translates to containment
|
||||
lag; the gap is the honest measure of generalization.
|
||||
Both are plotted on the perf scene.</div>
|
||||
<div><strong>val ↔ test gap</strong> — val − test macro-F1</div>
|
||||
<div class="eval-detail">latency translates to
|
||||
containment lag; the val ↔ test gap is the honest
|
||||
measure of how much accuracy survives the move from
|
||||
"samples we saw" to "samples we didn't". Both plot
|
||||
on the perf scene.</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 10. models -->
|
||||
<div class="stage-view" data-view="models">
|
||||
<div class="metric-stack">
|
||||
<div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
|
||||
<div class="model-bars" id="model-bars"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 10. training-code — how we trained the sequence models -->
|
||||
<!-- training-code — how we trained, before showing results -->
|
||||
<div class="stage-view" data-view="training-code">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">how we trained the sequence models</div>
|
||||
|
|
@ -530,6 +530,14 @@
|
|||
</div>
|
||||
</div>
|
||||
|
||||
<!-- models — accuracy bars (results after training-code) -->
|
||||
<div class="stage-view" data-view="models">
|
||||
<div class="metric-stack">
|
||||
<div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
|
||||
<div class="model-bars" id="model-bars"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 11. knn — interactive 3-D scatter with mode toggle -->
|
||||
<div class="stage-view" data-view="knn">
|
||||
<div class="metric-stack">
|
||||
|
|
@ -622,12 +630,14 @@
|
|||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-recover"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">cross-host as the eval axis</div>
|
||||
<div class="motivation-card-text">Held-out-by-host
|
||||
is reported as a first-class number alongside
|
||||
held-out-by-sample. The two often disagree by 0.4
|
||||
macro-F1, and only the cross-host number predicts
|
||||
fleet behaviour.</div>
|
||||
<div class="motivation-card-title">held-out-by-sample as the eval axis</div>
|
||||
<div class="motivation-card-text">The hosts in the
|
||||
fleet are uniform — same orchestrator, same workload,
|
||||
different production rates. The generalization claim
|
||||
is therefore "unseen malware sample", tested on the
|
||||
same population of devices the training data came
|
||||
from. Profile-stratified so every profile gets fair
|
||||
train/val/test cells.</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
|
@ -724,10 +734,12 @@
|
|||
<div class="motivation-card-marker mc-armed"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">two-host fleet</div>
|
||||
<div class="motivation-card-text">Cross-host generalization
|
||||
is reported between exactly two machines
|
||||
(elliott-thinkpad → k-gamingcom). N-host claims need
|
||||
more hosts on the WireGuard mesh.</div>
|
||||
<div class="motivation-card-text">Both hosts contribute
|
||||
to train, val, and test, but the device population
|
||||
is small (n = 2). Adding more hosts on the WireGuard
|
||||
mesh wouldn't change the split recipe but would make
|
||||
the dataset more representative of real-world
|
||||
hardware variety.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
|
|
@ -754,12 +766,12 @@
|
|||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-armed"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">KNN cross-host gap</div>
|
||||
<div class="motivation-card-title">KNN val ↔ test gap</div>
|
||||
<div class="motivation-card-text">KNN scores val
|
||||
macro-F1 ≈ 0.74 on elliott-thinkpad but only 0.13 on
|
||||
the held-out k-gamingcom. Instance-based memorization
|
||||
of the training host's feature space — informative
|
||||
as a baseline, but not a deployment candidate.</div>
|
||||
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13
|
||||
on held-out sample_names. Instance-based memorization
|
||||
of the specific training samples — informative as a
|
||||
baseline, not a deployment candidate.</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
|
@ -778,9 +790,12 @@
|
|||
<strong>/proc-only telemetry</strong> can classify
|
||||
workload phases at multi-class macro-F1 well above
|
||||
chance.</li>
|
||||
<li>Held-out-<strong>by-host</strong> evaluation is the
|
||||
right generalization axis; held-out-by-sample
|
||||
overstates real fleet performance by 0.3+ F1.</li>
|
||||
<li>Held-out-by-<strong>sample</strong>,
|
||||
profile-stratified, is the right generalization
|
||||
axis: both fleet hosts contribute to all three
|
||||
slices, and the test set's
|
||||
<code>sample_name</code>s never appear during
|
||||
training.</li>
|
||||
<li>The recurrent family (LSTM/GRU) and Transformer
|
||||
sit on the upper-left of the
|
||||
<strong>accuracy-vs-cost frontier</strong>; KNN and
|
||||
|
|
@ -835,6 +850,20 @@
|
|||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="collect">
|
||||
<div class="prose">
|
||||
<h2>Collecting the dataset</h2>
|
||||
<p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
|
||||
a profile-driven workload inside it, and samples
|
||||
<code>/proc/<qemu_pid></code> at 10 Hz. Every ~30 seconds
|
||||
the labeled tarball is shipped to this Pi over mTLS.</p>
|
||||
<p>The counter on the left is the running total, sourced from the
|
||||
receiver's <code>index.jsonl</code> on disk. The sparkline is the
|
||||
arrival rate over the last sixty seconds — proof that the deck
|
||||
is reading live data, not a fixed slide.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="motivation">
|
||||
<div class="prose">
|
||||
<h2>Why detect at all?</h2>
|
||||
|
|
@ -877,8 +906,8 @@
|
|||
trained on twelve channels of <code>/proc</code> telemetry
|
||||
classify five workload phases (clean / armed / infecting /
|
||||
infected_running / dormant) accurately enough to drive
|
||||
automated containment, <em>and</em> generalize across hosts
|
||||
and malware profiles it has never seen during training?</p>
|
||||
automated containment, <em>and</em> generalize to malware
|
||||
<code>sample_name</code>s it has never seen during training?</p>
|
||||
<p>The task is <strong>multi-class classification</strong>:
|
||||
the target is one of five mutually-exclusive phase labels.
|
||||
Not regression (no continuous target), not ranking
|
||||
|
|
@ -894,15 +923,16 @@
|
|||
<p>Literature on behaviour-based malware detection is rich but
|
||||
uneven. Most published results either (a) use richer
|
||||
telemetry than what a constrained host actually exports, or
|
||||
(b) frame evaluation in ways that hide the cross-host
|
||||
generalization problem. The card on the left summarises the
|
||||
(b) frame evaluation in ways that hide same-sample overfit
|
||||
(training and testing on the same malware instances). The card on the left summarises the
|
||||
gap.</p>
|
||||
<p>This project asks three concrete questions:</p>
|
||||
<p><strong>RQ1.</strong> How well can a per-window classifier
|
||||
identify workload phases from <code>/proc</code> alone, with
|
||||
no syscall traces and no kernel hooks?</p>
|
||||
<p><strong>RQ2.</strong> Does the model still work when test
|
||||
episodes come from a host the training set never saw?</p>
|
||||
<p><strong>RQ2.</strong> Does the model still work on
|
||||
<code>sample_name</code>s the training set never saw —
|
||||
i.e., new instances of malware profiles it does know?</p>
|
||||
<p><strong>RQ3.</strong> Of the standard sequence-model
|
||||
families (RNN, GRU, LSTM, CNN, Transformer) plus a
|
||||
non-parametric baseline (KNN) and a tabular baseline
|
||||
|
|
@ -927,7 +957,7 @@
|
|||
into one shared training loop. KNN, GBT, MLP, CNN, RNN,
|
||||
GRU, LSTM, and Transformer all reuse the same standardization,
|
||||
schema-hashed checkpoint format, class-weighted CE loss,
|
||||
and held-out-by-host evaluation — so the comparison is
|
||||
and held-out-by-sample evaluation — so the comparison is
|
||||
genuinely apples-to-apples.</p>
|
||||
<p>The detector's per-window verdict feeds two downstream
|
||||
loops: a fleet-wide <strong>trust score</strong> that
|
||||
|
|
@ -953,19 +983,6 @@
|
|||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="collect">
|
||||
<div class="prose">
|
||||
<h2>Collecting the dataset</h2>
|
||||
<p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
|
||||
a profile-driven workload inside it, and samples
|
||||
<code>/proc/<qemu_pid></code> at 10 Hz. Every ~30 seconds
|
||||
the labeled tarball is shipped to this Pi over mTLS.</p>
|
||||
<p>The counter on the left is the running total, sourced from the
|
||||
receiver's <code>index.jsonl</code> on disk. The sparkline is the
|
||||
arrival rate over the last sixty seconds.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="hosts">
|
||||
<div class="prose">
|
||||
<h2>A multi-host fleet</h2>
|
||||
|
|
@ -1044,11 +1061,15 @@
|
|||
split recipe, the primary metric, and what we measure next
|
||||
to accuracy. The temptation is to report a single big
|
||||
number; we report a number you can argue with.</p>
|
||||
<p><strong>Held-out by host.</strong> Train and validate on
|
||||
one machine; test on a different machine. A model that
|
||||
wins by memorising the train host's idle profile loses
|
||||
here, which is what you want — a fleet detector has to
|
||||
generalize across hosts it never saw at training time.</p>
|
||||
<p><strong>Held-out by <code>sample_name</code>,
|
||||
profile-stratified.</strong> The fleet is uniform — every
|
||||
host runs the same orchestrator and the same set of
|
||||
profiles — so we don't split by device. Both hosts
|
||||
contribute data to train, val, and test. What's held out is
|
||||
specific malware <em>instances</em>: the
|
||||
<code>sample_name</code>s in the test set never appear
|
||||
during training. The model has to generalize to unseen
|
||||
samples, not unseen devices.</p>
|
||||
<p><strong>Macro-F1, not accuracy.</strong> The dataset is
|
||||
heavily skewed: roughly half the labelled time is
|
||||
<code>infected_running</code> and only ~5 % is
|
||||
|
|
@ -1062,21 +1083,6 @@
|
|||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="models">
|
||||
<div class="prose">
|
||||
<h2>Sequence models</h2>
|
||||
<p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
|
||||
window one timestep at a time and carry state forward. Cheap,
|
||||
mature, easy to interpret.</p>
|
||||
<p><strong>BERT-style transformer</strong> — the window becomes a
|
||||
sequence of "tokens"; attention captures cross-position context
|
||||
instead of accumulating it through a hidden state. More
|
||||
parameters, more compute, more room to overfit a small dataset.</p>
|
||||
<p>Same input, same labels, four different inductive biases. The
|
||||
comparison on the left is the punchline of the whole project.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="training-code">
|
||||
<div class="prose">
|
||||
<h2>How we trained them</h2>
|
||||
|
|
@ -1093,6 +1099,21 @@
|
|||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="models">
|
||||
<div class="prose">
|
||||
<h2>Sequence models</h2>
|
||||
<p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
|
||||
window one timestep at a time and carry state forward. Cheap,
|
||||
mature, easy to interpret.</p>
|
||||
<p><strong>BERT-style transformer</strong> — the window becomes a
|
||||
sequence of "tokens"; attention captures cross-position context
|
||||
instead of accumulating it through a hidden state. More
|
||||
parameters, more compute, more room to overfit a small dataset.</p>
|
||||
<p>Same input, same labels, four different inductive biases. The
|
||||
comparison on the left is the punchline of the whole project.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="knn">
|
||||
<div class="prose">
|
||||
<h2>Nearest-neighbor as a sanity check</h2>
|
||||
|
|
@ -1157,11 +1178,13 @@
|
|||
trained on. Loading a model against a different schema
|
||||
fails fast. Without this, retroactive comparison silently
|
||||
scores models on misaligned columns and reports nonsense.</p>
|
||||
<p><strong>Cross-host as the eval axis.</strong>
|
||||
Held-out-by-host is reported as a first-class number
|
||||
alongside held-out-by-sample — the two often disagree by
|
||||
~0.4 macro-F1, and only the cross-host number predicts
|
||||
real fleet behaviour.</p>
|
||||
<p><strong>Held-out-by-sample, profile-stratified.</strong>
|
||||
Hosts in the fleet are uniform — same orchestrator, same
|
||||
workload, just different production rates — so we split by
|
||||
malware <code>sample_name</code> instead of by device. The
|
||||
generalization claim is "unseen malware sample", tested on
|
||||
the same population of hosts that contributed the training
|
||||
data.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
|
@ -1236,11 +1259,11 @@
|
|||
behaviours fall inside a single sample. Detection of
|
||||
millisecond-scale privilege checks would need faster
|
||||
telemetry than <code>/proc</code> provides.</p>
|
||||
<p><strong>KNN cross-host gap.</strong> KNN scores val
|
||||
macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the
|
||||
held-out one. Instance-based memorization of the training
|
||||
host's feature space — informative as a baseline, not a
|
||||
deployment candidate.</p>
|
||||
<p><strong>KNN val ↔ test gap.</strong> KNN scores val
|
||||
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
|
||||
held-out <code>sample_name</code>s. Instance-based
|
||||
memorization of the specific training samples — informative
|
||||
as a baseline, not a deployment candidate.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue