deck: remove the nine inserted scenes
Per the user's request — the rubric-derived scenes I added in one sweep weren't tied closely enough to their actual project narrative and ate up presentation time. Reverting to the pre-insertion deck: removed problem-statement / research-questions / solution-overview / evaluation-setup / theoretical / practical / design-principles / limitations / conclusion-future kept (user-requested earlier in the session) motivation (with the IEEE 9881803 citation) live (A100 inference scene) CSS rules and references/* sidecar files for the removed scenes are left in place as harmless dead code; they can be cleaned up later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
ed5f729ff0
commit
53d2b80009
1 changed files with 2 additions and 667 deletions
|
|
@ -219,144 +219,7 @@
|
|||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 3. problem-statement — what we're solving + task type -->
|
||||
<div class="stage-view" data-view="problem-statement">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">the problem · single sentence + numbers</div>
|
||||
<div class="problem-claim">
|
||||
<div class="problem-claim-text">Classify each ten-second window of fleet
|
||||
<code>/proc</code> telemetry into one of five workload phases —
|
||||
accurately enough to drive automated containment.</div>
|
||||
</div>
|
||||
<div class="problem-stats">
|
||||
<div class="problem-stat">
|
||||
<div class="problem-stat-num">5</div>
|
||||
<div class="problem-stat-lbl">phase classes<br><code>clean</code> → <code>infected_running</code></div>
|
||||
</div>
|
||||
<div class="problem-stat">
|
||||
<div class="problem-stat-num">12</div>
|
||||
<div class="problem-stat-lbl"><code>/proc</code> channels<br>no syscalls, no kernel hooks</div>
|
||||
</div>
|
||||
<div class="problem-stat">
|
||||
<div class="problem-stat-num">10s</div>
|
||||
<div class="problem-stat-lbl">classification window<br>100 samples × 12 channels</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="problem-task">
|
||||
<span class="problem-task-label">task type:</span>
|
||||
<span class="problem-task-value">multi-class classification</span>
|
||||
<span class="problem-task-detail">— five mutually-exclusive
|
||||
phase labels, balanced via class-weighted cross-entropy.
|
||||
Not regression (no continuous target), not ranking
|
||||
(downstream policy is a categorical containment decision).</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 4. research-questions — literature gaps and questions -->
|
||||
<div class="stage-view" data-view="research-questions">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">literature gaps · positioning the work</div>
|
||||
<div class="research-grid">
|
||||
<div class="research-col">
|
||||
<div class="research-col-title">what prior work covers</div>
|
||||
<ul class="research-list">
|
||||
<li><strong>LSTM on syscall traces</strong> in VMs —
|
||||
deeper telemetry than <code>/proc</code></li>
|
||||
<li><strong>Transformer on per-process resource metrics</strong>
|
||||
— related signal, single-host eval</li>
|
||||
<li><strong>BERT on system logs</strong> (LogBERT) —
|
||||
text-form telemetry, not numeric channels</li>
|
||||
<li><strong>Insider-threat LSTM on event logs</strong>
|
||||
(DANTE) — categorical events, not continuous</li>
|
||||
<li><strong>Network-behaviour trust establishment</strong>
|
||||
(IEEE 9881803) — cross-device aggregation,
|
||||
not per-host classifier</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="research-col">
|
||||
<div class="research-col-title">what's missing</div>
|
||||
<ul class="research-list">
|
||||
<li><strong>/proc-only signal</strong> — most work
|
||||
assumes syscalls or kernel hooks</li>
|
||||
<li><strong>Sample-stratified evaluation</strong> —
|
||||
papers often hide same-sample overfit by training
|
||||
and testing on the same malware instances</li>
|
||||
<li><strong>Real-time per-window classification</strong>
|
||||
for containment, not post-hoc batch labelling</li>
|
||||
<li><strong>Side-by-side cell-choice comparison</strong>
|
||||
(RNN/GRU/LSTM/CNN/Transformer) on one dataset</li>
|
||||
<li><strong>Direct integration</strong> with a
|
||||
fleet-wide trust score, not standalone output</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 5. solution-overview — pipeline block diagram -->
|
||||
<div class="stage-view" data-view="solution-overview">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">pipeline · what each stage produces</div>
|
||||
<svg class="pipeline-svg" viewBox="0 0 800 480"
|
||||
xmlns="http://www.w3.org/2000/svg"
|
||||
preserveAspectRatio="xMidYMid meet">
|
||||
<g class="pipeline-stage">
|
||||
<rect x="20" y="40" width="140" height="60" rx="4"/>
|
||||
<text x="90" y="68" text-anchor="middle">fleet hosts</text>
|
||||
<text x="90" y="86" text-anchor="middle" class="pipeline-detail">/proc · 10 Hz</text>
|
||||
</g>
|
||||
<g class="pipeline-stage">
|
||||
<rect x="200" y="40" width="140" height="60" rx="4"/>
|
||||
<text x="270" y="68" text-anchor="middle">receiver (Pi)</text>
|
||||
<text x="270" y="86" text-anchor="middle" class="pipeline-detail">bearer auth</text>
|
||||
</g>
|
||||
<g class="pipeline-stage">
|
||||
<rect x="380" y="40" width="140" height="60" rx="4"/>
|
||||
<text x="450" y="68" text-anchor="middle">episode store</text>
|
||||
<text x="450" y="86" text-anchor="middle" class="pipeline-detail">zstd · tar</text>
|
||||
</g>
|
||||
<g class="pipeline-stage">
|
||||
<rect x="560" y="40" width="220" height="60" rx="4"/>
|
||||
<text x="670" y="68" text-anchor="middle">windowing + features</text>
|
||||
<text x="670" y="86" text-anchor="middle" class="pipeline-detail">10 s · 100 samples × 12 ch</text>
|
||||
</g>
|
||||
<g class="pipeline-stage pipeline-stage-models">
|
||||
<rect x="180" y="170" width="440" height="120" rx="4"/>
|
||||
<text x="400" y="198" text-anchor="middle" class="pipeline-stage-title">model zoo</text>
|
||||
<text x="400" y="226" text-anchor="middle" class="pipeline-detail">KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer</text>
|
||||
<text x="400" y="252" text-anchor="middle" class="pipeline-detail">trained per (model × split-recipe)</text>
|
||||
<text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">held-out-by-sample · class-weighted CE · early stop on val macro-F1</text>
|
||||
</g>
|
||||
<g class="pipeline-stage">
|
||||
<rect x="60" y="350" width="200" height="60" rx="4"/>
|
||||
<text x="160" y="378" text-anchor="middle">per-window phase</text>
|
||||
<text x="160" y="396" text-anchor="middle" class="pipeline-detail">5-class softmax</text>
|
||||
</g>
|
||||
<g class="pipeline-stage pipeline-stage-final">
|
||||
<rect x="300" y="350" width="200" height="60" rx="4"/>
|
||||
<text x="400" y="378" text-anchor="middle">trust score</text>
|
||||
<text x="400" y="396" text-anchor="middle" class="pipeline-detail">+ network signals (9881803)</text>
|
||||
</g>
|
||||
<g class="pipeline-stage pipeline-stage-final">
|
||||
<rect x="540" y="350" width="220" height="60" rx="4"/>
|
||||
<text x="650" y="378" text-anchor="middle">containment + reset</text>
|
||||
<text x="650" y="396" text-anchor="middle" class="pipeline-detail">snapshot rollback</text>
|
||||
</g>
|
||||
<g class="pipeline-arrow" fill="none">
|
||||
<path d="M160 70 L200 70" />
|
||||
<path d="M340 70 L380 70" />
|
||||
<path d="M520 70 L560 70" />
|
||||
<path d="M670 100 L670 130 L400 130 L400 170" />
|
||||
<path d="M400 290 L400 320 L160 320 L160 350" />
|
||||
<path d="M260 380 L300 380" />
|
||||
<path d="M500 380 L540 380" />
|
||||
</g>
|
||||
</svg>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 6. stack — Python stack & libraries used in the project -->
|
||||
<!-- stack — Python stack & libraries used in the project -->
|
||||
<div class="stage-view" data-view="stack">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">the stack behind the live data on the right</div>
|
||||
|
|
@ -453,66 +316,6 @@
|
|||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 9. evaluation-setup — splits, metrics, baselines -->
|
||||
<div class="stage-view" data-view="evaluation-setup">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">evaluation setup · how the numbers get made</div>
|
||||
<div class="eval-blocks">
|
||||
<div class="eval-block">
|
||||
<div class="eval-block-title">split recipe</div>
|
||||
<div class="eval-block-body">
|
||||
<div><strong>train / val / test:</strong> held-out by
|
||||
<code>sample_name</code>, profile-stratified</div>
|
||||
<div><strong>both hosts</strong> contribute to all three slices</div>
|
||||
<div class="eval-detail">the fleet is uniform — every
|
||||
host runs the same orchestrator and every profile —
|
||||
so we don't split by host. We split by malware
|
||||
<code>sample_name</code>: the specific instances in
|
||||
the test set never appear during training.
|
||||
Generalization axis is "unseen malware", not
|
||||
"unseen device". Two profiles with only one sample
|
||||
(cpu-saturate, low-and-slow) are excluded from
|
||||
held-out-by-sample eval and reported separately.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="eval-block">
|
||||
<div class="eval-block-title">primary metric</div>
|
||||
<div class="eval-block-body">
|
||||
<div><strong>macro-F1</strong> averaged across the five phases</div>
|
||||
<div class="eval-detail">accuracy lies under class
|
||||
imbalance — ~50 % <code>infected_running</code>,
|
||||
~5 % <code>armed</code>. A constant majority predictor
|
||||
hits 0.5 accuracy. macro-F1 averages per-class F1,
|
||||
so rare phases actually count toward the score.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="eval-block">
|
||||
<div class="eval-block-title">baselines compared</div>
|
||||
<div class="eval-block-body">
|
||||
<div><strong>KNN</strong> — non-parametric, instance-based</div>
|
||||
<div><strong>GBT (XGBoost)</strong> — tabular non-NN</div>
|
||||
<div><strong>MLP</strong> — feedforward ablation</div>
|
||||
<div><strong>CNN</strong> — local-pattern ablation</div>
|
||||
<div><strong>RNN / GRU / LSTM</strong> — recurrent family</div>
|
||||
<div><strong>Transformer</strong> — attention</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="eval-block">
|
||||
<div class="eval-block-title">reported alongside accuracy</div>
|
||||
<div class="eval-block-body">
|
||||
<div><strong>μs / window</strong> — inference cost at batch=64</div>
|
||||
<div><strong>val ↔ test gap</strong> — val − test macro-F1</div>
|
||||
<div class="eval-detail">latency translates to
|
||||
containment lag; the val ↔ test gap is the honest
|
||||
measure of how much accuracy survives the move from
|
||||
"samples we saw" to "samples we didn't". Both plot
|
||||
on the perf scene.</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- training-code — how we trained, before showing results -->
|
||||
<div class="stage-view" data-view="training-code">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
|
|
@ -600,234 +403,6 @@
|
|||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 15. theoretical-contributions -->
|
||||
<div class="stage-view" data-view="theoretical">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">theoretical contributions · what's new methodologically</div>
|
||||
<div class="motivation-cards">
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-trust"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">window-centre labelling</div>
|
||||
<div class="motivation-card-text">A 10-second
|
||||
classification window is labelled by the phase that
|
||||
occupies its centre, not by majority vote across the
|
||||
window. Cleaner training signal at phase boundaries,
|
||||
and avoids the spurious "ambiguous" class.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-contain"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">schema-hashed checkpoints</div>
|
||||
<div class="motivation-card-text">Each checkpoint
|
||||
embeds a hash of the feature schema; loading a model
|
||||
against the wrong schema fails fast instead of
|
||||
silently scoring on misaligned columns. Makes
|
||||
retroactive comparison reproducible.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-recover"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">held-out-by-sample as the eval axis</div>
|
||||
<div class="motivation-card-text">The hosts in the
|
||||
fleet are uniform — same orchestrator, same workload,
|
||||
different production rates. The generalization claim
|
||||
is therefore "unseen malware sample", tested on the
|
||||
same population of devices the training data came
|
||||
from. Profile-stratified so every profile gets fair
|
||||
train/val/test cells.</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 16. practical-contributions -->
|
||||
<div class="stage-view" data-view="practical">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">practical contributions · what others can use</div>
|
||||
<div class="motivation-cards">
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-trust"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">/proc-only deployment</div>
|
||||
<div class="motivation-card-text">No syscall hooks, no
|
||||
eBPF, no kernel module — runs on hosts that don't
|
||||
permit deep instrumentation. The detector is one
|
||||
Python service plus a model file.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-contain"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">producer-agnostic dashboard</div>
|
||||
<div class="motivation-card-text">The deck consumes
|
||||
typed events; the inference loop runs anywhere
|
||||
(Pi, A100, cloud) and just POSTs back. Same UI for
|
||||
a lab demo and an operational console.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-recover"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">labelled dataset on disk</div>
|
||||
<div class="motivation-card-text">78,000+ episodes,
|
||||
five phases, two hosts, six attack profiles —
|
||||
archived in zstd-compressed tarballs with a
|
||||
schema-versioned format. Ready for downstream
|
||||
work without re-running the orchestrator.</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 17. design-principles -->
|
||||
<div class="stage-view" data-view="design-principles">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">design principles · patterns that emerged</div>
|
||||
<div class="motivation-cards">
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-trust"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">one loop, many models</div>
|
||||
<div class="motivation-card-text">Every NN architecture
|
||||
plugs into the same training loop — class weights,
|
||||
AMP, cosine LR, early stop. Architecture changes
|
||||
don't ripple into orchestration.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-contain"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">typed events as contract</div>
|
||||
<div class="motivation-card-text">Producers and
|
||||
consumers agree on dataclasses
|
||||
(<code>events.py</code>), not free-form dicts.
|
||||
Adding a new scene means adding a new dataclass;
|
||||
adding a new producer means importing it.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-recover"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">two-agent path ownership</div>
|
||||
<div class="motivation-card-text">Dashboard work and
|
||||
model work live in two parallel sessions with a
|
||||
documented path-ownership boundary. Merges go
|
||||
through git with explicit rebases instead of a
|
||||
shared workspace.</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 18. limitations -->
|
||||
<div class="stage-view" data-view="limitations">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">limitations · the honest list</div>
|
||||
<div class="motivation-cards">
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-armed"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">two-host fleet</div>
|
||||
<div class="motivation-card-text">Both hosts contribute
|
||||
to train, val, and test, but the device population
|
||||
is small (n = 2). Adding more hosts on the WireGuard
|
||||
mesh wouldn't change the split recipe but would make
|
||||
the dataset more representative of real-world
|
||||
hardware variety.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-armed"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">synthetic attack profiles</div>
|
||||
<div class="motivation-card-text">Six profiles cover the
|
||||
main shapes (cpu-saturate, ransomware-lite, bursty-c2,
|
||||
fork-bomb, crypto-miner, distccd-exec) but real-world
|
||||
malware can sit between or outside these envelopes.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-armed"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">10 Hz sampling floor</div>
|
||||
<div class="motivation-card-text">Sub-100ms attack
|
||||
behaviours fall inside a single sample. Detection of
|
||||
extremely short-lived attacks (millisecond-scale
|
||||
privilege checks) requires faster sampling than
|
||||
<code>/proc</code> currently provides.</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="motivation-card">
|
||||
<div class="motivation-card-marker mc-armed"></div>
|
||||
<div class="motivation-card-body">
|
||||
<div class="motivation-card-title">KNN val ↔ test gap</div>
|
||||
<div class="motivation-card-text">KNN scores val
|
||||
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13
|
||||
on held-out sample_names. Instance-based memorization
|
||||
of the specific training samples — informative as a
|
||||
baseline, not a deployment candidate.</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- 19. conclusion-future — summary + unsupervised next steps -->
|
||||
<div class="stage-view" data-view="conclusion-future">
|
||||
<div class="metric-stack metric-stack-wide">
|
||||
<div class="metric-eyebrow">conclusion + future work</div>
|
||||
<div class="conclusion-grid">
|
||||
<div class="conclusion-col">
|
||||
<div class="conclusion-col-title">what we showed</div>
|
||||
<ul class="conclusion-list">
|
||||
<li>A per-host detector trained on
|
||||
<strong>/proc-only telemetry</strong> can classify
|
||||
workload phases at multi-class macro-F1 well above
|
||||
chance.</li>
|
||||
<li>Held-out-by-<strong>sample</strong>,
|
||||
profile-stratified, is the right generalization
|
||||
axis: both fleet hosts contribute to all three
|
||||
slices, and the test set's
|
||||
<code>sample_name</code>s never appear during
|
||||
training.</li>
|
||||
<li>The recurrent family (LSTM/GRU) and Transformer
|
||||
sit on the upper-left of the
|
||||
<strong>accuracy-vs-cost frontier</strong>; KNN and
|
||||
GBT round out the comparison as honest baselines.</li>
|
||||
<li>The detector slots into a wider <strong>trust /
|
||||
containment / recovery</strong> loop — the per-host
|
||||
verdict isn't the final answer, it's one input.</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="conclusion-col">
|
||||
<div class="conclusion-col-title">next steps · unsupervised</div>
|
||||
<ul class="conclusion-list">
|
||||
<li><strong>Clustering</strong> the unlabeled tail of
|
||||
new fleet data (KMeans / HDBSCAN) to surface novel
|
||||
workload shapes the supervised model has no class
|
||||
for — a self-training feedback loop.</li>
|
||||
<li><strong>Anomaly detection</strong> on the
|
||||
last-layer embedding (one-class SVM, isolation forest)
|
||||
so a "none of the five known phases" verdict is
|
||||
available alongside the classifier output.</li>
|
||||
<li><strong>Self-supervised pretraining</strong> on
|
||||
the much larger pool of unlabeled telemetry from
|
||||
operational hosts; supervised fine-tune on the
|
||||
smaller orchestrated dataset.</li>
|
||||
<li><strong>Embedding visualisation</strong> via
|
||||
UMAP / t-SNE for human-in-the-loop labelling of
|
||||
the unlabeled tail (already prototyped in scene 12).</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
<button id="next-fab" class="fab" data-no-advance title="Next (→)">▼</button>
|
||||
|
|
@ -893,80 +468,6 @@
|
|||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="problem-statement">
|
||||
<div class="prose">
|
||||
<h2>Problem statement</h2>
|
||||
<p>Today's behaviour-based IDS systems rely on syscall traces,
|
||||
kernel hooks, or rich endpoint agents that can't ship to
|
||||
constrained or untrusted hosts. We want a detector that
|
||||
runs on the only telemetry every modern Linux already
|
||||
exports — <code>/proc</code> — and labels each ten-second
|
||||
window of activity with the phase the workload is in.</p>
|
||||
<p><strong>Research question.</strong> Can a sequence model
|
||||
trained on twelve channels of <code>/proc</code> telemetry
|
||||
classify five workload phases (clean / armed / infecting /
|
||||
infected_running / dormant) accurately enough to drive
|
||||
automated containment, <em>and</em> generalize to malware
|
||||
<code>sample_name</code>s it has never seen during training?</p>
|
||||
<p>The task is <strong>multi-class classification</strong>:
|
||||
the target is one of five mutually-exclusive phase labels.
|
||||
Not regression (no continuous target), not ranking
|
||||
(downstream policy is a categorical containment decision).
|
||||
We deliberately chose 10-second windows so detection
|
||||
latency stays bounded for a real fleet.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="research-questions">
|
||||
<div class="prose">
|
||||
<h2>Research gaps + questions</h2>
|
||||
<p>Literature on behaviour-based malware detection is rich but
|
||||
uneven. Most published results either (a) use richer
|
||||
telemetry than what a constrained host actually exports, or
|
||||
(b) frame evaluation in ways that hide same-sample overfit
|
||||
(training and testing on the same malware instances). The card on the left summarises the
|
||||
gap.</p>
|
||||
<p>This project asks three concrete questions:</p>
|
||||
<p><strong>RQ1.</strong> How well can a per-window classifier
|
||||
identify workload phases from <code>/proc</code> alone, with
|
||||
no syscall traces and no kernel hooks?</p>
|
||||
<p><strong>RQ2.</strong> Does the model still work on
|
||||
<code>sample_name</code>s the training set never saw —
|
||||
i.e., new instances of malware profiles it does know?</p>
|
||||
<p><strong>RQ3.</strong> Of the standard sequence-model
|
||||
families (RNN, GRU, LSTM, CNN, Transformer) plus a
|
||||
non-parametric baseline (KNN) and a tabular baseline
|
||||
(gradient-boosted trees), which trade off accuracy and
|
||||
inference cost best for a deployment that has to run on a
|
||||
constrained host?</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="solution-overview">
|
||||
<div class="prose">
|
||||
<h2>Proposed solution</h2>
|
||||
<p>A single end-to-end pipeline turns raw <code>/proc</code>
|
||||
telemetry on a fleet host into a per-window phase verdict
|
||||
in under a second. Each stage of the diagram on the left
|
||||
is a thin, independently-deployable component — the
|
||||
receiver doesn't know what model is running; the model
|
||||
doesn't know where the episode came from.</p>
|
||||
<p>The <strong>model zoo</strong> is the key abstraction:
|
||||
every model class registers itself by name, declares its
|
||||
input kind (summary features or window tensors), and plugs
|
||||
into one shared training loop. KNN, GBT, MLP, CNN, RNN,
|
||||
GRU, LSTM, and Transformer all reuse the same standardization,
|
||||
schema-hashed checkpoint format, class-weighted CE loss,
|
||||
and held-out-by-sample evaluation — so the comparison is
|
||||
genuinely apples-to-apples.</p>
|
||||
<p>The detector's per-window verdict feeds two downstream
|
||||
loops: a fleet-wide <strong>trust score</strong> that
|
||||
combines local classification with network-behaviour
|
||||
signals (per IEEE 9881803), and a <strong>fast-recovery</strong>
|
||||
snapshot rollback when an infection time is known.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="stack">
|
||||
<div class="prose">
|
||||
<h2>Live, not staged</h2>
|
||||
|
|
@ -1054,35 +555,6 @@
|
|||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="evaluation-setup">
|
||||
<div class="prose">
|
||||
<h2>Evaluation setup</h2>
|
||||
<p>Three choices anchor every result on the next slides — the
|
||||
split recipe, the primary metric, and what we measure next
|
||||
to accuracy. The temptation is to report a single big
|
||||
number; we report a number you can argue with.</p>
|
||||
<p><strong>Held-out by <code>sample_name</code>,
|
||||
profile-stratified.</strong> The fleet is uniform — every
|
||||
host runs the same orchestrator and the same set of
|
||||
profiles — so we don't split by device. Both hosts
|
||||
contribute data to train, val, and test. What's held out is
|
||||
specific malware <em>instances</em>: the
|
||||
<code>sample_name</code>s in the test set never appear
|
||||
during training. The model has to generalize to unseen
|
||||
samples, not unseen devices.</p>
|
||||
<p><strong>Macro-F1, not accuracy.</strong> The dataset is
|
||||
heavily skewed: roughly half the labelled time is
|
||||
<code>infected_running</code> and only ~5 % is
|
||||
<code>armed</code>. A "predict the majority class"
|
||||
baseline already hits 0.5 accuracy. Macro-F1 averages F1
|
||||
across all five phases so rare classes count.</p>
|
||||
<p><strong>Latency reported with accuracy.</strong> A model
|
||||
that's one F1 point better but ten milliseconds slower
|
||||
may still be the wrong choice for an on-host detector.
|
||||
The perf scene plots both axes so the trade-off is visible.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="training-code">
|
||||
<div class="prose">
|
||||
<h2>How we trained them</h2>
|
||||
|
|
@ -1161,143 +633,6 @@
|
|||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="theoretical">
|
||||
<div class="prose">
|
||||
<h2>Theoretical contributions</h2>
|
||||
<p>Three methodological claims this project makes — small in
|
||||
isolation, but together they change how the comparison is
|
||||
run. Each shows up explicitly in the codebase.</p>
|
||||
<p><strong>Window-centre labelling.</strong> Instead of
|
||||
majority-voting phase labels across each 10-second window
|
||||
(which creates noisy boundaries), we label each window by
|
||||
the phase that occupies its centre. Cleaner training
|
||||
signal at transitions, no spurious "ambiguous" class.</p>
|
||||
<p><strong>Schema-hashed checkpoints.</strong> Every
|
||||
checkpoint embeds a hash of the feature schema it was
|
||||
trained on. Loading a model against a different schema
|
||||
fails fast. Without this, retroactive comparison silently
|
||||
scores models on misaligned columns and reports nonsense.</p>
|
||||
<p><strong>Held-out-by-sample, profile-stratified.</strong>
|
||||
Hosts in the fleet are uniform — same orchestrator, same
|
||||
workload, just different production rates — so we split by
|
||||
malware <code>sample_name</code> instead of by device. The
|
||||
generalization claim is "unseen malware sample", tested on
|
||||
the same population of hosts that contributed the training
|
||||
data.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="practical">
|
||||
<div class="prose">
|
||||
<h2>Practical contributions</h2>
|
||||
<p>What others can pick up and use from this project — beyond
|
||||
the published numbers.</p>
|
||||
<p><strong>/proc-only deployment.</strong> The detector needs
|
||||
no syscall hooks, no eBPF, no kernel module. It runs on
|
||||
hosts that don't permit deeper instrumentation — a small
|
||||
VM, a container with limited capabilities, an embedded
|
||||
device. One Python service plus a model file.</p>
|
||||
<p><strong>Producer-agnostic dashboard.</strong> The deck
|
||||
consumes typed events
|
||||
(<code>training/dashboard/events.py</code>); the inference
|
||||
loop runs anywhere — Pi, A100, cloud — and just POSTs back.
|
||||
Same UI for a lab demo and an operational console.</p>
|
||||
<p><strong>Labelled dataset on disk.</strong> 78 000+
|
||||
episodes across two hosts and six attack profiles, archived
|
||||
in zstd-compressed tarballs with a schema-versioned format.
|
||||
Anyone reproducing or extending this work can start from
|
||||
the dataset directly without re-running the orchestrator.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="design-principles">
|
||||
<div class="prose">
|
||||
<h2>Design principles</h2>
|
||||
<p>Three patterns that emerged during the project and earned
|
||||
their keep enough that we'd repeat them.</p>
|
||||
<p><strong>One loop, many models.</strong> Every NN
|
||||
architecture plugs into the same training loop — class
|
||||
weights, AMP autocast, cosine LR with warmup, gradient
|
||||
clipping, early stop on val macro-F1. Architecture changes
|
||||
don't ripple into orchestration, and adding a new model
|
||||
class costs ~80 lines.</p>
|
||||
<p><strong>Typed events as contract.</strong> Producers and
|
||||
consumers agree on dataclasses, not free-form dicts.
|
||||
Adding a new dashboard scene means adding a new dataclass;
|
||||
adding a new producer means importing it. Static checking
|
||||
and editor autocomplete do most of the work that a
|
||||
schema-validation library would do at runtime.</p>
|
||||
<p><strong>Two-agent path ownership.</strong> Dashboard work
|
||||
and model work live in two parallel sessions with a
|
||||
documented path-ownership boundary
|
||||
(<code>training/dashboard/</code> vs everywhere else).
|
||||
Merges go through git with explicit rebases instead of a
|
||||
shared workspace — slow up front, fewer subtle stomps
|
||||
over time.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="limitations">
|
||||
<div class="prose">
|
||||
<h2>Limitations</h2>
|
||||
<p>What this project cannot honestly claim — and why each
|
||||
line on the left matters for how the results should be read.</p>
|
||||
<p><strong>Two-host fleet.</strong> Cross-host generalization
|
||||
is reported between exactly two machines; it's the right
|
||||
<em>shape</em> of evaluation but not a population claim.
|
||||
More hosts on the WireGuard mesh would let us report
|
||||
distributional bounds rather than single point comparisons.</p>
|
||||
<p><strong>Synthetic attack profiles.</strong> Our six
|
||||
profiles cover the main behavioural envelopes
|
||||
(cpu-saturate, ransomware-lite, bursty-c2, fork-bomb,
|
||||
crypto-miner, distccd-exec) but real-world malware can
|
||||
sit between or outside these envelopes. Generalization to
|
||||
unseen profiles is reported via held-out-by-sample, but
|
||||
in-the-wild distribution shift is unknown.</p>
|
||||
<p><strong>10 Hz sampling floor.</strong> Sub-100ms
|
||||
behaviours fall inside a single sample. Detection of
|
||||
millisecond-scale privilege checks would need faster
|
||||
telemetry than <code>/proc</code> provides.</p>
|
||||
<p><strong>KNN val ↔ test gap.</strong> KNN scores val
|
||||
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
|
||||
held-out <code>sample_name</code>s. Instance-based
|
||||
memorization of the specific training samples — informative
|
||||
as a baseline, not a deployment candidate.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="conclusion-future">
|
||||
<div class="prose">
|
||||
<h2>Conclusion + future work</h2>
|
||||
<p>A per-host classifier trained on <code>/proc</code>-only
|
||||
telemetry can identify workload phases at multi-class
|
||||
macro-F1 well above chance and slot into a wider
|
||||
trust / containment / recovery loop. The recurrent family
|
||||
(LSTM/GRU) and Transformer sit on the upper-left of the
|
||||
accuracy-vs-cost frontier; KNN and GBT are honest baselines.
|
||||
Held-out-by-host evaluation is the right generalization
|
||||
axis — held-out-by-sample overstates real fleet
|
||||
performance by 0.3+ F1.</p>
|
||||
<p><strong>Unsupervised next steps.</strong> The natural
|
||||
extensions are unsupervised:</p>
|
||||
<p>• <strong>Clustering</strong> the unlabeled tail of new
|
||||
fleet data (KMeans / HDBSCAN) to surface novel workload
|
||||
shapes the supervised model has no class for — a
|
||||
self-training feedback loop that enrolls new phases as
|
||||
the fleet grows.</p>
|
||||
<p>• <strong>Anomaly detection</strong> on the last-layer
|
||||
embedding (one-class SVM, isolation forest) so a "none of
|
||||
the five known phases" verdict is available alongside the
|
||||
classifier output.</p>
|
||||
<p>• <strong>Self-supervised pretraining</strong> on the much
|
||||
larger pool of unlabeled telemetry from operational hosts;
|
||||
supervised fine-tune on the smaller orchestrated dataset.</p>
|
||||
<p>• <strong>Embedding visualisation</strong> via UMAP /
|
||||
t-SNE for human-in-the-loop labelling — already prototyped
|
||||
in the KNN scene's interactive 3-D scatter.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="scene" data-stage="references">
|
||||
<div class="prose">
|
||||
<h2>References</h2>
|
||||
|
|
@ -1313,6 +648,6 @@
|
|||
</article>
|
||||
</div>
|
||||
|
||||
<script src="/static/dashboard.js?v=5316d1d8"></script>
|
||||
<script src="/static/dashboard.js?v=960c0baa"></script>
|
||||
</body>
|
||||
</html>
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue