deck: remove the nine inserted scenes
Per the user's request — the rubric-derived scenes I added in one sweep weren't tied closely enough to their actual project narrative and ate up presentation time. Reverting to the pre-insertion deck: removed problem-statement / research-questions / solution-overview / evaluation-setup / theoretical / practical / design-principles / limitations / conclusion-future kept (user-requested earlier in the session) motivation (with the IEEE 9881803 citation) live (A100 inference scene) CSS rules and references/* sidecar files for the removed scenes are left in place as harmless dead code; they can be cleaned up later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
ed5f729ff0
commit
53d2b80009
1 changed files with 2 additions and 667 deletions
|
|
@ -219,144 +219,7 @@
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<!-- 3. problem-statement — what we're solving + task type -->
|
<!-- stack — Python stack & libraries used in the project -->
|
||||||
<div class="stage-view" data-view="problem-statement">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">the problem · single sentence + numbers</div>
|
|
||||||
<div class="problem-claim">
|
|
||||||
<div class="problem-claim-text">Classify each ten-second window of fleet
|
|
||||||
<code>/proc</code> telemetry into one of five workload phases —
|
|
||||||
accurately enough to drive automated containment.</div>
|
|
||||||
</div>
|
|
||||||
<div class="problem-stats">
|
|
||||||
<div class="problem-stat">
|
|
||||||
<div class="problem-stat-num">5</div>
|
|
||||||
<div class="problem-stat-lbl">phase classes<br><code>clean</code> → <code>infected_running</code></div>
|
|
||||||
</div>
|
|
||||||
<div class="problem-stat">
|
|
||||||
<div class="problem-stat-num">12</div>
|
|
||||||
<div class="problem-stat-lbl"><code>/proc</code> channels<br>no syscalls, no kernel hooks</div>
|
|
||||||
</div>
|
|
||||||
<div class="problem-stat">
|
|
||||||
<div class="problem-stat-num">10s</div>
|
|
||||||
<div class="problem-stat-lbl">classification window<br>100 samples × 12 channels</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="problem-task">
|
|
||||||
<span class="problem-task-label">task type:</span>
|
|
||||||
<span class="problem-task-value">multi-class classification</span>
|
|
||||||
<span class="problem-task-detail">— five mutually-exclusive
|
|
||||||
phase labels, balanced via class-weighted cross-entropy.
|
|
||||||
Not regression (no continuous target), not ranking
|
|
||||||
(downstream policy is a categorical containment decision).</span>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 4. research-questions — literature gaps and questions -->
|
|
||||||
<div class="stage-view" data-view="research-questions">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">literature gaps · positioning the work</div>
|
|
||||||
<div class="research-grid">
|
|
||||||
<div class="research-col">
|
|
||||||
<div class="research-col-title">what prior work covers</div>
|
|
||||||
<ul class="research-list">
|
|
||||||
<li><strong>LSTM on syscall traces</strong> in VMs —
|
|
||||||
deeper telemetry than <code>/proc</code></li>
|
|
||||||
<li><strong>Transformer on per-process resource metrics</strong>
|
|
||||||
— related signal, single-host eval</li>
|
|
||||||
<li><strong>BERT on system logs</strong> (LogBERT) —
|
|
||||||
text-form telemetry, not numeric channels</li>
|
|
||||||
<li><strong>Insider-threat LSTM on event logs</strong>
|
|
||||||
(DANTE) — categorical events, not continuous</li>
|
|
||||||
<li><strong>Network-behaviour trust establishment</strong>
|
|
||||||
(IEEE 9881803) — cross-device aggregation,
|
|
||||||
not per-host classifier</li>
|
|
||||||
</ul>
|
|
||||||
</div>
|
|
||||||
<div class="research-col">
|
|
||||||
<div class="research-col-title">what's missing</div>
|
|
||||||
<ul class="research-list">
|
|
||||||
<li><strong>/proc-only signal</strong> — most work
|
|
||||||
assumes syscalls or kernel hooks</li>
|
|
||||||
<li><strong>Sample-stratified evaluation</strong> —
|
|
||||||
papers often hide same-sample overfit by training
|
|
||||||
and testing on the same malware instances</li>
|
|
||||||
<li><strong>Real-time per-window classification</strong>
|
|
||||||
for containment, not post-hoc batch labelling</li>
|
|
||||||
<li><strong>Side-by-side cell-choice comparison</strong>
|
|
||||||
(RNN/GRU/LSTM/CNN/Transformer) on one dataset</li>
|
|
||||||
<li><strong>Direct integration</strong> with a
|
|
||||||
fleet-wide trust score, not standalone output</li>
|
|
||||||
</ul>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 5. solution-overview — pipeline block diagram -->
|
|
||||||
<div class="stage-view" data-view="solution-overview">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">pipeline · what each stage produces</div>
|
|
||||||
<svg class="pipeline-svg" viewBox="0 0 800 480"
|
|
||||||
xmlns="http://www.w3.org/2000/svg"
|
|
||||||
preserveAspectRatio="xMidYMid meet">
|
|
||||||
<g class="pipeline-stage">
|
|
||||||
<rect x="20" y="40" width="140" height="60" rx="4"/>
|
|
||||||
<text x="90" y="68" text-anchor="middle">fleet hosts</text>
|
|
||||||
<text x="90" y="86" text-anchor="middle" class="pipeline-detail">/proc · 10 Hz</text>
|
|
||||||
</g>
|
|
||||||
<g class="pipeline-stage">
|
|
||||||
<rect x="200" y="40" width="140" height="60" rx="4"/>
|
|
||||||
<text x="270" y="68" text-anchor="middle">receiver (Pi)</text>
|
|
||||||
<text x="270" y="86" text-anchor="middle" class="pipeline-detail">bearer auth</text>
|
|
||||||
</g>
|
|
||||||
<g class="pipeline-stage">
|
|
||||||
<rect x="380" y="40" width="140" height="60" rx="4"/>
|
|
||||||
<text x="450" y="68" text-anchor="middle">episode store</text>
|
|
||||||
<text x="450" y="86" text-anchor="middle" class="pipeline-detail">zstd · tar</text>
|
|
||||||
</g>
|
|
||||||
<g class="pipeline-stage">
|
|
||||||
<rect x="560" y="40" width="220" height="60" rx="4"/>
|
|
||||||
<text x="670" y="68" text-anchor="middle">windowing + features</text>
|
|
||||||
<text x="670" y="86" text-anchor="middle" class="pipeline-detail">10 s · 100 samples × 12 ch</text>
|
|
||||||
</g>
|
|
||||||
<g class="pipeline-stage pipeline-stage-models">
|
|
||||||
<rect x="180" y="170" width="440" height="120" rx="4"/>
|
|
||||||
<text x="400" y="198" text-anchor="middle" class="pipeline-stage-title">model zoo</text>
|
|
||||||
<text x="400" y="226" text-anchor="middle" class="pipeline-detail">KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer</text>
|
|
||||||
<text x="400" y="252" text-anchor="middle" class="pipeline-detail">trained per (model × split-recipe)</text>
|
|
||||||
<text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">held-out-by-sample · class-weighted CE · early stop on val macro-F1</text>
|
|
||||||
</g>
|
|
||||||
<g class="pipeline-stage">
|
|
||||||
<rect x="60" y="350" width="200" height="60" rx="4"/>
|
|
||||||
<text x="160" y="378" text-anchor="middle">per-window phase</text>
|
|
||||||
<text x="160" y="396" text-anchor="middle" class="pipeline-detail">5-class softmax</text>
|
|
||||||
</g>
|
|
||||||
<g class="pipeline-stage pipeline-stage-final">
|
|
||||||
<rect x="300" y="350" width="200" height="60" rx="4"/>
|
|
||||||
<text x="400" y="378" text-anchor="middle">trust score</text>
|
|
||||||
<text x="400" y="396" text-anchor="middle" class="pipeline-detail">+ network signals (9881803)</text>
|
|
||||||
</g>
|
|
||||||
<g class="pipeline-stage pipeline-stage-final">
|
|
||||||
<rect x="540" y="350" width="220" height="60" rx="4"/>
|
|
||||||
<text x="650" y="378" text-anchor="middle">containment + reset</text>
|
|
||||||
<text x="650" y="396" text-anchor="middle" class="pipeline-detail">snapshot rollback</text>
|
|
||||||
</g>
|
|
||||||
<g class="pipeline-arrow" fill="none">
|
|
||||||
<path d="M160 70 L200 70" />
|
|
||||||
<path d="M340 70 L380 70" />
|
|
||||||
<path d="M520 70 L560 70" />
|
|
||||||
<path d="M670 100 L670 130 L400 130 L400 170" />
|
|
||||||
<path d="M400 290 L400 320 L160 320 L160 350" />
|
|
||||||
<path d="M260 380 L300 380" />
|
|
||||||
<path d="M500 380 L540 380" />
|
|
||||||
</g>
|
|
||||||
</svg>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 6. stack — Python stack & libraries used in the project -->
|
|
||||||
<div class="stage-view" data-view="stack">
|
<div class="stage-view" data-view="stack">
|
||||||
<div class="metric-stack metric-stack-wide">
|
<div class="metric-stack metric-stack-wide">
|
||||||
<div class="metric-eyebrow">the stack behind the live data on the right</div>
|
<div class="metric-eyebrow">the stack behind the live data on the right</div>
|
||||||
|
|
@ -453,66 +316,6 @@
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<!-- 9. evaluation-setup — splits, metrics, baselines -->
|
|
||||||
<div class="stage-view" data-view="evaluation-setup">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">evaluation setup · how the numbers get made</div>
|
|
||||||
<div class="eval-blocks">
|
|
||||||
<div class="eval-block">
|
|
||||||
<div class="eval-block-title">split recipe</div>
|
|
||||||
<div class="eval-block-body">
|
|
||||||
<div><strong>train / val / test:</strong> held-out by
|
|
||||||
<code>sample_name</code>, profile-stratified</div>
|
|
||||||
<div><strong>both hosts</strong> contribute to all three slices</div>
|
|
||||||
<div class="eval-detail">the fleet is uniform — every
|
|
||||||
host runs the same orchestrator and every profile —
|
|
||||||
so we don't split by host. We split by malware
|
|
||||||
<code>sample_name</code>: the specific instances in
|
|
||||||
the test set never appear during training.
|
|
||||||
Generalization axis is "unseen malware", not
|
|
||||||
"unseen device". Two profiles with only one sample
|
|
||||||
(cpu-saturate, low-and-slow) are excluded from
|
|
||||||
held-out-by-sample eval and reported separately.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="eval-block">
|
|
||||||
<div class="eval-block-title">primary metric</div>
|
|
||||||
<div class="eval-block-body">
|
|
||||||
<div><strong>macro-F1</strong> averaged across the five phases</div>
|
|
||||||
<div class="eval-detail">accuracy lies under class
|
|
||||||
imbalance — ~50 % <code>infected_running</code>,
|
|
||||||
~5 % <code>armed</code>. A constant majority predictor
|
|
||||||
hits 0.5 accuracy. macro-F1 averages per-class F1,
|
|
||||||
so rare phases actually count toward the score.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="eval-block">
|
|
||||||
<div class="eval-block-title">baselines compared</div>
|
|
||||||
<div class="eval-block-body">
|
|
||||||
<div><strong>KNN</strong> — non-parametric, instance-based</div>
|
|
||||||
<div><strong>GBT (XGBoost)</strong> — tabular non-NN</div>
|
|
||||||
<div><strong>MLP</strong> — feedforward ablation</div>
|
|
||||||
<div><strong>CNN</strong> — local-pattern ablation</div>
|
|
||||||
<div><strong>RNN / GRU / LSTM</strong> — recurrent family</div>
|
|
||||||
<div><strong>Transformer</strong> — attention</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="eval-block">
|
|
||||||
<div class="eval-block-title">reported alongside accuracy</div>
|
|
||||||
<div class="eval-block-body">
|
|
||||||
<div><strong>μs / window</strong> — inference cost at batch=64</div>
|
|
||||||
<div><strong>val ↔ test gap</strong> — val − test macro-F1</div>
|
|
||||||
<div class="eval-detail">latency translates to
|
|
||||||
containment lag; the val ↔ test gap is the honest
|
|
||||||
measure of how much accuracy survives the move from
|
|
||||||
"samples we saw" to "samples we didn't". Both plot
|
|
||||||
on the perf scene.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- training-code — how we trained, before showing results -->
|
<!-- training-code — how we trained, before showing results -->
|
||||||
<div class="stage-view" data-view="training-code">
|
<div class="stage-view" data-view="training-code">
|
||||||
<div class="metric-stack metric-stack-wide">
|
<div class="metric-stack metric-stack-wide">
|
||||||
|
|
@ -600,234 +403,6 @@
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<!-- 15. theoretical-contributions -->
|
|
||||||
<div class="stage-view" data-view="theoretical">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">theoretical contributions · what's new methodologically</div>
|
|
||||||
<div class="motivation-cards">
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-trust"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">window-centre labelling</div>
|
|
||||||
<div class="motivation-card-text">A 10-second
|
|
||||||
classification window is labelled by the phase that
|
|
||||||
occupies its centre, not by majority vote across the
|
|
||||||
window. Cleaner training signal at phase boundaries,
|
|
||||||
and avoids the spurious "ambiguous" class.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-contain"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">schema-hashed checkpoints</div>
|
|
||||||
<div class="motivation-card-text">Each checkpoint
|
|
||||||
embeds a hash of the feature schema; loading a model
|
|
||||||
against the wrong schema fails fast instead of
|
|
||||||
silently scoring on misaligned columns. Makes
|
|
||||||
retroactive comparison reproducible.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-recover"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">held-out-by-sample as the eval axis</div>
|
|
||||||
<div class="motivation-card-text">The hosts in the
|
|
||||||
fleet are uniform — same orchestrator, same workload,
|
|
||||||
different production rates. The generalization claim
|
|
||||||
is therefore "unseen malware sample", tested on the
|
|
||||||
same population of devices the training data came
|
|
||||||
from. Profile-stratified so every profile gets fair
|
|
||||||
train/val/test cells.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 16. practical-contributions -->
|
|
||||||
<div class="stage-view" data-view="practical">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">practical contributions · what others can use</div>
|
|
||||||
<div class="motivation-cards">
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-trust"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">/proc-only deployment</div>
|
|
||||||
<div class="motivation-card-text">No syscall hooks, no
|
|
||||||
eBPF, no kernel module — runs on hosts that don't
|
|
||||||
permit deep instrumentation. The detector is one
|
|
||||||
Python service plus a model file.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-contain"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">producer-agnostic dashboard</div>
|
|
||||||
<div class="motivation-card-text">The deck consumes
|
|
||||||
typed events; the inference loop runs anywhere
|
|
||||||
(Pi, A100, cloud) and just POSTs back. Same UI for
|
|
||||||
a lab demo and an operational console.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-recover"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">labelled dataset on disk</div>
|
|
||||||
<div class="motivation-card-text">78,000+ episodes,
|
|
||||||
five phases, two hosts, six attack profiles —
|
|
||||||
archived in zstd-compressed tarballs with a
|
|
||||||
schema-versioned format. Ready for downstream
|
|
||||||
work without re-running the orchestrator.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 17. design-principles -->
|
|
||||||
<div class="stage-view" data-view="design-principles">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">design principles · patterns that emerged</div>
|
|
||||||
<div class="motivation-cards">
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-trust"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">one loop, many models</div>
|
|
||||||
<div class="motivation-card-text">Every NN architecture
|
|
||||||
plugs into the same training loop — class weights,
|
|
||||||
AMP, cosine LR, early stop. Architecture changes
|
|
||||||
don't ripple into orchestration.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-contain"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">typed events as contract</div>
|
|
||||||
<div class="motivation-card-text">Producers and
|
|
||||||
consumers agree on dataclasses
|
|
||||||
(<code>events.py</code>), not free-form dicts.
|
|
||||||
Adding a new scene means adding a new dataclass;
|
|
||||||
adding a new producer means importing it.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-recover"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">two-agent path ownership</div>
|
|
||||||
<div class="motivation-card-text">Dashboard work and
|
|
||||||
model work live in two parallel sessions with a
|
|
||||||
documented path-ownership boundary. Merges go
|
|
||||||
through git with explicit rebases instead of a
|
|
||||||
shared workspace.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 18. limitations -->
|
|
||||||
<div class="stage-view" data-view="limitations">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">limitations · the honest list</div>
|
|
||||||
<div class="motivation-cards">
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-armed"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">two-host fleet</div>
|
|
||||||
<div class="motivation-card-text">Both hosts contribute
|
|
||||||
to train, val, and test, but the device population
|
|
||||||
is small (n = 2). Adding more hosts on the WireGuard
|
|
||||||
mesh wouldn't change the split recipe but would make
|
|
||||||
the dataset more representative of real-world
|
|
||||||
hardware variety.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-armed"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">synthetic attack profiles</div>
|
|
||||||
<div class="motivation-card-text">Six profiles cover the
|
|
||||||
main shapes (cpu-saturate, ransomware-lite, bursty-c2,
|
|
||||||
fork-bomb, crypto-miner, distccd-exec) but real-world
|
|
||||||
malware can sit between or outside these envelopes.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-armed"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">10 Hz sampling floor</div>
|
|
||||||
<div class="motivation-card-text">Sub-100ms attack
|
|
||||||
behaviours fall inside a single sample. Detection of
|
|
||||||
extremely short-lived attacks (millisecond-scale
|
|
||||||
privilege checks) requires faster sampling than
|
|
||||||
<code>/proc</code> currently provides.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div class="motivation-card">
|
|
||||||
<div class="motivation-card-marker mc-armed"></div>
|
|
||||||
<div class="motivation-card-body">
|
|
||||||
<div class="motivation-card-title">KNN val ↔ test gap</div>
|
|
||||||
<div class="motivation-card-text">KNN scores val
|
|
||||||
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13
|
|
||||||
on held-out sample_names. Instance-based memorization
|
|
||||||
of the specific training samples — informative as a
|
|
||||||
baseline, not a deployment candidate.</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 19. conclusion-future — summary + unsupervised next steps -->
|
|
||||||
<div class="stage-view" data-view="conclusion-future">
|
|
||||||
<div class="metric-stack metric-stack-wide">
|
|
||||||
<div class="metric-eyebrow">conclusion + future work</div>
|
|
||||||
<div class="conclusion-grid">
|
|
||||||
<div class="conclusion-col">
|
|
||||||
<div class="conclusion-col-title">what we showed</div>
|
|
||||||
<ul class="conclusion-list">
|
|
||||||
<li>A per-host detector trained on
|
|
||||||
<strong>/proc-only telemetry</strong> can classify
|
|
||||||
workload phases at multi-class macro-F1 well above
|
|
||||||
chance.</li>
|
|
||||||
<li>Held-out-by-<strong>sample</strong>,
|
|
||||||
profile-stratified, is the right generalization
|
|
||||||
axis: both fleet hosts contribute to all three
|
|
||||||
slices, and the test set's
|
|
||||||
<code>sample_name</code>s never appear during
|
|
||||||
training.</li>
|
|
||||||
<li>The recurrent family (LSTM/GRU) and Transformer
|
|
||||||
sit on the upper-left of the
|
|
||||||
<strong>accuracy-vs-cost frontier</strong>; KNN and
|
|
||||||
GBT round out the comparison as honest baselines.</li>
|
|
||||||
<li>The detector slots into a wider <strong>trust /
|
|
||||||
containment / recovery</strong> loop — the per-host
|
|
||||||
verdict isn't the final answer, it's one input.</li>
|
|
||||||
</ul>
|
|
||||||
</div>
|
|
||||||
<div class="conclusion-col">
|
|
||||||
<div class="conclusion-col-title">next steps · unsupervised</div>
|
|
||||||
<ul class="conclusion-list">
|
|
||||||
<li><strong>Clustering</strong> the unlabeled tail of
|
|
||||||
new fleet data (KMeans / HDBSCAN) to surface novel
|
|
||||||
workload shapes the supervised model has no class
|
|
||||||
for — a self-training feedback loop.</li>
|
|
||||||
<li><strong>Anomaly detection</strong> on the
|
|
||||||
last-layer embedding (one-class SVM, isolation forest)
|
|
||||||
so a "none of the five known phases" verdict is
|
|
||||||
available alongside the classifier output.</li>
|
|
||||||
<li><strong>Self-supervised pretraining</strong> on
|
|
||||||
the much larger pool of unlabeled telemetry from
|
|
||||||
operational hosts; supervised fine-tune on the
|
|
||||||
smaller orchestrated dataset.</li>
|
|
||||||
<li><strong>Embedding visualisation</strong> via
|
|
||||||
UMAP / t-SNE for human-in-the-loop labelling of
|
|
||||||
the unlabeled tail (already prototyped in scene 12).</li>
|
|
||||||
</ul>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
<button id="next-fab" class="fab" data-no-advance title="Next (→)">▼</button>
|
<button id="next-fab" class="fab" data-no-advance title="Next (→)">▼</button>
|
||||||
|
|
@ -893,80 +468,6 @@
|
||||||
</div>
|
</div>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section class="scene" data-stage="problem-statement">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Problem statement</h2>
|
|
||||||
<p>Today's behaviour-based IDS systems rely on syscall traces,
|
|
||||||
kernel hooks, or rich endpoint agents that can't ship to
|
|
||||||
constrained or untrusted hosts. We want a detector that
|
|
||||||
runs on the only telemetry every modern Linux already
|
|
||||||
exports — <code>/proc</code> — and labels each ten-second
|
|
||||||
window of activity with the phase the workload is in.</p>
|
|
||||||
<p><strong>Research question.</strong> Can a sequence model
|
|
||||||
trained on twelve channels of <code>/proc</code> telemetry
|
|
||||||
classify five workload phases (clean / armed / infecting /
|
|
||||||
infected_running / dormant) accurately enough to drive
|
|
||||||
automated containment, <em>and</em> generalize to malware
|
|
||||||
<code>sample_name</code>s it has never seen during training?</p>
|
|
||||||
<p>The task is <strong>multi-class classification</strong>:
|
|
||||||
the target is one of five mutually-exclusive phase labels.
|
|
||||||
Not regression (no continuous target), not ranking
|
|
||||||
(downstream policy is a categorical containment decision).
|
|
||||||
We deliberately chose 10-second windows so detection
|
|
||||||
latency stays bounded for a real fleet.</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="research-questions">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Research gaps + questions</h2>
|
|
||||||
<p>Literature on behaviour-based malware detection is rich but
|
|
||||||
uneven. Most published results either (a) use richer
|
|
||||||
telemetry than what a constrained host actually exports, or
|
|
||||||
(b) frame evaluation in ways that hide same-sample overfit
|
|
||||||
(training and testing on the same malware instances). The card on the left summarises the
|
|
||||||
gap.</p>
|
|
||||||
<p>This project asks three concrete questions:</p>
|
|
||||||
<p><strong>RQ1.</strong> How well can a per-window classifier
|
|
||||||
identify workload phases from <code>/proc</code> alone, with
|
|
||||||
no syscall traces and no kernel hooks?</p>
|
|
||||||
<p><strong>RQ2.</strong> Does the model still work on
|
|
||||||
<code>sample_name</code>s the training set never saw —
|
|
||||||
i.e., new instances of malware profiles it does know?</p>
|
|
||||||
<p><strong>RQ3.</strong> Of the standard sequence-model
|
|
||||||
families (RNN, GRU, LSTM, CNN, Transformer) plus a
|
|
||||||
non-parametric baseline (KNN) and a tabular baseline
|
|
||||||
(gradient-boosted trees), which trade off accuracy and
|
|
||||||
inference cost best for a deployment that has to run on a
|
|
||||||
constrained host?</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="solution-overview">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Proposed solution</h2>
|
|
||||||
<p>A single end-to-end pipeline turns raw <code>/proc</code>
|
|
||||||
telemetry on a fleet host into a per-window phase verdict
|
|
||||||
in under a second. Each stage of the diagram on the left
|
|
||||||
is a thin, independently-deployable component — the
|
|
||||||
receiver doesn't know what model is running; the model
|
|
||||||
doesn't know where the episode came from.</p>
|
|
||||||
<p>The <strong>model zoo</strong> is the key abstraction:
|
|
||||||
every model class registers itself by name, declares its
|
|
||||||
input kind (summary features or window tensors), and plugs
|
|
||||||
into one shared training loop. KNN, GBT, MLP, CNN, RNN,
|
|
||||||
GRU, LSTM, and Transformer all reuse the same standardization,
|
|
||||||
schema-hashed checkpoint format, class-weighted CE loss,
|
|
||||||
and held-out-by-sample evaluation — so the comparison is
|
|
||||||
genuinely apples-to-apples.</p>
|
|
||||||
<p>The detector's per-window verdict feeds two downstream
|
|
||||||
loops: a fleet-wide <strong>trust score</strong> that
|
|
||||||
combines local classification with network-behaviour
|
|
||||||
signals (per IEEE 9881803), and a <strong>fast-recovery</strong>
|
|
||||||
snapshot rollback when an infection time is known.</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="stack">
|
<section class="scene" data-stage="stack">
|
||||||
<div class="prose">
|
<div class="prose">
|
||||||
<h2>Live, not staged</h2>
|
<h2>Live, not staged</h2>
|
||||||
|
|
@ -1054,35 +555,6 @@
|
||||||
</div>
|
</div>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section class="scene" data-stage="evaluation-setup">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Evaluation setup</h2>
|
|
||||||
<p>Three choices anchor every result on the next slides — the
|
|
||||||
split recipe, the primary metric, and what we measure next
|
|
||||||
to accuracy. The temptation is to report a single big
|
|
||||||
number; we report a number you can argue with.</p>
|
|
||||||
<p><strong>Held-out by <code>sample_name</code>,
|
|
||||||
profile-stratified.</strong> The fleet is uniform — every
|
|
||||||
host runs the same orchestrator and the same set of
|
|
||||||
profiles — so we don't split by device. Both hosts
|
|
||||||
contribute data to train, val, and test. What's held out is
|
|
||||||
specific malware <em>instances</em>: the
|
|
||||||
<code>sample_name</code>s in the test set never appear
|
|
||||||
during training. The model has to generalize to unseen
|
|
||||||
samples, not unseen devices.</p>
|
|
||||||
<p><strong>Macro-F1, not accuracy.</strong> The dataset is
|
|
||||||
heavily skewed: roughly half the labelled time is
|
|
||||||
<code>infected_running</code> and only ~5 % is
|
|
||||||
<code>armed</code>. A "predict the majority class"
|
|
||||||
baseline already hits 0.5 accuracy. Macro-F1 averages F1
|
|
||||||
across all five phases so rare classes count.</p>
|
|
||||||
<p><strong>Latency reported with accuracy.</strong> A model
|
|
||||||
that's one F1 point better but ten milliseconds slower
|
|
||||||
may still be the wrong choice for an on-host detector.
|
|
||||||
The perf scene plots both axes so the trade-off is visible.</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="training-code">
|
<section class="scene" data-stage="training-code">
|
||||||
<div class="prose">
|
<div class="prose">
|
||||||
<h2>How we trained them</h2>
|
<h2>How we trained them</h2>
|
||||||
|
|
@ -1161,143 +633,6 @@
|
||||||
</div>
|
</div>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section class="scene" data-stage="theoretical">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Theoretical contributions</h2>
|
|
||||||
<p>Three methodological claims this project makes — small in
|
|
||||||
isolation, but together they change how the comparison is
|
|
||||||
run. Each shows up explicitly in the codebase.</p>
|
|
||||||
<p><strong>Window-centre labelling.</strong> Instead of
|
|
||||||
majority-voting phase labels across each 10-second window
|
|
||||||
(which creates noisy boundaries), we label each window by
|
|
||||||
the phase that occupies its centre. Cleaner training
|
|
||||||
signal at transitions, no spurious "ambiguous" class.</p>
|
|
||||||
<p><strong>Schema-hashed checkpoints.</strong> Every
|
|
||||||
checkpoint embeds a hash of the feature schema it was
|
|
||||||
trained on. Loading a model against a different schema
|
|
||||||
fails fast. Without this, retroactive comparison silently
|
|
||||||
scores models on misaligned columns and reports nonsense.</p>
|
|
||||||
<p><strong>Held-out-by-sample, profile-stratified.</strong>
|
|
||||||
Hosts in the fleet are uniform — same orchestrator, same
|
|
||||||
workload, just different production rates — so we split by
|
|
||||||
malware <code>sample_name</code> instead of by device. The
|
|
||||||
generalization claim is "unseen malware sample", tested on
|
|
||||||
the same population of hosts that contributed the training
|
|
||||||
data.</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="practical">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Practical contributions</h2>
|
|
||||||
<p>What others can pick up and use from this project — beyond
|
|
||||||
the published numbers.</p>
|
|
||||||
<p><strong>/proc-only deployment.</strong> The detector needs
|
|
||||||
no syscall hooks, no eBPF, no kernel module. It runs on
|
|
||||||
hosts that don't permit deeper instrumentation — a small
|
|
||||||
VM, a container with limited capabilities, an embedded
|
|
||||||
device. One Python service plus a model file.</p>
|
|
||||||
<p><strong>Producer-agnostic dashboard.</strong> The deck
|
|
||||||
consumes typed events
|
|
||||||
(<code>training/dashboard/events.py</code>); the inference
|
|
||||||
loop runs anywhere — Pi, A100, cloud — and just POSTs back.
|
|
||||||
Same UI for a lab demo and an operational console.</p>
|
|
||||||
<p><strong>Labelled dataset on disk.</strong> 78 000+
|
|
||||||
episodes across two hosts and six attack profiles, archived
|
|
||||||
in zstd-compressed tarballs with a schema-versioned format.
|
|
||||||
Anyone reproducing or extending this work can start from
|
|
||||||
the dataset directly without re-running the orchestrator.</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="design-principles">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Design principles</h2>
|
|
||||||
<p>Three patterns that emerged during the project and earned
|
|
||||||
their keep enough that we'd repeat them.</p>
|
|
||||||
<p><strong>One loop, many models.</strong> Every NN
|
|
||||||
architecture plugs into the same training loop — class
|
|
||||||
weights, AMP autocast, cosine LR with warmup, gradient
|
|
||||||
clipping, early stop on val macro-F1. Architecture changes
|
|
||||||
don't ripple into orchestration, and adding a new model
|
|
||||||
class costs ~80 lines.</p>
|
|
||||||
<p><strong>Typed events as contract.</strong> Producers and
|
|
||||||
consumers agree on dataclasses, not free-form dicts.
|
|
||||||
Adding a new dashboard scene means adding a new dataclass;
|
|
||||||
adding a new producer means importing it. Static checking
|
|
||||||
and editor autocomplete do most of the work that a
|
|
||||||
schema-validation library would do at runtime.</p>
|
|
||||||
<p><strong>Two-agent path ownership.</strong> Dashboard work
|
|
||||||
and model work live in two parallel sessions with a
|
|
||||||
documented path-ownership boundary
|
|
||||||
(<code>training/dashboard/</code> vs everywhere else).
|
|
||||||
Merges go through git with explicit rebases instead of a
|
|
||||||
shared workspace — slow up front, fewer subtle stomps
|
|
||||||
over time.</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="limitations">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Limitations</h2>
|
|
||||||
<p>What this project cannot honestly claim — and why each
|
|
||||||
line on the left matters for how the results should be read.</p>
|
|
||||||
<p><strong>Two-host fleet.</strong> Cross-host generalization
|
|
||||||
is reported between exactly two machines; it's the right
|
|
||||||
<em>shape</em> of evaluation but not a population claim.
|
|
||||||
More hosts on the WireGuard mesh would let us report
|
|
||||||
distributional bounds rather than single point comparisons.</p>
|
|
||||||
<p><strong>Synthetic attack profiles.</strong> Our six
|
|
||||||
profiles cover the main behavioural envelopes
|
|
||||||
(cpu-saturate, ransomware-lite, bursty-c2, fork-bomb,
|
|
||||||
crypto-miner, distccd-exec) but real-world malware can
|
|
||||||
sit between or outside these envelopes. Generalization to
|
|
||||||
unseen profiles is reported via held-out-by-sample, but
|
|
||||||
in-the-wild distribution shift is unknown.</p>
|
|
||||||
<p><strong>10 Hz sampling floor.</strong> Sub-100ms
|
|
||||||
behaviours fall inside a single sample. Detection of
|
|
||||||
millisecond-scale privilege checks would need faster
|
|
||||||
telemetry than <code>/proc</code> provides.</p>
|
|
||||||
<p><strong>KNN val ↔ test gap.</strong> KNN scores val
|
|
||||||
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
|
|
||||||
held-out <code>sample_name</code>s. Instance-based
|
|
||||||
memorization of the specific training samples — informative
|
|
||||||
as a baseline, not a deployment candidate.</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="conclusion-future">
|
|
||||||
<div class="prose">
|
|
||||||
<h2>Conclusion + future work</h2>
|
|
||||||
<p>A per-host classifier trained on <code>/proc</code>-only
|
|
||||||
telemetry can identify workload phases at multi-class
|
|
||||||
macro-F1 well above chance and slot into a wider
|
|
||||||
trust / containment / recovery loop. The recurrent family
|
|
||||||
(LSTM/GRU) and Transformer sit on the upper-left of the
|
|
||||||
accuracy-vs-cost frontier; KNN and GBT are honest baselines.
|
|
||||||
Held-out-by-host evaluation is the right generalization
|
|
||||||
axis — held-out-by-sample overstates real fleet
|
|
||||||
performance by 0.3+ F1.</p>
|
|
||||||
<p><strong>Unsupervised next steps.</strong> The natural
|
|
||||||
extensions are unsupervised:</p>
|
|
||||||
<p>• <strong>Clustering</strong> the unlabeled tail of new
|
|
||||||
fleet data (KMeans / HDBSCAN) to surface novel workload
|
|
||||||
shapes the supervised model has no class for — a
|
|
||||||
self-training feedback loop that enrolls new phases as
|
|
||||||
the fleet grows.</p>
|
|
||||||
<p>• <strong>Anomaly detection</strong> on the last-layer
|
|
||||||
embedding (one-class SVM, isolation forest) so a "none of
|
|
||||||
the five known phases" verdict is available alongside the
|
|
||||||
classifier output.</p>
|
|
||||||
<p>• <strong>Self-supervised pretraining</strong> on the much
|
|
||||||
larger pool of unlabeled telemetry from operational hosts;
|
|
||||||
supervised fine-tune on the smaller orchestrated dataset.</p>
|
|
||||||
<p>• <strong>Embedding visualisation</strong> via UMAP /
|
|
||||||
t-SNE for human-in-the-loop labelling — already prototyped
|
|
||||||
in the KNN scene's interactive 3-D scatter.</p>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section class="scene" data-stage="references">
|
<section class="scene" data-stage="references">
|
||||||
<div class="prose">
|
<div class="prose">
|
||||||
<h2>References</h2>
|
<h2>References</h2>
|
||||||
|
|
@ -1313,6 +648,6 @@
|
||||||
</article>
|
</article>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<script src="/static/dashboard.js?v=5316d1d8"></script>
|
<script src="/static/dashboard.js?v=960c0baa"></script>
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue