deck: remove the nine inserted scenes

Per the user's request — the rubric-derived scenes I added in one
sweep weren't tied closely enough to their actual project narrative
and ate up presentation time. Reverting to the pre-insertion deck:

removed
  problem-statement / research-questions / solution-overview /
  evaluation-setup / theoretical / practical / design-principles /
  limitations / conclusion-future

kept (user-requested earlier in the session)
  motivation (with the IEEE 9881803 citation)
  live (A100 inference scene)

CSS rules and references/* sidecar files for the removed scenes
are left in place as harmless dead code; they can be cleaned up
later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Max Gorog 2026-05-08 19:06:57 -05:00
parent ed5f729ff0
commit 53d2b80009

View file

@ -219,144 +219,7 @@
</div> </div>
</div> </div>
<!-- 3. problem-statement — what we're solving + task type --> <!-- stack — Python stack & libraries used in the project -->
<div class="stage-view" data-view="problem-statement">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">the problem · single sentence + numbers</div>
<div class="problem-claim">
<div class="problem-claim-text">Classify each ten-second window of fleet
<code>/proc</code> telemetry into one of five workload phases —
accurately enough to drive automated containment.</div>
</div>
<div class="problem-stats">
<div class="problem-stat">
<div class="problem-stat-num">5</div>
<div class="problem-stat-lbl">phase classes<br><code>clean</code><code>infected_running</code></div>
</div>
<div class="problem-stat">
<div class="problem-stat-num">12</div>
<div class="problem-stat-lbl"><code>/proc</code> channels<br>no syscalls, no kernel hooks</div>
</div>
<div class="problem-stat">
<div class="problem-stat-num">10s</div>
<div class="problem-stat-lbl">classification window<br>100 samples × 12 channels</div>
</div>
</div>
<div class="problem-task">
<span class="problem-task-label">task type:</span>
<span class="problem-task-value">multi-class classification</span>
<span class="problem-task-detail">— five mutually-exclusive
phase labels, balanced via class-weighted cross-entropy.
Not regression (no continuous target), not ranking
(downstream policy is a categorical containment decision).</span>
</div>
</div>
</div>
<!-- 4. research-questions — literature gaps and questions -->
<div class="stage-view" data-view="research-questions">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">literature gaps · positioning the work</div>
<div class="research-grid">
<div class="research-col">
<div class="research-col-title">what prior work covers</div>
<ul class="research-list">
<li><strong>LSTM on syscall traces</strong> in VMs —
deeper telemetry than <code>/proc</code></li>
<li><strong>Transformer on per-process resource metrics</strong>
— related signal, single-host eval</li>
<li><strong>BERT on system logs</strong> (LogBERT) —
text-form telemetry, not numeric channels</li>
<li><strong>Insider-threat LSTM on event logs</strong>
(DANTE) — categorical events, not continuous</li>
<li><strong>Network-behaviour trust establishment</strong>
(IEEE 9881803) — cross-device aggregation,
not per-host classifier</li>
</ul>
</div>
<div class="research-col">
<div class="research-col-title">what's missing</div>
<ul class="research-list">
<li><strong>/proc-only signal</strong> — most work
assumes syscalls or kernel hooks</li>
<li><strong>Sample-stratified evaluation</strong>
papers often hide same-sample overfit by training
and testing on the same malware instances</li>
<li><strong>Real-time per-window classification</strong>
for containment, not post-hoc batch labelling</li>
<li><strong>Side-by-side cell-choice comparison</strong>
(RNN/GRU/LSTM/CNN/Transformer) on one dataset</li>
<li><strong>Direct integration</strong> with a
fleet-wide trust score, not standalone output</li>
</ul>
</div>
</div>
</div>
</div>
<!-- 5. solution-overview — pipeline block diagram -->
<div class="stage-view" data-view="solution-overview">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">pipeline · what each stage produces</div>
<svg class="pipeline-svg" viewBox="0 0 800 480"
xmlns="http://www.w3.org/2000/svg"
preserveAspectRatio="xMidYMid meet">
<g class="pipeline-stage">
<rect x="20" y="40" width="140" height="60" rx="4"/>
<text x="90" y="68" text-anchor="middle">fleet hosts</text>
<text x="90" y="86" text-anchor="middle" class="pipeline-detail">/proc · 10 Hz</text>
</g>
<g class="pipeline-stage">
<rect x="200" y="40" width="140" height="60" rx="4"/>
<text x="270" y="68" text-anchor="middle">receiver (Pi)</text>
<text x="270" y="86" text-anchor="middle" class="pipeline-detail">bearer auth</text>
</g>
<g class="pipeline-stage">
<rect x="380" y="40" width="140" height="60" rx="4"/>
<text x="450" y="68" text-anchor="middle">episode store</text>
<text x="450" y="86" text-anchor="middle" class="pipeline-detail">zstd · tar</text>
</g>
<g class="pipeline-stage">
<rect x="560" y="40" width="220" height="60" rx="4"/>
<text x="670" y="68" text-anchor="middle">windowing + features</text>
<text x="670" y="86" text-anchor="middle" class="pipeline-detail">10 s · 100 samples × 12 ch</text>
</g>
<g class="pipeline-stage pipeline-stage-models">
<rect x="180" y="170" width="440" height="120" rx="4"/>
<text x="400" y="198" text-anchor="middle" class="pipeline-stage-title">model zoo</text>
<text x="400" y="226" text-anchor="middle" class="pipeline-detail">KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer</text>
<text x="400" y="252" text-anchor="middle" class="pipeline-detail">trained per (model × split-recipe)</text>
<text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">held-out-by-sample · class-weighted CE · early stop on val macro-F1</text>
</g>
<g class="pipeline-stage">
<rect x="60" y="350" width="200" height="60" rx="4"/>
<text x="160" y="378" text-anchor="middle">per-window phase</text>
<text x="160" y="396" text-anchor="middle" class="pipeline-detail">5-class softmax</text>
</g>
<g class="pipeline-stage pipeline-stage-final">
<rect x="300" y="350" width="200" height="60" rx="4"/>
<text x="400" y="378" text-anchor="middle">trust score</text>
<text x="400" y="396" text-anchor="middle" class="pipeline-detail">+ network signals (9881803)</text>
</g>
<g class="pipeline-stage pipeline-stage-final">
<rect x="540" y="350" width="220" height="60" rx="4"/>
<text x="650" y="378" text-anchor="middle">containment + reset</text>
<text x="650" y="396" text-anchor="middle" class="pipeline-detail">snapshot rollback</text>
</g>
<g class="pipeline-arrow" fill="none">
<path d="M160 70 L200 70" />
<path d="M340 70 L380 70" />
<path d="M520 70 L560 70" />
<path d="M670 100 L670 130 L400 130 L400 170" />
<path d="M400 290 L400 320 L160 320 L160 350" />
<path d="M260 380 L300 380" />
<path d="M500 380 L540 380" />
</g>
</svg>
</div>
</div>
<!-- 6. stack — Python stack & libraries used in the project -->
<div class="stage-view" data-view="stack"> <div class="stage-view" data-view="stack">
<div class="metric-stack metric-stack-wide"> <div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">the stack behind the live data on the right</div> <div class="metric-eyebrow">the stack behind the live data on the right</div>
@ -453,66 +316,6 @@
</div> </div>
</div> </div>
<!-- 9. evaluation-setup — splits, metrics, baselines -->
<div class="stage-view" data-view="evaluation-setup">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">evaluation setup · how the numbers get made</div>
<div class="eval-blocks">
<div class="eval-block">
<div class="eval-block-title">split recipe</div>
<div class="eval-block-body">
<div><strong>train / val / test:</strong> held-out by
<code>sample_name</code>, profile-stratified</div>
<div><strong>both hosts</strong> contribute to all three slices</div>
<div class="eval-detail">the fleet is uniform — every
host runs the same orchestrator and every profile —
so we don't split by host. We split by malware
<code>sample_name</code>: the specific instances in
the test set never appear during training.
Generalization axis is "unseen malware", not
"unseen device". Two profiles with only one sample
(cpu-saturate, low-and-slow) are excluded from
held-out-by-sample eval and reported separately.</div>
</div>
</div>
<div class="eval-block">
<div class="eval-block-title">primary metric</div>
<div class="eval-block-body">
<div><strong>macro-F1</strong> averaged across the five phases</div>
<div class="eval-detail">accuracy lies under class
imbalance — ~50 % <code>infected_running</code>,
~5 % <code>armed</code>. A constant majority predictor
hits 0.5 accuracy. macro-F1 averages per-class F1,
so rare phases actually count toward the score.</div>
</div>
</div>
<div class="eval-block">
<div class="eval-block-title">baselines compared</div>
<div class="eval-block-body">
<div><strong>KNN</strong> — non-parametric, instance-based</div>
<div><strong>GBT (XGBoost)</strong> — tabular non-NN</div>
<div><strong>MLP</strong> — feedforward ablation</div>
<div><strong>CNN</strong> — local-pattern ablation</div>
<div><strong>RNN / GRU / LSTM</strong> — recurrent family</div>
<div><strong>Transformer</strong> — attention</div>
</div>
</div>
<div class="eval-block">
<div class="eval-block-title">reported alongside accuracy</div>
<div class="eval-block-body">
<div><strong>μs / window</strong> — inference cost at batch=64</div>
<div><strong>val ↔ test gap</strong> — val test macro-F1</div>
<div class="eval-detail">latency translates to
containment lag; the val ↔ test gap is the honest
measure of how much accuracy survives the move from
"samples we saw" to "samples we didn't". Both plot
on the perf scene.</div>
</div>
</div>
</div>
</div>
</div>
<!-- training-code — how we trained, before showing results --> <!-- training-code — how we trained, before showing results -->
<div class="stage-view" data-view="training-code"> <div class="stage-view" data-view="training-code">
<div class="metric-stack metric-stack-wide"> <div class="metric-stack metric-stack-wide">
@ -600,234 +403,6 @@
</div> </div>
</div> </div>
<!-- 15. theoretical-contributions -->
<div class="stage-view" data-view="theoretical">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">theoretical contributions · what's new methodologically</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-trust"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">window-centre labelling</div>
<div class="motivation-card-text">A 10-second
classification window is labelled by the phase that
occupies its centre, not by majority vote across the
window. Cleaner training signal at phase boundaries,
and avoids the spurious "ambiguous" class.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-contain"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">schema-hashed checkpoints</div>
<div class="motivation-card-text">Each checkpoint
embeds a hash of the feature schema; loading a model
against the wrong schema fails fast instead of
silently scoring on misaligned columns. Makes
retroactive comparison reproducible.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-recover"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">held-out-by-sample as the eval axis</div>
<div class="motivation-card-text">The hosts in the
fleet are uniform — same orchestrator, same workload,
different production rates. The generalization claim
is therefore "unseen malware sample", tested on the
same population of devices the training data came
from. Profile-stratified so every profile gets fair
train/val/test cells.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 16. practical-contributions -->
<div class="stage-view" data-view="practical">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">practical contributions · what others can use</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-trust"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">/proc-only deployment</div>
<div class="motivation-card-text">No syscall hooks, no
eBPF, no kernel module — runs on hosts that don't
permit deep instrumentation. The detector is one
Python service plus a model file.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-contain"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">producer-agnostic dashboard</div>
<div class="motivation-card-text">The deck consumes
typed events; the inference loop runs anywhere
(Pi, A100, cloud) and just POSTs back. Same UI for
a lab demo and an operational console.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-recover"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">labelled dataset on disk</div>
<div class="motivation-card-text">78,000+ episodes,
five phases, two hosts, six attack profiles —
archived in zstd-compressed tarballs with a
schema-versioned format. Ready for downstream
work without re-running the orchestrator.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 17. design-principles -->
<div class="stage-view" data-view="design-principles">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">design principles · patterns that emerged</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-trust"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">one loop, many models</div>
<div class="motivation-card-text">Every NN architecture
plugs into the same training loop — class weights,
AMP, cosine LR, early stop. Architecture changes
don't ripple into orchestration.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-contain"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">typed events as contract</div>
<div class="motivation-card-text">Producers and
consumers agree on dataclasses
(<code>events.py</code>), not free-form dicts.
Adding a new scene means adding a new dataclass;
adding a new producer means importing it.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-recover"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">two-agent path ownership</div>
<div class="motivation-card-text">Dashboard work and
model work live in two parallel sessions with a
documented path-ownership boundary. Merges go
through git with explicit rebases instead of a
shared workspace.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 18. limitations -->
<div class="stage-view" data-view="limitations">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">limitations · the honest list</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">two-host fleet</div>
<div class="motivation-card-text">Both hosts contribute
to train, val, and test, but the device population
is small (n = 2). Adding more hosts on the WireGuard
mesh wouldn't change the split recipe but would make
the dataset more representative of real-world
hardware variety.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">synthetic attack profiles</div>
<div class="motivation-card-text">Six profiles cover the
main shapes (cpu-saturate, ransomware-lite, bursty-c2,
fork-bomb, crypto-miner, distccd-exec) but real-world
malware can sit between or outside these envelopes.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">10 Hz sampling floor</div>
<div class="motivation-card-text">Sub-100ms attack
behaviours fall inside a single sample. Detection of
extremely short-lived attacks (millisecond-scale
privilege checks) requires faster sampling than
<code>/proc</code> currently provides.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">KNN val ↔ test gap</div>
<div class="motivation-card-text">KNN scores val
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13
on held-out sample_names. Instance-based memorization
of the specific training samples — informative as a
baseline, not a deployment candidate.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 19. conclusion-future — summary + unsupervised next steps -->
<div class="stage-view" data-view="conclusion-future">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">conclusion + future work</div>
<div class="conclusion-grid">
<div class="conclusion-col">
<div class="conclusion-col-title">what we showed</div>
<ul class="conclusion-list">
<li>A per-host detector trained on
<strong>/proc-only telemetry</strong> can classify
workload phases at multi-class macro-F1 well above
chance.</li>
<li>Held-out-by-<strong>sample</strong>,
profile-stratified, is the right generalization
axis: both fleet hosts contribute to all three
slices, and the test set's
<code>sample_name</code>s never appear during
training.</li>
<li>The recurrent family (LSTM/GRU) and Transformer
sit on the upper-left of the
<strong>accuracy-vs-cost frontier</strong>; KNN and
GBT round out the comparison as honest baselines.</li>
<li>The detector slots into a wider <strong>trust /
containment / recovery</strong> loop — the per-host
verdict isn't the final answer, it's one input.</li>
</ul>
</div>
<div class="conclusion-col">
<div class="conclusion-col-title">next steps · unsupervised</div>
<ul class="conclusion-list">
<li><strong>Clustering</strong> the unlabeled tail of
new fleet data (KMeans / HDBSCAN) to surface novel
workload shapes the supervised model has no class
for — a self-training feedback loop.</li>
<li><strong>Anomaly detection</strong> on the
last-layer embedding (one-class SVM, isolation forest)
so a "none of the five known phases" verdict is
available alongside the classifier output.</li>
<li><strong>Self-supervised pretraining</strong> on
the much larger pool of unlabeled telemetry from
operational hosts; supervised fine-tune on the
smaller orchestrated dataset.</li>
<li><strong>Embedding visualisation</strong> via
UMAP / t-SNE for human-in-the-loop labelling of
the unlabeled tail (already prototyped in scene 12).</li>
</ul>
</div>
</div>
</div>
</div>
</div> </div>
<button id="next-fab" class="fab" data-no-advance title="Next (→)"></button> <button id="next-fab" class="fab" data-no-advance title="Next (→)"></button>
@ -893,80 +468,6 @@
</div> </div>
</section> </section>
<section class="scene" data-stage="problem-statement">
<div class="prose">
<h2>Problem statement</h2>
<p>Today's behaviour-based IDS systems rely on syscall traces,
kernel hooks, or rich endpoint agents that can't ship to
constrained or untrusted hosts. We want a detector that
runs on the only telemetry every modern Linux already
exports — <code>/proc</code> — and labels each ten-second
window of activity with the phase the workload is in.</p>
<p><strong>Research question.</strong> Can a sequence model
trained on twelve channels of <code>/proc</code> telemetry
classify five workload phases (clean / armed / infecting /
infected_running / dormant) accurately enough to drive
automated containment, <em>and</em> generalize to malware
<code>sample_name</code>s it has never seen during training?</p>
<p>The task is <strong>multi-class classification</strong>:
the target is one of five mutually-exclusive phase labels.
Not regression (no continuous target), not ranking
(downstream policy is a categorical containment decision).
We deliberately chose 10-second windows so detection
latency stays bounded for a real fleet.</p>
</div>
</section>
<section class="scene" data-stage="research-questions">
<div class="prose">
<h2>Research gaps + questions</h2>
<p>Literature on behaviour-based malware detection is rich but
uneven. Most published results either (a) use richer
telemetry than what a constrained host actually exports, or
(b) frame evaluation in ways that hide same-sample overfit
(training and testing on the same malware instances). The card on the left summarises the
gap.</p>
<p>This project asks three concrete questions:</p>
<p><strong>RQ1.</strong> How well can a per-window classifier
identify workload phases from <code>/proc</code> alone, with
no syscall traces and no kernel hooks?</p>
<p><strong>RQ2.</strong> Does the model still work on
<code>sample_name</code>s the training set never saw —
i.e., new instances of malware profiles it does know?</p>
<p><strong>RQ3.</strong> Of the standard sequence-model
families (RNN, GRU, LSTM, CNN, Transformer) plus a
non-parametric baseline (KNN) and a tabular baseline
(gradient-boosted trees), which trade off accuracy and
inference cost best for a deployment that has to run on a
constrained host?</p>
</div>
</section>
<section class="scene" data-stage="solution-overview">
<div class="prose">
<h2>Proposed solution</h2>
<p>A single end-to-end pipeline turns raw <code>/proc</code>
telemetry on a fleet host into a per-window phase verdict
in under a second. Each stage of the diagram on the left
is a thin, independently-deployable component — the
receiver doesn't know what model is running; the model
doesn't know where the episode came from.</p>
<p>The <strong>model zoo</strong> is the key abstraction:
every model class registers itself by name, declares its
input kind (summary features or window tensors), and plugs
into one shared training loop. KNN, GBT, MLP, CNN, RNN,
GRU, LSTM, and Transformer all reuse the same standardization,
schema-hashed checkpoint format, class-weighted CE loss,
and held-out-by-sample evaluation — so the comparison is
genuinely apples-to-apples.</p>
<p>The detector's per-window verdict feeds two downstream
loops: a fleet-wide <strong>trust score</strong> that
combines local classification with network-behaviour
signals (per IEEE 9881803), and a <strong>fast-recovery</strong>
snapshot rollback when an infection time is known.</p>
</div>
</section>
<section class="scene" data-stage="stack"> <section class="scene" data-stage="stack">
<div class="prose"> <div class="prose">
<h2>Live, not staged</h2> <h2>Live, not staged</h2>
@ -1054,35 +555,6 @@
</div> </div>
</section> </section>
<section class="scene" data-stage="evaluation-setup">
<div class="prose">
<h2>Evaluation setup</h2>
<p>Three choices anchor every result on the next slides — the
split recipe, the primary metric, and what we measure next
to accuracy. The temptation is to report a single big
number; we report a number you can argue with.</p>
<p><strong>Held-out by <code>sample_name</code>,
profile-stratified.</strong> The fleet is uniform — every
host runs the same orchestrator and the same set of
profiles — so we don't split by device. Both hosts
contribute data to train, val, and test. What's held out is
specific malware <em>instances</em>: the
<code>sample_name</code>s in the test set never appear
during training. The model has to generalize to unseen
samples, not unseen devices.</p>
<p><strong>Macro-F1, not accuracy.</strong> The dataset is
heavily skewed: roughly half the labelled time is
<code>infected_running</code> and only ~5 % is
<code>armed</code>. A "predict the majority class"
baseline already hits 0.5 accuracy. Macro-F1 averages F1
across all five phases so rare classes count.</p>
<p><strong>Latency reported with accuracy.</strong> A model
that's one F1 point better but ten milliseconds slower
may still be the wrong choice for an on-host detector.
The perf scene plots both axes so the trade-off is visible.</p>
</div>
</section>
<section class="scene" data-stage="training-code"> <section class="scene" data-stage="training-code">
<div class="prose"> <div class="prose">
<h2>How we trained them</h2> <h2>How we trained them</h2>
@ -1161,143 +633,6 @@
</div> </div>
</section> </section>
<section class="scene" data-stage="theoretical">
<div class="prose">
<h2>Theoretical contributions</h2>
<p>Three methodological claims this project makes — small in
isolation, but together they change how the comparison is
run. Each shows up explicitly in the codebase.</p>
<p><strong>Window-centre labelling.</strong> Instead of
majority-voting phase labels across each 10-second window
(which creates noisy boundaries), we label each window by
the phase that occupies its centre. Cleaner training
signal at transitions, no spurious "ambiguous" class.</p>
<p><strong>Schema-hashed checkpoints.</strong> Every
checkpoint embeds a hash of the feature schema it was
trained on. Loading a model against a different schema
fails fast. Without this, retroactive comparison silently
scores models on misaligned columns and reports nonsense.</p>
<p><strong>Held-out-by-sample, profile-stratified.</strong>
Hosts in the fleet are uniform — same orchestrator, same
workload, just different production rates — so we split by
malware <code>sample_name</code> instead of by device. The
generalization claim is "unseen malware sample", tested on
the same population of hosts that contributed the training
data.</p>
</div>
</section>
<section class="scene" data-stage="practical">
<div class="prose">
<h2>Practical contributions</h2>
<p>What others can pick up and use from this project — beyond
the published numbers.</p>
<p><strong>/proc-only deployment.</strong> The detector needs
no syscall hooks, no eBPF, no kernel module. It runs on
hosts that don't permit deeper instrumentation — a small
VM, a container with limited capabilities, an embedded
device. One Python service plus a model file.</p>
<p><strong>Producer-agnostic dashboard.</strong> The deck
consumes typed events
(<code>training/dashboard/events.py</code>); the inference
loop runs anywhere — Pi, A100, cloud — and just POSTs back.
Same UI for a lab demo and an operational console.</p>
<p><strong>Labelled dataset on disk.</strong> 78 000+
episodes across two hosts and six attack profiles, archived
in zstd-compressed tarballs with a schema-versioned format.
Anyone reproducing or extending this work can start from
the dataset directly without re-running the orchestrator.</p>
</div>
</section>
<section class="scene" data-stage="design-principles">
<div class="prose">
<h2>Design principles</h2>
<p>Three patterns that emerged during the project and earned
their keep enough that we'd repeat them.</p>
<p><strong>One loop, many models.</strong> Every NN
architecture plugs into the same training loop — class
weights, AMP autocast, cosine LR with warmup, gradient
clipping, early stop on val macro-F1. Architecture changes
don't ripple into orchestration, and adding a new model
class costs ~80 lines.</p>
<p><strong>Typed events as contract.</strong> Producers and
consumers agree on dataclasses, not free-form dicts.
Adding a new dashboard scene means adding a new dataclass;
adding a new producer means importing it. Static checking
and editor autocomplete do most of the work that a
schema-validation library would do at runtime.</p>
<p><strong>Two-agent path ownership.</strong> Dashboard work
and model work live in two parallel sessions with a
documented path-ownership boundary
(<code>training/dashboard/</code> vs everywhere else).
Merges go through git with explicit rebases instead of a
shared workspace — slow up front, fewer subtle stomps
over time.</p>
</div>
</section>
<section class="scene" data-stage="limitations">
<div class="prose">
<h2>Limitations</h2>
<p>What this project cannot honestly claim — and why each
line on the left matters for how the results should be read.</p>
<p><strong>Two-host fleet.</strong> Cross-host generalization
is reported between exactly two machines; it's the right
<em>shape</em> of evaluation but not a population claim.
More hosts on the WireGuard mesh would let us report
distributional bounds rather than single point comparisons.</p>
<p><strong>Synthetic attack profiles.</strong> Our six
profiles cover the main behavioural envelopes
(cpu-saturate, ransomware-lite, bursty-c2, fork-bomb,
crypto-miner, distccd-exec) but real-world malware can
sit between or outside these envelopes. Generalization to
unseen profiles is reported via held-out-by-sample, but
in-the-wild distribution shift is unknown.</p>
<p><strong>10 Hz sampling floor.</strong> Sub-100ms
behaviours fall inside a single sample. Detection of
millisecond-scale privilege checks would need faster
telemetry than <code>/proc</code> provides.</p>
<p><strong>KNN val ↔ test gap.</strong> KNN scores val
macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
held-out <code>sample_name</code>s. Instance-based
memorization of the specific training samples — informative
as a baseline, not a deployment candidate.</p>
</div>
</section>
<section class="scene" data-stage="conclusion-future">
<div class="prose">
<h2>Conclusion + future work</h2>
<p>A per-host classifier trained on <code>/proc</code>-only
telemetry can identify workload phases at multi-class
macro-F1 well above chance and slot into a wider
trust / containment / recovery loop. The recurrent family
(LSTM/GRU) and Transformer sit on the upper-left of the
accuracy-vs-cost frontier; KNN and GBT are honest baselines.
Held-out-by-host evaluation is the right generalization
axis — held-out-by-sample overstates real fleet
performance by 0.3+ F1.</p>
<p><strong>Unsupervised next steps.</strong> The natural
extensions are unsupervised:</p>
<p><strong>Clustering</strong> the unlabeled tail of new
fleet data (KMeans / HDBSCAN) to surface novel workload
shapes the supervised model has no class for — a
self-training feedback loop that enrolls new phases as
the fleet grows.</p>
<p><strong>Anomaly detection</strong> on the last-layer
embedding (one-class SVM, isolation forest) so a "none of
the five known phases" verdict is available alongside the
classifier output.</p>
<p><strong>Self-supervised pretraining</strong> on the much
larger pool of unlabeled telemetry from operational hosts;
supervised fine-tune on the smaller orchestrated dataset.</p>
<p><strong>Embedding visualisation</strong> via UMAP /
t-SNE for human-in-the-loop labelling — already prototyped
in the KNN scene's interactive 3-D scatter.</p>
</div>
</section>
<section class="scene" data-stage="references"> <section class="scene" data-stage="references">
<div class="prose"> <div class="prose">
<h2>References</h2> <h2>References</h2>
@ -1313,6 +648,6 @@
</article> </article>
</div> </div>
<script src="/static/dashboard.js?v=5316d1d8"></script> <script src="/static/dashboard.js?v=960c0baa"></script>
</body> </body>
</html> </html>