CIS490/training/dashboard/static/index.html
Max Gorog db9f013969 deck: 9 new scenes to meet CIS-490 assignment-guide rubric
Five required + four optional slides, slotted into the existing flow
without renumbering the visible deck UI:

REQUIRED
- problem-statement (after motivation): single-sentence problem,
  three numeric stat cards, explicit task-type justification
  (multi-class classification, why not regression/ranking)
- research-questions (after problem-statement): two-column literature
  gap layout + RQ1/RQ2/RQ3
- solution-overview (after research-questions): inline-SVG block
  diagram of the pipeline (fleet hosts → receiver → episodes →
  windowing → model zoo → per-window phase → trust score →
  containment + reset)
- evaluation-setup (between chunking and models): four blocks
  covering split recipe, primary metric, baselines compared, and
  what's reported alongside accuracy. Each block leads with the
  *why*, matching the assignment's "explain not only what will be
  measured but why" requirement.
- conclusion-future (before references): two-column "what we showed"
  + unsupervised next steps (clustering / anomaly / SSL pretrain /
  embedding viz). Addresses Section 8 of the assignment guide.

OPTIONAL
- theoretical-contributions: window-centre labelling,
  schema-hashed checkpoints, cross-host as eval axis
- practical-contributions: /proc-only deployment,
  producer-agnostic dashboard, labelled dataset on disk
- design-principles: one-loop-many-models, typed events as
  contract, two-agent path ownership
- limitations: two-host fleet, synthetic profiles, 10 Hz floor,
  KNN cross-host gap

Plus references/links.md gains four real online references (PyTorch,
XGBoost, scikit-learn, proc(5)) bringing the citation count from 8
to 12 — over the assignment's 10-source minimum.

CSS additions cover the new layouts (.problem-claim, .problem-stats,
.research-grid, .pipeline-svg + .pipeline-stage / .pipeline-arrow,
.eval-blocks, .conclusion-grid). Limitations cards reuse the
motivation-card pattern with an armed-phase amber marker for the
"warning" feel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:32:50 -05:00

1296 lines
65 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>CIS490 — live</title>
<link rel="stylesheet" href="/static/dashboard.css?v=0ef6cb6d">
</head>
<body>
<!-- SVG filter defs for the lava-lamp goo effect. Width/height 0
so it doesn't take layout space; the filter is referenced by
CSS via filter: url(#goo). -->
<svg class="goo-defs" width="0" height="0" aria-hidden="true">
<defs>
<filter id="goo">
<feGaussianBlur in="SourceGraphic" stdDeviation="22" result="blur"/>
<feColorMatrix in="blur" mode="matrix" result="goo" values="
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 26 -12"/>
<feBlend in="SourceGraphic" in2="goo"/>
</filter>
</defs>
</svg>
<!-- Theme background layers — exactly one is visible at a time,
selected by body[data-theme]. The blobs / bubbles / beams
inside drift / lava / laser are generated by JS so the count
and statistical-distribution sliders actually take effect. -->
<div class="bg-canvas" id="bg-canvas" aria-hidden="true">
<div class="bg-tint"></div>
<div class="bg-drift" id="bg-drift"></div>
<div class="bg-lava">
<div class="goo-container" id="bg-lava-bubbles"></div>
</div>
<div class="bg-vaporwave">
<div class="vw-sky"></div>
<!-- Scanlines BEFORE sun: the sun's solid disc occludes
scanlines inside its area so they can't beat against the
sun's venetian-blind stripes (the same kind of moiré
that previously appeared between scanlines and the
perspective floor — same shape, smaller scale). -->
<div class="vw-scanlines"></div>
<div class="vw-sun"><div class="vw-sun-blinds"></div></div>
<div class="vw-horizon"></div>
<div class="vw-floor"><div class="vw-floor-grid"></div></div>
</div>
<div class="bg-laser" id="bg-laser-beams"></div>
</div>
<!-- Right-half sidebar theme panel. Slides in/out via the
`is-open` class — we don't use the `hidden` attribute because
the transform animation needs the panel to stay rendered. -->
<div class="theme-panel" id="theme-panel">
<div class="theme-panel-header">
<span class="theme-title">theme · OKLCH</span>
<button id="theme-close" class="ghost icon" title="Close (t)">×</button>
</div>
<label class="theme-row">
<span>background</span>
<select id="theme-bg">
<option value="black">black (still)</option>
<option value="drift">drift (soft blobs)</option>
<option value="lava">lava lamp (goo metaballs)</option>
<option value="vaporwave">vaporwave</option>
<option value="laser">laser show</option>
</select>
</label>
<div class="theme-wheel-block">
<div class="theme-wheel" id="theme-wheel">
<div class="wheel-disc"></div>
<div class="wheel-rim"></div>
<div class="wheel-markers" id="wheel-markers"></div>
</div>
<div class="theme-sliders">
<label>L · lightness · <span id="theme-l-val">70</span>%
<input type="range" id="theme-l" min="20" max="95" value="70" step="1"></label>
<label>C · chroma · <span id="theme-c-val">0.15</span>
<input type="range" id="theme-c" min="0" max="0.4" value="0.15" step="0.005"></label>
<label>H · hue · <span id="theme-h-val">250</span>°
<input type="range" id="theme-h" min="0" max="360" value="250" step="1"></label>
</div>
</div>
<div class="theme-sliders theme-harmony-block">
<label>colors · count · <span id="theme-count-val">3</span>
<input type="range" id="theme-count" min="1" max="6" value="3" step="1"></label>
<label>spread · angular range · <span id="theme-spread-val">60</span>°
<input type="range" id="theme-spread" min="0" max="300" value="60" step="1"></label>
<div class="theme-harmony-hint" id="theme-harmony-hint"></div>
</div>
<details class="theme-advanced">
<summary>advanced — palette ladder</summary>
<div class="theme-sliders">
<label>L variance · per-color lightness ladder · <span id="theme-lvar-val">0</span>
<input type="range" id="theme-lvar" min="0" max="40" value="0" step="1"></label>
<label>C variance · per-color chroma ladder · <span id="theme-cvar-val">0.00</span>
<input type="range" id="theme-cvar" min="0" max="0.15" value="0" step="0.005"></label>
</div>
</details>
<div class="theme-row">
<span>palette</span>
<div class="theme-swatches" id="theme-swatches"></div>
</div>
<details class="theme-advanced" open>
<summary>animation · global</summary>
<div class="theme-sliders">
<label>speed · <span id="theme-speed-val">1.00</span>×
<input type="range" id="theme-speed" min="0.1" max="4" value="1" step="0.05"></label>
<label>blur · <span id="theme-blur-val">0</span> px
<input type="range" id="theme-blur" min="0" max="40" value="0" step="1"></label>
<label>tint strength · <span id="theme-tint-val">0.10</span>
<input type="range" id="theme-tint" min="0" max="0.6" value="0.1" step="0.02"></label>
<label>content backdrop · <span id="theme-backdrop-val">0.30</span>
<input type="range" id="theme-backdrop" min="0" max="1" value="0.3" step="0.05"></label>
</div>
</details>
<!-- Per-theme settings — dynamically built by JS from the THEMES
spec; only the section matching state.background is shown. -->
<div id="theme-bg-settings"></div>
<div class="theme-meta-row">
<code id="theme-meta">oklch(70% 0.15 250)</code>
<button id="theme-reset" class="ghost">reset</button>
</div>
</div>
<header class="topbar">
<span class="brand">CIS490</span>
<span id="status" class="status">connecting…</span>
<span class="spacer"></span>
<span class="counter"><span id="scene-idx">1</span> / <span id="scene-total">1</span></span>
<button id="prev-btn" class="ghost icon" title="Previous (← / k)"></button>
<button id="next-btn" class="ghost icon" title="Next (→ / space / j)"></button>
<button id="click-nav-btn" class="ghost" title="Click on the stage to advance to the next slide (c)">click-nav: off</button>
<button id="demo-btn" class="ghost" title="Toggle local synthetic data">demo: off</button>
<button id="theme-btn" class="ghost" title="Theme panel (t)">theme</button>
</header>
<div class="layout">
<div class="canvas-wrapper" id="stage-col">
<div class="stage">
<!-- 1. intro -->
<div class="stage-view" data-view="intro">
<div class="bg-grid"></div>
<div class="intro-block">
<div class="intro-eyebrow">cis490 · live fleet telemetry</div>
<div class="intro-title">behavioral<br>malware<br>detection</div>
</div>
</div>
<!-- 2. motivation — what detection unlocks -->
<div class="stage-view" data-view="motivation">
<div class="metric-stack metric-stack-wide motivation-stack">
<div class="metric-eyebrow">what detection unlocks</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-trust"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">network-level trust scoring</div>
<div class="motivation-card-text">A noisy on-device classifier becomes
useful when its verdict feeds a fleet-wide trust score —
peers, gateways, and traffic patterns vote together. A
single host's signal is fragile; combined network
behaviour is much harder to spoof.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-contain"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">containment before pivot</div>
<div class="motivation-card-text">"Infected" is actionable: quarantine
the device's credentials, drop its traffic at the
gateway, stop lateral movement before the attacker
pivots to a neighbor. Detection latency directly
bounds blast radius.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-recover"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">fast post-attack reset</div>
<div class="motivation-card-text">With a known infection time you can
roll a device back to a snapshot taken before the
compromise — no forensic dwell time, no guessing how
far back to roll. Recovery becomes a one-button
operation instead of a week of cleanup.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 3. problem-statement — what we're solving + task type -->
<div class="stage-view" data-view="problem-statement">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">the problem · single sentence + numbers</div>
<div class="problem-claim">
<div class="problem-claim-text">Classify each ten-second window of fleet
<code>/proc</code> telemetry into one of five workload phases —
accurately enough to drive automated containment.</div>
</div>
<div class="problem-stats">
<div class="problem-stat">
<div class="problem-stat-num">5</div>
<div class="problem-stat-lbl">phase classes<br><code>clean</code><code>infected_running</code></div>
</div>
<div class="problem-stat">
<div class="problem-stat-num">12</div>
<div class="problem-stat-lbl"><code>/proc</code> channels<br>no syscalls, no kernel hooks</div>
</div>
<div class="problem-stat">
<div class="problem-stat-num">10s</div>
<div class="problem-stat-lbl">classification window<br>100 samples × 12 channels</div>
</div>
</div>
<div class="problem-task">
<span class="problem-task-label">task type:</span>
<span class="problem-task-value">multi-class classification</span>
<span class="problem-task-detail">— five mutually-exclusive
phase labels, balanced via class-weighted cross-entropy.
Not regression (no continuous target), not ranking
(downstream policy is a categorical containment decision).</span>
</div>
</div>
</div>
<!-- 4. research-questions — literature gaps and questions -->
<div class="stage-view" data-view="research-questions">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">literature gaps · positioning the work</div>
<div class="research-grid">
<div class="research-col">
<div class="research-col-title">what prior work covers</div>
<ul class="research-list">
<li><strong>LSTM on syscall traces</strong> in VMs —
deeper telemetry than <code>/proc</code></li>
<li><strong>Transformer on per-process resource metrics</strong>
— related signal, single-host eval</li>
<li><strong>BERT on system logs</strong> (LogBERT) —
text-form telemetry, not numeric channels</li>
<li><strong>Insider-threat LSTM on event logs</strong>
(DANTE) — categorical events, not continuous</li>
<li><strong>Network-behaviour trust establishment</strong>
(IEEE 9881803) — cross-device aggregation,
not per-host classifier</li>
</ul>
</div>
<div class="research-col">
<div class="research-col-title">what's missing</div>
<ul class="research-list">
<li><strong>/proc-only signal</strong> — most work
assumes syscalls or kernel hooks</li>
<li><strong>Cross-host generalization</strong> — eval
splits often hide it (held-out by sample, not host)</li>
<li><strong>Real-time per-window classification</strong>
for containment, not post-hoc batch labelling</li>
<li><strong>Side-by-side cell-choice comparison</strong>
(RNN/GRU/LSTM/CNN/Transformer) on one dataset</li>
<li><strong>Direct integration</strong> with a
fleet-wide trust score, not standalone output</li>
</ul>
</div>
</div>
</div>
</div>
<!-- 5. solution-overview — pipeline block diagram -->
<div class="stage-view" data-view="solution-overview">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">pipeline · what each stage produces</div>
<svg class="pipeline-svg" viewBox="0 0 800 480"
xmlns="http://www.w3.org/2000/svg"
preserveAspectRatio="xMidYMid meet">
<g class="pipeline-stage">
<rect x="20" y="40" width="140" height="60" rx="4"/>
<text x="90" y="68" text-anchor="middle">fleet hosts</text>
<text x="90" y="86" text-anchor="middle" class="pipeline-detail">/proc · 10 Hz</text>
</g>
<g class="pipeline-stage">
<rect x="200" y="40" width="140" height="60" rx="4"/>
<text x="270" y="68" text-anchor="middle">receiver (Pi)</text>
<text x="270" y="86" text-anchor="middle" class="pipeline-detail">bearer auth</text>
</g>
<g class="pipeline-stage">
<rect x="380" y="40" width="140" height="60" rx="4"/>
<text x="450" y="68" text-anchor="middle">episode store</text>
<text x="450" y="86" text-anchor="middle" class="pipeline-detail">zstd · tar</text>
</g>
<g class="pipeline-stage">
<rect x="560" y="40" width="220" height="60" rx="4"/>
<text x="670" y="68" text-anchor="middle">windowing + features</text>
<text x="670" y="86" text-anchor="middle" class="pipeline-detail">10 s · 100 samples × 12 ch</text>
</g>
<g class="pipeline-stage pipeline-stage-models">
<rect x="180" y="170" width="440" height="120" rx="4"/>
<text x="400" y="198" text-anchor="middle" class="pipeline-stage-title">model zoo</text>
<text x="400" y="226" text-anchor="middle" class="pipeline-detail">KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer</text>
<text x="400" y="252" text-anchor="middle" class="pipeline-detail">trained per (model × split-recipe)</text>
<text x="400" y="276" text-anchor="middle" class="pipeline-detail-mini">cross-host eval · class-weighted CE · early stop on val macro-F1</text>
</g>
<g class="pipeline-stage">
<rect x="60" y="350" width="200" height="60" rx="4"/>
<text x="160" y="378" text-anchor="middle">per-window phase</text>
<text x="160" y="396" text-anchor="middle" class="pipeline-detail">5-class softmax</text>
</g>
<g class="pipeline-stage pipeline-stage-final">
<rect x="300" y="350" width="200" height="60" rx="4"/>
<text x="400" y="378" text-anchor="middle">trust score</text>
<text x="400" y="396" text-anchor="middle" class="pipeline-detail">+ network signals (9881803)</text>
</g>
<g class="pipeline-stage pipeline-stage-final">
<rect x="540" y="350" width="220" height="60" rx="4"/>
<text x="650" y="378" text-anchor="middle">containment + reset</text>
<text x="650" y="396" text-anchor="middle" class="pipeline-detail">snapshot rollback</text>
</g>
<g class="pipeline-arrow" fill="none">
<path d="M160 70 L200 70" />
<path d="M340 70 L380 70" />
<path d="M520 70 L560 70" />
<path d="M670 100 L670 130 L400 130 L400 170" />
<path d="M400 290 L400 320 L160 320 L160 350" />
<path d="M260 380 L300 380" />
<path d="M500 380 L540 380" />
</g>
</svg>
</div>
</div>
<!-- 6. stack — Python stack & libraries used in the project -->
<div class="stage-view" data-view="stack">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">the stack behind the live data on the right</div>
<div class="code-grid">
<div class="code-card">
<div class="code-card-header">pyproject.toml</div>
<pre class="code" id="code-pyproject"></pre>
</div>
<div class="code-card">
<div class="code-card-header">receiver/app.py · file header</div>
<pre class="code" id="code-receiver"></pre>
</div>
</div>
</div>
</div>
<!-- 3. collect -->
<div class="stage-view" data-view="collect">
<div class="metric-stack">
<div class="metric-eyebrow">episodes ingested</div>
<div class="metric-big" id="ingest-total">0</div>
<div class="metric-sub">
<span id="ingest-rate">0.0</span> / sec · last 60 s ·
total bytes on disk: <span id="ingest-bytes">0 B</span>
</div>
<svg class="sparkline" id="ingest-spark" viewBox="0 0 600 120" preserveAspectRatio="none">
<path id="ingest-spark-fill" d=""></path>
<path id="ingest-spark-path" d=""></path>
</svg>
</div>
</div>
<!-- 4. hosts -->
<div class="stage-view" data-view="hosts">
<div class="metric-stack">
<div class="metric-eyebrow">per-host shipping</div>
<div class="bars" id="host-bars">
<div class="awaiting">awaiting snapshot…</div>
</div>
</div>
</div>
<!-- 5. db — episode database explorer -->
<div class="stage-view" data-view="db">
<div class="metric-stack metric-stack-wide">
<div class="db-header">
<div class="metric-eyebrow">episode database · last 200 records</div>
<div class="db-count" id="db-count">0 of 0</div>
</div>
<div class="db-controls">
<div class="db-tabs" id="db-tabs"></div>
<input class="db-search" id="db-search" type="text"
placeholder="filter by host / id / sha…" />
</div>
<div class="db-table-wrap">
<table class="db-table">
<thead>
<tr>
<th>host</th>
<th>episode_id</th>
<th>received</th>
<th>size</th>
</tr>
</thead>
<tbody id="db-tbody"></tbody>
</table>
</div>
<div class="db-detail" id="db-detail" hidden>
<div class="db-detail-meta" id="db-detail-meta"></div>
<div class="db-detail-chart-wrap">
<svg class="db-detail-chart" id="db-detail-chart"
viewBox="0 0 1000 360" preserveAspectRatio="none"></svg>
</div>
<div class="db-detail-legend" id="db-detail-legend"></div>
</div>
</div>
</div>
<!-- 6. baseline -->
<div class="stage-view" data-view="baseline">
<div class="metric-stack">
<div class="metric-eyebrow" id="phase-mix-eyebrow">phase mix · sampling dataset…</div>
<div class="phase-stack" id="phase-stack"></div>
<div class="phase-legend" id="phase-legend"></div>
<div class="metric-sub" id="phase-mix-sub">computing the phase
distribution across a random sample of episodes on disk.
A clean fleet sits mostly in <code>clean</code>; skew toward
<code>infecting</code> / <code>infected_running</code>
reflects time spent under attack workloads.</div>
</div>
</div>
<!-- 7. attacks -->
<div class="stage-view" data-view="attacks">
<div class="metric-stack">
<div class="metric-eyebrow">attack envelopes · /proc signature per profile</div>
<div class="profile-grid" id="profile-grid"></div>
</div>
</div>
<!-- 8. chunking -->
<div class="stage-view" data-view="chunking">
<div class="metric-stack">
<div class="metric-eyebrow">10-second windows · model input shape</div>
<div class="chunk-rule" id="chunk-rule"></div>
<div class="chunk-row" id="chunk-row"></div>
<div class="chunk-axis" id="chunk-axis"></div>
<div class="metric-sub">each window: 100 samples (10 Hz × 10 s),
labeled by the phase that occupies its center.</div>
</div>
</div>
<!-- 9. evaluation-setup — splits, metrics, baselines -->
<div class="stage-view" data-view="evaluation-setup">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">evaluation setup · how the numbers get made</div>
<div class="eval-blocks">
<div class="eval-block">
<div class="eval-block-title">split recipe</div>
<div class="eval-block-body">
<div><strong>train val:</strong> elliott-thinkpad</div>
<div><strong>test:</strong> k-gamingcom</div>
<div class="eval-detail">held-out by host so the test set
measures cross-device generalization, not in-distribution
self-prediction. A 90 % accuracy that comes from
recognising the host's idle profile is worthless for
a fleet detector.</div>
</div>
</div>
<div class="eval-block">
<div class="eval-block-title">primary metric</div>
<div class="eval-block-body">
<div><strong>macro-F1</strong> averaged across the five phases</div>
<div class="eval-detail">accuracy lies under class
imbalance — ~50 % <code>infected_running</code>,
~5 % <code>armed</code>. A constant majority predictor
hits 0.5 accuracy. macro-F1 averages per-class F1,
so rare phases actually count toward the score.</div>
</div>
</div>
<div class="eval-block">
<div class="eval-block-title">baselines compared</div>
<div class="eval-block-body">
<div><strong>KNN</strong> — non-parametric, instance-based</div>
<div><strong>GBT (XGBoost)</strong> — tabular non-NN</div>
<div><strong>MLP</strong> — feedforward ablation</div>
<div><strong>CNN</strong> — local-pattern ablation</div>
<div><strong>RNN / GRU / LSTM</strong> — recurrent family</div>
<div><strong>Transformer</strong> — attention</div>
</div>
</div>
<div class="eval-block">
<div class="eval-block-title">reported alongside accuracy</div>
<div class="eval-block-body">
<div><strong>μs / window</strong> — inference cost at batch=64</div>
<div><strong>cross-host gap</strong> — val test macro-F1</div>
<div class="eval-detail">latency translates to containment
lag; the gap is the honest measure of generalization.
Both are plotted on the perf scene.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 10. models -->
<div class="stage-view" data-view="models">
<div class="metric-stack">
<div class="metric-eyebrow">sequence models · accuracy on held-out samples</div>
<div class="model-bars" id="model-bars"></div>
</div>
</div>
<!-- 10. training-code — how we trained the sequence models -->
<div class="stage-view" data-view="training-code">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">how we trained the sequence models</div>
<div class="code-grid">
<div class="code-card">
<div class="code-card-header">training/models/lstm.py</div>
<pre class="code" id="code-train-lstm"></pre>
</div>
<div class="code-card">
<div class="code-card-header">training/trainer/_loop.py · train_nn</div>
<pre class="code" id="code-train-loop"></pre>
</div>
</div>
</div>
</div>
<!-- 11. knn — interactive 3-D scatter with mode toggle -->
<div class="stage-view" data-view="knn">
<div class="metric-stack">
<div class="metric-eyebrow">window features · 3-D projection · drag to rotate</div>
<div class="scatter3d-controls">
<div class="scatter3d-modes">
<button class="scatter3d-mode active" data-mode="phase">phase (ground truth)</button>
<button class="scatter3d-mode" data-mode="predicted">KNN-predicted label</button>
<button class="scatter3d-mode" data-mode="cluster">cluster id</button>
</div>
<button class="scatter3d-reset">reset view</button>
</div>
<div class="scatter3d-wrap">
<canvas class="scatter3d" id="knn-scatter-canvas"></canvas>
</div>
<div class="phase-legend" id="knn-legend"></div>
</div>
</div>
<!-- 13. references — PDF viewer with tabs + description -->
<div class="stage-view" data-view="references">
<div class="metric-stack metric-stack-wide ref-stack">
<div class="metric-eyebrow">references · papers, notes, prior work</div>
<div class="ref-tabs" id="ref-tabs"></div>
<div class="ref-content">
<div class="ref-viewer-wrap">
<iframe class="ref-viewer" id="ref-viewer"
title="reference viewer"
sandbox="allow-same-origin allow-scripts allow-popups allow-forms"></iframe>
</div>
<div class="ref-description" id="ref-description"></div>
</div>
</div>
</div>
<!-- 12. perf -->
<div class="stage-view" data-view="perf">
<div class="metric-stack">
<div class="metric-eyebrow">accuracy vs inference cost</div>
<svg class="scatter" id="perf-scatter" viewBox="0 0 600 360" preserveAspectRatio="xMidYMid meet"></svg>
<div class="metric-sub">x: μs / window (lower is better) ·
y: held-out accuracy (higher is better).</div>
</div>
</div>
<!-- 14. live — fleet-wide live detections feed -->
<div class="stage-view" data-view="live">
<div class="metric-stack metric-stack-wide live-stack">
<div class="live-stats">
<span class="live-stats-eye">live detections</span>
<span class="live-stats-dot" id="live-stats-hosts">0 hosts</span>
<span class="live-stats-dot" id="live-stats-rate">0 / sec</span>
<span class="live-stats-dot" id="live-stats-model">model: —</span>
<span class="live-stats-dot" id="live-stats-acc">hit-rate: —</span>
</div>
<div class="live-lanes" id="live-lanes"></div>
<div class="live-latest" id="live-latest">
<div class="live-latest-empty">awaiting <code>live_detection</code> events from the inference loop</div>
</div>
</div>
</div>
<!-- 15. theoretical-contributions -->
<div class="stage-view" data-view="theoretical">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">theoretical contributions · what's new methodologically</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-trust"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">window-centre labelling</div>
<div class="motivation-card-text">A 10-second
classification window is labelled by the phase that
occupies its centre, not by majority vote across the
window. Cleaner training signal at phase boundaries,
and avoids the spurious "ambiguous" class.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-contain"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">schema-hashed checkpoints</div>
<div class="motivation-card-text">Each checkpoint
embeds a hash of the feature schema; loading a model
against the wrong schema fails fast instead of
silently scoring on misaligned columns. Makes
retroactive comparison reproducible.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-recover"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">cross-host as the eval axis</div>
<div class="motivation-card-text">Held-out-by-host
is reported as a first-class number alongside
held-out-by-sample. The two often disagree by 0.4
macro-F1, and only the cross-host number predicts
fleet behaviour.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 16. practical-contributions -->
<div class="stage-view" data-view="practical">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">practical contributions · what others can use</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-trust"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">/proc-only deployment</div>
<div class="motivation-card-text">No syscall hooks, no
eBPF, no kernel module — runs on hosts that don't
permit deep instrumentation. The detector is one
Python service plus a model file.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-contain"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">producer-agnostic dashboard</div>
<div class="motivation-card-text">The deck consumes
typed events; the inference loop runs anywhere
(Pi, A100, cloud) and just POSTs back. Same UI for
a lab demo and an operational console.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-recover"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">labelled dataset on disk</div>
<div class="motivation-card-text">78,000+ episodes,
five phases, two hosts, six attack profiles —
archived in zstd-compressed tarballs with a
schema-versioned format. Ready for downstream
work without re-running the orchestrator.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 17. design-principles -->
<div class="stage-view" data-view="design-principles">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">design principles · patterns that emerged</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-trust"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">one loop, many models</div>
<div class="motivation-card-text">Every NN architecture
plugs into the same training loop — class weights,
AMP, cosine LR, early stop. Architecture changes
don't ripple into orchestration.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-contain"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">typed events as contract</div>
<div class="motivation-card-text">Producers and
consumers agree on dataclasses
(<code>events.py</code>), not free-form dicts.
Adding a new scene means adding a new dataclass;
adding a new producer means importing it.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-recover"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">two-agent path ownership</div>
<div class="motivation-card-text">Dashboard work and
model work live in two parallel sessions with a
documented path-ownership boundary. Merges go
through git with explicit rebases instead of a
shared workspace.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 18. limitations -->
<div class="stage-view" data-view="limitations">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">limitations · the honest list</div>
<div class="motivation-cards">
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">two-host fleet</div>
<div class="motivation-card-text">Cross-host generalization
is reported between exactly two machines
(elliott-thinkpad → k-gamingcom). N-host claims need
more hosts on the WireGuard mesh.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">synthetic attack profiles</div>
<div class="motivation-card-text">Six profiles cover the
main shapes (cpu-saturate, ransomware-lite, bursty-c2,
fork-bomb, crypto-miner, distccd-exec) but real-world
malware can sit between or outside these envelopes.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">10 Hz sampling floor</div>
<div class="motivation-card-text">Sub-100ms attack
behaviours fall inside a single sample. Detection of
extremely short-lived attacks (millisecond-scale
privilege checks) requires faster sampling than
<code>/proc</code> currently provides.</div>
</div>
</div>
<div class="motivation-card">
<div class="motivation-card-marker mc-armed"></div>
<div class="motivation-card-body">
<div class="motivation-card-title">KNN cross-host gap</div>
<div class="motivation-card-text">KNN scores val
macro-F1 ≈ 0.74 on elliott-thinkpad but only 0.13 on
the held-out k-gamingcom. Instance-based memorization
of the training host's feature space — informative
as a baseline, but not a deployment candidate.</div>
</div>
</div>
</div>
</div>
</div>
<!-- 19. conclusion-future — summary + unsupervised next steps -->
<div class="stage-view" data-view="conclusion-future">
<div class="metric-stack metric-stack-wide">
<div class="metric-eyebrow">conclusion + future work</div>
<div class="conclusion-grid">
<div class="conclusion-col">
<div class="conclusion-col-title">what we showed</div>
<ul class="conclusion-list">
<li>A per-host detector trained on
<strong>/proc-only telemetry</strong> can classify
workload phases at multi-class macro-F1 well above
chance.</li>
<li>Held-out-<strong>by-host</strong> evaluation is the
right generalization axis; held-out-by-sample
overstates real fleet performance by 0.3+ F1.</li>
<li>The recurrent family (LSTM/GRU) and Transformer
sit on the upper-left of the
<strong>accuracy-vs-cost frontier</strong>; KNN and
GBT round out the comparison as honest baselines.</li>
<li>The detector slots into a wider <strong>trust /
containment / recovery</strong> loop — the per-host
verdict isn't the final answer, it's one input.</li>
</ul>
</div>
<div class="conclusion-col">
<div class="conclusion-col-title">next steps · unsupervised</div>
<ul class="conclusion-list">
<li><strong>Clustering</strong> the unlabeled tail of
new fleet data (KMeans / HDBSCAN) to surface novel
workload shapes the supervised model has no class
for — a self-training feedback loop.</li>
<li><strong>Anomaly detection</strong> on the
last-layer embedding (one-class SVM, isolation forest)
so a "none of the five known phases" verdict is
available alongside the classifier output.</li>
<li><strong>Self-supervised pretraining</strong> on
the much larger pool of unlabeled telemetry from
operational hosts; supervised fine-tune on the
smaller orchestrated dataset.</li>
<li><strong>Embedding visualisation</strong> via
UMAP / t-SNE for human-in-the-loop labelling of
the unlabeled tail (already prototyped in scene 12).</li>
</ul>
</div>
</div>
</div>
</div>
</div>
<button id="next-fab" class="fab" data-no-advance title="Next (→)"></button>
</div>
<article class="article">
<section class="scene" data-stage="intro">
<div class="prose">
<p class="lede">Most malware doesn't look like malware in a database
— it looks like a process behaving badly.</p>
<p>An <strong>intrusion detection system</strong> spots the bad
behavior; an <strong>intrusion prevention system</strong> stops it.
Both depend on knowing what bad behavior <em>looks like</em> at the
level of telemetry the device can actually see.</p>
<p>This deck is the live face of the dataset we're building to teach
a model that distinction — every panel on the left is a slice of
real data shipping in right now.</p>
<p class="hint">scroll, click, or → to advance</p>
</div>
</section>
<section class="scene" data-stage="motivation">
<div class="prose">
<h2>Why detect at all?</h2>
<p>Knowing a device is compromised is the precondition for everything
else. A classifier that says "this host is infected right now"
turns into three concrete operational capabilities — and each
one rewards a faster, more confident detector.</p>
<p><strong>Trust scoring across the network.</strong> Recent work
on per-device trust establishment
(<a href="https://ieeexplore.ieee.org/document/9881803"
target="_blank" rel="noopener">IEEE 9881803</a>) argues that
on-device metrics alone aren't enough — a fleet has to combine
local classifier verdicts with network-behaviour signals
(peer observations, gateway traffic patterns, inter-host
relationships) to score trust reliably. Our per-host detector
is one input to that broader signal.</p>
<p><strong>Containment.</strong> Once a host is flagged, the
gateway can drop its traffic and the IAM layer can revoke
credentials before lateral movement begins. Detection
latency translates directly into how much of the network
an attacker reaches.</p>
<p><strong>Quick recovery.</strong> A confirmed infection time
lets you restore from a snapshot taken just before the
compromise — no forensic dwell time, no guessing how far
back to roll. The recovery path becomes a one-button operation
instead of a week of cleanup.</p>
</div>
</section>
<section class="scene" data-stage="problem-statement">
<div class="prose">
<h2>Problem statement</h2>
<p>Today's behaviour-based IDS systems rely on syscall traces,
kernel hooks, or rich endpoint agents that can't ship to
constrained or untrusted hosts. We want a detector that
runs on the only telemetry every modern Linux already
exports — <code>/proc</code> — and labels each ten-second
window of activity with the phase the workload is in.</p>
<p><strong>Research question.</strong> Can a sequence model
trained on twelve channels of <code>/proc</code> telemetry
classify five workload phases (clean / armed / infecting /
infected_running / dormant) accurately enough to drive
automated containment, <em>and</em> generalize across hosts
and malware profiles it has never seen during training?</p>
<p>The task is <strong>multi-class classification</strong>:
the target is one of five mutually-exclusive phase labels.
Not regression (no continuous target), not ranking
(downstream policy is a categorical containment decision).
We deliberately chose 10-second windows so detection
latency stays bounded for a real fleet.</p>
</div>
</section>
<section class="scene" data-stage="research-questions">
<div class="prose">
<h2>Research gaps + questions</h2>
<p>Literature on behaviour-based malware detection is rich but
uneven. Most published results either (a) use richer
telemetry than what a constrained host actually exports, or
(b) frame evaluation in ways that hide the cross-host
generalization problem. The card on the left summarises the
gap.</p>
<p>This project asks three concrete questions:</p>
<p><strong>RQ1.</strong> How well can a per-window classifier
identify workload phases from <code>/proc</code> alone, with
no syscall traces and no kernel hooks?</p>
<p><strong>RQ2.</strong> Does the model still work when test
episodes come from a host the training set never saw?</p>
<p><strong>RQ3.</strong> Of the standard sequence-model
families (RNN, GRU, LSTM, CNN, Transformer) plus a
non-parametric baseline (KNN) and a tabular baseline
(gradient-boosted trees), which trade off accuracy and
inference cost best for a deployment that has to run on a
constrained host?</p>
</div>
</section>
<section class="scene" data-stage="solution-overview">
<div class="prose">
<h2>Proposed solution</h2>
<p>A single end-to-end pipeline turns raw <code>/proc</code>
telemetry on a fleet host into a per-window phase verdict
in under a second. Each stage of the diagram on the left
is a thin, independently-deployable component — the
receiver doesn't know what model is running; the model
doesn't know where the episode came from.</p>
<p>The <strong>model zoo</strong> is the key abstraction:
every model class registers itself by name, declares its
input kind (summary features or window tensors), and plugs
into one shared training loop. KNN, GBT, MLP, CNN, RNN,
GRU, LSTM, and Transformer all reuse the same standardization,
schema-hashed checkpoint format, class-weighted CE loss,
and held-out-by-host evaluation — so the comparison is
genuinely apples-to-apples.</p>
<p>The detector's per-window verdict feeds two downstream
loops: a fleet-wide <strong>trust score</strong> that
combines local classification with network-behaviour
signals (per IEEE 9881803), and a <strong>fast-recovery</strong>
snapshot rollback when an infection time is known.</p>
</div>
</section>
<section class="scene" data-stage="stack">
<div class="prose">
<h2>Live, not staged</h2>
<p>Every panel from here on is real data from real devices —
counters, bars, the episode database, all driven by the
<code>cis490-receiver</code> service running on this Pi as
you scroll.</p>
<p>The code on the left is how it gets here. Four runtime deps:
<strong>starlette</strong> + <strong>uvicorn</strong> for the
async HTTP and WebSocket surface, <strong>msgpack</strong>
talks to Metasploit's RPC, <strong>pycdlib</strong> builds the
lab-VM cidata ISOs. Everything else is the standard library,
and every dep is annotated with a one-line reason it's there.</p>
</div>
</section>
<section class="scene" data-stage="collect">
<div class="prose">
<h2>Collecting the dataset</h2>
<p>Each lab host on the WireGuard mesh boots a real Alpine VM, runs
a profile-driven workload inside it, and samples
<code>/proc/&lt;qemu_pid&gt;</code> at 10&nbsp;Hz. Every ~30&nbsp;seconds
the labeled tarball is shipped to this Pi over mTLS.</p>
<p>The counter on the left is the running total, sourced from the
receiver's <code>index.jsonl</code> on disk. The sparkline is the
arrival rate over the last sixty seconds.</p>
</div>
</section>
<section class="scene" data-stage="hosts">
<div class="prose">
<h2>A multi-host fleet</h2>
<p>Running the same orchestrator on multiple hosts gives novel,
non-overlapping data per host — no central coordinator. Each host
pulls a different slice of the manifest, so the dataset grows in
parallel.</p>
<p>The numbers below are absolute episode counts on disk, refreshed
from <code>/var/lib/cis490/episodes/&lt;host&gt;/</code> every
thirty seconds.</p>
</div>
</section>
<section class="scene" data-stage="db">
<div class="prose">
<h2>The dataset, browsable</h2>
<p>Every row is one labeled episode tarball stored at
<code>/var/lib/cis490/episodes/&lt;host&gt;/&lt;id&gt;.tar.zst</code>
after the receiver verifies its SHA-256 and writes it through.</p>
<p>Filter by host with the tabs, or grep by host / episode id /
sha with the search box. Click a row for the full
<code>index.jsonl</code> record. The view holds the most recent
two hundred records — older history is on disk, indexable
from the receiver.</p>
</div>
</section>
<section class="scene" data-stage="baseline">
<div class="prose">
<h2>A baseline of normal</h2>
<p>Before we can detect a deviation, we have to know what the fleet
looks like across a wide slice of its life. The stacked bar
aggregates ground-truth phase labels across hundreds of randomly
sampled episodes from the dataset on disk — weighted by the time
the workload actually spent in each phase, not just the count of
transitions.</p>
<p>If the model only ever sees <code>clean</code>, it overfits to
"everything is fine." The phase schedule fixes that by forcing
every run to walk through every phase, which is why
<code>infected_running</code> dominates the mix — that's where
the labelled attack workload sits.</p>
</div>
</section>
<section class="scene" data-stage="attacks">
<div class="prose">
<h2>Linking attack to telemetry</h2>
<p>The same six profiles run across every host, and each one
produces a different envelope in <code>/proc</code>. A
cryptominer pegs one core for minutes. A bursty C2 channel sits
idle, then exhales three packets. Ransomware walks the
filesystem and saturates I/O.</p>
<p>The thumbnails on the left are the canonical envelopes the
model has to learn to recognize — same axes, different shapes.
That shape difference is what makes detection tractable.</p>
</div>
</section>
<section class="scene" data-stage="chunking">
<div class="prose">
<h2>Ten-second windows</h2>
<p>Models eat fixed-size inputs. We chop each episode into
10-second windows — 100 samples per window at 10&nbsp;Hz — and
label each window with the phase that occupies its center.</p>
<p>Window size is a knob. Too short and the model can't see slow
envelopes (low-and-slow malware, idle C2). Too long and you can't
react fast enough to be a useful prevention signal. Ten seconds
is the starting point we tune around.</p>
</div>
</section>
<section class="scene" data-stage="evaluation-setup">
<div class="prose">
<h2>Evaluation setup</h2>
<p>Three choices anchor every result on the next slides — the
split recipe, the primary metric, and what we measure next
to accuracy. The temptation is to report a single big
number; we report a number you can argue with.</p>
<p><strong>Held-out by host.</strong> Train and validate on
one machine; test on a different machine. A model that
wins by memorising the train host's idle profile loses
here, which is what you want — a fleet detector has to
generalize across hosts it never saw at training time.</p>
<p><strong>Macro-F1, not accuracy.</strong> The dataset is
heavily skewed: roughly half the labelled time is
<code>infected_running</code> and only ~5 % is
<code>armed</code>. A "predict the majority class"
baseline already hits 0.5 accuracy. Macro-F1 averages F1
across all five phases so rare classes count.</p>
<p><strong>Latency reported with accuracy.</strong> A model
that's one F1 point better but ten milliseconds slower
may still be the wrong choice for an on-host detector.
The perf scene plots both axes so the trade-off is visible.</p>
</div>
</section>
<section class="scene" data-stage="models">
<div class="prose">
<h2>Sequence models</h2>
<p><strong>RNN, GRU, LSTM</strong> — recurrent models that read the
window one timestep at a time and carry state forward. Cheap,
mature, easy to interpret.</p>
<p><strong>BERT-style transformer</strong> — the window becomes a
sequence of "tokens"; attention captures cross-position context
instead of accumulating it through a hidden state. More
parameters, more compute, more room to overfit a small dataset.</p>
<p>Same input, same labels, four different inductive biases. The
comparison on the left is the punchline of the whole project.</p>
</div>
</section>
<section class="scene" data-stage="training-code">
<div class="prose">
<h2>How we trained them</h2>
<p>One trainer per model — load the windowed dataset, define the
network, train, evaluate. Same shape for RNN, GRU, LSTM, BERT,
so you can read all four side-by-side and the only differences
are the architecture itself.</p>
<p>The code on the left is the LSTM trainer.
PyTorch's <code>DataLoader</code> handles windowing,
<code>nn.LSTM</code> is one line, the loop is six.
No custom loss, no rate schedule, no manual batching —
anything fancier has to earn its place by beating the simple
version on held-out samples.</p>
</div>
</section>
<section class="scene" data-stage="knn">
<div class="prose">
<h2>Nearest-neighbor as a sanity check</h2>
<p>Before anything fancy: engineer summary features per window
(mean, std, p95, slope, zero-bucket counts per channel) and run
<strong>KNN</strong> in that feature space.</p>
<p>If the phase clusters separate visibly in two dimensions, KNN
already does most of the work and a deep model is only buying
marginal improvement. If they don't separate, you've learned
something about the feature engineering before training a single
epoch.</p>
</div>
</section>
<section class="scene" data-stage="perf">
<div class="prose">
<h2>Accuracy vs complexity</h2>
<p>Bigger models earn better numbers in the validation set — but
they also need more parameters, more inference time, and more
memory at the edge. The deployed model has to fit on the device
it's protecting.</p>
<p>The scatter on the left is the usable trade-off curve: every
point above and to the left of where you currently sit is a
reachable upgrade. The point in the bottom-right is a model
you'd never ship.</p>
</div>
</section>
<section class="scene" data-stage="live">
<div class="prose">
<h2>Catching attacks live</h2>
<p>Real episodes arrive from the fleet, get chunked into ten-second
windows, and a deployed model labels each window in flight. The
heavy models can offload inference to an <strong>A100</strong>
so the receiver never blocks on a forward pass — predictions
stream back as they finish.</p>
<p>Each row on the stage is a host; each cell is one ten-second
window painted by the model's predicted phase. A clean run
cruises blue; an attack profile pushes the lane through
<code>armed</code><code>infecting</code>
<code>infected_running</code>. When ground truth catches up,
mismatched cells get a hatched overlay so you can spot where
the model disagrees with the orchestrator. The callout below
holds the most recent prediction with model name,
confidence, and round-trip latency.</p>
</div>
</section>
<section class="scene" data-stage="theoretical">
<div class="prose">
<h2>Theoretical contributions</h2>
<p>Three methodological claims this project makes — small in
isolation, but together they change how the comparison is
run. Each shows up explicitly in the codebase.</p>
<p><strong>Window-centre labelling.</strong> Instead of
majority-voting phase labels across each 10-second window
(which creates noisy boundaries), we label each window by
the phase that occupies its centre. Cleaner training
signal at transitions, no spurious "ambiguous" class.</p>
<p><strong>Schema-hashed checkpoints.</strong> Every
checkpoint embeds a hash of the feature schema it was
trained on. Loading a model against a different schema
fails fast. Without this, retroactive comparison silently
scores models on misaligned columns and reports nonsense.</p>
<p><strong>Cross-host as the eval axis.</strong>
Held-out-by-host is reported as a first-class number
alongside held-out-by-sample — the two often disagree by
~0.4 macro-F1, and only the cross-host number predicts
real fleet behaviour.</p>
</div>
</section>
<section class="scene" data-stage="practical">
<div class="prose">
<h2>Practical contributions</h2>
<p>What others can pick up and use from this project — beyond
the published numbers.</p>
<p><strong>/proc-only deployment.</strong> The detector needs
no syscall hooks, no eBPF, no kernel module. It runs on
hosts that don't permit deeper instrumentation — a small
VM, a container with limited capabilities, an embedded
device. One Python service plus a model file.</p>
<p><strong>Producer-agnostic dashboard.</strong> The deck
consumes typed events
(<code>training/dashboard/events.py</code>); the inference
loop runs anywhere — Pi, A100, cloud — and just POSTs back.
Same UI for a lab demo and an operational console.</p>
<p><strong>Labelled dataset on disk.</strong> 78 000+
episodes across two hosts and six attack profiles, archived
in zstd-compressed tarballs with a schema-versioned format.
Anyone reproducing or extending this work can start from
the dataset directly without re-running the orchestrator.</p>
</div>
</section>
<section class="scene" data-stage="design-principles">
<div class="prose">
<h2>Design principles</h2>
<p>Three patterns that emerged during the project and earned
their keep enough that we'd repeat them.</p>
<p><strong>One loop, many models.</strong> Every NN
architecture plugs into the same training loop — class
weights, AMP autocast, cosine LR with warmup, gradient
clipping, early stop on val macro-F1. Architecture changes
don't ripple into orchestration, and adding a new model
class costs ~80 lines.</p>
<p><strong>Typed events as contract.</strong> Producers and
consumers agree on dataclasses, not free-form dicts.
Adding a new dashboard scene means adding a new dataclass;
adding a new producer means importing it. Static checking
and editor autocomplete do most of the work that a
schema-validation library would do at runtime.</p>
<p><strong>Two-agent path ownership.</strong> Dashboard work
and model work live in two parallel sessions with a
documented path-ownership boundary
(<code>training/dashboard/</code> vs everywhere else).
Merges go through git with explicit rebases instead of a
shared workspace — slow up front, fewer subtle stomps
over time.</p>
</div>
</section>
<section class="scene" data-stage="limitations">
<div class="prose">
<h2>Limitations</h2>
<p>What this project cannot honestly claim — and why each
line on the left matters for how the results should be read.</p>
<p><strong>Two-host fleet.</strong> Cross-host generalization
is reported between exactly two machines; it's the right
<em>shape</em> of evaluation but not a population claim.
More hosts on the WireGuard mesh would let us report
distributional bounds rather than single point comparisons.</p>
<p><strong>Synthetic attack profiles.</strong> Our six
profiles cover the main behavioural envelopes
(cpu-saturate, ransomware-lite, bursty-c2, fork-bomb,
crypto-miner, distccd-exec) but real-world malware can
sit between or outside these envelopes. Generalization to
unseen profiles is reported via held-out-by-sample, but
in-the-wild distribution shift is unknown.</p>
<p><strong>10 Hz sampling floor.</strong> Sub-100ms
behaviours fall inside a single sample. Detection of
millisecond-scale privilege checks would need faster
telemetry than <code>/proc</code> provides.</p>
<p><strong>KNN cross-host gap.</strong> KNN scores val
macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the
held-out one. Instance-based memorization of the training
host's feature space — informative as a baseline, not a
deployment candidate.</p>
</div>
</section>
<section class="scene" data-stage="conclusion-future">
<div class="prose">
<h2>Conclusion + future work</h2>
<p>A per-host classifier trained on <code>/proc</code>-only
telemetry can identify workload phases at multi-class
macro-F1 well above chance and slot into a wider
trust / containment / recovery loop. The recurrent family
(LSTM/GRU) and Transformer sit on the upper-left of the
accuracy-vs-cost frontier; KNN and GBT are honest baselines.
Held-out-by-host evaluation is the right generalization
axis — held-out-by-sample overstates real fleet
performance by 0.3+ F1.</p>
<p><strong>Unsupervised next steps.</strong> The natural
extensions are unsupervised:</p>
<p><strong>Clustering</strong> the unlabeled tail of new
fleet data (KMeans / HDBSCAN) to surface novel workload
shapes the supervised model has no class for — a
self-training feedback loop that enrolls new phases as
the fleet grows.</p>
<p><strong>Anomaly detection</strong> on the last-layer
embedding (one-class SVM, isolation forest) so a "none of
the five known phases" verdict is available alongside the
classifier output.</p>
<p><strong>Self-supervised pretraining</strong> on the much
larger pool of unlabeled telemetry from operational hosts;
supervised fine-tune on the smaller orchestrated dataset.</p>
<p><strong>Embedding visualisation</strong> via UMAP /
t-SNE for human-in-the-loop labelling — already prototyped
in the KNN scene's interactive 3-D scatter.</p>
</div>
</section>
<section class="scene" data-stage="references">
<div class="prose">
<h2>References</h2>
<p>The papers, notes, and prior work this project leans on.
Pick a tab on the left to load the document; the viewer
takes the bulk of the stage so you can scroll through
without leaving the deck.</p>
<p class="hint">end of deck · ← to flip back</p>
</div>
</section>
<div class="scene-end-spacer"></div>
</article>
</div>
<script src="/static/dashboard.js?v=7c6859eb"></script>
</body>
</html>