From 53d2b80009c826354d0d9dc6971dea7b0b4a4150 Mon Sep 17 00:00:00 2001 From: Max Gorog Date: Fri, 8 May 2026 19:06:57 -0500 Subject: [PATCH] deck: remove the nine inserted scenes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per the user's request — the rubric-derived scenes I added in one sweep weren't tied closely enough to their actual project narrative and ate up presentation time. Reverting to the pre-insertion deck: removed problem-statement / research-questions / solution-overview / evaluation-setup / theoretical / practical / design-principles / limitations / conclusion-future kept (user-requested earlier in the session) motivation (with the IEEE 9881803 citation) live (A100 inference scene) CSS rules and references/* sidecar files for the removed scenes are left in place as harmless dead code; they can be cleaned up later. Co-Authored-By: Claude Opus 4.7 (1M context) --- training/dashboard/static/index.html | 669 +-------------------------- 1 file changed, 2 insertions(+), 667 deletions(-) diff --git a/training/dashboard/static/index.html b/training/dashboard/static/index.html index 1130ffe..166ce5d 100644 --- a/training/dashboard/static/index.html +++ b/training/dashboard/static/index.html @@ -219,144 +219,7 @@ - -
-
-
the problem · single sentence + numbers
-
-
Classify each ten-second window of fleet - /proc telemetry into one of five workload phases — - accurately enough to drive automated containment.
-
-
-
-
5
-
phase classes
cleaninfected_running
-
-
-
12
-
/proc channels
no syscalls, no kernel hooks
-
-
-
10s
-
classification window
100 samples × 12 channels
-
-
-
- task type: - multi-class classification - — five mutually-exclusive - phase labels, balanced via class-weighted cross-entropy. - Not regression (no continuous target), not ranking - (downstream policy is a categorical containment decision). -
-
-
- - -
-
-
literature gaps · positioning the work
-
-
-
what prior work covers
-
    -
  • LSTM on syscall traces in VMs — - deeper telemetry than /proc
  • -
  • Transformer on per-process resource metrics - — related signal, single-host eval
  • -
  • BERT on system logs (LogBERT) — - text-form telemetry, not numeric channels
  • -
  • Insider-threat LSTM on event logs - (DANTE) — categorical events, not continuous
  • -
  • Network-behaviour trust establishment - (IEEE 9881803) — cross-device aggregation, - not per-host classifier
  • -
-
-
-
what's missing
-
    -
  • /proc-only signal — most work - assumes syscalls or kernel hooks
  • -
  • Sample-stratified evaluation — - papers often hide same-sample overfit by training - and testing on the same malware instances
  • -
  • Real-time per-window classification - for containment, not post-hoc batch labelling
  • -
  • Side-by-side cell-choice comparison - (RNN/GRU/LSTM/CNN/Transformer) on one dataset
  • -
  • Direct integration with a - fleet-wide trust score, not standalone output
  • -
-
-
-
-
- - -
-
-
pipeline · what each stage produces
- - - - fleet hosts - /proc · 10 Hz - - - - receiver (Pi) - bearer auth - - - - episode store - zstd · tar - - - - windowing + features - 10 s · 100 samples × 12 ch - - - - model zoo - KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer - trained per (model × split-recipe) - held-out-by-sample · class-weighted CE · early stop on val macro-F1 - - - - per-window phase - 5-class softmax - - - - trust score - + network signals (9881803) - - - - containment + reset - snapshot rollback - - - - - - - - - - - -
-
- - +
the stack behind the live data on the right
@@ -453,66 +316,6 @@
- -
-
-
evaluation setup · how the numbers get made
-
-
-
split recipe
-
-
train / val / test: held-out by - sample_name, profile-stratified
-
both hosts contribute to all three slices
-
the fleet is uniform — every - host runs the same orchestrator and every profile — - so we don't split by host. We split by malware - sample_name: the specific instances in - the test set never appear during training. - Generalization axis is "unseen malware", not - "unseen device". Two profiles with only one sample - (cpu-saturate, low-and-slow) are excluded from - held-out-by-sample eval and reported separately.
-
-
-
-
primary metric
-
-
macro-F1 averaged across the five phases
-
accuracy lies under class - imbalance — ~50 % infected_running, - ~5 % armed. A constant majority predictor - hits 0.5 accuracy. macro-F1 averages per-class F1, - so rare phases actually count toward the score.
-
-
-
-
baselines compared
-
-
KNN — non-parametric, instance-based
-
GBT (XGBoost) — tabular non-NN
-
MLP — feedforward ablation
-
CNN — local-pattern ablation
-
RNN / GRU / LSTM — recurrent family
-
Transformer — attention
-
-
-
-
reported alongside accuracy
-
-
μs / window — inference cost at batch=64
-
val ↔ test gap — val − test macro-F1
-
latency translates to - containment lag; the val ↔ test gap is the honest - measure of how much accuracy survives the move from - "samples we saw" to "samples we didn't". Both plot - on the perf scene.
-
-
-
-
-
-
@@ -600,234 +403,6 @@
- -
-
-
theoretical contributions · what's new methodologically
-
-
-
-
-
window-centre labelling
-
A 10-second - classification window is labelled by the phase that - occupies its centre, not by majority vote across the - window. Cleaner training signal at phase boundaries, - and avoids the spurious "ambiguous" class.
-
-
-
-
-
-
schema-hashed checkpoints
-
Each checkpoint - embeds a hash of the feature schema; loading a model - against the wrong schema fails fast instead of - silently scoring on misaligned columns. Makes - retroactive comparison reproducible.
-
-
-
-
-
-
held-out-by-sample as the eval axis
-
The hosts in the - fleet are uniform — same orchestrator, same workload, - different production rates. The generalization claim - is therefore "unseen malware sample", tested on the - same population of devices the training data came - from. Profile-stratified so every profile gets fair - train/val/test cells.
-
-
-
-
-
- - -
-
-
practical contributions · what others can use
-
-
-
-
-
/proc-only deployment
-
No syscall hooks, no - eBPF, no kernel module — runs on hosts that don't - permit deep instrumentation. The detector is one - Python service plus a model file.
-
-
-
-
-
-
producer-agnostic dashboard
-
The deck consumes - typed events; the inference loop runs anywhere - (Pi, A100, cloud) and just POSTs back. Same UI for - a lab demo and an operational console.
-
-
-
-
-
-
labelled dataset on disk
-
78,000+ episodes, - five phases, two hosts, six attack profiles — - archived in zstd-compressed tarballs with a - schema-versioned format. Ready for downstream - work without re-running the orchestrator.
-
-
-
-
-
- - -
-
-
design principles · patterns that emerged
-
-
-
-
-
one loop, many models
-
Every NN architecture - plugs into the same training loop — class weights, - AMP, cosine LR, early stop. Architecture changes - don't ripple into orchestration.
-
-
-
-
-
-
typed events as contract
-
Producers and - consumers agree on dataclasses - (events.py), not free-form dicts. - Adding a new scene means adding a new dataclass; - adding a new producer means importing it.
-
-
-
-
-
-
two-agent path ownership
-
Dashboard work and - model work live in two parallel sessions with a - documented path-ownership boundary. Merges go - through git with explicit rebases instead of a - shared workspace.
-
-
-
-
-
- - -
-
-
limitations · the honest list
-
-
-
-
-
two-host fleet
-
Both hosts contribute - to train, val, and test, but the device population - is small (n = 2). Adding more hosts on the WireGuard - mesh wouldn't change the split recipe but would make - the dataset more representative of real-world - hardware variety.
-
-
-
-
-
-
synthetic attack profiles
-
Six profiles cover the - main shapes (cpu-saturate, ransomware-lite, bursty-c2, - fork-bomb, crypto-miner, distccd-exec) but real-world - malware can sit between or outside these envelopes.
-
-
-
-
-
-
10 Hz sampling floor
-
Sub-100ms attack - behaviours fall inside a single sample. Detection of - extremely short-lived attacks (millisecond-scale - privilege checks) requires faster sampling than - /proc currently provides.
-
-
-
-
-
-
KNN val ↔ test gap
-
KNN scores val - macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 - on held-out sample_names. Instance-based memorization - of the specific training samples — informative as a - baseline, not a deployment candidate.
-
-
-
-
-
- - -
-
-
conclusion + future work
-
-
-
what we showed
-
    -
  • A per-host detector trained on - /proc-only telemetry can classify - workload phases at multi-class macro-F1 well above - chance.
  • -
  • Held-out-by-sample, - profile-stratified, is the right generalization - axis: both fleet hosts contribute to all three - slices, and the test set's - sample_names never appear during - training.
  • -
  • The recurrent family (LSTM/GRU) and Transformer - sit on the upper-left of the - accuracy-vs-cost frontier; KNN and - GBT round out the comparison as honest baselines.
  • -
  • The detector slots into a wider trust / - containment / recovery loop — the per-host - verdict isn't the final answer, it's one input.
  • -
-
-
-
next steps · unsupervised
-
    -
  • Clustering the unlabeled tail of - new fleet data (KMeans / HDBSCAN) to surface novel - workload shapes the supervised model has no class - for — a self-training feedback loop.
  • -
  • Anomaly detection on the - last-layer embedding (one-class SVM, isolation forest) - so a "none of the five known phases" verdict is - available alongside the classifier output.
  • -
  • Self-supervised pretraining on - the much larger pool of unlabeled telemetry from - operational hosts; supervised fine-tune on the - smaller orchestrated dataset.
  • -
  • Embedding visualisation via - UMAP / t-SNE for human-in-the-loop labelling of - the unlabeled tail (already prototyped in scene 12).
  • -
-
-
-
-
@@ -893,80 +468,6 @@ -
-
-

Problem statement

-

Today's behaviour-based IDS systems rely on syscall traces, - kernel hooks, or rich endpoint agents that can't ship to - constrained or untrusted hosts. We want a detector that - runs on the only telemetry every modern Linux already - exports — /proc — and labels each ten-second - window of activity with the phase the workload is in.

-

Research question. Can a sequence model - trained on twelve channels of /proc telemetry - classify five workload phases (clean / armed / infecting / - infected_running / dormant) accurately enough to drive - automated containment, and generalize to malware - sample_names it has never seen during training?

-

The task is multi-class classification: - the target is one of five mutually-exclusive phase labels. - Not regression (no continuous target), not ranking - (downstream policy is a categorical containment decision). - We deliberately chose 10-second windows so detection - latency stays bounded for a real fleet.

-
-
- -
-
-

Research gaps + questions

-

Literature on behaviour-based malware detection is rich but - uneven. Most published results either (a) use richer - telemetry than what a constrained host actually exports, or - (b) frame evaluation in ways that hide same-sample overfit - (training and testing on the same malware instances). The card on the left summarises the - gap.

-

This project asks three concrete questions:

-

RQ1. How well can a per-window classifier - identify workload phases from /proc alone, with - no syscall traces and no kernel hooks?

-

RQ2. Does the model still work on - sample_names the training set never saw — - i.e., new instances of malware profiles it does know?

-

RQ3. Of the standard sequence-model - families (RNN, GRU, LSTM, CNN, Transformer) plus a - non-parametric baseline (KNN) and a tabular baseline - (gradient-boosted trees), which trade off accuracy and - inference cost best for a deployment that has to run on a - constrained host?

-
-
- -
-
-

Proposed solution

-

A single end-to-end pipeline turns raw /proc - telemetry on a fleet host into a per-window phase verdict - in under a second. Each stage of the diagram on the left - is a thin, independently-deployable component — the - receiver doesn't know what model is running; the model - doesn't know where the episode came from.

-

The model zoo is the key abstraction: - every model class registers itself by name, declares its - input kind (summary features or window tensors), and plugs - into one shared training loop. KNN, GBT, MLP, CNN, RNN, - GRU, LSTM, and Transformer all reuse the same standardization, - schema-hashed checkpoint format, class-weighted CE loss, - and held-out-by-sample evaluation — so the comparison is - genuinely apples-to-apples.

-

The detector's per-window verdict feeds two downstream - loops: a fleet-wide trust score that - combines local classification with network-behaviour - signals (per IEEE 9881803), and a fast-recovery - snapshot rollback when an infection time is known.

-
-
-

Live, not staged

@@ -1054,35 +555,6 @@
-
-
-

Evaluation setup

-

Three choices anchor every result on the next slides — the - split recipe, the primary metric, and what we measure next - to accuracy. The temptation is to report a single big - number; we report a number you can argue with.

-

Held-out by sample_name, - profile-stratified. The fleet is uniform — every - host runs the same orchestrator and the same set of - profiles — so we don't split by device. Both hosts - contribute data to train, val, and test. What's held out is - specific malware instances: the - sample_names in the test set never appear - during training. The model has to generalize to unseen - samples, not unseen devices.

-

Macro-F1, not accuracy. The dataset is - heavily skewed: roughly half the labelled time is - infected_running and only ~5 % is - armed. A "predict the majority class" - baseline already hits 0.5 accuracy. Macro-F1 averages F1 - across all five phases so rare classes count.

-

Latency reported with accuracy. A model - that's one F1 point better but ten milliseconds slower - may still be the wrong choice for an on-host detector. - The perf scene plots both axes so the trade-off is visible.

-
-
-

How we trained them

@@ -1161,143 +633,6 @@
-
-
-

Theoretical contributions

-

Three methodological claims this project makes — small in - isolation, but together they change how the comparison is - run. Each shows up explicitly in the codebase.

-

Window-centre labelling. Instead of - majority-voting phase labels across each 10-second window - (which creates noisy boundaries), we label each window by - the phase that occupies its centre. Cleaner training - signal at transitions, no spurious "ambiguous" class.

-

Schema-hashed checkpoints. Every - checkpoint embeds a hash of the feature schema it was - trained on. Loading a model against a different schema - fails fast. Without this, retroactive comparison silently - scores models on misaligned columns and reports nonsense.

-

Held-out-by-sample, profile-stratified. - Hosts in the fleet are uniform — same orchestrator, same - workload, just different production rates — so we split by - malware sample_name instead of by device. The - generalization claim is "unseen malware sample", tested on - the same population of hosts that contributed the training - data.

-
-
- -
-
-

Practical contributions

-

What others can pick up and use from this project — beyond - the published numbers.

-

/proc-only deployment. The detector needs - no syscall hooks, no eBPF, no kernel module. It runs on - hosts that don't permit deeper instrumentation — a small - VM, a container with limited capabilities, an embedded - device. One Python service plus a model file.

-

Producer-agnostic dashboard. The deck - consumes typed events - (training/dashboard/events.py); the inference - loop runs anywhere — Pi, A100, cloud — and just POSTs back. - Same UI for a lab demo and an operational console.

-

Labelled dataset on disk. 78 000+ - episodes across two hosts and six attack profiles, archived - in zstd-compressed tarballs with a schema-versioned format. - Anyone reproducing or extending this work can start from - the dataset directly without re-running the orchestrator.

-
-
- -
-
-

Design principles

-

Three patterns that emerged during the project and earned - their keep enough that we'd repeat them.

-

One loop, many models. Every NN - architecture plugs into the same training loop — class - weights, AMP autocast, cosine LR with warmup, gradient - clipping, early stop on val macro-F1. Architecture changes - don't ripple into orchestration, and adding a new model - class costs ~80 lines.

-

Typed events as contract. Producers and - consumers agree on dataclasses, not free-form dicts. - Adding a new dashboard scene means adding a new dataclass; - adding a new producer means importing it. Static checking - and editor autocomplete do most of the work that a - schema-validation library would do at runtime.

-

Two-agent path ownership. Dashboard work - and model work live in two parallel sessions with a - documented path-ownership boundary - (training/dashboard/ vs everywhere else). - Merges go through git with explicit rebases instead of a - shared workspace — slow up front, fewer subtle stomps - over time.

-
-
- -
-
-

Limitations

-

What this project cannot honestly claim — and why each - line on the left matters for how the results should be read.

-

Two-host fleet. Cross-host generalization - is reported between exactly two machines; it's the right - shape of evaluation but not a population claim. - More hosts on the WireGuard mesh would let us report - distributional bounds rather than single point comparisons.

-

Synthetic attack profiles. Our six - profiles cover the main behavioural envelopes - (cpu-saturate, ransomware-lite, bursty-c2, fork-bomb, - crypto-miner, distccd-exec) but real-world malware can - sit between or outside these envelopes. Generalization to - unseen profiles is reported via held-out-by-sample, but - in-the-wild distribution shift is unknown.

-

10 Hz sampling floor. Sub-100ms - behaviours fall inside a single sample. Detection of - millisecond-scale privilege checks would need faster - telemetry than /proc provides.

-

KNN val ↔ test gap. KNN scores val - macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on - held-out sample_names. Instance-based - memorization of the specific training samples — informative - as a baseline, not a deployment candidate.

-
-
- -
-
-

Conclusion + future work

-

A per-host classifier trained on /proc-only - telemetry can identify workload phases at multi-class - macro-F1 well above chance and slot into a wider - trust / containment / recovery loop. The recurrent family - (LSTM/GRU) and Transformer sit on the upper-left of the - accuracy-vs-cost frontier; KNN and GBT are honest baselines. - Held-out-by-host evaluation is the right generalization - axis — held-out-by-sample overstates real fleet - performance by 0.3+ F1.

-

Unsupervised next steps. The natural - extensions are unsupervised:

-

Clustering the unlabeled tail of new - fleet data (KMeans / HDBSCAN) to surface novel workload - shapes the supervised model has no class for — a - self-training feedback loop that enrolls new phases as - the fleet grows.

-

Anomaly detection on the last-layer - embedding (one-class SVM, isolation forest) so a "none of - the five known phases" verdict is available alongside the - classifier output.

-

Self-supervised pretraining on the much - larger pool of unlabeled telemetry from operational hosts; - supervised fine-tune on the smaller orchestrated dataset.

-

Embedding visualisation via UMAP / - t-SNE for human-in-the-loop labelling — already prototyped - in the KNN scene's interactive 3-D scatter.

-
-
-

References

@@ -1313,6 +648,6 @@
- +