diff --git a/training/dashboard/static/index.html b/training/dashboard/static/index.html index 1130ffe..166ce5d 100644 --- a/training/dashboard/static/index.html +++ b/training/dashboard/static/index.html @@ -219,144 +219,7 @@ - -
/proc telemetry into one of five workload phases —
- accurately enough to drive automated containment.clean → infected_running/proc channels/procsample_name, profile-stratifiedsample_name: the specific instances in
- the test set never appear during training.
- Generalization axis is "unseen malware", not
- "unseen device". Two profiles with only one sample
- (cpu-saturate, low-and-slow) are excluded from
- held-out-by-sample eval and reported separately.infected_running,
- ~5 % armed. A constant majority predictor
- hits 0.5 accuracy. macro-F1 averages per-class F1,
- so rare phases actually count toward the score.events.py), not free-form dicts.
- Adding a new scene means adding a new dataclass;
- adding a new producer means importing it./proc currently provides.sample_names never appear during
- training.Today's behaviour-based IDS systems rely on syscall traces,
- kernel hooks, or rich endpoint agents that can't ship to
- constrained or untrusted hosts. We want a detector that
- runs on the only telemetry every modern Linux already
- exports — /proc — and labels each ten-second
- window of activity with the phase the workload is in.
Research question. Can a sequence model
- trained on twelve channels of /proc telemetry
- classify five workload phases (clean / armed / infecting /
- infected_running / dormant) accurately enough to drive
- automated containment, and generalize to malware
- sample_names it has never seen during training?
The task is multi-class classification: - the target is one of five mutually-exclusive phase labels. - Not regression (no continuous target), not ranking - (downstream policy is a categorical containment decision). - We deliberately chose 10-second windows so detection - latency stays bounded for a real fleet.
-Literature on behaviour-based malware detection is rich but - uneven. Most published results either (a) use richer - telemetry than what a constrained host actually exports, or - (b) frame evaluation in ways that hide same-sample overfit - (training and testing on the same malware instances). The card on the left summarises the - gap.
-This project asks three concrete questions:
-RQ1. How well can a per-window classifier
- identify workload phases from /proc alone, with
- no syscall traces and no kernel hooks?
RQ2. Does the model still work on
- sample_names the training set never saw —
- i.e., new instances of malware profiles it does know?
RQ3. Of the standard sequence-model - families (RNN, GRU, LSTM, CNN, Transformer) plus a - non-parametric baseline (KNN) and a tabular baseline - (gradient-boosted trees), which trade off accuracy and - inference cost best for a deployment that has to run on a - constrained host?
-A single end-to-end pipeline turns raw /proc
- telemetry on a fleet host into a per-window phase verdict
- in under a second. Each stage of the diagram on the left
- is a thin, independently-deployable component — the
- receiver doesn't know what model is running; the model
- doesn't know where the episode came from.
The model zoo is the key abstraction: - every model class registers itself by name, declares its - input kind (summary features or window tensors), and plugs - into one shared training loop. KNN, GBT, MLP, CNN, RNN, - GRU, LSTM, and Transformer all reuse the same standardization, - schema-hashed checkpoint format, class-weighted CE loss, - and held-out-by-sample evaluation — so the comparison is - genuinely apples-to-apples.
-The detector's per-window verdict feeds two downstream - loops: a fleet-wide trust score that - combines local classification with network-behaviour - signals (per IEEE 9881803), and a fast-recovery - snapshot rollback when an infection time is known.
-Three choices anchor every result on the next slides — the - split recipe, the primary metric, and what we measure next - to accuracy. The temptation is to report a single big - number; we report a number you can argue with.
-Held-out by sample_name,
- profile-stratified. The fleet is uniform — every
- host runs the same orchestrator and the same set of
- profiles — so we don't split by device. Both hosts
- contribute data to train, val, and test. What's held out is
- specific malware instances: the
- sample_names in the test set never appear
- during training. The model has to generalize to unseen
- samples, not unseen devices.
Macro-F1, not accuracy. The dataset is
- heavily skewed: roughly half the labelled time is
- infected_running and only ~5 % is
- armed. A "predict the majority class"
- baseline already hits 0.5 accuracy. Macro-F1 averages F1
- across all five phases so rare classes count.
Latency reported with accuracy. A model - that's one F1 point better but ten milliseconds slower - may still be the wrong choice for an on-host detector. - The perf scene plots both axes so the trade-off is visible.
-Three methodological claims this project makes — small in - isolation, but together they change how the comparison is - run. Each shows up explicitly in the codebase.
-Window-centre labelling. Instead of - majority-voting phase labels across each 10-second window - (which creates noisy boundaries), we label each window by - the phase that occupies its centre. Cleaner training - signal at transitions, no spurious "ambiguous" class.
-Schema-hashed checkpoints. Every - checkpoint embeds a hash of the feature schema it was - trained on. Loading a model against a different schema - fails fast. Without this, retroactive comparison silently - scores models on misaligned columns and reports nonsense.
-Held-out-by-sample, profile-stratified.
- Hosts in the fleet are uniform — same orchestrator, same
- workload, just different production rates — so we split by
- malware sample_name instead of by device. The
- generalization claim is "unseen malware sample", tested on
- the same population of hosts that contributed the training
- data.
What others can pick up and use from this project — beyond - the published numbers.
-/proc-only deployment. The detector needs - no syscall hooks, no eBPF, no kernel module. It runs on - hosts that don't permit deeper instrumentation — a small - VM, a container with limited capabilities, an embedded - device. One Python service plus a model file.
-Producer-agnostic dashboard. The deck
- consumes typed events
- (training/dashboard/events.py); the inference
- loop runs anywhere — Pi, A100, cloud — and just POSTs back.
- Same UI for a lab demo and an operational console.
Labelled dataset on disk. 78 000+ - episodes across two hosts and six attack profiles, archived - in zstd-compressed tarballs with a schema-versioned format. - Anyone reproducing or extending this work can start from - the dataset directly without re-running the orchestrator.
-Three patterns that emerged during the project and earned - their keep enough that we'd repeat them.
-One loop, many models. Every NN - architecture plugs into the same training loop — class - weights, AMP autocast, cosine LR with warmup, gradient - clipping, early stop on val macro-F1. Architecture changes - don't ripple into orchestration, and adding a new model - class costs ~80 lines.
-Typed events as contract. Producers and - consumers agree on dataclasses, not free-form dicts. - Adding a new dashboard scene means adding a new dataclass; - adding a new producer means importing it. Static checking - and editor autocomplete do most of the work that a - schema-validation library would do at runtime.
-Two-agent path ownership. Dashboard work
- and model work live in two parallel sessions with a
- documented path-ownership boundary
- (training/dashboard/ vs everywhere else).
- Merges go through git with explicit rebases instead of a
- shared workspace — slow up front, fewer subtle stomps
- over time.
What this project cannot honestly claim — and why each - line on the left matters for how the results should be read.
-Two-host fleet. Cross-host generalization - is reported between exactly two machines; it's the right - shape of evaluation but not a population claim. - More hosts on the WireGuard mesh would let us report - distributional bounds rather than single point comparisons.
-Synthetic attack profiles. Our six - profiles cover the main behavioural envelopes - (cpu-saturate, ransomware-lite, bursty-c2, fork-bomb, - crypto-miner, distccd-exec) but real-world malware can - sit between or outside these envelopes. Generalization to - unseen profiles is reported via held-out-by-sample, but - in-the-wild distribution shift is unknown.
-10 Hz sampling floor. Sub-100ms
- behaviours fall inside a single sample. Detection of
- millisecond-scale privilege checks would need faster
- telemetry than /proc provides.
KNN val ↔ test gap. KNN scores val
- macro-F1 ≈ 0.74 on samples it saw, but only ≈ 0.13 on
- held-out sample_names. Instance-based
- memorization of the specific training samples — informative
- as a baseline, not a deployment candidate.
A per-host classifier trained on /proc-only
- telemetry can identify workload phases at multi-class
- macro-F1 well above chance and slot into a wider
- trust / containment / recovery loop. The recurrent family
- (LSTM/GRU) and Transformer sit on the upper-left of the
- accuracy-vs-cost frontier; KNN and GBT are honest baselines.
- Held-out-by-host evaluation is the right generalization
- axis — held-out-by-sample overstates real fleet
- performance by 0.3+ F1.
Unsupervised next steps. The natural - extensions are unsupervised:
-• Clustering the unlabeled tail of new - fleet data (KMeans / HDBSCAN) to surface novel workload - shapes the supervised model has no class for — a - self-training feedback loop that enrolls new phases as - the fleet grows.
-• Anomaly detection on the last-layer - embedding (one-class SVM, isolation forest) so a "none of - the five known phases" verdict is available alongside the - classifier output.
-• Self-supervised pretraining on the much - larger pool of unlabeled telemetry from operational hosts; - supervised fine-tune on the smaller orchestrated dataset.
-• Embedding visualisation via UMAP / - t-SNE for human-in-the-loop labelling — already prototyped - in the KNN scene's interactive 3-D scatter.
-