theme · OKLCH
advanced — palette ladder
palette
animation · global
oklch(70% 0.15 250)
CIS490 connecting… 1 / 1
cis490 · live fleet telemetry
behavioral
malware
detection
what detection unlocks
network-level trust scoring
A noisy on-device classifier becomes useful when its verdict feeds a fleet-wide trust score — peers, gateways, and traffic patterns vote together. A single host's signal is fragile; combined network behaviour is much harder to spoof.
containment before pivot
"Infected" is actionable: quarantine the device's credentials, drop its traffic at the gateway, stop lateral movement before the attacker pivots to a neighbor. Detection latency directly bounds blast radius.
fast post-attack reset
With a known infection time you can roll a device back to a snapshot taken before the compromise — no forensic dwell time, no guessing how far back to roll. Recovery becomes a one-button operation instead of a week of cleanup.
the problem · single sentence + numbers
Classify each ten-second window of fleet /proc telemetry into one of five workload phases — accurately enough to drive automated containment.
5
phase classes
cleaninfected_running
12
/proc channels
no syscalls, no kernel hooks
10s
classification window
100 samples × 12 channels
task type: multi-class classification — five mutually-exclusive phase labels, balanced via class-weighted cross-entropy. Not regression (no continuous target), not ranking (downstream policy is a categorical containment decision).
literature gaps · positioning the work
what prior work covers
  • LSTM on syscall traces in VMs — deeper telemetry than /proc
  • Transformer on per-process resource metrics — related signal, single-host eval
  • BERT on system logs (LogBERT) — text-form telemetry, not numeric channels
  • Insider-threat LSTM on event logs (DANTE) — categorical events, not continuous
  • Network-behaviour trust establishment (IEEE 9881803) — cross-device aggregation, not per-host classifier
what's missing
  • /proc-only signal — most work assumes syscalls or kernel hooks
  • Cross-host generalization — eval splits often hide it (held-out by sample, not host)
  • Real-time per-window classification for containment, not post-hoc batch labelling
  • Side-by-side cell-choice comparison (RNN/GRU/LSTM/CNN/Transformer) on one dataset
  • Direct integration with a fleet-wide trust score, not standalone output
pipeline · what each stage produces
fleet hosts /proc · 10 Hz receiver (Pi) bearer auth episode store zstd · tar windowing + features 10 s · 100 samples × 12 ch model zoo KNN · GBT · MLP · CNN · RNN · GRU · LSTM · Transformer trained per (model × split-recipe) cross-host eval · class-weighted CE · early stop on val macro-F1 per-window phase 5-class softmax trust score + network signals (9881803) containment + reset snapshot rollback
the stack behind the live data on the right
pyproject.toml

              
receiver/app.py · file header

              
episodes ingested
0
0.0 / sec · last 60 s · total bytes on disk: 0 B
per-host shipping
awaiting snapshot…
episode database · last 200 records
0 of 0
host episode_id received size
phase mix · sampling dataset…
computing the phase distribution across a random sample of episodes on disk. A clean fleet sits mostly in clean; skew toward infecting / infected_running reflects time spent under attack workloads.
attack envelopes · /proc signature per profile
10-second windows · model input shape
each window: 100 samples (10 Hz × 10 s), labeled by the phase that occupies its center.
evaluation setup · how the numbers get made
split recipe
train ∪ val: elliott-thinkpad
test: k-gamingcom
held-out by host so the test set measures cross-device generalization, not in-distribution self-prediction. A 90 % accuracy that comes from recognising the host's idle profile is worthless for a fleet detector.
primary metric
macro-F1 averaged across the five phases
accuracy lies under class imbalance — ~50 % infected_running, ~5 % armed. A constant majority predictor hits 0.5 accuracy. macro-F1 averages per-class F1, so rare phases actually count toward the score.
baselines compared
KNN — non-parametric, instance-based
GBT (XGBoost) — tabular non-NN
MLP — feedforward ablation
CNN — local-pattern ablation
RNN / GRU / LSTM — recurrent family
Transformer — attention
reported alongside accuracy
μs / window — inference cost at batch=64
cross-host gap — val − test macro-F1
latency translates to containment lag; the gap is the honest measure of generalization. Both are plotted on the perf scene.
sequence models · accuracy on held-out samples
how we trained the sequence models
training/models/lstm.py

              
training/trainer/_loop.py · train_nn

              
window features · 3-D projection · drag to rotate
references · papers, notes, prior work
accuracy vs inference cost
x: μs / window (lower is better) · y: held-out accuracy (higher is better).
live detections 0 hosts 0 / sec model: — hit-rate: —
awaiting live_detection events from the inference loop
theoretical contributions · what's new methodologically
window-centre labelling
A 10-second classification window is labelled by the phase that occupies its centre, not by majority vote across the window. Cleaner training signal at phase boundaries, and avoids the spurious "ambiguous" class.
schema-hashed checkpoints
Each checkpoint embeds a hash of the feature schema; loading a model against the wrong schema fails fast instead of silently scoring on misaligned columns. Makes retroactive comparison reproducible.
cross-host as the eval axis
Held-out-by-host is reported as a first-class number alongside held-out-by-sample. The two often disagree by 0.4 macro-F1, and only the cross-host number predicts fleet behaviour.
practical contributions · what others can use
/proc-only deployment
No syscall hooks, no eBPF, no kernel module — runs on hosts that don't permit deep instrumentation. The detector is one Python service plus a model file.
producer-agnostic dashboard
The deck consumes typed events; the inference loop runs anywhere (Pi, A100, cloud) and just POSTs back. Same UI for a lab demo and an operational console.
labelled dataset on disk
78,000+ episodes, five phases, two hosts, six attack profiles — archived in zstd-compressed tarballs with a schema-versioned format. Ready for downstream work without re-running the orchestrator.
design principles · patterns that emerged
one loop, many models
Every NN architecture plugs into the same training loop — class weights, AMP, cosine LR, early stop. Architecture changes don't ripple into orchestration.
typed events as contract
Producers and consumers agree on dataclasses (events.py), not free-form dicts. Adding a new scene means adding a new dataclass; adding a new producer means importing it.
two-agent path ownership
Dashboard work and model work live in two parallel sessions with a documented path-ownership boundary. Merges go through git with explicit rebases instead of a shared workspace.
limitations · the honest list
two-host fleet
Cross-host generalization is reported between exactly two machines (elliott-thinkpad → k-gamingcom). N-host claims need more hosts on the WireGuard mesh.
synthetic attack profiles
Six profiles cover the main shapes (cpu-saturate, ransomware-lite, bursty-c2, fork-bomb, crypto-miner, distccd-exec) but real-world malware can sit between or outside these envelopes.
10 Hz sampling floor
Sub-100ms attack behaviours fall inside a single sample. Detection of extremely short-lived attacks (millisecond-scale privilege checks) requires faster sampling than /proc currently provides.
KNN cross-host gap
KNN scores val macro-F1 ≈ 0.74 on elliott-thinkpad but only 0.13 on the held-out k-gamingcom. Instance-based memorization of the training host's feature space — informative as a baseline, but not a deployment candidate.
conclusion + future work
what we showed
  • A per-host detector trained on /proc-only telemetry can classify workload phases at multi-class macro-F1 well above chance.
  • Held-out-by-host evaluation is the right generalization axis; held-out-by-sample overstates real fleet performance by 0.3+ F1.
  • The recurrent family (LSTM/GRU) and Transformer sit on the upper-left of the accuracy-vs-cost frontier; KNN and GBT round out the comparison as honest baselines.
  • The detector slots into a wider trust / containment / recovery loop — the per-host verdict isn't the final answer, it's one input.
next steps · unsupervised
  • Clustering the unlabeled tail of new fleet data (KMeans / HDBSCAN) to surface novel workload shapes the supervised model has no class for — a self-training feedback loop.
  • Anomaly detection on the last-layer embedding (one-class SVM, isolation forest) so a "none of the five known phases" verdict is available alongside the classifier output.
  • Self-supervised pretraining on the much larger pool of unlabeled telemetry from operational hosts; supervised fine-tune on the smaller orchestrated dataset.
  • Embedding visualisation via UMAP / t-SNE for human-in-the-loop labelling of the unlabeled tail (already prototyped in scene 12).

Most malware doesn't look like malware in a database — it looks like a process behaving badly.

An intrusion detection system spots the bad behavior; an intrusion prevention system stops it. Both depend on knowing what bad behavior looks like at the level of telemetry the device can actually see.

This deck is the live face of the dataset we're building to teach a model that distinction — every panel on the left is a slice of real data shipping in right now.

scroll, click, or → to advance

Why detect at all?

Knowing a device is compromised is the precondition for everything else. A classifier that says "this host is infected right now" turns into three concrete operational capabilities — and each one rewards a faster, more confident detector.

Trust scoring across the network. Recent work on per-device trust establishment (IEEE 9881803) argues that on-device metrics alone aren't enough — a fleet has to combine local classifier verdicts with network-behaviour signals (peer observations, gateway traffic patterns, inter-host relationships) to score trust reliably. Our per-host detector is one input to that broader signal.

Containment. Once a host is flagged, the gateway can drop its traffic and the IAM layer can revoke credentials before lateral movement begins. Detection latency translates directly into how much of the network an attacker reaches.

Quick recovery. A confirmed infection time lets you restore from a snapshot taken just before the compromise — no forensic dwell time, no guessing how far back to roll. The recovery path becomes a one-button operation instead of a week of cleanup.

Problem statement

Today's behaviour-based IDS systems rely on syscall traces, kernel hooks, or rich endpoint agents that can't ship to constrained or untrusted hosts. We want a detector that runs on the only telemetry every modern Linux already exports — /proc — and labels each ten-second window of activity with the phase the workload is in.

Research question. Can a sequence model trained on twelve channels of /proc telemetry classify five workload phases (clean / armed / infecting / infected_running / dormant) accurately enough to drive automated containment, and generalize across hosts and malware profiles it has never seen during training?

The task is multi-class classification: the target is one of five mutually-exclusive phase labels. Not regression (no continuous target), not ranking (downstream policy is a categorical containment decision). We deliberately chose 10-second windows so detection latency stays bounded for a real fleet.

Research gaps + questions

Literature on behaviour-based malware detection is rich but uneven. Most published results either (a) use richer telemetry than what a constrained host actually exports, or (b) frame evaluation in ways that hide the cross-host generalization problem. The card on the left summarises the gap.

This project asks three concrete questions:

RQ1. How well can a per-window classifier identify workload phases from /proc alone, with no syscall traces and no kernel hooks?

RQ2. Does the model still work when test episodes come from a host the training set never saw?

RQ3. Of the standard sequence-model families (RNN, GRU, LSTM, CNN, Transformer) plus a non-parametric baseline (KNN) and a tabular baseline (gradient-boosted trees), which trade off accuracy and inference cost best for a deployment that has to run on a constrained host?

Proposed solution

A single end-to-end pipeline turns raw /proc telemetry on a fleet host into a per-window phase verdict in under a second. Each stage of the diagram on the left is a thin, independently-deployable component — the receiver doesn't know what model is running; the model doesn't know where the episode came from.

The model zoo is the key abstraction: every model class registers itself by name, declares its input kind (summary features or window tensors), and plugs into one shared training loop. KNN, GBT, MLP, CNN, RNN, GRU, LSTM, and Transformer all reuse the same standardization, schema-hashed checkpoint format, class-weighted CE loss, and held-out-by-host evaluation — so the comparison is genuinely apples-to-apples.

The detector's per-window verdict feeds two downstream loops: a fleet-wide trust score that combines local classification with network-behaviour signals (per IEEE 9881803), and a fast-recovery snapshot rollback when an infection time is known.

Live, not staged

Every panel from here on is real data from real devices — counters, bars, the episode database, all driven by the cis490-receiver service running on this Pi as you scroll.

The code on the left is how it gets here. Four runtime deps: starlette + uvicorn for the async HTTP and WebSocket surface, msgpack talks to Metasploit's RPC, pycdlib builds the lab-VM cidata ISOs. Everything else is the standard library, and every dep is annotated with a one-line reason it's there.

Collecting the dataset

Each lab host on the WireGuard mesh boots a real Alpine VM, runs a profile-driven workload inside it, and samples /proc/<qemu_pid> at 10 Hz. Every ~30 seconds the labeled tarball is shipped to this Pi over mTLS.

The counter on the left is the running total, sourced from the receiver's index.jsonl on disk. The sparkline is the arrival rate over the last sixty seconds.

A multi-host fleet

Running the same orchestrator on multiple hosts gives novel, non-overlapping data per host — no central coordinator. Each host pulls a different slice of the manifest, so the dataset grows in parallel.

The numbers below are absolute episode counts on disk, refreshed from /var/lib/cis490/episodes/<host>/ every thirty seconds.

The dataset, browsable

Every row is one labeled episode tarball stored at /var/lib/cis490/episodes/<host>/<id>.tar.zst after the receiver verifies its SHA-256 and writes it through.

Filter by host with the tabs, or grep by host / episode id / sha with the search box. Click a row for the full index.jsonl record. The view holds the most recent two hundred records — older history is on disk, indexable from the receiver.

A baseline of normal

Before we can detect a deviation, we have to know what the fleet looks like across a wide slice of its life. The stacked bar aggregates ground-truth phase labels across hundreds of randomly sampled episodes from the dataset on disk — weighted by the time the workload actually spent in each phase, not just the count of transitions.

If the model only ever sees clean, it overfits to "everything is fine." The phase schedule fixes that by forcing every run to walk through every phase, which is why infected_running dominates the mix — that's where the labelled attack workload sits.

Linking attack to telemetry

The same six profiles run across every host, and each one produces a different envelope in /proc. A cryptominer pegs one core for minutes. A bursty C2 channel sits idle, then exhales three packets. Ransomware walks the filesystem and saturates I/O.

The thumbnails on the left are the canonical envelopes the model has to learn to recognize — same axes, different shapes. That shape difference is what makes detection tractable.

Ten-second windows

Models eat fixed-size inputs. We chop each episode into 10-second windows — 100 samples per window at 10 Hz — and label each window with the phase that occupies its center.

Window size is a knob. Too short and the model can't see slow envelopes (low-and-slow malware, idle C2). Too long and you can't react fast enough to be a useful prevention signal. Ten seconds is the starting point we tune around.

Evaluation setup

Three choices anchor every result on the next slides — the split recipe, the primary metric, and what we measure next to accuracy. The temptation is to report a single big number; we report a number you can argue with.

Held-out by host. Train and validate on one machine; test on a different machine. A model that wins by memorising the train host's idle profile loses here, which is what you want — a fleet detector has to generalize across hosts it never saw at training time.

Macro-F1, not accuracy. The dataset is heavily skewed: roughly half the labelled time is infected_running and only ~5 % is armed. A "predict the majority class" baseline already hits 0.5 accuracy. Macro-F1 averages F1 across all five phases so rare classes count.

Latency reported with accuracy. A model that's one F1 point better but ten milliseconds slower may still be the wrong choice for an on-host detector. The perf scene plots both axes so the trade-off is visible.

Sequence models

RNN, GRU, LSTM — recurrent models that read the window one timestep at a time and carry state forward. Cheap, mature, easy to interpret.

BERT-style transformer — the window becomes a sequence of "tokens"; attention captures cross-position context instead of accumulating it through a hidden state. More parameters, more compute, more room to overfit a small dataset.

Same input, same labels, four different inductive biases. The comparison on the left is the punchline of the whole project.

How we trained them

One trainer per model — load the windowed dataset, define the network, train, evaluate. Same shape for RNN, GRU, LSTM, BERT, so you can read all four side-by-side and the only differences are the architecture itself.

The code on the left is the LSTM trainer. PyTorch's DataLoader handles windowing, nn.LSTM is one line, the loop is six. No custom loss, no rate schedule, no manual batching — anything fancier has to earn its place by beating the simple version on held-out samples.

Nearest-neighbor as a sanity check

Before anything fancy: engineer summary features per window (mean, std, p95, slope, zero-bucket counts per channel) and run KNN in that feature space.

If the phase clusters separate visibly in two dimensions, KNN already does most of the work and a deep model is only buying marginal improvement. If they don't separate, you've learned something about the feature engineering before training a single epoch.

Accuracy vs complexity

Bigger models earn better numbers in the validation set — but they also need more parameters, more inference time, and more memory at the edge. The deployed model has to fit on the device it's protecting.

The scatter on the left is the usable trade-off curve: every point above and to the left of where you currently sit is a reachable upgrade. The point in the bottom-right is a model you'd never ship.

Catching attacks live

Real episodes arrive from the fleet, get chunked into ten-second windows, and a deployed model labels each window in flight. The heavy models can offload inference to an A100 so the receiver never blocks on a forward pass — predictions stream back as they finish.

Each row on the stage is a host; each cell is one ten-second window painted by the model's predicted phase. A clean run cruises blue; an attack profile pushes the lane through armedinfectinginfected_running. When ground truth catches up, mismatched cells get a hatched overlay so you can spot where the model disagrees with the orchestrator. The callout below holds the most recent prediction with model name, confidence, and round-trip latency.

Theoretical contributions

Three methodological claims this project makes — small in isolation, but together they change how the comparison is run. Each shows up explicitly in the codebase.

Window-centre labelling. Instead of majority-voting phase labels across each 10-second window (which creates noisy boundaries), we label each window by the phase that occupies its centre. Cleaner training signal at transitions, no spurious "ambiguous" class.

Schema-hashed checkpoints. Every checkpoint embeds a hash of the feature schema it was trained on. Loading a model against a different schema fails fast. Without this, retroactive comparison silently scores models on misaligned columns and reports nonsense.

Cross-host as the eval axis. Held-out-by-host is reported as a first-class number alongside held-out-by-sample — the two often disagree by ~0.4 macro-F1, and only the cross-host number predicts real fleet behaviour.

Practical contributions

What others can pick up and use from this project — beyond the published numbers.

/proc-only deployment. The detector needs no syscall hooks, no eBPF, no kernel module. It runs on hosts that don't permit deeper instrumentation — a small VM, a container with limited capabilities, an embedded device. One Python service plus a model file.

Producer-agnostic dashboard. The deck consumes typed events (training/dashboard/events.py); the inference loop runs anywhere — Pi, A100, cloud — and just POSTs back. Same UI for a lab demo and an operational console.

Labelled dataset on disk. 78 000+ episodes across two hosts and six attack profiles, archived in zstd-compressed tarballs with a schema-versioned format. Anyone reproducing or extending this work can start from the dataset directly without re-running the orchestrator.

Design principles

Three patterns that emerged during the project and earned their keep enough that we'd repeat them.

One loop, many models. Every NN architecture plugs into the same training loop — class weights, AMP autocast, cosine LR with warmup, gradient clipping, early stop on val macro-F1. Architecture changes don't ripple into orchestration, and adding a new model class costs ~80 lines.

Typed events as contract. Producers and consumers agree on dataclasses, not free-form dicts. Adding a new dashboard scene means adding a new dataclass; adding a new producer means importing it. Static checking and editor autocomplete do most of the work that a schema-validation library would do at runtime.

Two-agent path ownership. Dashboard work and model work live in two parallel sessions with a documented path-ownership boundary (training/dashboard/ vs everywhere else). Merges go through git with explicit rebases instead of a shared workspace — slow up front, fewer subtle stomps over time.

Limitations

What this project cannot honestly claim — and why each line on the left matters for how the results should be read.

Two-host fleet. Cross-host generalization is reported between exactly two machines; it's the right shape of evaluation but not a population claim. More hosts on the WireGuard mesh would let us report distributional bounds rather than single point comparisons.

Synthetic attack profiles. Our six profiles cover the main behavioural envelopes (cpu-saturate, ransomware-lite, bursty-c2, fork-bomb, crypto-miner, distccd-exec) but real-world malware can sit between or outside these envelopes. Generalization to unseen profiles is reported via held-out-by-sample, but in-the-wild distribution shift is unknown.

10 Hz sampling floor. Sub-100ms behaviours fall inside a single sample. Detection of millisecond-scale privilege checks would need faster telemetry than /proc provides.

KNN cross-host gap. KNN scores val macro-F1 ≈ 0.74 on the train host but only ≈ 0.13 on the held-out one. Instance-based memorization of the training host's feature space — informative as a baseline, not a deployment candidate.

Conclusion + future work

A per-host classifier trained on /proc-only telemetry can identify workload phases at multi-class macro-F1 well above chance and slot into a wider trust / containment / recovery loop. The recurrent family (LSTM/GRU) and Transformer sit on the upper-left of the accuracy-vs-cost frontier; KNN and GBT are honest baselines. Held-out-by-host evaluation is the right generalization axis — held-out-by-sample overstates real fleet performance by 0.3+ F1.

Unsupervised next steps. The natural extensions are unsupervised:

Clustering the unlabeled tail of new fleet data (KMeans / HDBSCAN) to surface novel workload shapes the supervised model has no class for — a self-training feedback loop that enrolls new phases as the fleet grows.

Anomaly detection on the last-layer embedding (one-class SVM, isolation forest) so a "none of the five known phases" verdict is available alongside the classifier output.

Self-supervised pretraining on the much larger pool of unlabeled telemetry from operational hosts; supervised fine-tune on the smaller orchestrated dataset.

Embedding visualisation via UMAP / t-SNE for human-in-the-loop labelling — already prototyped in the KNN scene's interactive 3-D scatter.

References

The papers, notes, and prior work this project leans on. Pick a tab on the left to load the document; the viewer takes the bulk of the stage so you can scroll through without leaving the deck.

end of deck · ← to flip back