theme · OKLCH
advanced — palette ladder
palette
animation · global
oklch(70% 0.15 250)
CIS490 connecting… 1 / 1
cis490 · live fleet telemetry
behavioral
malware
detection
episodes ingested
0
0.0 / sec · last 60 s · total bytes on disk: 0 B
what detection unlocks
network-level trust scoring
A noisy on-device classifier becomes useful when its verdict feeds a fleet-wide trust score — peers, gateways, and traffic patterns vote together. A single host's signal is fragile; combined network behaviour is much harder to spoof.
containment before pivot
"Infected" is actionable: quarantine the device's credentials, drop its traffic at the gateway, stop lateral movement before the attacker pivots to a neighbor. Detection latency directly bounds blast radius.
fast post-attack reset
With a known infection time you can roll a device back to a snapshot taken before the compromise — no forensic dwell time, no guessing how far back to roll. Recovery becomes a one-button operation instead of a week of cleanup.
the stack behind the live data on the right
pyproject.toml

              
receiver/app.py · file header

              
per-host shipping
awaiting snapshot…
episode database · last 200 records
0 of 0
host episode_id received size
phase mix · sampling dataset…
computing the phase distribution across a random sample of episodes on disk. A clean fleet sits mostly in clean; skew toward infecting / infected_running reflects time spent under attack workloads.
attack envelopes · /proc signature per profile
10-second windows · model input shape
each window: 100 samples (10 Hz × 10 s), labeled by the phase that occupies its center.
how we trained the sequence models
training/models/lstm.py

              
training/trainer/_loop.py · train_nn

              
sequence models · accuracy on held-out samples
window features · 3-D projection · drag to rotate
references · papers, notes, prior work
accuracy vs inference cost
x: μs / window (lower is better) · y: held-out accuracy (higher is better).
A100 inference · live 0 models 0 infer / sec last window: — hit-rate: —
awaiting live_detection events from the A100 inference loop

Most malware doesn't look like malware in a database — it looks like a process behaving badly.

An intrusion detection system spots the bad behavior; an intrusion prevention system stops it. Both depend on knowing what bad behavior looks like at the level of telemetry the device can actually see.

This deck is the live face of the dataset we're building to teach a model that distinction — every panel on the left is a slice of real data shipping in right now.

scroll, click, or → to advance

Collecting the dataset

Each lab host on the WireGuard mesh boots a real Alpine VM, runs a profile-driven workload inside it, and samples /proc/<qemu_pid> at 10 Hz. Every ~30 seconds the labeled tarball is shipped to this Pi over mTLS.

The counter on the left is the running total, sourced from the receiver's index.jsonl on disk. The sparkline is the arrival rate over the last sixty seconds — proof that the deck is reading live data, not a fixed slide.

Why detect at all?

Knowing a device is compromised is the precondition for everything else. A classifier that says "this host is infected right now" turns into three concrete operational capabilities — and each one rewards a faster, more confident detector.

Trust scoring across the network. Recent work on per-device trust establishment (IEEE 9881803) argues that on-device metrics alone aren't enough — a fleet has to combine local classifier verdicts with network-behaviour signals (peer observations, gateway traffic patterns, inter-host relationships) to score trust reliably. Our per-host detector is one input to that broader signal.

Containment. Once a host is flagged, the gateway can drop its traffic and the IAM layer can revoke credentials before lateral movement begins. Detection latency translates directly into how much of the network an attacker reaches.

Quick recovery. A confirmed infection time lets you restore from a snapshot taken just before the compromise — no forensic dwell time, no guessing how far back to roll. The recovery path becomes a one-button operation instead of a week of cleanup.

Live, not staged

Every panel from here on is real data from real devices — counters, bars, the episode database, all driven by the cis490-receiver service running on this Pi as you scroll.

The code on the left is how it gets here. Four runtime deps: starlette + uvicorn for the async HTTP and WebSocket surface, msgpack talks to Metasploit's RPC, pycdlib builds the lab-VM cidata ISOs. Everything else is the standard library, and every dep is annotated with a one-line reason it's there.

A multi-host fleet

Running the same orchestrator on multiple hosts gives novel, non-overlapping data per host — no central coordinator. Each host pulls a different slice of the manifest, so the dataset grows in parallel.

The numbers below are absolute episode counts on disk, refreshed from /var/lib/cis490/episodes/<host>/ every thirty seconds.

The dataset, browsable

Every row is one labeled episode tarball stored at /var/lib/cis490/episodes/<host>/<id>.tar.zst after the receiver verifies its SHA-256 and writes it through.

Filter by host with the tabs, or grep by host / episode id / sha with the search box. Click a row for the full index.jsonl record. The view holds the most recent two hundred records — older history is on disk, indexable from the receiver.

A baseline of normal

Before we can detect a deviation, we have to know what the fleet looks like across a wide slice of its life. The stacked bar aggregates ground-truth phase labels across hundreds of randomly sampled episodes from the dataset on disk — weighted by the time the workload actually spent in each phase, not just the count of transitions.

If the model only ever sees clean, it overfits to "everything is fine." The phase schedule fixes that by forcing every run to walk through every phase, which is why infected_running dominates the mix — that's where the labelled attack workload sits.

Linking attack to telemetry

The same six profiles run across every host, and each one produces a different envelope in /proc. A cryptominer pegs one core for minutes. A bursty C2 channel sits idle, then exhales three packets. Ransomware walks the filesystem and saturates I/O.

The thumbnails on the left are the canonical envelopes the model has to learn to recognize — same axes, different shapes. That shape difference is what makes detection tractable.

Ten-second windows

Models eat fixed-size inputs. We chop each episode into 10-second windows — 100 samples per window at 10 Hz — and label each window with the phase that occupies its center.

Window size is a knob. Too short and the model can't see slow envelopes (low-and-slow malware, idle C2). Too long and you can't react fast enough to be a useful prevention signal. Ten seconds is the starting point we tune around.

How we trained them

One trainer per model — load the windowed dataset, define the network, train, evaluate. Same shape for RNN, GRU, LSTM, BERT, so you can read all four side-by-side and the only differences are the architecture itself.

The code on the left is the LSTM trainer. PyTorch's DataLoader handles windowing, nn.LSTM is one line, the loop is six. No custom loss, no rate schedule, no manual batching — anything fancier has to earn its place by beating the simple version on held-out samples.

Sequence models

RNN, GRU, LSTM — recurrent models that read the window one timestep at a time and carry state forward. Cheap, mature, easy to interpret.

BERT-style transformer — the window becomes a sequence of "tokens"; attention captures cross-position context instead of accumulating it through a hidden state. More parameters, more compute, more room to overfit a small dataset.

Same input, same labels, four different inductive biases. The comparison on the left is the punchline of the whole project.

Nearest-neighbor as a sanity check

Before anything fancy: engineer summary features per window (mean, std, p95, slope, zero-bucket counts per channel) and run KNN in that feature space.

If the phase clusters separate visibly in two dimensions, KNN already does most of the work and a deep model is only buying marginal improvement. If they don't separate, you've learned something about the feature engineering before training a single epoch.

Accuracy vs complexity

Bigger models earn better numbers in the validation set — but they also need more parameters, more inference time, and more memory at the edge. The deployed model has to fit on the device it's protecting.

The scatter on the left is the usable trade-off curve: every point above and to the left of where you currently sit is a reachable upgrade. The point in the bottom-right is a model you'd never ship.

Catching attacks live

The A100 runs inference against incoming ten-second windows from the fleet. Each row on the stage is one trained model doing live prediction; each cell is its phase verdict on a freshly-arrived window, painted by the predicted phase.

Read the lanes side-by-side as a model-agreement check: when the recurrent family (RNN / GRU / LSTM) all flip to infecting at the same time, that's strong evidence the host actually is. When ground truth from labels.jsonl catches up, mismatched cells get a hatched overlay and the running hit-rate ticks. The callout below holds the most recent prediction with model name, A100 round-trip latency, and confidence.

References

The papers, notes, and prior work this project leans on. Pick a tab on the left to load the document; the viewer takes the bulk of the stage so you can scroll through without leaving the deck.

end of deck · ← to flip back