Most malware doesn't look like malware in a database — it looks like a process behaving badly.
An intrusion detection system spots the bad behavior; an intrusion prevention system stops it. Both depend on knowing what bad behavior looks like at the level of telemetry the device can actually see.
This deck is the live face of the dataset we're building to teach a model that distinction — every panel on the left is a slice of real data shipping in right now.
scroll, click, or → to advance
Collecting the dataset
Each lab host on the WireGuard mesh boots a real Alpine VM, runs
a profile-driven workload inside it, and samples
/proc/<qemu_pid> at 10 Hz. Every ~30 seconds
the labeled tarball is shipped to this Pi over mTLS.
The counter on the left is the running total, sourced from the
receiver's index.jsonl on disk. The sparkline is the
arrival rate over the last sixty seconds — proof that the deck
is reading live data, not a fixed slide.
Why detect at all?
Knowing a device is compromised is the precondition for everything else. A classifier that says "this host is infected right now" turns into three concrete operational capabilities — and each one rewards a faster, more confident detector.
Trust scoring across the network. Recent work on per-device trust establishment (IEEE 9881803) argues that on-device metrics alone aren't enough — a fleet has to combine local classifier verdicts with network-behaviour signals (peer observations, gateway traffic patterns, inter-host relationships) to score trust reliably. Our per-host detector is one input to that broader signal.
Containment. Once a host is flagged, the gateway can drop its traffic and the IAM layer can revoke credentials before lateral movement begins. Detection latency translates directly into how much of the network an attacker reaches.
Quick recovery. A confirmed infection time lets you restore from a snapshot taken just before the compromise — no forensic dwell time, no guessing how far back to roll. The recovery path becomes a one-button operation instead of a week of cleanup.
Live, not staged
Every panel from here on is real data from real devices —
counters, bars, the episode database, all driven by the
cis490-receiver service running on this Pi as
you scroll.
The code on the left is how it gets here. Four runtime deps: starlette + uvicorn for the async HTTP and WebSocket surface, msgpack talks to Metasploit's RPC, pycdlib builds the lab-VM cidata ISOs. Everything else is the standard library, and every dep is annotated with a one-line reason it's there.
A multi-host fleet
Running the same orchestrator on multiple hosts gives novel, non-overlapping data per host — no central coordinator. Each host pulls a different slice of the manifest, so the dataset grows in parallel.
The numbers below are absolute episode counts on disk, refreshed
from /var/lib/cis490/episodes/<host>/ every
thirty seconds.
The dataset, browsable
Every row is one labeled episode tarball stored at
/var/lib/cis490/episodes/<host>/<id>.tar.zst
after the receiver verifies its SHA-256 and writes it through.
Filter by host with the tabs, or grep by host / episode id /
sha with the search box. Click a row for the full
index.jsonl record. The view holds the most recent
two hundred records — older history is on disk, indexable
from the receiver.
A baseline of normal
Before we can detect a deviation, we have to know what the fleet looks like across a wide slice of its life. The stacked bar aggregates ground-truth phase labels across hundreds of randomly sampled episodes from the dataset on disk — weighted by the time the workload actually spent in each phase, not just the count of transitions.
If the model only ever sees clean, it overfits to
"everything is fine." The phase schedule fixes that by forcing
every run to walk through every phase, which is why
infected_running dominates the mix — that's where
the labelled attack workload sits.
Linking attack to telemetry
The same six profiles run across every host, and each one
produces a different envelope in /proc. A
cryptominer pegs one core for minutes. A bursty C2 channel sits
idle, then exhales three packets. Ransomware walks the
filesystem and saturates I/O.
The thumbnails on the left are the canonical envelopes the model has to learn to recognize — same axes, different shapes. That shape difference is what makes detection tractable.
Ten-second windows
Models eat fixed-size inputs. We chop each episode into 10-second windows — 100 samples per window at 10 Hz — and label each window with the phase that occupies its center.
Window size is a knob. Too short and the model can't see slow envelopes (low-and-slow malware, idle C2). Too long and you can't react fast enough to be a useful prevention signal. Ten seconds is the starting point we tune around.
How we trained them
One trainer per model — load the windowed dataset, define the network, train, evaluate. Same shape for RNN, GRU, LSTM, BERT, so you can read all four side-by-side and the only differences are the architecture itself.
The code on the left is the LSTM trainer.
PyTorch's DataLoader handles windowing,
nn.LSTM is one line, the loop is six.
No custom loss, no rate schedule, no manual batching —
anything fancier has to earn its place by beating the simple
version on held-out samples.
Sequence models
RNN, GRU, LSTM — recurrent models that read the window one timestep at a time and carry state forward. Cheap, mature, easy to interpret.
BERT-style transformer — the window becomes a sequence of "tokens"; attention captures cross-position context instead of accumulating it through a hidden state. More parameters, more compute, more room to overfit a small dataset.
Same input, same labels, four different inductive biases. The comparison on the left is the punchline of the whole project.
Nearest-neighbor as a sanity check
Before anything fancy: engineer summary features per window (mean, std, p95, slope, zero-bucket counts per channel) and run KNN in that feature space.
If the phase clusters separate visibly in two dimensions, KNN already does most of the work and a deep model is only buying marginal improvement. If they don't separate, you've learned something about the feature engineering before training a single epoch.
Accuracy vs complexity
Bigger models earn better numbers in the validation set — but they also need more parameters, more inference time, and more memory at the edge. The deployed model has to fit on the device it's protecting.
The scatter on the left is the usable trade-off curve: every point above and to the left of where you currently sit is a reachable upgrade. The point in the bottom-right is a model you'd never ship.
Catching attacks live
The A100 runs inference against incoming ten-second windows from the fleet. Each row on the stage is one trained model doing live prediction; each cell is its phase verdict on a freshly-arrived window, painted by the predicted phase.
Read the lanes side-by-side as a model-agreement check:
when the recurrent family (RNN / GRU / LSTM) all flip to
infecting at the same time, that's strong
evidence the host actually is. When ground truth from
labels.jsonl catches up, mismatched cells get
a hatched overlay and the running hit-rate ticks. The
callout below holds the most recent prediction with model
name, A100 round-trip latency, and confidence.
References
The papers, notes, and prior work this project leans on. Pick a tab on the left to load the document; the viewer takes the bulk of the stage so you can scroll through without leaving the deck.
end of deck · ← to flip back