End-to-end now drives a real KVM guest through the full XMRig-shaped
phase schedule with the workload running INSIDE the guest. Telemetry is
host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU
saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven
over the serial console at every phase transition. The plotted envelope
shows clean idle → armed → infecting (disk spike) → infected_running
(100% CPU plateau) → dormant → re-entry → final clean.
Components:
vm/launch_demo.sh now boots Alpine 3.21 nocloud-cloudinit
(Cirros 0.6.x's cirros-init blocks on the
EC2 metadata service for ~17 min before
falling through to NoCloud — abandoned).
Mounts a cidata ISO as a second drive.
tools/build_cidata.py pure-Python NoCloud ISO builder (pycdlib).
Sets root password and ssh_pwauth via
runcmd so we don't depend on a specific
cloud-init version's plain_text_passwd
handling.
tools/vm_serial.py serial-console client (stdlib socket).
Idempotent login (detects already-in-shell
state), sentinel-bracketed run() that
distinguishes shell output from the TTY
echo of input by requiring a leading
\r\n boundary on the marker.
tools/vm_load_controller.py in-guest load controller. set_phase()
dispatches the per-phase shell command
over the serial connection.
tools/run_real_vm_demo.py ties it all together: boot VM, wait for
cloud-init runcmd, log in, run the
EpisodeRunner with on_phase=controller,
shut down VM.
Deps: paramiko, pycdlib added.
docs/sources.md updated with Alpine cloud image (sha512 pinned), and
the new Python deps.
README leads with the tier-2 plot now (real VM, real workload). The
previous synthetic plot is moved below with explicit "host-side mimic,
not a VM" labelling. Tier-2 status flipped to ✅ in the tier table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
275 lines
11 KiB
Markdown
275 lines
11 KiB
Markdown
# CIS490 — Behavioral Malware Detection Dataset & Model
|
||
|
||
Course project for CIS490 (Cybersecurity). The end-goal is an ML model that
|
||
watches performance metrics on a real device, decides whether the device has
|
||
been breached, and triggers a hardware-level reset when confidence is high
|
||
enough. This repository covers the **dataset side** — we run public malware
|
||
samples against intentionally vulnerable Linux VMs and capture labeled
|
||
time-series telemetry that mirrors what the deployed model would see in the
|
||
field.
|
||
|
||
The work is grounded in the trust-over-time scoring model from
|
||
[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803).
|
||
|
||
---
|
||
|
||
## What an episode looks like
|
||
|
||
Each episode runs a target through a labeled phase schedule
|
||
(`clean → armed → infecting → infected_running → dormant → ...`) while
|
||
sampling host-side `/proc` telemetry at 10 Hz. The dataset's "envelope" is
|
||
the set of timestamped phase transitions written to `labels.jsonl` —
|
||
sharing a monotonic clock with the metric rows so anything aligned in
|
||
time can be aligned in code.
|
||
|
||
### Tier 2 — *real Alpine VM, real workload driven from inside the guest*
|
||
|
||
This is the closest we get to real-malware behaviour without yet running
|
||
real malware. Telemetry is real `/proc/<qemu_pid>` from outside the
|
||
guest, **and the load is generated inside the guest** by busybox
|
||
``yes`` (CPU saturation) and ``dd`` (disk bursts), driven over the
|
||
serial console by `tools/vm_load_controller.py`. Every phase transition
|
||
in `labels.jsonl` corresponds to an actual command issued inside the
|
||
real VM.
|
||
|
||

|
||
|
||
The 100% CPU plateaux are `yes > /dev/null` running on the guest's
|
||
single vCPU; the IO spikes during *infecting* are `dd if=/dev/urandom`
|
||
producing the sample-drop shape; the *dormant* drops are the
|
||
controller killing the load process inside the VM. The
|
||
infected_running → dormant → infected_running re-entry is the textbook
|
||
envelope that justifies the whole project framing.
|
||
|
||
Reproduce with:
|
||
|
||
```sh
|
||
uv run python tools/run_real_vm_demo.py --data-root data
|
||
```
|
||
|
||
### Tier 1 — *real Alpine VM, idle baseline*
|
||
|
||
Same pipeline, pointed at the real `qemu-system` process while the
|
||
guest is doing nothing. Periodic ~10% CPU spikes are KVM/timer
|
||
interrupts; the single disk-write spike near t=3 s is the guest
|
||
finishing late-boot activity.
|
||
|
||

|
||
|
||
### Pipeline-validation plot — *synthetic load, real telemetry*
|
||
|
||
This is **not real malware** and the load is **not** even running
|
||
inside a VM — it's a Python program on the host (`tools/load_mimic.py`)
|
||
that mimics an XMRig-style envelope. We used it to validate the
|
||
orchestrator + collector + labeling pipeline before plugging in a real
|
||
guest. Kept here because it shows the same shape the tier-2 plot
|
||
above produces from real KVM behaviour.
|
||
|
||

|
||
|
||
### What's still missing for the real-malware envelope
|
||
|
||
| Tier | What it gives | Status |
|
||
|---|---|---|
|
||
| 1 — real VM, idle | confidence the collector reads real KVM behaviour | ✅ done |
|
||
| 2 — real VM, real workload from inside the guest | first real-load envelope shape | ✅ done |
|
||
| 3 — real VM, real exploit fire (Metasploitable + msfrpc) | honest `armed → infecting` transitions | 🚧 |
|
||
| 4 — real VM, real malware sample (XMRig from MalwareBazaar) | the full envelope we ultimately train on | 🚧 |
|
||
|
||
For an interactive view of any episode (zoom/pan/hover), run:
|
||
|
||
```sh
|
||
tools/show_envelope.sh data/episodes/<episode_id>
|
||
# then open http://127.0.0.1:8988/
|
||
```
|
||
|
||
---
|
||
|
||
## Status
|
||
|
||
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
|
||
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
|
||
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz
|
||
- ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
|
||
- ✅ Real VM (Alpine 3.21 cloud-init under KVM) — orchestrator collects against the real `qemu-system` pid
|
||
- ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition
|
||
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
|
||
- 🚧 Exploit driver (Metasploit RPC) for `armed → infecting` transitions on `session_open`
|
||
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)
|
||
|
||
> **Topology note:** in this project the **Pi5 is the WireGuard-side
|
||
> *collector*** that receives episode tarballs from one or more lab hosts.
|
||
> It is *not* the deployment target for the model. The deployment target is
|
||
> generic ("any constrained Linux device"). See
|
||
> [`docs/architecture.md`](docs/architecture.md).
|
||
|
||
---
|
||
|
||
<details>
|
||
<summary><b>Quick start — run the synthetic envelope demo (~90 s)</b></summary>
|
||
|
||
```sh
|
||
git clone https://maxgit.wg/spectral/CIS490.git
|
||
cd CIS490
|
||
|
||
# One-time setup.
|
||
uv sync
|
||
|
||
# Generate one labeled episode (8 phases, 851 telemetry rows, 85 s).
|
||
uv run python tools/run_envelope_demo.py --data-root data
|
||
|
||
# Render a static PNG envelope of that episode.
|
||
uv run python tools/plot_envelope.py data/episodes/<episode_id>
|
||
|
||
# Or open an interactive plot in your browser:
|
||
tools/show_envelope.sh data/episodes/<episode_id>
|
||
```
|
||
|
||
The data lands in `data/episodes/<ulid>/`:
|
||
|
||
```
|
||
meta.json episode metadata (image, snapshot, schedule, host fingerprint)
|
||
events.jsonl orchestrator actions (snapshot_load, phase_transition, episode_end)
|
||
labels.jsonl one row per phase transition — THIS is the envelope
|
||
telemetry-proc.jsonl host /proc sampler at 10 Hz
|
||
done.marker written last; the shipper only sees finished episodes
|
||
```
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Quick start — boot a real Linux VM (Cirros)</b></summary>
|
||
|
||
The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its
|
||
QMP/monitor sockets and pidfile. The orchestrator then samples the real
|
||
`qemu-system` process.
|
||
|
||
```sh
|
||
# Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'.
|
||
# (See docs/sources.md for the Cirros sha256.)
|
||
|
||
# Boot in one terminal:
|
||
RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh
|
||
|
||
# In another terminal, point the orchestrator at the VM's pid:
|
||
QPID=$(cat /tmp/cis490-vm/qemu.pid)
|
||
uv run python -m orchestrator --target-pid $QPID --duration 20
|
||
|
||
# Plot:
|
||
tools/show_envelope.sh data/episodes/<episode_id>
|
||
```
|
||
|
||
The idle-VM envelope shape is distinct from the synthetic load: periodic
|
||
~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single
|
||
late-boot disk write. That's a real KVM guest you're seeing.
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Repository layout</b></summary>
|
||
|
||
| Path | What it holds |
|
||
|---|---|
|
||
| [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning |
|
||
| [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split |
|
||
| [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum |
|
||
| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG |
|
||
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
|
||
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
|
||
| [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
|
||
| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
|
||
| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
|
||
| `receiver/` | Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent |
|
||
| `vm/` | qcow2 images, launch scripts, snapshot recipes (binaries gitignored) |
|
||
| `tools/` | Demo runners, load mimic, plot scripts |
|
||
| `exploits/` | Metasploit resource scripts for repeatable exploitation (TODO) |
|
||
| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
|
||
| `training/` | Model training code (deferred — schema first) |
|
||
| `etc/` | systemd units and config templates installed by the deploy scripts |
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Design decisions — why these choices</b></summary>
|
||
|
||
- **Why VMs (not Docker)?** We need a clean snapshot/revert loop and we need
|
||
to run real malware without compromising the host. KVM gives both at
|
||
near-native speed; containers share the host kernel and many samples detect
|
||
containerization and refuse to detonate. See
|
||
[`docs/architecture.md`](docs/architecture.md).
|
||
- **Why KVM (not TCG/-icount)?** ML training data wants noise to generalize
|
||
to. KVM is ~15× faster than TCG, which directly multiplies dataset size
|
||
per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the
|
||
"constrained device" framing.
|
||
- **Why JSONL (not a DB yet)?** Schema-last. Collect first, decide storage
|
||
shape after we see what's useful. JSONL is crash-safe, append-only,
|
||
reshapes trivially into Postgres/Timescale/Parquet.
|
||
- **Why two models — realistic vs. oracle?** Features that exist on a
|
||
deployed device train the *realistic* model. Host-side QEMU telemetry
|
||
(which doesn't exist in deployment) is *oracle*-only — used to assign
|
||
honest labels at training time, never as a model input. The accuracy gap
|
||
between the two measures how much detection power a privileged rootkit
|
||
can take from us by lying to in-device tools. See
|
||
[`docs/threat-model.md`](docs/threat-model.md).
|
||
- **Why ULIDs for episode ids?** Time-sortable, no coordinator, URL-safe.
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Deploying the receiver and lab-host roles</b></summary>
|
||
|
||
Two roles, one bootstrap command each. Detailed in
|
||
[`docs/deploy.md`](docs/deploy.md):
|
||
|
||
- `lab-host` — runs episodes, ships completed episodes to the receiver.
|
||
- `receiver` — accepts ship uploads, stores tarballs + appends to
|
||
`index.jsonl`. Runs on the Pi5 in our setup.
|
||
|
||
```sh
|
||
# On a lab host:
|
||
./scripts/install-lab-host.sh # (TODO — currently bring up by hand per docs/deploy.md)
|
||
|
||
# On the Pi5 (or any always-on WG node):
|
||
./scripts/install-receiver.sh # (TODO — same)
|
||
```
|
||
|
||
For now both bootstrap scripts are scaffolds; the units and configs they
|
||
install live in `etc/`. The receiver itself works today
|
||
(`uv run python -m receiver --config etc/receiver.toml.example` — modify
|
||
paths).
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Threat model and feature-availability split</b></summary>
|
||
|
||
See [`docs/threat-model.md`](docs/threat-model.md) for the full argument.
|
||
The short version:
|
||
|
||
| Channel | Vantage | Role |
|
||
|---|---|---|
|
||
| Host `/proc/<qemu_pid>` | outside guest | oracle (label only) |
|
||
| QEMU QMP `query-stats` etc. | outside guest | oracle (label only) |
|
||
| `perf stat -p <qemu_pid>` | outside guest | oracle (label only) |
|
||
| Bridge-side pcap | gateway-style | feature (deployable) |
|
||
| In-guest `/proc`, perf, thermal | inside guest | feature (deployable) |
|
||
|
||
We collect everything in the lab. Only the *features* go into the deployed
|
||
model; the oracles are used to label episodes with high confidence
|
||
(disagreement between in-guest and host-side data is itself a rootkit
|
||
signal).
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## Citing this work
|
||
|
||
A short course-project citation, until the dataset reaches a publishable
|
||
form:
|
||
|
||
> Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).*
|
||
> Spectral lab, 2026.
|
||
|
||
See [`docs/sources.md`](docs/sources.md) for everything else this project
|
||
leans on.
|