CIS490/README.md
Maximus Gorog 7216ec09bd Tier 2: real Alpine VM, real workload, real envelope
End-to-end now drives a real KVM guest through the full XMRig-shaped
phase schedule with the workload running INSIDE the guest. Telemetry is
host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU
saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven
over the serial console at every phase transition. The plotted envelope
shows clean idle → armed → infecting (disk spike) → infected_running
(100% CPU plateau) → dormant → re-entry → final clean.

Components:

  vm/launch_demo.sh              now boots Alpine 3.21 nocloud-cloudinit
                                 (Cirros 0.6.x's cirros-init blocks on the
                                 EC2 metadata service for ~17 min before
                                 falling through to NoCloud — abandoned).
                                 Mounts a cidata ISO as a second drive.

  tools/build_cidata.py          pure-Python NoCloud ISO builder (pycdlib).
                                 Sets root password and ssh_pwauth via
                                 runcmd so we don't depend on a specific
                                 cloud-init version's plain_text_passwd
                                 handling.

  tools/vm_serial.py             serial-console client (stdlib socket).
                                 Idempotent login (detects already-in-shell
                                 state), sentinel-bracketed run() that
                                 distinguishes shell output from the TTY
                                 echo of input by requiring a leading
                                 \r\n boundary on the marker.

  tools/vm_load_controller.py    in-guest load controller. set_phase()
                                 dispatches the per-phase shell command
                                 over the serial connection.

  tools/run_real_vm_demo.py      ties it all together: boot VM, wait for
                                 cloud-init runcmd, log in, run the
                                 EpisodeRunner with on_phase=controller,
                                 shut down VM.

Deps: paramiko, pycdlib added.

docs/sources.md updated with Alpine cloud image (sha512 pinned), and
the new Python deps.

README leads with the tier-2 plot now (real VM, real workload). The
previous synthetic plot is moved below with explicit "host-side mimic,
not a VM" labelling. Tier-2 status flipped to  in the tier table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:38:53 -06:00

275 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CIS490 — Behavioral Malware Detection Dataset & Model
Course project for CIS490 (Cybersecurity). The end-goal is an ML model that
watches performance metrics on a real device, decides whether the device has
been breached, and triggers a hardware-level reset when confidence is high
enough. This repository covers the **dataset side** — we run public malware
samples against intentionally vulnerable Linux VMs and capture labeled
time-series telemetry that mirrors what the deployed model would see in the
field.
The work is grounded in the trust-over-time scoring model from
[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803).
---
## What an episode looks like
Each episode runs a target through a labeled phase schedule
(`clean → armed → infecting → infected_running → dormant → ...`) while
sampling host-side `/proc` telemetry at 10 Hz. The dataset's "envelope" is
the set of timestamped phase transitions written to `labels.jsonl`
sharing a monotonic clock with the metric rows so anything aligned in
time can be aligned in code.
### Tier 2 — *real Alpine VM, real workload driven from inside the guest*
This is the closest we get to real-malware behaviour without yet running
real malware. Telemetry is real `/proc/<qemu_pid>` from outside the
guest, **and the load is generated inside the guest** by busybox
``yes`` (CPU saturation) and ``dd`` (disk bursts), driven over the
serial console by `tools/vm_load_controller.py`. Every phase transition
in `labels.jsonl` corresponds to an actual command issued inside the
real VM.
![Real Alpine VM envelope](docs/images/real-vm-envelope.png)
The 100% CPU plateaux are `yes > /dev/null` running on the guest's
single vCPU; the IO spikes during *infecting* are `dd if=/dev/urandom`
producing the sample-drop shape; the *dormant* drops are the
controller killing the load process inside the VM. The
infected_running → dormant → infected_running re-entry is the textbook
envelope that justifies the whole project framing.
Reproduce with:
```sh
uv run python tools/run_real_vm_demo.py --data-root data
```
### Tier 1 — *real Alpine VM, idle baseline*
Same pipeline, pointed at the real `qemu-system` process while the
guest is doing nothing. Periodic ~10% CPU spikes are KVM/timer
interrupts; the single disk-write spike near t=3 s is the guest
finishing late-boot activity.
![Real VM idle baseline](docs/images/real-vm-idle.png)
### Pipeline-validation plot — *synthetic load, real telemetry*
This is **not real malware** and the load is **not** even running
inside a VM — it's a Python program on the host (`tools/load_mimic.py`)
that mimics an XMRig-style envelope. We used it to validate the
orchestrator + collector + labeling pipeline before plugging in a real
guest. Kept here because it shows the same shape the tier-2 plot
above produces from real KVM behaviour.
![Synthetic envelope (host-side mimic)](docs/images/synthetic-envelope.png)
### What's still missing for the real-malware envelope
| Tier | What it gives | Status |
|---|---|---|
| 1 — real VM, idle | confidence the collector reads real KVM behaviour | ✅ done |
| 2 — real VM, real workload from inside the guest | first real-load envelope shape | ✅ done |
| 3 — real VM, real exploit fire (Metasploitable + msfrpc) | honest `armed → infecting` transitions | 🚧 |
| 4 — real VM, real malware sample (XMRig from MalwareBazaar) | the full envelope we ultimately train on | 🚧 |
For an interactive view of any episode (zoom/pan/hover), run:
```sh
tools/show_envelope.sh data/episodes/<episode_id>
# then open http://127.0.0.1:8988/
```
---
## Status
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz
- ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
- ✅ Real VM (Alpine 3.21 cloud-init under KVM) — orchestrator collects against the real `qemu-system` pid
- ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
- 🚧 Exploit driver (Metasploit RPC) for `armed → infecting` transitions on `session_open`
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)
> **Topology note:** in this project the **Pi5 is the WireGuard-side
> *collector*** that receives episode tarballs from one or more lab hosts.
> It is *not* the deployment target for the model. The deployment target is
> generic ("any constrained Linux device"). See
> [`docs/architecture.md`](docs/architecture.md).
---
<details>
<summary><b>Quick start — run the synthetic envelope demo (~90 s)</b></summary>
```sh
git clone https://maxgit.wg/spectral/CIS490.git
cd CIS490
# One-time setup.
uv sync
# Generate one labeled episode (8 phases, 851 telemetry rows, 85 s).
uv run python tools/run_envelope_demo.py --data-root data
# Render a static PNG envelope of that episode.
uv run python tools/plot_envelope.py data/episodes/<episode_id>
# Or open an interactive plot in your browser:
tools/show_envelope.sh data/episodes/<episode_id>
```
The data lands in `data/episodes/<ulid>/`:
```
meta.json episode metadata (image, snapshot, schedule, host fingerprint)
events.jsonl orchestrator actions (snapshot_load, phase_transition, episode_end)
labels.jsonl one row per phase transition — THIS is the envelope
telemetry-proc.jsonl host /proc sampler at 10 Hz
done.marker written last; the shipper only sees finished episodes
```
</details>
<details>
<summary><b>Quick start — boot a real Linux VM (Cirros)</b></summary>
The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its
QMP/monitor sockets and pidfile. The orchestrator then samples the real
`qemu-system` process.
```sh
# Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'.
# (See docs/sources.md for the Cirros sha256.)
# Boot in one terminal:
RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh
# In another terminal, point the orchestrator at the VM's pid:
QPID=$(cat /tmp/cis490-vm/qemu.pid)
uv run python -m orchestrator --target-pid $QPID --duration 20
# Plot:
tools/show_envelope.sh data/episodes/<episode_id>
```
The idle-VM envelope shape is distinct from the synthetic load: periodic
~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single
late-boot disk write. That's a real KVM guest you're seeing.
</details>
<details>
<summary><b>Repository layout</b></summary>
| Path | What it holds |
|---|---|
| [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning |
| [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split |
| [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum |
| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG |
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
| [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
| `receiver/` | Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent |
| `vm/` | qcow2 images, launch scripts, snapshot recipes (binaries gitignored) |
| `tools/` | Demo runners, load mimic, plot scripts |
| `exploits/` | Metasploit resource scripts for repeatable exploitation (TODO) |
| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
| `training/` | Model training code (deferred — schema first) |
| `etc/` | systemd units and config templates installed by the deploy scripts |
</details>
<details>
<summary><b>Design decisions — why these choices</b></summary>
- **Why VMs (not Docker)?** We need a clean snapshot/revert loop and we need
to run real malware without compromising the host. KVM gives both at
near-native speed; containers share the host kernel and many samples detect
containerization and refuse to detonate. See
[`docs/architecture.md`](docs/architecture.md).
- **Why KVM (not TCG/-icount)?** ML training data wants noise to generalize
to. KVM is ~15× faster than TCG, which directly multiplies dataset size
per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the
"constrained device" framing.
- **Why JSONL (not a DB yet)?** Schema-last. Collect first, decide storage
shape after we see what's useful. JSONL is crash-safe, append-only,
reshapes trivially into Postgres/Timescale/Parquet.
- **Why two models — realistic vs. oracle?** Features that exist on a
deployed device train the *realistic* model. Host-side QEMU telemetry
(which doesn't exist in deployment) is *oracle*-only — used to assign
honest labels at training time, never as a model input. The accuracy gap
between the two measures how much detection power a privileged rootkit
can take from us by lying to in-device tools. See
[`docs/threat-model.md`](docs/threat-model.md).
- **Why ULIDs for episode ids?** Time-sortable, no coordinator, URL-safe.
</details>
<details>
<summary><b>Deploying the receiver and lab-host roles</b></summary>
Two roles, one bootstrap command each. Detailed in
[`docs/deploy.md`](docs/deploy.md):
- `lab-host` — runs episodes, ships completed episodes to the receiver.
- `receiver` — accepts ship uploads, stores tarballs + appends to
`index.jsonl`. Runs on the Pi5 in our setup.
```sh
# On a lab host:
./scripts/install-lab-host.sh # (TODO — currently bring up by hand per docs/deploy.md)
# On the Pi5 (or any always-on WG node):
./scripts/install-receiver.sh # (TODO — same)
```
For now both bootstrap scripts are scaffolds; the units and configs they
install live in `etc/`. The receiver itself works today
(`uv run python -m receiver --config etc/receiver.toml.example` — modify
paths).
</details>
<details>
<summary><b>Threat model and feature-availability split</b></summary>
See [`docs/threat-model.md`](docs/threat-model.md) for the full argument.
The short version:
| Channel | Vantage | Role |
|---|---|---|
| Host `/proc/<qemu_pid>` | outside guest | oracle (label only) |
| QEMU QMP `query-stats` etc. | outside guest | oracle (label only) |
| `perf stat -p <qemu_pid>` | outside guest | oracle (label only) |
| Bridge-side pcap | gateway-style | feature (deployable) |
| In-guest `/proc`, perf, thermal | inside guest | feature (deployable) |
We collect everything in the lab. Only the *features* go into the deployed
model; the oracles are used to label episodes with high confidence
(disagreement between in-guest and host-side data is itself a rootkit
signal).
</details>
---
## Citing this work
A short course-project citation, until the dataset reaches a publishable
form:
> Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).*
> Spectral lab, 2026.
See [`docs/sources.md`](docs/sources.md) for everything else this project
leans on.