CIS490/README.md

# CIS490 — Behavioral Malware Detection Dataset & Model

Course project for CIS490 (Cybersecurity). The end-goal is an ML model that
watches performance metrics on a real device, decides whether the device has
been breached, and triggers a hardware-level reset when confidence is high
enough. This repository covers the **dataset side** — we run public malware
samples (and behavior-matched mimics) against intentionally vulnerable Linux
VMs and capture labeled time-series telemetry that mirrors what the deployed
model would see in the field.

Concretely, every lab host on the WireGuard mesh detects how much capacity
it has, spins up that many concurrent VMs, gives each VM a *different*
malware profile from the manifest, and ships the resulting labeled episode
tarballs to the central receiver on the Pi over mTLS. Running the same
fleet on multiple hosts gives novel, non-overlapping data per host with no
coordinator — see [Multi-host fleet](#multi-host-fleet) below.

The work is grounded in the trust-over-time scoring model from
[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803).

---

## What an episode looks like

Each episode runs a target through a labeled phase schedule
(`clean → armed → infecting → infected_running → dormant → ...`) while
sampling host-side `/proc` telemetry at 10 Hz. The dataset's "envelope" is
the set of timestamped phase transitions written to `labels.jsonl` —
sharing a monotonic clock with the metric rows so anything aligned in
time can be aligned in code.

### Tier 2 — *real Alpine VM, profile-driven workload inside the guest*

This is the closest we get to real-malware behaviour without yet running
real malware. Telemetry is real `/proc/<qemu_pid>` from outside the
guest plus three more sources running concurrently (QMP, bridge pcap,
in-guest agent — see *Telemetry sources* below). The *load* itself is
generated inside the guest by a profile-matched shell command from
[`exploits/workloads.py`](exploits/workloads.py), driven over the
serial console by [`tools/vm_load_controller.py`](tools/vm_load_controller.py).

Each sample's `profile` (from [`samples/manifest.toml`](samples/manifest.toml))
dispatches to a different in-session workload, so the envelope each
VM produces is observably different per family — exactly the variance
the ML model needs to learn:

| profile          | shape                                                  |
|------------------|--------------------------------------------------------|
| `cpu-saturate`   | sustained 1-vCPU saturation (XMRig)                    |
| `scan-and-dial`  | SYN-style probes across the bridge subnet + dial-home  |
| `io-walk`        | fs traversal + 4 KiB urandom writes (ransomware)       |
| `bursty-c2`      | long idle + periodic 3-packet egress burst (Dridex)    |
| `low-and-slow`   | minimal CPU + periodic memory churn (Kovter / fileless)|
| `shell-resident` | one long-lived TCP socket + periodic command ticks (RAT)|

Every phase transition in `labels.jsonl` corresponds to an actual
command issued inside the real VM, and `meta.json` records which
sample / profile / kind drove it.

![Real Alpine VM envelope](docs/images/real-vm-envelope.png)

The 100% CPU plateaux are `yes > /dev/null` running on the guest's
single vCPU; the IO spikes during *infecting* are `dd if=/dev/urandom`
producing the sample-drop shape; the *dormant* drops are the
controller killing the load process inside the VM. The
infected_running → dormant → infected_running re-entry is the textbook
envelope that justifies the whole project framing.

Reproduce one episode (profile-driven via `--sample` or `SAMPLE_NAME`
env, defaults to the v1 yes-loop without one):

```sh
uv run python tools/run_real_vm_demo.py --data-root data \
    --sample xmrig-cryptominer
```

Or run the **fleet** — one wave of `max_concurrent` parallel episodes,
each slot pulling a different sample from the manifest:

```sh
uv run python tools/run_fleet.py --capacity            # see what the host can do
uv run python tools/run_fleet.py --waves 1 --data-root data
```

### Tier 1 — *real Alpine VM, idle baseline*

Same pipeline, pointed at the real `qemu-system` process while the
guest is doing nothing. Periodic ~10% CPU spikes are KVM/timer
interrupts; the single disk-write spike near t=3 s is the guest
finishing late-boot activity.

![Real VM idle baseline](docs/images/real-vm-idle.png)

### Pipeline-validation plot — *synthetic load, real telemetry*

This is **not real malware** and the load is **not** even running
inside a VM — it's a Python program on the host (`tools/load_mimic.py`)
that mimics an XMRig-style envelope. We used it to validate the
orchestrator + collector + labeling pipeline before plugging in a real
guest. Kept here because it shows the same shape the tier-2 plot
above produces from real KVM behaviour.

![Synthetic envelope (host-side mimic)](docs/images/synthetic-envelope.png)

### Tier 3 — *real exploit fire, profile-matched workload (Driver v2)*

The Tier-3 driver lives in [`exploits/`](exploits/README.md) — a tiny
msgpack-over-HTTPS msfrpc client + `MSFExploitDriver`. With a
[`Sample`](samples/manifest.py) supplied, the driver dispatches the
post-exploit `infected_running` workload through
[`exploits/workloads.py`](exploits/workloads.py) — same six profiles
as Tier 2, so a fleet wave produces matched envelopes whether or not
an exploit fires. Without a sample, the v1 yes-loop path is preserved
for smoke runs.

First canned module: `exploits/modules/vsftpd_234_backdoor.toml`
(Metasploitable2's CVE-2011-2523). [`scripts/install-msfrpcd.sh`](scripts/install-msfrpcd.sh)
sets up `msfrpcd` (loopback only) as a hardened systemd unit;
[`scripts/fetch-metasploitable2.sh`](scripts/fetch-metasploitable2.sh)
pulls + sha256-verifies a target image from operator-supplied URL.

### Tier 4 — *real malware sample, fetched + uploaded + executed*

A manifest entry with a `sha256` flips its `Sample.kind` to `"real"`.
The driver then bypasses the mimic profile and runs the real-binary
path:

1. [`tools/fetch_sample.py <sha256>`](tools/fetch_sample.py) pulls the
   binary from MalwareBazaar (Auth-Key from
   `samples/.bazaar.token` or `MALWAREBAZAAR_API_KEY`), unzips with the
   standard `infected` password, sha-verifies, and lands at
   `samples/store/<sha256>` (gitignored).
2. At `infected_running`, the driver chunked-uploads the binary into
   the shell session as 8 KiB base64 segments
   (`exploits.workloads.chunked_real_binary_upload`). 256 KiB binaries
   work without buffer-busting msfrpc.
3. The session decodes, sha-verifies *again on the guest side*, chmods,
   and execs only if the hash matches. Mismatch fail-stops the run.
4. `meta.sample.sha256` + per-step events
   (`real_binary_upload_begin`, `real_binary_verify`,
   `sample_executed{kind=real}`) record exactly which binary was run
   and when, so trainers can join cleanly.

### Tier maturity

| Tier | What it gives | Status |
|---|---|---|
| 1 — real VM, idle | confidence the collectors read real KVM behaviour | ✅ done |
| 2 — real VM, profile-driven workload | distinguishable in-guest envelopes per malware family | ✅ done |
| 3 — real VM, real exploit fire + profile workload | honest `armed → infecting` transitions, driver v2 dispatch | ✅ code; ⏳ awaiting Metasploitable2 image + msfrpcd on a lab host |
| 4 — real VM, real malware sample (MalwareBazaar fetch) | the full envelope we ultimately train on | ✅ code; ⏳ awaiting MalwareBazaar API key + sha256s in manifest |

### Telemetry sources (all five wire into one episode dir)

| # | Source                         | Vantage       | Role                |
|---|--------------------------------|---------------|---------------------|
| 1 | host `/proc/<qemu_pid>`        | outside       | oracle (label only) |
| 2 | QEMU QMP queries               | outside       | oracle (label only) |
| 3 | `perf stat -p <qemu_pid>`      | outside       | oracle (label only) |
| 4 | Bridge pcap → 100 ms netflow   | gateway-side  | feature (deployable)|
| 5 | In-guest agent (virtio-serial) | inside        | feature (deployable)|

All five are live. The deploy/oracle split follows
[`docs/threat-model.md`](docs/threat-model.md): only sources 4 + 5
are usable as model *features* in the field — sources 1, 2, 3 exist
as labeling oracles only.

For an interactive view of any episode (zoom/pan/hover), run:

```sh
tools/show_envelope.sh data/episodes/<episode_id>
# then open http://127.0.0.1:8988/
```

---

## Status (106/106 tests passing as of `a88ac83`)

**Pipeline (lab-host → Pi → tarball stored)**
- ✅ Receiver app (HTTPS PUT, sha256-verified, idempotent) — running on the Pi behind Caddy with mTLS via the wg-pki client CA
- ✅ `POST /v1/ping` smoke endpoint (writes nothing, exercises the full auth path)
- ✅ Shipper (`shipper/`) — tar+zstd, retry/backoff, `--ping` mode
- ✅ Caddy `collector.wg` block (in `spectral/caddy`)
- ✅ Lab-host install script + systemd units (`scripts/install-lab-host.sh`, `etc/cis490-{shipper,orchestrator}.service`)
- ✅ Receiver install script (`scripts/install-receiver.sh`)
- ✅ wg-pki client-CA bootstrap + per-host leaf issuance (in `spectral/wg-pki`)

**Telemetry**
- ✅ Source 1 — host `/proc/<qemu_pid>` @ 10 Hz
- ✅ Source 2 — QEMU QMP @ 1 Hz
- ✅ Source 3 — `perf stat -p <qemu_pid>` (opt-in via `enable_perf`; needs `CAP_SYS_ADMIN` / `CAP_PERFMON`)
- ✅ Source 4 — bridge pcap + 100 ms netflow bucketizer (pure-Python parser, no scapy/dpkt dep), wired into `EpisodeRunner` via `bridge_iface`
- ✅ Source 5 — in-guest agent over virtio-serial; cidata-embedded for first-boot install on Alpine

**Orchestrator + drivers**
- ✅ Orchestrator v0 — phase-scheduled episode runner, ULID episode ids
- ✅ Snapshot/revert via QMP `loadvm` (`revert_at_start` / `revert_at_end`) for clean baselines between episodes
- ✅ Tier 2 driver — real Alpine VM, profile-driven in-guest workload over serial console
- ✅ Tier 3 driver v2 — `MSFExploitDriver` + msfrpc client + per-sample workload dispatch; first canned module `vsftpd_234_backdoor.toml`
- ✅ Tier 4 — `tools/fetch_sample.py` (MalwareBazaar by sha256) + chunked real-binary upload (`exploits.workloads.chunked_real_binary_upload`) + guest-side sha-verify-then-exec dispatch in `MSFExploitDriver`
- ⏳ Tier 3 integration — needs operator to drop a Metasploitable2 image + run `scripts/install-msfrpcd.sh` on a lab host
- ⏳ Tier 4 integration — needs operator's MalwareBazaar API key + at least one `sha256` entry in `samples/manifest.toml`

**Fleet (multi-VM, multi-host data generation)**
- ✅ Resource-aware capacity detector (cores / RAM / load) — `orchestrator/fleet.py`
- ✅ Concurrent slot runner — `tools/run_fleet.py`
- ✅ Sample manifest with six behavioural profiles + deterministic per-(host_id, slot, episode) selection so every host walks the catalog in a different order

> **Topology note:** the **Pi5 is the WireGuard-side *collector*** that
> receives episode tarballs from one or more lab hosts. It is *not* the
> deployment target for the model. The deployment target is generic
> ("any constrained Linux device"). See
> [`docs/architecture.md`](docs/architecture.md).

---

<details>
<summary><b>Quick start — fleet mode (the primary workflow)</b></summary>

```sh
git clone https://maxgit.wg/spectral/CIS490.git
cd CIS490
uv sync

# 1. Build the cidata ISO with the in-guest agent baked in.
uv run python tools/build_cidata.py vm/images/cidata.iso

# 2. See what this host is sized for.
uv run python tools/run_fleet.py --capacity
# cores: 4 (reserve 1)
# ram:   7951 MiB total, 5223 MiB available (headroom 1024 MiB, per-vm 320 MiB)
# load:  1m=0.51
# caps:  by_cores=3, by_ram=13, by_load=3
# --> max_concurrent VMs: 3

# 3. Run one wave (= max_concurrent parallel episodes, each with a
#    different sample profile).
uv run python tools/run_fleet.py --waves 1 --data-root data

# 4. Plot any episode (matplotlib WebAgg).
tools/show_envelope.sh data/episodes/<episode_id>
```

Each episode dir contains:

```
meta.json              episode metadata (image, sample, profile, fleet capacity)
events.jsonl           orchestrator + driver events (exploit_fire, session_open, sample_executed, ...)
labels.jsonl           one row per phase transition — THIS is the envelope
telemetry-proc.jsonl   source 1: host /proc sampler @ 10 Hz
telemetry-qmp.jsonl    source 2: QMP query-status / blockstats / kvm stats @ 1 Hz
telemetry-guest.jsonl  source 5: in-guest agent (CPU jiffies, mem, listen ports, top procs)
network.pcap           source 4: tcpdump on br-malware
netflow.jsonl          source 4: 100 ms-bucketed pcap aggregation
done.marker            written last; the shipper only sees finished episodes
```

</details>

<details>
<summary><b>Quick start — single episode, no fleet</b></summary>

```sh
# Tier 2 (no exploit, profile-driven workload):
uv run python tools/run_real_vm_demo.py --data-root data \
    --sample mirai-class-bot

# Tier 3 (real exploit fire via msfrpcd):
MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env; echo $MSFRPC_PASSWORD) \
    uv run python tools/run_tier3_demo.py \
    --module vsftpd_234_backdoor \
    --sample ransomware-mimic \
    --data-root data
```

</details>

<details>
<summary><b>Multi-host fleet — how cross-host diversity works</b></summary>

Each lab host's `host_id` (set in `/etc/cis490/lab-host.toml`) seeds a
deterministic walk through the sample catalog:

```python
# samples/manifest.py
def select(self, *, host_id, slot, episode_index):
    seed = f"{host_id}|{slot}|{episode_index}"
    idx  = sha256(seed)[:8] % len(self.samples)
    return self.samples[idx]
```

So:
- `host=alice slot=0 ep=0` and `host=bob slot=0 ep=0` almost certainly
  pick *different* samples (test asserts < 25% collision over 20 trials).
- A single host walks the entire catalog within ~`len(manifest)` waves
  (test confirms full coverage in 200 episodes).
- No coordinator needed — every host independently produces non-overlapping
  data, and `meta.fleet.host_id` + `meta.sample.name` make the join trivial
  at training time.

The fleet runner shells out to the same `tools/run_real_vm_demo.py` per
slot, with `SLOT` / `RUN_DIR` / `SAMPLE_NAME` env passed through to the
launcher. Each VM gets its own QMP socket, agent socket, hostfwd port
range, and episode dir, so concurrency is collision-free up to the
capacity ceiling.

</details>

<details>
<summary><b>Repository layout</b></summary>

| Path | What it holds |
|---|---|
| [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning |
| [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split |
| [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum |
| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG |
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
| [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
| `orchestrator/` | Episode runner + `fleet.py` (capacity detection, concurrent slot driver) |
| `collectors/` | One module per telemetry source: `proc_qemu`, `qmp`, `pcap`, `guest_agent` |
| `receiver/` | Starlette app: PUT `/v1/episodes` + POST `/v1/ping`, sha256-verified, idempotent |
| `shipper/` | Lab-host-side: scan `data/episodes/`, tar+zstd, PUT over mTLS, retry/backoff |
| `vm/` | Launch scripts (`launch_demo.sh`, `launch_target.sh`), `setup_bridge.sh`, in-guest agent at `vm/guest-agent/cis490_agent.py`. qcow2 images and pcap captures gitignored. |
| `tools/` | `run_fleet.py`, `run_real_vm_demo.py`, `run_tier3_demo.py`, `build_cidata.py`, `plot_envelope.py`, `show_envelope.sh` |
| [`exploits/`](exploits/README.md) | MSF RPC client (`msfrpc.py`), `driver.py` (v2 with sample dispatch), `workloads.py` (six profile-matched in-session loops), per-module TOML configs |
| [`samples/`](samples/manifest.toml) | Sample manifest + loader. Binaries land at `samples/store/<sha256>` (gitignored). |
| `scripts/` | `install-{lab-host,receiver,msfrpcd}.sh`, `fetch-metasploitable2.sh` |
| `training/` | Model training code (deferred — schema first) |
| `etc/` | systemd units and config templates (`cis490-{receiver,shipper,orchestrator}.service`, `lab-host.toml.example`, `receiver.toml.example`) |
| [`AGENTS.md`](AGENTS.md) | Conventions for AI agents working on this and sibling spectral repos |

</details>

<details>
<summary><b>Design decisions — why these choices</b></summary>

- **Why VMs (not Docker)?** We need a clean snapshot/revert loop and we need
  to run real malware without compromising the host. KVM gives both at
  near-native speed; containers share the host kernel and many samples detect
  containerization and refuse to detonate. See
  [`docs/architecture.md`](docs/architecture.md).
- **Why KVM (not TCG/-icount)?** ML training data wants noise to generalize
  to. KVM is ~15× faster than TCG, which directly multiplies dataset size
  per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the
  "constrained device" framing.
- **Why JSONL (not a DB yet)?** Schema-last. Collect first, decide storage
  shape after we see what's useful. JSONL is crash-safe, append-only,
  reshapes trivially into Postgres/Timescale/Parquet.
- **Why two models — realistic vs. oracle?** Features that exist on a
  deployed device train the *realistic* model. Host-side QEMU telemetry
  (which doesn't exist in deployment) is *oracle*-only — used to assign
  honest labels at training time, never as a model input. The accuracy gap
  between the two measures how much detection power a privileged rootkit
  can take from us by lying to in-device tools. See
  [`docs/threat-model.md`](docs/threat-model.md).
- **Why ULIDs for episode ids?** Time-sortable, no coordinator, URL-safe.

</details>

<details>
<summary><b>Deploying the receiver and lab-host roles</b></summary>

Two roles, one bootstrap command each. Detailed in
[`docs/deploy.md`](docs/deploy.md):

- `lab-host` — runs episodes, ships completed episodes to the receiver.
- `receiver` — accepts ship uploads, stores tarballs + appends to
  `index.jsonl`. Runs on the Pi5 in our setup.

```sh
# On the Pi5 (or any always-on WG node):
sudo ./scripts/install-receiver.sh
# Add the collector.wg block to spectral/caddy (already merged), then:
sudo systemctl enable --now cis490-receiver

# One-time, on the Pi: bootstrap the CIS490 client CA.
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh

# On each lab host: enroll via wg-enroll first, then:
sudo ./scripts/install-lab-host.sh
# Drop a TLS leaf from wg-pki at /etc/cis490/certs/, edit /etc/cis490/lab-host.toml.
sudo systemctl enable --now cis490-shipper cis490-orchestrator
```

The orchestrator service runs `tools/run_fleet.py --waves 1` per
invocation with `Restart=always`, giving a continuous stream of
fresh-sample episodes per host. The shipper picks them up as
`done.marker` files appear and PUTs them to `https://collector.wg`.

For mTLS leaf-cert minting: `spectral/wg-pki/scripts/issue-cis490-client-cert.sh <host_id>`.

</details>

<details>
<summary><b>Threat model and feature-availability split</b></summary>

See [`docs/threat-model.md`](docs/threat-model.md) for the full argument.
The short version:

| Channel | Vantage | Role |
|---|---|---|
| Host `/proc/<qemu_pid>` | outside guest | oracle (label only) |
| QEMU QMP `query-stats` etc. | outside guest | oracle (label only) |
| `perf stat -p <qemu_pid>` | outside guest | oracle (label only) |
| Bridge-side pcap | gateway-style | feature (deployable) |
| In-guest `/proc`, perf, thermal | inside guest | feature (deployable) |

We collect everything in the lab. Only the *features* go into the deployed
model; the oracles are used to label episodes with high confidence
(disagreement between in-guest and host-side data is itself a rootkit
signal).

</details>

---

## Citing this work

A short course-project citation, until the dataset reaches a publishable
form:

> Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).*
> Spectral lab, 2026.

See [`docs/sources.md`](docs/sources.md) for everything else this project
leans on.