# CIS490 — Behavioral Malware Detection Dataset & Model Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches performance metrics on a real device, decides whether the device has been breached, and triggers a hardware-level reset when confidence is high enough. This repository covers the **dataset side** — we run public malware samples against intentionally vulnerable Linux VMs and capture labeled time-series telemetry that mirrors what the deployed model would see in the field. The work is grounded in the trust-over-time scoring model from [IEEE 9881803](https://ieeexplore.ieee.org/document/9881803). --- ## What an episode looks like Each episode runs a target through a labeled phase schedule (`clean → armed → infecting → infected_running → dormant → ...`) while sampling host-side `/proc` telemetry at 10 Hz. The dataset's "envelope" is the set of timestamped phase transitions written to `labels.jsonl` — sharing a monotonic clock with the metric rows so anything aligned in time can be aligned in code. ### Tier 2 — *real Alpine VM, real workload driven from inside the guest* This is the closest we get to real-malware behaviour without yet running real malware. Telemetry is real `/proc/` from outside the guest, **and the load is generated inside the guest** by busybox ``yes`` (CPU saturation) and ``dd`` (disk bursts), driven over the serial console by `tools/vm_load_controller.py`. Every phase transition in `labels.jsonl` corresponds to an actual command issued inside the real VM. ![Real Alpine VM envelope](docs/images/real-vm-envelope.png) The 100% CPU plateaux are `yes > /dev/null` running on the guest's single vCPU; the IO spikes during *infecting* are `dd if=/dev/urandom` producing the sample-drop shape; the *dormant* drops are the controller killing the load process inside the VM. The infected_running → dormant → infected_running re-entry is the textbook envelope that justifies the whole project framing. Reproduce with: ```sh uv run python tools/run_real_vm_demo.py --data-root data ``` ### Tier 1 — *real Alpine VM, idle baseline* Same pipeline, pointed at the real `qemu-system` process while the guest is doing nothing. Periodic ~10% CPU spikes are KVM/timer interrupts; the single disk-write spike near t=3 s is the guest finishing late-boot activity. ![Real VM idle baseline](docs/images/real-vm-idle.png) ### Pipeline-validation plot — *synthetic load, real telemetry* This is **not real malware** and the load is **not** even running inside a VM — it's a Python program on the host (`tools/load_mimic.py`) that mimics an XMRig-style envelope. We used it to validate the orchestrator + collector + labeling pipeline before plugging in a real guest. Kept here because it shows the same shape the tier-2 plot above produces from real KVM behaviour. ![Synthetic envelope (host-side mimic)](docs/images/synthetic-envelope.png) ### What's still missing for the real-malware envelope | Tier | What it gives | Status | |---|---|---| | 1 — real VM, idle | confidence the collector reads real KVM behaviour | ✅ done | | 2 — real VM, real workload from inside the guest | first real-load envelope shape | ✅ done | | 3 — real VM, real exploit fire (Metasploitable + msfrpc) | honest `armed → infecting` transitions | 🚧 | | 4 — real VM, real malware sample (XMRig from MalwareBazaar) | the full envelope we ultimately train on | 🚧 | For an interactive view of any episode (zoom/pan/hover), run: ```sh tools/show_envelope.sh data/episodes/ # then open http://127.0.0.1:8988/ ``` --- ## Status - ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl - ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids - ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz - ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end - ✅ Real VM (Alpine 3.21 cloud-init under KVM) — orchestrator collects against the real `qemu-system` pid - ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition - 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5) - 🚧 Exploit driver (Metasploit RPC) for `armed → infecting` transitions on `session_open` - 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified) > **Topology note:** in this project the **Pi5 is the WireGuard-side > *collector*** that receives episode tarballs from one or more lab hosts. > It is *not* the deployment target for the model. The deployment target is > generic ("any constrained Linux device"). See > [`docs/architecture.md`](docs/architecture.md). ---
Quick start — run the synthetic envelope demo (~90 s) ```sh git clone https://maxgit.wg/spectral/CIS490.git cd CIS490 # One-time setup. uv sync # Generate one labeled episode (8 phases, 851 telemetry rows, 85 s). uv run python tools/run_envelope_demo.py --data-root data # Render a static PNG envelope of that episode. uv run python tools/plot_envelope.py data/episodes/ # Or open an interactive plot in your browser: tools/show_envelope.sh data/episodes/ ``` The data lands in `data/episodes//`: ``` meta.json episode metadata (image, snapshot, schedule, host fingerprint) events.jsonl orchestrator actions (snapshot_load, phase_transition, episode_end) labels.jsonl one row per phase transition — THIS is the envelope telemetry-proc.jsonl host /proc sampler at 10 Hz done.marker written last; the shipper only sees finished episodes ```
Quick start — boot a real Linux VM (Cirros) The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its QMP/monitor sockets and pidfile. The orchestrator then samples the real `qemu-system` process. ```sh # Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'. # (See docs/sources.md for the Cirros sha256.) # Boot in one terminal: RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh # In another terminal, point the orchestrator at the VM's pid: QPID=$(cat /tmp/cis490-vm/qemu.pid) uv run python -m orchestrator --target-pid $QPID --duration 20 # Plot: tools/show_envelope.sh data/episodes/ ``` The idle-VM envelope shape is distinct from the synthetic load: periodic ~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single late-boot disk write. That's a real KVM guest you're seeing.
Repository layout | Path | What it holds | |---|---| | [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning | | [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split | | [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum | | [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG | | [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles | | [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring | | [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard | | `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop | | `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) | | `receiver/` | Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent | | `vm/` | qcow2 images, launch scripts, snapshot recipes (binaries gitignored) | | `tools/` | Demo runners, load mimic, plot scripts | | `exploits/` | Metasploit resource scripts for repeatable exploitation (TODO) | | `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** | | `training/` | Model training code (deferred — schema first) | | `etc/` | systemd units and config templates installed by the deploy scripts |
Design decisions — why these choices - **Why VMs (not Docker)?** We need a clean snapshot/revert loop and we need to run real malware without compromising the host. KVM gives both at near-native speed; containers share the host kernel and many samples detect containerization and refuse to detonate. See [`docs/architecture.md`](docs/architecture.md). - **Why KVM (not TCG/-icount)?** ML training data wants noise to generalize to. KVM is ~15× faster than TCG, which directly multiplies dataset size per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the "constrained device" framing. - **Why JSONL (not a DB yet)?** Schema-last. Collect first, decide storage shape after we see what's useful. JSONL is crash-safe, append-only, reshapes trivially into Postgres/Timescale/Parquet. - **Why two models — realistic vs. oracle?** Features that exist on a deployed device train the *realistic* model. Host-side QEMU telemetry (which doesn't exist in deployment) is *oracle*-only — used to assign honest labels at training time, never as a model input. The accuracy gap between the two measures how much detection power a privileged rootkit can take from us by lying to in-device tools. See [`docs/threat-model.md`](docs/threat-model.md). - **Why ULIDs for episode ids?** Time-sortable, no coordinator, URL-safe.
Deploying the receiver and lab-host roles Two roles, one bootstrap command each. Detailed in [`docs/deploy.md`](docs/deploy.md): - `lab-host` — runs episodes, ships completed episodes to the receiver. - `receiver` — accepts ship uploads, stores tarballs + appends to `index.jsonl`. Runs on the Pi5 in our setup. ```sh # On a lab host: ./scripts/install-lab-host.sh # (TODO — currently bring up by hand per docs/deploy.md) # On the Pi5 (or any always-on WG node): ./scripts/install-receiver.sh # (TODO — same) ``` For now both bootstrap scripts are scaffolds; the units and configs they install live in `etc/`. The receiver itself works today (`uv run python -m receiver --config etc/receiver.toml.example` — modify paths).
Threat model and feature-availability split See [`docs/threat-model.md`](docs/threat-model.md) for the full argument. The short version: | Channel | Vantage | Role | |---|---|---| | Host `/proc/` | outside guest | oracle (label only) | | QEMU QMP `query-stats` etc. | outside guest | oracle (label only) | | `perf stat -p ` | outside guest | oracle (label only) | | Bridge-side pcap | gateway-style | feature (deployable) | | In-guest `/proc`, perf, thermal | inside guest | feature (deployable) | We collect everything in the lab. Only the *features* go into the deployed model; the oracles are used to label episodes with high confidence (disagreement between in-guest and host-side data is itself a rootkit signal).
--- ## Citing this work A short course-project citation, until the dataset reaches a publishable form: > Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).* > Spectral lab, 2026. See [`docs/sources.md`](docs/sources.md) for everything else this project leans on.