vm/launch_demo.sh boots a Cirros qcow2 under KVM with QMP and a monitor socket exposed; snapshot=on routes guest writes to a temporary overlay so the on-disk image is never mutated (clean factory reset every boot). End-to-end verified: vm/launch_demo.sh → orchestrator with --target-pid <qemu pid> → 201 telemetry rows over 20s against the real qemu-system process. The plotted envelope shows the expected idle-VM shape: periodic ~10% CPU spikes from KVM/timer interrupts, flat 230 MiB RSS, and a single late-boot disk write. Distinct from the synthetic load_mimic envelope, confirming the collector reads real KVM behavior. docs/sources.md is the works-cited doc — every tool, library, sample source, paper, and standard the project leans on, grouped by category. README's nav table now points at it. README's status section also lists what's done vs. in progress so reviewers can see scope at a glance. Note: vm/images/ stays gitignored. The Cirros 0.6.3 image is documented with its sha256 (7d6355852aeb...) in docs/sources.md so any team member can reproduce the bytes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| collectors | ||
| docs | ||
| etc | ||
| exploits | ||
| orchestrator | ||
| receiver | ||
| samples | ||
| tests | ||
| tools | ||
| training | ||
| vm | ||
| .gitignore | ||
| pyproject.toml | ||
| README.md | ||
| uv.lock | ||
CIS490 — Behavioral Malware Detection Dataset & Model
Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches performance metrics on a real device, decides whether the device has been breached, and triggers a hardware-level reset when confidence is high enough.
This repository covers the dataset side of that pipeline: we run real, public malware samples against intentionally vulnerable Linux VMs and capture labeled time-series telemetry that mirrors what the same model would see in deployment on an arbitrary target Linux device.
Note on the topology: in this project the Pi5 is the WireGuard-side collector that receives episode tarballs from one or more lab hosts — it is not the deployment target for the model. The deployment target is generic ("any constrained Linux device"). See
docs/architecture.md.
The work is grounded in the trust-over-time scoring model from IEEE 9881803 and a related proprietary follow-on that pairs detection with blockchain-anchored hardware reset.
What lives where
| Path | What it holds |
|---|---|
docs/architecture.md |
Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning |
docs/threat-model.md |
Train/serve parity rule and the oracle-vs-deployable feature split |
docs/data-model.md |
On-disk JSONL schema, per-episode layout, phase enum |
docs/transport.md |
Sender/receiver design — how episodes get to the central collector over WG |
docs/deploy.md |
One-command install for the lab-host and receiver roles |
docs/lab-setup.md |
KVM prereqs, VM build, snapshot, virtio-serial wiring |
docs/sources.md |
Works cited — every tool, dep, sample source, paper, and standard |
orchestrator/ |
State machine that drives the boot → arm → detonate → observe → revert loop |
collectors/ |
One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
vm/ |
qcow2 images and snapshot scripts (binaries gitignored) |
exploits/ |
Metasploit resource scripts for repeatable exploitation |
samples/ |
Sample manifest (sha256-pinned). Binaries never committed. |
training/ |
Model training code (deferred — schema first) |
Quick orientation
- Why VMs? We need a clean snapshot/revert loop and we need to run real malware without burning hardware. KVM gives us both at near-native speed.
- Why is the network isolated? A host-only bridge keeps malware off the internet and off the WG overlay. The Pi5 gateway is the lab-side observer, playing the same role it would play in a deployed setting.
- Why JSONL and not a database (yet)? Schema-last: collect first, decide storage shape after we see what's actually useful. JSONL is crash-safe, append-only, and reshapes trivially into Postgres/Timescale/Parquet later.
- Why two models? One trained on features that exist on a real Pi (deployable), one trained on host-side QEMU-only features (oracle). The accuracy gap measures how much detection power a privileged rootkit can take from the deployed model. See docs/threat-model.md.
Status
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl.
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids.
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz.
- ✅ Synthetic envelope demo (
tools/run_envelope_demo.py) — full 8-phase XMRig-shaped envelope produced end-to-end. - ✅ Phase 2 — real VM: Cirros boots under KVM, orchestrator collects telemetry against the real
qemu-systempid (vm/launch_demo.sh+ the existing orchestrator). - 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5).
- 🚧 Exploit driver (Metasploit RPC) for
armed → infectingtransitions onsession_open. - 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified).