CIS490 coursework

Find a file

Maximus Gorog 69c09f4404 Phase 2: real-VM episode (Cirros under KVM) + works-cited doc vm/launch_demo.sh boots a Cirros qcow2 under KVM with QMP and a monitor socket exposed; snapshot=on routes guest writes to a temporary overlay so the on-disk image is never mutated (clean factory reset every boot). End-to-end verified: vm/launch_demo.sh → orchestrator with --target-pid <qemu pid> → 201 telemetry rows over 20s against the real qemu-system process. The plotted envelope shows the expected idle-VM shape: periodic ~10% CPU spikes from KVM/timer interrupts, flat 230 MiB RSS, and a single late-boot disk write. Distinct from the synthetic load_mimic envelope, confirming the collector reads real KVM behavior. docs/sources.md is the works-cited doc — every tool, library, sample source, paper, and standard the project leans on, grouped by category. README's nav table now points at it. README's status section also lists what's done vs. in progress so reviewers can see scope at a glance. Note: vm/images/ stays gitignored. The Cirros 0.6.3 image is documented with its sha256 (7d6355852aeb...) in docs/sources.md so any team member can reproduce the bytes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-29 00:00:25 -06:00
collectors	Add v0 orchestrator + first oracle collector (host /proc)	2026-04-28 23:40:25 -06:00
docs	Phase 2: real-VM episode (Cirros under KVM) + works-cited doc	2026-04-29 00:00:25 -06:00
etc	Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency	2026-04-28 23:34:04 -06:00
exploits	Scaffold project: docs, repo skeleton, transport + deploy design	2026-04-28 23:21:00 -06:00
orchestrator	Synthetic envelope demo: phase-driven load mimic + plotter	2026-04-28 23:53:20 -06:00
receiver	Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency	2026-04-28 23:34:04 -06:00
samples	Scaffold project: docs, repo skeleton, transport + deploy design	2026-04-28 23:21:00 -06:00
tests	Add v0 orchestrator + first oracle collector (host /proc)	2026-04-28 23:40:25 -06:00
tools	Synthetic envelope demo: phase-driven load mimic + plotter	2026-04-28 23:53:20 -06:00
training	Scaffold project: docs, repo skeleton, transport + deploy design	2026-04-28 23:21:00 -06:00
vm	Phase 2: real-VM episode (Cirros under KVM) + works-cited doc	2026-04-29 00:00:25 -06:00
.gitignore	Scaffold project: docs, repo skeleton, transport + deploy design	2026-04-28 23:21:00 -06:00
pyproject.toml	Synthetic envelope demo: phase-driven load mimic + plotter	2026-04-28 23:53:20 -06:00
README.md	Phase 2: real-VM episode (Cirros under KVM) + works-cited doc	2026-04-29 00:00:25 -06:00
uv.lock	Synthetic envelope demo: phase-driven load mimic + plotter	2026-04-28 23:53:20 -06:00

README.md

CIS490 — Behavioral Malware Detection Dataset & Model

Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches performance metrics on a real device, decides whether the device has been breached, and triggers a hardware-level reset when confidence is high enough.

This repository covers the dataset side of that pipeline: we run real, public malware samples against intentionally vulnerable Linux VMs and capture labeled time-series telemetry that mirrors what the same model would see in deployment on an arbitrary target Linux device.

Note on the topology: in this project the Pi5 is the WireGuard-side collector that receives episode tarballs from one or more lab hosts — it is not the deployment target for the model. The deployment target is generic ("any constrained Linux device"). See docs/architecture.md.

The work is grounded in the trust-over-time scoring model from IEEE 9881803 and a related proprietary follow-on that pairs detection with blockchain-anchored hardware reset.

What lives where

Path	What it holds
`docs/architecture.md`	Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning
`docs/threat-model.md`	Train/serve parity rule and the oracle-vs-deployable feature split
`docs/data-model.md`	On-disk JSONL schema, per-episode layout, phase enum
`docs/transport.md`	Sender/receiver design — how episodes get to the central collector over WG
`docs/deploy.md`	One-command install for the lab-host and receiver roles
`docs/lab-setup.md`	KVM prereqs, VM build, snapshot, virtio-serial wiring
`docs/sources.md`	Works cited — every tool, dep, sample source, paper, and standard
`orchestrator/`	State machine that drives the boot → arm → detonate → observe → revert loop
`collectors/`	One module per telemetry source (host /proc, QMP, perf, pcap, guest agent)
`vm/`	qcow2 images and snapshot scripts (binaries gitignored)
`exploits/`	Metasploit resource scripts for repeatable exploitation
`samples/`	Sample manifest (sha256-pinned). Binaries never committed.
`training/`	Model training code (deferred — schema first)

Quick orientation

Why VMs? We need a clean snapshot/revert loop and we need to run real malware without burning hardware. KVM gives us both at near-native speed.
Why is the network isolated? A host-only bridge keeps malware off the internet and off the WG overlay. The Pi5 gateway is the lab-side observer, playing the same role it would play in a deployed setting.
Why JSONL and not a database (yet)? Schema-last: collect first, decide storage shape after we see what's actually useful. JSONL is crash-safe, append-only, and reshapes trivially into Postgres/Timescale/Parquet later.
Why two models? One trained on features that exist on a real Pi (deployable), one trained on host-side QEMU-only features (oracle). The accuracy gap measures how much detection power a privileged rootkit can take from the deployed model. See docs/threat-model.md.

Status

✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl.
✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids.
✅ Host /proc oracle collector (source 1 of 5) at 10 Hz.
✅ Synthetic envelope demo (tools/run_envelope_demo.py) — full 8-phase XMRig-shaped envelope produced end-to-end.
✅ Phase 2 — real VM: Cirros boots under KVM, orchestrator collects telemetry against the real qemu-system pid (vm/launch_demo.sh + the existing orchestrator).
🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5).
🚧 Exploit driver (Metasploit RPC) for armed → infecting transitions on session_open.
🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified).