CIS490 coursework
Find a file
Maximus Gorog 69c09f4404 Phase 2: real-VM episode (Cirros under KVM) + works-cited doc
vm/launch_demo.sh boots a Cirros qcow2 under KVM with QMP and a monitor
socket exposed; snapshot=on routes guest writes to a temporary overlay
so the on-disk image is never mutated (clean factory reset every boot).

End-to-end verified: vm/launch_demo.sh → orchestrator with --target-pid
<qemu pid> → 201 telemetry rows over 20s against the real qemu-system
process. The plotted envelope shows the expected idle-VM shape:
periodic ~10% CPU spikes from KVM/timer interrupts, flat 230 MiB RSS,
and a single late-boot disk write. Distinct from the synthetic
load_mimic envelope, confirming the collector reads real KVM behavior.

docs/sources.md is the works-cited doc — every tool, library, sample
source, paper, and standard the project leans on, grouped by category.
README's nav table now points at it. README's status section also lists
what's done vs. in progress so reviewers can see scope at a glance.

Note: vm/images/ stays gitignored. The Cirros 0.6.3 image is documented
with its sha256 (7d6355852aeb...) in docs/sources.md so any team member
can reproduce the bytes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:00:25 -06:00
collectors Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00
docs Phase 2: real-VM episode (Cirros under KVM) + works-cited doc 2026-04-29 00:00:25 -06:00
etc Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency 2026-04-28 23:34:04 -06:00
exploits Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
orchestrator Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00
receiver Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency 2026-04-28 23:34:04 -06:00
samples Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
tests Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00
tools Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00
training Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
vm Phase 2: real-VM episode (Cirros under KVM) + works-cited doc 2026-04-29 00:00:25 -06:00
.gitignore Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
pyproject.toml Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00
README.md Phase 2: real-VM episode (Cirros under KVM) + works-cited doc 2026-04-29 00:00:25 -06:00
uv.lock Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00

CIS490 — Behavioral Malware Detection Dataset & Model

Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches performance metrics on a real device, decides whether the device has been breached, and triggers a hardware-level reset when confidence is high enough.

This repository covers the dataset side of that pipeline: we run real, public malware samples against intentionally vulnerable Linux VMs and capture labeled time-series telemetry that mirrors what the same model would see in deployment on an arbitrary target Linux device.

Note on the topology: in this project the Pi5 is the WireGuard-side collector that receives episode tarballs from one or more lab hosts — it is not the deployment target for the model. The deployment target is generic ("any constrained Linux device"). See docs/architecture.md.

The work is grounded in the trust-over-time scoring model from IEEE 9881803 and a related proprietary follow-on that pairs detection with blockchain-anchored hardware reset.

What lives where

Path What it holds
docs/architecture.md Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning
docs/threat-model.md Train/serve parity rule and the oracle-vs-deployable feature split
docs/data-model.md On-disk JSONL schema, per-episode layout, phase enum
docs/transport.md Sender/receiver design — how episodes get to the central collector over WG
docs/deploy.md One-command install for the lab-host and receiver roles
docs/lab-setup.md KVM prereqs, VM build, snapshot, virtio-serial wiring
docs/sources.md Works cited — every tool, dep, sample source, paper, and standard
orchestrator/ State machine that drives the boot → arm → detonate → observe → revert loop
collectors/ One module per telemetry source (host /proc, QMP, perf, pcap, guest agent)
vm/ qcow2 images and snapshot scripts (binaries gitignored)
exploits/ Metasploit resource scripts for repeatable exploitation
samples/ Sample manifest (sha256-pinned). Binaries never committed.
training/ Model training code (deferred — schema first)

Quick orientation

  1. Why VMs? We need a clean snapshot/revert loop and we need to run real malware without burning hardware. KVM gives us both at near-native speed.
  2. Why is the network isolated? A host-only bridge keeps malware off the internet and off the WG overlay. The Pi5 gateway is the lab-side observer, playing the same role it would play in a deployed setting.
  3. Why JSONL and not a database (yet)? Schema-last: collect first, decide storage shape after we see what's actually useful. JSONL is crash-safe, append-only, and reshapes trivially into Postgres/Timescale/Parquet later.
  4. Why two models? One trained on features that exist on a real Pi (deployable), one trained on host-side QEMU-only features (oracle). The accuracy gap measures how much detection power a privileged rootkit can take from the deployed model. See docs/threat-model.md.

Status

  • Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl.
  • Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids.
  • Host /proc oracle collector (source 1 of 5) at 10 Hz.
  • Synthetic envelope demo (tools/run_envelope_demo.py) — full 8-phase XMRig-shaped envelope produced end-to-end.
  • Phase 2 — real VM: Cirros boots under KVM, orchestrator collects telemetry against the real qemu-system pid (vm/launch_demo.sh + the existing orchestrator).
  • 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5).
  • 🚧 Exploit driver (Metasploit RPC) for armed → infecting transitions on session_open.
  • 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified).