CIS490 coursework
Find a file
Maximus Gorog 32ae161ef2 README: embed demo plots, mark synthetic vs real clearly, add collapsibles
The README now leads with a 'What an episode looks like' section that
shows both:

  * docs/images/synthetic-envelope.png — pipeline-validation plot. Real
    telemetry of a real process whose load is shaped by tools/load_mimic.py
    (Python). Explicitly labelled NOT REAL MALWARE in the caption — the
    earlier wording was unclear.

  * docs/images/real-vm-idle.png — real Cirros 0.6.3 booted under KVM,
    same orchestrator + /proc collector pointed at the qemu-system pid.
    Idle baseline; no exploit, no payload yet.

A 'What's still missing for the real-malware envelope' table makes the
tier path explicit (real VM idle → real workload in-guest → real exploit
fire → real sample).

Repository nav, deploy steps, design rationale, and threat model are
moved into <details>...</details> blocks so first-time visitors see the
demo plots and the status list without scrolling past wall-of-text.
Stale Pi-as-deployment-target wording in the design-rationale section
is fixed alongside.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:11:54 -06:00
collectors Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00
docs README: embed demo plots, mark synthetic vs real clearly, add collapsibles 2026-04-29 00:11:54 -06:00
etc Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency 2026-04-28 23:34:04 -06:00
exploits Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
orchestrator Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00
receiver Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency 2026-04-28 23:34:04 -06:00
samples Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
tests Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00
tools Interactive envelope plot via WebAgg (browser-based) 2026-04-29 00:06:22 -06:00
training Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
vm Phase 2: real-VM episode (Cirros under KVM) + works-cited doc 2026-04-29 00:00:25 -06:00
.gitignore Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
pyproject.toml Interactive envelope plot via WebAgg (browser-based) 2026-04-29 00:06:22 -06:00
README.md README: embed demo plots, mark synthetic vs real clearly, add collapsibles 2026-04-29 00:11:54 -06:00
uv.lock Interactive envelope plot via WebAgg (browser-based) 2026-04-29 00:06:22 -06:00

CIS490 — Behavioral Malware Detection Dataset & Model

Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches performance metrics on a real device, decides whether the device has been breached, and triggers a hardware-level reset when confidence is high enough. This repository covers the dataset side — we run public malware samples against intentionally vulnerable Linux VMs and capture labeled time-series telemetry that mirrors what the deployed model would see in the field.

The work is grounded in the trust-over-time scoring model from IEEE 9881803.


What an episode looks like

Each episode runs a target through a labeled phase schedule (clean → armed → infecting → infected_running → dormant → ...) while sampling host-side /proc telemetry at 10 Hz. The dataset's "envelope" is the set of timestamped phase transitions written to labels.jsonl — sharing a monotonic clock with the metric rows so anything aligned in time can be aligned in code.

Pipeline-validation plot — synthetic load, real telemetry

This is not real malware. The CPU/RSS/IO numbers are real /proc/<pid> reads of a real process; the workload shape is a Python program (tools/load_mimic.py) that mimics an XMRig-style envelope so we can validate the orchestrator + collector + labeling pipeline before plugging in a real exploit and a real sample. Coloured bands are phase labels straight out of labels.jsonl.

Synthetic envelope demo (pipeline validation only)

Real-VM idle baseline — real Cirros guest under KVM, no malware yet

Same pipeline, pointed at the real qemu-system process running a fresh Cirros 0.6.3 guest with nothing happening inside it. Periodic ~10% CPU spikes are KVM/timer interrupts; the single ~1 MiB write near t=3 s is the guest finishing its late-boot disk activity. No phase transitions — just labelled clean for the whole window.

Real Cirros VM idle

What's still missing for the real-malware envelope

Tier What it gives Status
1 — real VM, idle confidence the collector reads real KVM behaviour done
2 — real VM, real workload from inside the guest first real-load envelope shape 🚧 next
3 — real VM, real exploit fire (Metasploitable + msfrpc) honest armed → infecting transitions 🚧
4 — real VM, real malware sample (XMRig from MalwareBazaar) the full envelope we ultimately train on 🚧

For an interactive view of any episode (zoom/pan/hover), run:

tools/show_envelope.sh data/episodes/<episode_id>
# then open http://127.0.0.1:8988/

Status

  • Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
  • Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
  • Host /proc oracle collector (source 1 of 5) at 10 Hz
  • Synthetic envelope demo — full 8-phase envelope produced end-to-end
  • Real VM (Cirros under KVM) — orchestrator collects against the real qemu-system pid
  • 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
  • 🚧 Exploit driver (Metasploit RPC) for armed → infecting transitions on session_open
  • 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)

Topology note: in this project the Pi5 is the WireGuard-side collector that receives episode tarballs from one or more lab hosts. It is not the deployment target for the model. The deployment target is generic ("any constrained Linux device"). See docs/architecture.md.


Quick start — run the synthetic envelope demo (~90 s)
git clone https://maxgit.wg/spectral/CIS490.git
cd CIS490

# One-time setup.
uv sync

# Generate one labeled episode (8 phases, 851 telemetry rows, 85 s).
uv run python tools/run_envelope_demo.py --data-root data

# Render a static PNG envelope of that episode.
uv run python tools/plot_envelope.py data/episodes/<episode_id>

# Or open an interactive plot in your browser:
tools/show_envelope.sh data/episodes/<episode_id>

The data lands in data/episodes/<ulid>/:

meta.json              episode metadata (image, snapshot, schedule, host fingerprint)
events.jsonl           orchestrator actions (snapshot_load, phase_transition, episode_end)
labels.jsonl           one row per phase transition — THIS is the envelope
telemetry-proc.jsonl   host /proc sampler at 10 Hz
done.marker            written last; the shipper only sees finished episodes
Quick start — boot a real Linux VM (Cirros)

The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its QMP/monitor sockets and pidfile. The orchestrator then samples the real qemu-system process.

# Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'.
# (See docs/sources.md for the Cirros sha256.)

# Boot in one terminal:
RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh

# In another terminal, point the orchestrator at the VM's pid:
QPID=$(cat /tmp/cis490-vm/qemu.pid)
uv run python -m orchestrator --target-pid $QPID --duration 20

# Plot:
tools/show_envelope.sh data/episodes/<episode_id>

The idle-VM envelope shape is distinct from the synthetic load: periodic ~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single late-boot disk write. That's a real KVM guest you're seeing.

Repository layout
Path What it holds
docs/architecture.md Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning
docs/threat-model.md Train/serve parity rule and the oracle-vs-deployable feature split
docs/data-model.md On-disk JSONL schema, per-episode layout, phase enum
docs/transport.md Sender/receiver design — how episodes get to the central collector over WG
docs/deploy.md One-command install for the lab-host and receiver roles
docs/lab-setup.md KVM prereqs, VM build, snapshot, virtio-serial wiring
docs/sources.md Works cited — every tool, dep, sample source, paper, and standard
orchestrator/ State machine that drives the boot → arm → detonate → observe → revert loop
collectors/ One module per telemetry source (host /proc, QMP, perf, pcap, guest agent)
receiver/ Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent
vm/ qcow2 images, launch scripts, snapshot recipes (binaries gitignored)
tools/ Demo runners, load mimic, plot scripts
exploits/ Metasploit resource scripts for repeatable exploitation (TODO)
samples/ Sample manifest (sha256-pinned). Binaries never committed.
training/ Model training code (deferred — schema first)
etc/ systemd units and config templates installed by the deploy scripts
Design decisions — why these choices
  • Why VMs (not Docker)? We need a clean snapshot/revert loop and we need to run real malware without compromising the host. KVM gives both at near-native speed; containers share the host kernel and many samples detect containerization and refuse to detonate. See docs/architecture.md.
  • Why KVM (not TCG/-icount)? ML training data wants noise to generalize to. KVM is ~15× faster than TCG, which directly multiplies dataset size per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the "constrained device" framing.
  • Why JSONL (not a DB yet)? Schema-last. Collect first, decide storage shape after we see what's useful. JSONL is crash-safe, append-only, reshapes trivially into Postgres/Timescale/Parquet.
  • Why two models — realistic vs. oracle? Features that exist on a deployed device train the realistic model. Host-side QEMU telemetry (which doesn't exist in deployment) is oracle-only — used to assign honest labels at training time, never as a model input. The accuracy gap between the two measures how much detection power a privileged rootkit can take from us by lying to in-device tools. See docs/threat-model.md.
  • Why ULIDs for episode ids? Time-sortable, no coordinator, URL-safe.
Deploying the receiver and lab-host roles

Two roles, one bootstrap command each. Detailed in docs/deploy.md:

  • lab-host — runs episodes, ships completed episodes to the receiver.
  • receiver — accepts ship uploads, stores tarballs + appends to index.jsonl. Runs on the Pi5 in our setup.
# On a lab host:
./scripts/install-lab-host.sh   # (TODO — currently bring up by hand per docs/deploy.md)

# On the Pi5 (or any always-on WG node):
./scripts/install-receiver.sh   # (TODO — same)

For now both bootstrap scripts are scaffolds; the units and configs they install live in etc/. The receiver itself works today (uv run python -m receiver --config etc/receiver.toml.example — modify paths).

Threat model and feature-availability split

See docs/threat-model.md for the full argument. The short version:

Channel Vantage Role
Host /proc/<qemu_pid> outside guest oracle (label only)
QEMU QMP query-stats etc. outside guest oracle (label only)
perf stat -p <qemu_pid> outside guest oracle (label only)
Bridge-side pcap gateway-style feature (deployable)
In-guest /proc, perf, thermal inside guest feature (deployable)

We collect everything in the lab. Only the features go into the deployed model; the oracles are used to label episodes with high confidence (disagreement between in-guest and host-side data is itself a rootkit signal).


Citing this work

A short course-project citation, until the dataset reaches a publishable form:

Gorog, M. CIS490 Behavioral Malware Detection Dataset (in progress). Spectral lab, 2026.

See docs/sources.md for everything else this project leans on.