CIS490 coursework
Find a file
max 1b6c7b2f4a Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts
This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.

Collectors landed:
  collectors/qmp.py          — source 2 (oracle). Tiny synchronous QMP
                               client + row builder + run loop. Tolerates
                               older qemu without query-stats.
  collectors/guest_agent.py  — source 5 (deployable). Reads the
                               virtio-serial host-side socket, parses
                               agent JSON-lines, re-stamps to the host
                               monotonic clock, persists.
  collectors/pcap.py         — source 4 (deployable). tcpdump capture
                               + pure-Python pcap reader + 100 ms
                               netflow.jsonl bucketizer. Decodes
                               Ethernet/IPv4/TCP/UDP enough for the
                               schema in docs/data-model.md.

In-guest agent:
  vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
    /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
    thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
  tools/build_cidata.py — embeds the agent + an OpenRC service into
    user-data so first boot of the Alpine cidata image auto-starts it.

Launchers:
  vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
    the agent socket; SLOT env support so multiple VMs run without
    socket / port collisions; PORT_BASE on launch_target so multiple
    target VMs hostfwd different host ports.
  vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
    no NAT). Idempotent.

Fleet:
  orchestrator/fleet.py — capacity detector (cores / RAM / load
    headroom) + concurrent-slot runner. Per-slot ENV selects the
    sample. FleetCapacity dataclass round-trips into meta.json so
    "this episode ran with 6 concurrent VMs" is auditable post-hoc.
  tools/run_fleet.py — CLI: --capacity report; --waves N runs N
    waves of (max_concurrent) episodes each, every slot with a
    different sample.
  etc/cis490-orchestrator.service — now drives the fleet runner with
    Restart=always so each invocation runs one wave and respawns,
    giving a continuous stream.

Samples:
  samples/manifest.toml — six profiles spanning the five major
    behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
  samples/manifest.py — strict TOML loader (rejects dups, unknown
    categories) + deterministic select(host_id, slot, episode_index)
    so different hosts on the network walk the catalog in different
    orders without any coordinator.

EpisodeRunner:
  orchestrator/episode.py — optional qmp_socket + guest_agent_socket
    fields on EpisodeConfig; when set, additional collector threads
    run alongside proc_qemu. EpisodeResult now carries rows_qmp +
    rows_guest counters.

Tier-3 setup automation:
  scripts/install-msfrpcd.sh — installs metasploit-framework where
    the package manager has it, generates a strong password into
    /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
    127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
    once MSFRPC_PASSWORD is sourced.
  scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
    from the operator (Rapid7 download is registration-walled), pulls,
    verifies, converts vmdk → qcow2, lands at vm/images/.

Tests: 82 pass (was 51). New suites:
  tests/test_qmp.py       — fake QMP server, capability handshake,
                            blockstats, async-event interleaving,
                            5-failure backoff
  tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
                              re-stamp, malformed-line tolerance
  tests/test_pcap.py      — synthetic pcap with TCP/UDP/ARP frames,
                            bucketize correctness across windows
  tests/test_fleet.py     — capacity math (8-core idle / low-RAM /
                            high-load / Pi5 / 1-core box), manifest
                            selection determinism + diversity

What's queued for the next commit (already discussed in convo):
  - MSFExploitDriver v2: map sample.profile → distinct in-session
    workload so Tier-3 episodes don't all produce the same yes-loop
    envelope. Critical for ML to learn varied malware shapes.
  - Real-sample fetch from MalwareBazaar by sha256.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:02:27 -05:00
collectors Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
docs Tier 3: msfrpc-driven exploit driver + first module config 2026-04-29 23:11:52 -05:00
etc Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
exploits Tier 3: msfrpc-driven exploit driver + first module config 2026-04-29 23:11:52 -05:00
orchestrator Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
receiver Lab-host shipper + receiver /v1/ping + install scripts 2026-04-29 23:41:32 -05:00
samples Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
scripts Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
shipper Lab-host shipper + receiver /v1/ping + install scripts 2026-04-29 23:41:32 -05:00
tests Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
tools Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
training Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
vm Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
.gitignore Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
AGENTS.md Lab-host shipper + receiver /v1/ping + install scripts 2026-04-29 23:41:32 -05:00
pyproject.toml Tier 3: msfrpc-driven exploit driver + first module config 2026-04-29 23:11:52 -05:00
README.md Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
uv.lock Tier 2: real Alpine VM, real workload, real envelope 2026-04-29 08:38:53 -06:00

CIS490 — Behavioral Malware Detection Dataset & Model

Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches performance metrics on a real device, decides whether the device has been breached, and triggers a hardware-level reset when confidence is high enough. This repository covers the dataset side — we run public malware samples against intentionally vulnerable Linux VMs and capture labeled time-series telemetry that mirrors what the deployed model would see in the field.

The work is grounded in the trust-over-time scoring model from IEEE 9881803.


What an episode looks like

Each episode runs a target through a labeled phase schedule (clean → armed → infecting → infected_running → dormant → ...) while sampling host-side /proc telemetry at 10 Hz. The dataset's "envelope" is the set of timestamped phase transitions written to labels.jsonl — sharing a monotonic clock with the metric rows so anything aligned in time can be aligned in code.

Tier 2 — real Alpine VM, real workload driven from inside the guest

This is the closest we get to real-malware behaviour without yet running real malware. Telemetry is real /proc/<qemu_pid> from outside the guest, and the load is generated inside the guest by busybox yes (CPU saturation) and dd (disk bursts), driven over the serial console by tools/vm_load_controller.py. Every phase transition in labels.jsonl corresponds to an actual command issued inside the real VM.

Real Alpine VM envelope

The 100% CPU plateaux are yes > /dev/null running on the guest's single vCPU; the IO spikes during infecting are dd if=/dev/urandom producing the sample-drop shape; the dormant drops are the controller killing the load process inside the VM. The infected_running → dormant → infected_running re-entry is the textbook envelope that justifies the whole project framing.

Reproduce with:

uv run python tools/run_real_vm_demo.py --data-root data

Tier 1 — real Alpine VM, idle baseline

Same pipeline, pointed at the real qemu-system process while the guest is doing nothing. Periodic ~10% CPU spikes are KVM/timer interrupts; the single disk-write spike near t=3 s is the guest finishing late-boot activity.

Real VM idle baseline

Pipeline-validation plot — synthetic load, real telemetry

This is not real malware and the load is not even running inside a VM — it's a Python program on the host (tools/load_mimic.py) that mimics an XMRig-style envelope. We used it to validate the orchestrator + collector + labeling pipeline before plugging in a real guest. Kept here because it shows the same shape the tier-2 plot above produces from real KVM behaviour.

Synthetic envelope (host-side mimic)

What's still missing for the real-malware envelope

Tier What it gives Status
1 — real VM, idle confidence the collector reads real KVM behaviour done
2 — real VM, real workload from inside the guest first real-load envelope shape done
3 — real VM, real exploit fire (Metasploitable + msfrpc) honest armed → infecting transitions 🟡 driver landed, integration pending
4 — real VM, real malware sample (XMRig from MalwareBazaar) the full envelope we ultimately train on 🚧

The Tier-3 driver lives in exploits/ — a tiny msgpack-over-HTTPS msfrpc client plus an MSFExploitDriver plugged into the orchestrator as the on_phase callback. First canned module: exploits/modules/vsftpd_234_backdoor.toml (Metasploitable2's CVE-2011-2523). End-to-end integration needs msfrpcd running and a Metasploitable2 image at vm/images/, which is the next bring-up step.

For an interactive view of any episode (zoom/pan/hover), run:

tools/show_envelope.sh data/episodes/<episode_id>
# then open http://127.0.0.1:8988/

Status

  • Receiver (HTTPS PUT, sha256-verified, idempotent) — running on Pi5 via Caddy + mTLS (wg-pki client CA)
  • Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
  • Host /proc oracle collector (source 1) @ 10 Hz
  • QMP collector (source 2) — query-status / query-blockstats / query-stats, 1 Hz
  • Bridge pcap (source 4) — pure-Python pcap parser + 100 ms-bucketed netflow.jsonl
  • In-guest agent (source 5) — virtio-serial; cidata-embedded for first-boot install on Alpine; host-side reader re-stamps to host clock
  • Synthetic envelope demo — full 8-phase envelope produced end-to-end
  • Real VM (Alpine 3.21 cloud-init under KVM)
  • Tier 2 — real VM, real workload: serial-console-driven load controller fires yes/dd inside the guest at every phase transition
  • 🟡 Tier 3 — exploit driver: MSFExploitDriver + msfrpc client + first module config landed; scripts/install-msfrpcd.sh automates msfrpcd setup; scripts/fetch-metasploitable2.sh pulls + verifies the target image (URL+sha256 from operator). Driver v2 (sample-profile-driven workloads) is the next step for ML diversity.
  • Shipper — lab-host ↔ Pi receiver via tar+zstd PUT over WG with mTLS; --ping smoke mode
  • Fleet runner — host-capacity-aware concurrency (tools/run_fleet.py); resource detector reserves cores + RAM headroom; sample manifest with deterministic per-(host, slot, episode) selection so every host on the network produces novel, varied, labeled data
  • Sample manifest — six initial profiles (cryptominer / botnet / ransomware / banking-trojan / fileless / RAT). Real-malware fetch from MalwareBazaar is the Tier-4 follow-up.

Topology note: in this project the Pi5 is the WireGuard-side collector that receives episode tarballs from one or more lab hosts. It is not the deployment target for the model. The deployment target is generic ("any constrained Linux device"). See docs/architecture.md.


Quick start — run the synthetic envelope demo (~90 s)
git clone https://maxgit.wg/spectral/CIS490.git
cd CIS490

# One-time setup.
uv sync

# Generate one labeled episode (8 phases, 851 telemetry rows, 85 s).
uv run python tools/run_envelope_demo.py --data-root data

# Render a static PNG envelope of that episode.
uv run python tools/plot_envelope.py data/episodes/<episode_id>

# Or open an interactive plot in your browser:
tools/show_envelope.sh data/episodes/<episode_id>

The data lands in data/episodes/<ulid>/:

meta.json              episode metadata (image, snapshot, schedule, host fingerprint)
events.jsonl           orchestrator actions (snapshot_load, phase_transition, episode_end)
labels.jsonl           one row per phase transition — THIS is the envelope
telemetry-proc.jsonl   host /proc sampler at 10 Hz
done.marker            written last; the shipper only sees finished episodes
Quick start — boot a real Linux VM (Cirros)

The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its QMP/monitor sockets and pidfile. The orchestrator then samples the real qemu-system process.

# Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'.
# (See docs/sources.md for the Cirros sha256.)

# Boot in one terminal:
RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh

# In another terminal, point the orchestrator at the VM's pid:
QPID=$(cat /tmp/cis490-vm/qemu.pid)
uv run python -m orchestrator --target-pid $QPID --duration 20

# Plot:
tools/show_envelope.sh data/episodes/<episode_id>

The idle-VM envelope shape is distinct from the synthetic load: periodic ~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single late-boot disk write. That's a real KVM guest you're seeing.

Repository layout
Path What it holds
docs/architecture.md Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning
docs/threat-model.md Train/serve parity rule and the oracle-vs-deployable feature split
docs/data-model.md On-disk JSONL schema, per-episode layout, phase enum
docs/transport.md Sender/receiver design — how episodes get to the central collector over WG
docs/deploy.md One-command install for the lab-host and receiver roles
docs/lab-setup.md KVM prereqs, VM build, snapshot, virtio-serial wiring
docs/sources.md Works cited — every tool, dep, sample source, paper, and standard
orchestrator/ State machine that drives the boot → arm → detonate → observe → revert loop
collectors/ One module per telemetry source (host /proc, QMP, perf, pcap, guest agent)
receiver/ Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent
vm/ qcow2 images, launch scripts, snapshot recipes (binaries gitignored)
tools/ Demo runners, load mimic, plot scripts
exploits/ MSF RPC client + driver + per-module TOML configs (Tier 3)
samples/ Sample manifest (sha256-pinned). Binaries never committed.
training/ Model training code (deferred — schema first)
etc/ systemd units and config templates installed by the deploy scripts
Design decisions — why these choices
  • Why VMs (not Docker)? We need a clean snapshot/revert loop and we need to run real malware without compromising the host. KVM gives both at near-native speed; containers share the host kernel and many samples detect containerization and refuse to detonate. See docs/architecture.md.
  • Why KVM (not TCG/-icount)? ML training data wants noise to generalize to. KVM is ~15× faster than TCG, which directly multiplies dataset size per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the "constrained device" framing.
  • Why JSONL (not a DB yet)? Schema-last. Collect first, decide storage shape after we see what's useful. JSONL is crash-safe, append-only, reshapes trivially into Postgres/Timescale/Parquet.
  • Why two models — realistic vs. oracle? Features that exist on a deployed device train the realistic model. Host-side QEMU telemetry (which doesn't exist in deployment) is oracle-only — used to assign honest labels at training time, never as a model input. The accuracy gap between the two measures how much detection power a privileged rootkit can take from us by lying to in-device tools. See docs/threat-model.md.
  • Why ULIDs for episode ids? Time-sortable, no coordinator, URL-safe.
Deploying the receiver and lab-host roles

Two roles, one bootstrap command each. Detailed in docs/deploy.md:

  • lab-host — runs episodes, ships completed episodes to the receiver.
  • receiver — accepts ship uploads, stores tarballs + appends to index.jsonl. Runs on the Pi5 in our setup.
# On a lab host:
./scripts/install-lab-host.sh   # (TODO — currently bring up by hand per docs/deploy.md)

# On the Pi5 (or any always-on WG node):
./scripts/install-receiver.sh   # (TODO — same)

For now both bootstrap scripts are scaffolds; the units and configs they install live in etc/. The receiver itself works today (uv run python -m receiver --config etc/receiver.toml.example — modify paths).

Threat model and feature-availability split

See docs/threat-model.md for the full argument. The short version:

Channel Vantage Role
Host /proc/<qemu_pid> outside guest oracle (label only)
QEMU QMP query-stats etc. outside guest oracle (label only)
perf stat -p <qemu_pid> outside guest oracle (label only)
Bridge-side pcap gateway-style feature (deployable)
In-guest /proc, perf, thermal inside guest feature (deployable)

We collect everything in the lab. Only the features go into the deployed model; the oracles are used to label episodes with high confidence (disagreement between in-guest and host-side data is itself a rootkit signal).


Citing this work

A short course-project citation, until the dataset reaches a publishable form:

Gorog, M. CIS490 Behavioral Malware Detection Dataset (in progress). Spectral lab, 2026.

See docs/sources.md for everything else this project leans on.