CIS490 coursework
Find a file
max bdcd2ecbef Close out the open issues: bridge pcap wiring, perf collector, Tier-4
Wraps the three remaining 🚧 items from the README so every collector
the threat-model promises is actually live, and the Tier-4 path
(real-malware fetch + upload + exec) works end-to-end as soon as a
sha256 lands in samples/store/.

Closes spectral/CIS490#4, #5, #6.

== #6 — Bridge pcap wiring ==
EpisodeConfig grows three optional fields:
  bridge_iface: str | None        # e.g. "br-malware"
  bridge_ip:    str = "10.200.0.1"
  pcap_snaplen: int = 256
When bridge_iface is set, EpisodeRunner spawns tcpdump for the duration
of the schedule (network.pcap), stops it cleanly on episode end, and
runs collectors.pcap.bucketize() to produce netflow.jsonl per the
100-ms schema in docs/data-model.md. EpisodeResult + meta.result
gain rows_netflow + pcap_bytes counters.

vm/launch_demo.sh + launch_target.sh now switch between SLIRP usermode
and tap+bridge based on $BRIDGE — operator pre-creates the tap as a
bridge member, no sudo from the launcher.

run_real_vm_demo.py picks BRIDGE up from env so the fleet runner can
opt entire waves into pcap mode by exporting BRIDGE before invocation.

== #5 — Source 3 perf collector ==
collectors/perf_qemu.py shells out to ``perf stat -p <pid> -I 100 -j``
and parses the per-event JSON stream. Aggregates one row per interval
across the canonical event set (cycles/instructions/cache-{refs,misses}/
branches/branch-misses/page-faults/context-switches), computes IPC +
cache-miss rate. Tolerates missing events (``<not counted>`` /
``<not supported>``) without dropping the row, and skips cleanly when
``perf`` isn't on PATH or the process can't be attached.

EpisodeConfig.enable_perf=True opts into the collector — off by default
because perf needs CAP_SYS_ADMIN or perf_event_paranoid <= 1. When
enabled, runs as a parallel thread alongside the other collectors;
EpisodeResult.rows_perf records the count.

== #4 — Tier 4 (real-malware fetch + upload + exec) ==
tools/fetch_sample.py: pulls a sample by sha256 from MalwareBazaar
(API key from env or samples/.bazaar.token), unzips with the standard
"infected" password, verifies the resulting binary's sha256, lands at
samples/store/<sha256>. Idempotent — already-staged correct binaries
return immediately.

samples/manifest.py: Sample.binary_path(store_root) resolves to the
staged binary path, or None for mimics / not-yet-fetched real samples.

exploits/workloads.py: real_binary_workload(bytes, sample) builds a
Workload that base64-uploads the binary into the shell session via a
heredoc, decodes + chmods + execs it in the background, captures the
PID for clean stop on dormant. Per-profile pid/bin paths so concurrent
samples in the same guest don't collide.

exploits/driver.py: dispatch order is now:
  1) sample.kind == "real" + binary staged at sample_store_root
     → real_binary_workload (Tier 4)
  2) profile mimic from workloads.workload_for() (Tier 3 v2)
  3) None → driver v1 fallback yes-loop
DriverConfig.sample_store_root is the new field; run_tier3_demo.py
wires it to repo_root/samples/store. driver_setup event records
sample_sha256 so trainers can join Tier-4 episodes against the
manifest by hash.

samples/store/.gitkeep added (binaries themselves are gitignored).

Tests: 102 pass (was 86). New suites:
  tests/test_perf_qemu.py — parser + builder + perf-missing fallback
  tests/test_tier4.py     — real_binary_workload base64 round-trip,
                            stop-cmd kills pidfile, per-profile path
                            isolation, driver dispatch chooses real vs
                            mimic correctly, fetcher input validation
                            and cached-fast-path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:17:49 -05:00
collectors Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
docs Tier 3: msfrpc-driven exploit driver + first module config 2026-04-29 23:11:52 -05:00
etc Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
exploits Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
orchestrator Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
receiver Lab-host shipper + receiver /v1/ping + install scripts 2026-04-29 23:41:32 -05:00
samples Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
scripts Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts 2026-04-30 00:02:27 -05:00
shipper Lab-host shipper + receiver /v1/ping + install scripts 2026-04-29 23:41:32 -05:00
tests Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
tools Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
training Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
vm Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
.gitignore Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
AGENTS.md README + AGENTS.md: reflect fleet, driver v2, all 4 collectors 2026-04-30 00:11:35 -05:00
pyproject.toml Tier 3: msfrpc-driven exploit driver + first module config 2026-04-29 23:11:52 -05:00
README.md README + AGENTS.md: reflect fleet, driver v2, all 4 collectors 2026-04-30 00:11:35 -05:00
uv.lock Tier 2: real Alpine VM, real workload, real envelope 2026-04-29 08:38:53 -06:00

CIS490 — Behavioral Malware Detection Dataset & Model

Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches performance metrics on a real device, decides whether the device has been breached, and triggers a hardware-level reset when confidence is high enough. This repository covers the dataset side — we run public malware samples (and behavior-matched mimics) against intentionally vulnerable Linux VMs and capture labeled time-series telemetry that mirrors what the deployed model would see in the field.

Concretely, every lab host on the WireGuard mesh detects how much capacity it has, spins up that many concurrent VMs, gives each VM a different malware profile from the manifest, and ships the resulting labeled episode tarballs to the central receiver on the Pi over mTLS. Running the same fleet on multiple hosts gives novel, non-overlapping data per host with no coordinator — see Multi-host fleet below.

The work is grounded in the trust-over-time scoring model from IEEE 9881803.


What an episode looks like

Each episode runs a target through a labeled phase schedule (clean → armed → infecting → infected_running → dormant → ...) while sampling host-side /proc telemetry at 10 Hz. The dataset's "envelope" is the set of timestamped phase transitions written to labels.jsonl — sharing a monotonic clock with the metric rows so anything aligned in time can be aligned in code.

Tier 2 — real Alpine VM, profile-driven workload inside the guest

This is the closest we get to real-malware behaviour without yet running real malware. Telemetry is real /proc/<qemu_pid> from outside the guest plus three more sources running concurrently (QMP, bridge pcap, in-guest agent — see Telemetry sources below). The load itself is generated inside the guest by a profile-matched shell command from exploits/workloads.py, driven over the serial console by tools/vm_load_controller.py.

Each sample's profile (from samples/manifest.toml) dispatches to a different in-session workload, so the envelope each VM produces is observably different per family — exactly the variance the ML model needs to learn:

profile shape
cpu-saturate sustained 1-vCPU saturation (XMRig)
scan-and-dial SYN-style probes across the bridge subnet + dial-home
io-walk fs traversal + 4 KiB urandom writes (ransomware)
bursty-c2 long idle + periodic 3-packet egress burst (Dridex)
low-and-slow minimal CPU + periodic memory churn (Kovter / fileless)
shell-resident one long-lived TCP socket + periodic command ticks (RAT)

Every phase transition in labels.jsonl corresponds to an actual command issued inside the real VM, and meta.json records which sample / profile / kind drove it.

Real Alpine VM envelope

The 100% CPU plateaux are yes > /dev/null running on the guest's single vCPU; the IO spikes during infecting are dd if=/dev/urandom producing the sample-drop shape; the dormant drops are the controller killing the load process inside the VM. The infected_running → dormant → infected_running re-entry is the textbook envelope that justifies the whole project framing.

Reproduce one episode (profile-driven via --sample or SAMPLE_NAME env, defaults to the v1 yes-loop without one):

uv run python tools/run_real_vm_demo.py --data-root data \
    --sample xmrig-cryptominer

Or run the fleet — one wave of max_concurrent parallel episodes, each slot pulling a different sample from the manifest:

uv run python tools/run_fleet.py --capacity            # see what the host can do
uv run python tools/run_fleet.py --waves 1 --data-root data

Tier 1 — real Alpine VM, idle baseline

Same pipeline, pointed at the real qemu-system process while the guest is doing nothing. Periodic ~10% CPU spikes are KVM/timer interrupts; the single disk-write spike near t=3 s is the guest finishing late-boot activity.

Real VM idle baseline

Pipeline-validation plot — synthetic load, real telemetry

This is not real malware and the load is not even running inside a VM — it's a Python program on the host (tools/load_mimic.py) that mimics an XMRig-style envelope. We used it to validate the orchestrator + collector + labeling pipeline before plugging in a real guest. Kept here because it shows the same shape the tier-2 plot above produces from real KVM behaviour.

Synthetic envelope (host-side mimic)

Tier 3 — real exploit fire, profile-matched workload (Driver v2)

The Tier-3 driver lives in exploits/ — a tiny msgpack-over-HTTPS msfrpc client + MSFExploitDriver. With a Sample supplied, the driver dispatches the post-exploit infected_running workload through exploits/workloads.py — same six profiles as Tier 2, so a fleet wave produces matched envelopes whether or not an exploit fires. Without a sample, the v1 yes-loop path is preserved for smoke runs.

First canned module: exploits/modules/vsftpd_234_backdoor.toml (Metasploitable2's CVE-2011-2523). scripts/install-msfrpcd.sh sets up msfrpcd (loopback only) as a hardened systemd unit; scripts/fetch-metasploitable2.sh pulls + sha256-verifies a target image from operator-supplied URL.

Tier maturity

Tier What it gives Status
1 — real VM, idle confidence the collectors read real KVM behaviour done
2 — real VM, profile-driven workload distinguishable in-guest envelopes per malware family done
3 — real VM, real exploit fire + profile workload honest armed → infecting transitions, driver v2 dispatch code; awaiting Metasploitable2 image + msfrpcd on a lab host
4 — real VM, real malware sample (MalwareBazaar fetch) the full envelope we ultimately train on 🚧 manifest schema ready (sample.sha256kind=real); fetcher TBD

Telemetry sources (all four wire into one episode dir)

# Source Vantage Role
1 host /proc/<qemu_pid> outside oracle (label only)
2 QEMU QMP queries outside oracle (label only)
3 perf stat -p <qemu_pid> outside oracle (planned)
4 Bridge pcap → 100 ms netflow gateway-side feature (deployable)
5 In-guest agent (virtio-serial) inside feature (deployable)

Sources 1, 2, 4, 5 are live as of this commit. The deploy/oracle split follows docs/threat-model.md: only sources 4 + 5 are usable as model features in the field — sources 1, 2, 3 exist as labeling oracles only.

For an interactive view of any episode (zoom/pan/hover), run:

tools/show_envelope.sh data/episodes/<episode_id>
# then open http://127.0.0.1:8988/

Status (86/86 tests passing as of b80986d)

Pipeline (lab-host → Pi → tarball stored)

  • Receiver app (HTTPS PUT, sha256-verified, idempotent) — running on the Pi behind Caddy with mTLS via the wg-pki client CA
  • POST /v1/ping smoke endpoint (writes nothing, exercises the full auth path)
  • Shipper (shipper/) — tar+zstd, retry/backoff, --ping mode
  • Caddy collector.wg block (in spectral/caddy)
  • Lab-host install script + systemd units (scripts/install-lab-host.sh, etc/cis490-{shipper,orchestrator}.service)
  • Receiver install script (scripts/install-receiver.sh)
  • wg-pki client-CA bootstrap + per-host leaf issuance (in spectral/wg-pki)

Telemetry

  • Source 1 — host /proc/<qemu_pid> @ 10 Hz
  • Source 2 — QEMU QMP @ 1 Hz
  • Source 4 — bridge pcap + 100 ms netflow bucketizer (pure-Python parser, no scapy/dpkt dep). Per-episode wiring in EpisodeRunner is tracked in #6.
  • Source 5 — in-guest agent over virtio-serial; cidata-embedded for first-boot install on Alpine
  • 🚧 Source 3 — perf stat -p <qemu_pid> (#5)

Orchestrator + drivers

  • Orchestrator v0 — phase-scheduled episode runner, ULID episode ids
  • Tier 2 driver — real Alpine VM, profile-driven in-guest workload over serial console
  • Tier 3 driver v2 — MSFExploitDriver + msfrpc client + per-sample workload dispatch; first canned module vsftpd_234_backdoor.toml
  • Tier 3 integration — needs operator to drop a Metasploitable2 image + run scripts/install-msfrpcd.sh on a lab host
  • 🚧 Tier 4 — MalwareBazaar fetch by sha256 (manifest schema is ready; tracked in #4)

Fleet (multi-VM, multi-host data generation)

  • Resource-aware capacity detector (cores / RAM / load) — orchestrator/fleet.py
  • Concurrent slot runner — tools/run_fleet.py
  • Sample manifest with six behavioural profiles + deterministic per-(host_id, slot, episode) selection so every host walks the catalog in a different order

Topology note: the Pi5 is the WireGuard-side collector that receives episode tarballs from one or more lab hosts. It is not the deployment target for the model. The deployment target is generic ("any constrained Linux device"). See docs/architecture.md.


Quick start — fleet mode (the primary workflow)
git clone https://maxgit.wg/spectral/CIS490.git
cd CIS490
uv sync

# 1. Build the cidata ISO with the in-guest agent baked in.
uv run python tools/build_cidata.py vm/images/cidata.iso

# 2. See what this host is sized for.
uv run python tools/run_fleet.py --capacity
# cores: 4 (reserve 1)
# ram:   7951 MiB total, 5223 MiB available (headroom 1024 MiB, per-vm 320 MiB)
# load:  1m=0.51
# caps:  by_cores=3, by_ram=13, by_load=3
# --> max_concurrent VMs: 3

# 3. Run one wave (= max_concurrent parallel episodes, each with a
#    different sample profile).
uv run python tools/run_fleet.py --waves 1 --data-root data

# 4. Plot any episode (matplotlib WebAgg).
tools/show_envelope.sh data/episodes/<episode_id>

Each episode dir contains:

meta.json              episode metadata (image, sample, profile, fleet capacity)
events.jsonl           orchestrator + driver events (exploit_fire, session_open, sample_executed, ...)
labels.jsonl           one row per phase transition — THIS is the envelope
telemetry-proc.jsonl   source 1: host /proc sampler @ 10 Hz
telemetry-qmp.jsonl    source 2: QMP query-status / blockstats / kvm stats @ 1 Hz
telemetry-guest.jsonl  source 5: in-guest agent (CPU jiffies, mem, listen ports, top procs)
network.pcap           source 4: tcpdump on br-malware
netflow.jsonl          source 4: 100 ms-bucketed pcap aggregation
done.marker            written last; the shipper only sees finished episodes
Quick start — single episode, no fleet
# Tier 2 (no exploit, profile-driven workload):
uv run python tools/run_real_vm_demo.py --data-root data \
    --sample mirai-class-bot

# Tier 3 (real exploit fire via msfrpcd):
MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env; echo $MSFRPC_PASSWORD) \
    uv run python tools/run_tier3_demo.py \
    --module vsftpd_234_backdoor \
    --sample ransomware-mimic \
    --data-root data
Multi-host fleet — how cross-host diversity works

Each lab host's host_id (set in /etc/cis490/lab-host.toml) seeds a deterministic walk through the sample catalog:

# samples/manifest.py
def select(self, *, host_id, slot, episode_index):
    seed = f"{host_id}|{slot}|{episode_index}"
    idx  = sha256(seed)[:8] % len(self.samples)
    return self.samples[idx]

So:

  • host=alice slot=0 ep=0 and host=bob slot=0 ep=0 almost certainly pick different samples (test asserts < 25% collision over 20 trials).
  • A single host walks the entire catalog within ~len(manifest) waves (test confirms full coverage in 200 episodes).
  • No coordinator needed — every host independently produces non-overlapping data, and meta.fleet.host_id + meta.sample.name make the join trivial at training time.

The fleet runner shells out to the same tools/run_real_vm_demo.py per slot, with SLOT / RUN_DIR / SAMPLE_NAME env passed through to the launcher. Each VM gets its own QMP socket, agent socket, hostfwd port range, and episode dir, so concurrency is collision-free up to the capacity ceiling.

Repository layout
Path What it holds
docs/architecture.md Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning
docs/threat-model.md Train/serve parity rule and the oracle-vs-deployable feature split
docs/data-model.md On-disk JSONL schema, per-episode layout, phase enum
docs/transport.md Sender/receiver design — how episodes get to the central collector over WG
docs/deploy.md One-command install for the lab-host and receiver roles
docs/lab-setup.md KVM prereqs, VM build, snapshot, virtio-serial wiring
docs/sources.md Works cited — every tool, dep, sample source, paper, and standard
orchestrator/ Episode runner + fleet.py (capacity detection, concurrent slot driver)
collectors/ One module per telemetry source: proc_qemu, qmp, pcap, guest_agent
receiver/ Starlette app: PUT /v1/episodes + POST /v1/ping, sha256-verified, idempotent
shipper/ Lab-host-side: scan data/episodes/, tar+zstd, PUT over mTLS, retry/backoff
vm/ Launch scripts (launch_demo.sh, launch_target.sh), setup_bridge.sh, in-guest agent at vm/guest-agent/cis490_agent.py. qcow2 images and pcap captures gitignored.
tools/ run_fleet.py, run_real_vm_demo.py, run_tier3_demo.py, build_cidata.py, plot_envelope.py, show_envelope.sh
exploits/ MSF RPC client (msfrpc.py), driver.py (v2 with sample dispatch), workloads.py (six profile-matched in-session loops), per-module TOML configs
samples/ Sample manifest + loader. Binaries land at samples/store/<sha256> (gitignored).
scripts/ install-{lab-host,receiver,msfrpcd}.sh, fetch-metasploitable2.sh
training/ Model training code (deferred — schema first)
etc/ systemd units and config templates (cis490-{receiver,shipper,orchestrator}.service, lab-host.toml.example, receiver.toml.example)
AGENTS.md Conventions for AI agents working on this and sibling spectral repos
Design decisions — why these choices
  • Why VMs (not Docker)? We need a clean snapshot/revert loop and we need to run real malware without compromising the host. KVM gives both at near-native speed; containers share the host kernel and many samples detect containerization and refuse to detonate. See docs/architecture.md.
  • Why KVM (not TCG/-icount)? ML training data wants noise to generalize to. KVM is ~15× faster than TCG, which directly multiplies dataset size per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the "constrained device" framing.
  • Why JSONL (not a DB yet)? Schema-last. Collect first, decide storage shape after we see what's useful. JSONL is crash-safe, append-only, reshapes trivially into Postgres/Timescale/Parquet.
  • Why two models — realistic vs. oracle? Features that exist on a deployed device train the realistic model. Host-side QEMU telemetry (which doesn't exist in deployment) is oracle-only — used to assign honest labels at training time, never as a model input. The accuracy gap between the two measures how much detection power a privileged rootkit can take from us by lying to in-device tools. See docs/threat-model.md.
  • Why ULIDs for episode ids? Time-sortable, no coordinator, URL-safe.
Deploying the receiver and lab-host roles

Two roles, one bootstrap command each. Detailed in docs/deploy.md:

  • lab-host — runs episodes, ships completed episodes to the receiver.
  • receiver — accepts ship uploads, stores tarballs + appends to index.jsonl. Runs on the Pi5 in our setup.
# On the Pi5 (or any always-on WG node):
sudo ./scripts/install-receiver.sh
# Add the collector.wg block to spectral/caddy (already merged), then:
sudo systemctl enable --now cis490-receiver

# One-time, on the Pi: bootstrap the CIS490 client CA.
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh

# On each lab host: enroll via wg-enroll first, then:
sudo ./scripts/install-lab-host.sh
# Drop a TLS leaf from wg-pki at /etc/cis490/certs/, edit /etc/cis490/lab-host.toml.
sudo systemctl enable --now cis490-shipper cis490-orchestrator

The orchestrator service runs tools/run_fleet.py --waves 1 per invocation with Restart=always, giving a continuous stream of fresh-sample episodes per host. The shipper picks them up as done.marker files appear and PUTs them to https://collector.wg.

For mTLS leaf-cert minting: spectral/wg-pki/scripts/issue-cis490-client-cert.sh <host_id>.

Threat model and feature-availability split

See docs/threat-model.md for the full argument. The short version:

Channel Vantage Role
Host /proc/<qemu_pid> outside guest oracle (label only)
QEMU QMP query-stats etc. outside guest oracle (label only)
perf stat -p <qemu_pid> outside guest oracle (label only)
Bridge-side pcap gateway-style feature (deployable)
In-guest /proc, perf, thermal inside guest feature (deployable)

We collect everything in the lab. Only the features go into the deployed model; the oracles are used to label episodes with high confidence (disagreement between in-guest and host-side data is itself a rootkit signal).


Citing this work

A short course-project citation, until the dataset reaches a publishable form:

Gorog, M. CIS490 Behavioral Malware Detection Dataset (in progress). Spectral lab, 2026.

See docs/sources.md for everything else this project leans on.