Scaffold project: docs, repo skeleton, transport + deploy design

Lays down the design surface for the CIS490 behavioral-malware-detection
dataset and model. No code yet — schema and topology are decided first so
collection can start without rework.

Docs:
- README: project goal, navigation
- architecture: lab topology, KVM choice, episode state machine,
  deployment-mirror reasoning
- threat-model: train/serve parity rule, oracle-vs-deployable feature
  split, two-model evaluation strategy
- data-model: per-episode JSONL layout, row schemas, phase enum
- transport: WG-native shipper/receiver design, idempotent uploads
- deploy: one-command install for lab-host and receiver roles
- lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring

Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/,
training/ (each with a short README explaining purpose).
Extended .gitignore to exclude qcow2 images, pcaps, sample binaries,
secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Maximus Gorog 2026-04-28 23:21:00 -06:00
parent 7a0fefc02e
commit fa1574a0a6
14 changed files with 1080 additions and 0 deletions

47
.gitignore vendored
View file

@ -1 +1,48 @@
# Disk images and snapshots
*.iso
*.img
*.qcow2
*.qcow2.*
*.vmdk
*.vdi
*.raw
vm/images/
vm/snapshots/
# Telemetry output
data/episodes/
*.pcap
*.pcapng
# Malware samples — NEVER commit binaries
samples/store/
*.bin
*.elf
*.exe
*.dll
*.so.malware
# Python
__pycache__/
*.py[cod]
.venv/
venv/
.pytest_cache/
.mypy_cache/
.ruff_cache/
*.egg-info/
dist/
build/
# Editor
.vscode/
.idea/
*.swp
.DS_Store
# Local secrets (never commit)
.env
.env.local
secrets.toml
*.pat
*.token

51
README.md Normal file
View file

@ -0,0 +1,51 @@
# CIS490 — Behavioral Malware Detection Dataset & Model
Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches
performance metrics on a real device, decides whether the device has been breached,
and triggers a hardware-level reset when confidence is high enough.
This repository covers the **dataset side** of that pipeline: we run real, public
malware samples against intentionally vulnerable Linux VMs and capture labeled
time-series telemetry that mirrors what the same model would see in deployment on
a Raspberry Pi or similarly-constrained target.
The work is grounded in the trust-over-time scoring model from
[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803) and a related
proprietary follow-on that pairs detection with blockchain-anchored hardware reset.
## What lives where
| Path | What it holds |
|---|---|
| [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning |
| [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split |
| [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum |
| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG |
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
| `vm/` | qcow2 images and snapshot scripts (binaries gitignored) |
| `exploits/` | Metasploit resource scripts for repeatable exploitation |
| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
| `training/` | Model training code (deferred — schema first) |
## Quick orientation
1. **Why VMs?** We need a clean snapshot/revert loop and we need to run real malware
without burning hardware. KVM gives us both at near-native speed.
2. **Why is the network isolated?** A host-only bridge keeps malware off the
internet and off the WG overlay. The Pi5 gateway is the **lab-side observer**,
playing the same role it would play in a deployed setting.
3. **Why JSONL and not a database (yet)?** Schema-last: collect first, decide
storage shape after we see what's actually useful. JSONL is crash-safe,
append-only, and reshapes trivially into Postgres/Timescale/Parquet later.
4. **Why two models?** One trained on features that exist on a real Pi
(*deployable*), one trained on host-side QEMU-only features (*oracle*). The
accuracy gap measures how much detection power a privileged rootkit can take
from the deployed model. See [docs/threat-model.md](docs/threat-model.md).
## Status
Project bootstrap. Skeleton, documentation, and design decisions in place;
collection and orchestration code in progress.

23
collectors/README.md Normal file
View file

@ -0,0 +1,23 @@
# collectors/
One module per telemetry source. All collectors:
- Receive an `episode_id`, an output directory, and a shared `t_mono_origin_ns`.
- Write JSONL into `data/episodes/<episode_id>/telemetry-<name>.jsonl`.
- Stamp every row with the same `t_mono_ns` / `t_wall_ns` clock pair.
- Stamp every row with `source` and `available_in_deployment` (true/false).
- Exit cleanly on `SIGTERM` from the orchestrator.
| Module | Source | Vantage | Role |
|---|---|---|---|
| `proc_qemu.py` | host `/proc/<qemu_pid>/{stat,io,status,schedstat}` | outside guest | oracle |
| `qmp.py` | QEMU QMP `query-stats`, `query-blockstats`, netdev | outside guest | oracle |
| `perf_qemu.py` | `perf stat -p <qemu_pid>` | outside guest | oracle |
| `pcap.py` | `tcpdump -i br-malware`, bucketed | gateway-side | feature |
| `guest_agent.py` | virtio-serial reader, parses agent JSONL | inside guest | feature |
The in-guest agent itself (a small Python+psutil program that runs on the
guest and writes to `/dev/virtio-ports/cis490.guest.agent`) lives under
`vm/guest-agent/` because it is shipped *into* the guest at image-build time.
See [`docs/data-model.md`](../docs/data-model.md) for row schemas.

107
docs/architecture.md Normal file
View file

@ -0,0 +1,107 @@
# Architecture
## One-paragraph summary
A QEMU/KVM host runs short, repeatable **episodes** against a vulnerable Linux
guest. Each episode boots from a clean snapshot, captures a baseline, fires a
known exploit, drops a public malware sample, observes the infection envelope,
and reverts the snapshot. Telemetry is captured from five vantage points
simultaneously, all stamped with the host monotonic clock so rows align. The
output of an episode is a self-contained directory of JSONL files plus a pcap.
## Lab topology
```
+---------------------------------------------------------------+
| VM HOST (this machine, /home/maximus/.env/qemu) |
| |
| +-----------------------+ +------------------------+ |
| | KVM guest | | orchestrator (host) | |
| | (Metasploitable2, | | - snapshot loop | |
| | 1 vCPU, capped) | | - exploit driver | |
| | |<====>| - phase labeler | |
| | in-guest agent -----|virtio| | |
| | |serial| collectors: | |
| | vNIC ----------------| | * host /proc/qemu_pid| |
| | | | | * QMP query-stats | |
| +--------|--------------+ | * perf -p qemu_pid | |
| | | * tcpdump on br | |
| v | * guest agent rx | |
| br-malware (host-only, NO NAT) | | |
| | +-----------|------------+ |
| +--- isolated, no internet | |
| v |
| data/episodes/
+----------------------------------------------------------|----+
| (later)
v
WG overlay -> Pi5 (DB + ingest)
```
The malware bridge `br-malware` is **host-only** — no NAT, no route to the WG
overlay, no DNS. The orchestrator also blocks egress with nftables on the host
as a belt-and-suspenders measure.
## Why KVM, not TCG and not Docker
| Option | Speed | Determinism | Real OS isolation | Verdict |
|---|---|---|---|---|
| TCG `-icount` | slow | bit-exact replay | yes | overkill — ML wants noise |
| **KVM** | near-native | host-scheduler noise (good) | yes | **chosen** |
| Docker | fastest | low | shares host kernel — unsafe for malware | ruled out |
KVM is roughly 15× faster than TCG for boot/snapshot-revert cycles, which directly
multiplies dataset size for a fixed wall-clock budget. The "constrained
single-threaded device" framing from the project goal is preserved by pinning to
1 vCPU and applying a host cgroup CPU cap.
## The episode state machine
```
snapshot_load(baseline)
|
v
[clean] ---- record T_baseline seconds of idle telemetry ----+
| |
v |
[armed] ---- exploit module fires; session opens ------------+
| |
v |
[infecting] ---- sample uploaded + executed -----------------+
| |
v |
[infected_running] ---- observe T_active seconds ------------+
| |
v |
[dormant] ---- (optional) wait for sample's idle window ----+
| |
v |
[reverting] ---- snapshot_load(baseline); episode ends -----+
|
v
write meta.json + close jsonl
```
Phase transitions are emitted by the orchestrator into `labels.jsonl` *at the
moment the orchestrator takes the action*, not inferred from metrics afterward.
This is what makes the dataset honestly labeled.
## Why the lab topology mirrors deployment
In the field, the ML model runs on a real Pi or constrained device. Whatever
sees the device's network traffic from outside (router, gateway, hypervisor) is
the **gateway observer**. In our lab, the host-only bridge plays exactly that
role — bridge-side pcap features at training time map 1:1 to gateway-side
NetFlow/pcap features at inference time. This is what makes
*train/serve parity* possible for the network channel even though we'll later
run on bare metal.
See [`threat-model.md`](threat-model.md) for the rest of the parity story
(host-side QEMU features must NOT be used as model inputs — they are labeling
oracles only).
## Out of scope for this repo
- Authoring novel malware or zero-day exploits.
- Detection-evasion research targeting other vendors' AV.
- Production deployment of the trained model — that lives in a separate repo.

205
docs/data-model.md Normal file
View file

@ -0,0 +1,205 @@
# Data Model
JSONL only, no database, schema-last. Each episode is a self-contained directory.
## Per-episode layout
```
data/episodes/<episode_id>/
meta.json # one-time, written at start; updated at end with summary
events.jsonl # orchestrator actions, one row per event
labels.jsonl # phase transitions, one row per transition
telemetry-proc.jsonl # source 1 (oracle) host /proc/<qemu_pid>
telemetry-qmp.jsonl # source 2 (oracle) QEMU QMP queries
telemetry-perf.jsonl # source 3 (oracle) perf stat -p <qemu_pid>
telemetry-guest.jsonl # source 5 (feature) in-guest agent over virtio-serial
network.pcap # source 4 raw tcpdump -i br-malware
netflow.jsonl # source 4 bucketed 100ms aggregations of pcap
stderr.log # raw qemu + agent logs
```
`<episode_id>` is a [ULID](https://github.com/ulid/spec) — sortable by time,
unique without coordination, URL-safe.
## Common fields on every telemetry row
| Field | Type | Notes |
|---|---|---|
| `t_mono_ns` | int | host `CLOCK_MONOTONIC` at sample time, episode-relative origin |
| `t_wall_ns` | int | host wall clock, ns since epoch |
| `source` | string | one of `host_proc`, `host_qmp`, `host_perf`, `bridge_pcap`, `guest_agent` |
| `available_in_deployment` | bool | **true = feature, false = oracle** |
The `available_in_deployment` flag is denormalized onto every row so downstream
loaders don't have to look up a separate manifest to filter for the realistic
model.
## meta.json schema
```json
{
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
"schema_version": 1,
"started_at_wall": "2026-04-28T22:30:00Z",
"ended_at_wall": "2026-04-28T22:31:42Z",
"git_commit": "<sha>",
"host_fingerprint": {
"kernel": "6.18.8",
"qemu_version": "9.0.0",
"cpu_model": "...",
"smt_off": true
},
"vm": {
"image_name": "metasploitable2",
"image_sha256": "...",
"vcpus": 1,
"ram_mib": 512,
"cgroup_cpu_cap": "800ms/1s",
"snapshot_name": "baseline-v1"
},
"exploit": {
"framework": "metasploit",
"module": "exploit/multi/samba/usermap_script",
"rport": 445,
"rhost": "10.200.0.10"
},
"sample": {
"name": "linux.miner.xmrig.elf",
"sha256": "...",
"source": "MalwareBazaar",
"first_seen": "2024-...",
"category": "miner"
},
"schedule": {
"baseline_seconds": 30,
"infected_seconds": 90,
"dormant_seconds": 60
},
"result": {
"phases_observed": ["clean","armed","infecting","infected_running","dormant"],
"exploit_succeeded": true,
"sample_executed": true,
"snapshot_revert_ok": true
}
}
```
## events.jsonl
One row per orchestrator action. Tells you exactly what happened and when.
```json
{"t_mono_ns": 0, "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"}
{"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"}
{"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1}
{"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."}
{"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042}
{"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"}
```
## labels.jsonl
```json
{"t_mono_ns": 0, "phase": "clean", "prev": null, "reason": "snapshot_loaded"}
{"t_mono_ns": 30100000000, "phase": "armed", "prev": "clean", "reason": "exploit_module_running"}
{"t_mono_ns": 31250000000, "phase": "infecting", "prev": "armed", "reason": "session_open"}
{"t_mono_ns": 31400000000, "phase": "infected_running","prev": "infecting", "reason": "sample_executed"}
{"t_mono_ns": 91400000000, "phase": "dormant", "prev": "infected_running", "reason": "scheduler_transition"}
```
### Phase enum (closed)
```
clean — known-good, post-snapshot-load, pre-exploit
armed — exploit module is running but no session yet
infecting — session opened, sample landing/starting
infected_running — sample is actively producing observable behavior
dormant — sample is present but idle (sleep timer, beacon interval)
reverting — snapshot_load triggered, episode ending
```
## telemetry-proc.jsonl (source 1, oracle)
```json
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "host_proc", "available_in_deployment": false,
"cpu_user_jiffies": 142, "cpu_sys_jiffies": 38,
"rss_bytes": 542113792, "vsize_bytes": 1842933760,
"io_read_bytes": 0, "io_write_bytes": 4096,
"voluntary_ctxsw": 12, "involuntary_ctxsw": 3,
"minor_faults": 412, "major_faults": 0
}
```
## telemetry-qmp.jsonl (source 2, oracle)
```json
{
"t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000,
"source": "host_qmp", "available_in_deployment": false,
"blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}},
"kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110},
"netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}}
}
```
## telemetry-perf.jsonl (source 3, oracle)
```json
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "host_perf", "available_in_deployment": false,
"cycles": 184_213_104, "instructions": 121_987_001,
"cache_references": 1_041_213, "cache_misses": 38_104,
"branches": 24_198_421, "branch_misses": 412_004,
"page_faults": 12, "context_switches": 18,
"ipc": 0.66, "cache_miss_rate": 0.0366
}
```
## netflow.jsonl (source 4, feature)
Bucketed from the pcap. The pcap stays raw on disk for re-derivation later.
```json
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "bridge_pcap", "available_in_deployment": true,
"bucket_ms": 100,
"pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0,
"unique_dst_ips": 0, "unique_dst_ports": 0,
"syn_count": 0, "fin_count": 0, "rst_count": 0,
"dns_query_count": 0, "tcp_new_flows": 0
}
```
## telemetry-guest.jsonl (source 5, feature)
```json
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "guest_agent", "available_in_deployment": true,
"cpu_pct_total": 12.4, "load_1m": 0.41,
"mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872,
"thermal_milli_c": 47200,
"net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}},
"top_procs": [
{"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576},
{"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304}
],
"listen_ports": [22, 80, 445]
}
```
## Versioning
`schema_version` lives in `meta.json`. Bump when any row schema changes. Keep
old episodes untouched; loaders dispatch on version.
## Ingest later
When we move to a database (Timescale most likely), each `telemetry-*.jsonl`
becomes one hypertable, partitioned by `t_wall_ns`, indexed on
`(episode_id, source)`. The deployment-tag flag becomes a column we filter on
when materializing the realistic-model training view.

138
docs/deploy.md Normal file
View file

@ -0,0 +1,138 @@
# Deploy
Two roles. One install command each.
## Roles
| Role | Where it runs | What it does |
|---|---|---|
| `lab-host` | any KVM-capable Linux box on WG | runs episodes, ships completed episodes to the receiver |
| `receiver` | Pi5 (or any always-on WG node) | accepts ship uploads, stores tarballs + `index.jsonl` |
## Lab host install
```sh
git clone https://maxgit.wg/spectral/CIS490.git
cd CIS490
./scripts/install-lab-host.sh
```
The installer:
1. Verifies KVM (`/dev/kvm` exists, user in `kvm` group).
2. Installs system deps via the host package manager (qemu, tcpdump,
linux-tools/perf, zstd, python ≥ 3.11).
3. Bootstraps a [`uv`](https://github.com/astral-sh/uv)-managed venv at
`.venv/` and installs the pinned Python deps from `uv.lock`.
4. Drops two systemd units into `/etc/systemd/system/`:
- `cis490-orchestrator.service` — runs the episode loop on a queue
- `cis490-shipper.service` — watches `data/episodes/` and ships completed
episodes
5. Writes a config template to `/etc/cis490/lab-host.toml` (idempotent — only
on first install).
You finish by editing `/etc/cis490/lab-host.toml` to point at your receiver
and to enroll your lab host's WG-issued client cert, then:
```sh
sudo systemctl enable --now cis490-orchestrator cis490-shipper
```
### `lab-host.toml`
```toml
host_id = "lab-host-1"
[paths]
data_root = "/var/lib/cis490/data"
samples_store = "/var/lib/cis490/samples/store"
qcow_image = "/var/lib/cis490/vm/images/metasploitable2.qcow2"
[receiver]
url = "https://collector.wg"
client_cert = "/etc/cis490/certs/lab-host-1.pem"
client_key = "/etc/cis490/certs/lab-host-1.key"
ca_bundle = "/etc/cis490/certs/wg-ca.pem"
[episode]
baseline_seconds = 30
infected_seconds = 90
dormant_seconds = 60
[retention]
keep_local_for_days = 7
prune_at_disk_pct = 80
```
## Receiver install
On the Pi5 (or designated central node):
```sh
git clone https://maxgit.wg/spectral/CIS490.git
cd CIS490
./scripts/install-receiver.sh
```
The installer:
1. Installs Python ≥ 3.11 + zstd + a tiny WSGI runner (uvicorn).
2. Bootstraps the same `uv`-managed venv.
3. Drops `cis490-receiver.service` listening on `127.0.0.1:8443` (TLS
terminated by the existing Caddy in `spectral/caddy`, which already binds
`*.wg`).
4. Writes a config template to `/etc/cis490/receiver.toml`.
Caddy block (added to your `spectral/caddy` config) for the receiver:
```caddy
collector.wg {
tls internal
reverse_proxy 127.0.0.1:8443 {
transport http {
tls
tls_client_auth /etc/cis490/certs/wg-ca.pem
}
}
}
```
(mTLS terminates at the receiver, not Caddy — so the receiver sees the
client cert and can enforce per-host policies later.)
### `receiver.toml`
```toml
listen_addr = "127.0.0.1:8443"
store_root = "/var/lib/cis490/episodes"
incoming_root = "/var/lib/cis490/incoming"
index_path = "/var/lib/cis490/index.jsonl"
ca_bundle = "/etc/cis490/certs/wg-ca.pem"
[limits]
max_episode_bytes = 268_435_456 # 256 MiB
```
## Day-2 operations
```sh
# How many episodes have been shipped?
ssh collector.wg 'wc -l /var/lib/cis490/index.jsonl'
# What's in the outbox on a lab host? (failed/pending shipments)
ls /var/lib/cis490/data/outbox/
# Tail the orchestrator log
journalctl -u cis490-orchestrator -f
# Tail the shipper log
journalctl -u cis490-shipper -f
```
## Updating
```sh
git pull
./scripts/install-lab-host.sh # idempotent; re-syncs deps and units
sudo systemctl restart cis490-orchestrator cis490-shipper
```

145
docs/lab-setup.md Normal file
View file

@ -0,0 +1,145 @@
# Lab Setup
How to bring up the host, build the guest, and verify the snapshot loop.
## Host prerequisites
```
qemu-system-x86_64 >= 8.0
qemu-img >= 8.0
bridge-utils
tcpdump / tshark
linux-tools-common (for `perf`)
zstd
python >= 3.11
uv (https://github.com/astral-sh/uv)
```
`scripts/install-lab-host.sh` installs all of these and wires up systemd —
see [`deploy.md`](deploy.md).
KVM must be enabled in the kernel and the user must be in the `kvm` group:
```
ls /dev/kvm # must exist
groups # must include kvm
```
## Network: host-only malware bridge
`br-malware` (10.200.0.1/24) is the only network the guest sees, and it is
host-only — no NAT, no upstream route. The host's WG interface is on a
*separate* link (`wg0`) used only for shipping completed episodes to the
collector; the bridge and WG never touch.
| Interface | Purpose |
|---|---|
| `br-malware` (10.200.0.1/24) | host-only bridge, only NIC attached to the guest |
| guest `eth0` | DHCP from a dnsmasq bound only to `br-malware` |
| host WG (`wg0`) | shipping channel to the collector — not connected to the bridge |
> Detailed firewall rules and the egress-drop safety net are out of scope for
> this document and live in the deploy script. The relevant invariant for
> readers is: **the guest cannot route off `br-malware`, period.**
## Guest: Metasploitable 2
1. Download from the [Rapid7 mirror](https://information.rapid7.com/download-metasploitable-2017.html)
(verify sha256 against the published value before use).
2. Convert VMware → qcow2:
```
qemu-img convert -O qcow2 -p Metasploitable.vmdk metasploitable2.qcow2
```
3. First boot (no snapshot yet) — let it come up, log in (msfadmin/msfadmin),
confirm services are listening on the expected ports, shut down cleanly.
4. Take the baseline snapshot:
```
qemu-img snapshot -c baseline-v1 metasploitable2.qcow2
```
Internal qcow2 snapshots load in well under a second — this is the
"factory reset" mechanism for every episode.
## Single-vCPU constrained-device emulation
```
-cpu host -smp 1,sockets=1,cores=1,threads=1
-m 512
-machine type=q35,accel=kvm
```
Plus a host-side cgroup CPU cap on the QEMU process (e.g. 80% of one core) so
the guest behaves like a small, constrained device under load.
## Telemetry channels
### virtio-serial for the in-guest agent
```
-device virtio-serial-pci
-chardev socket,path=/run/qemu/guest-agent.sock,server=on,wait=off,id=ga
-device virtserialport,chardev=ga,name=cis490.guest.agent
```
The in-guest agent opens `/dev/virtio-ports/cis490.guest.agent` and writes
JSONL to it. Host side, the orchestrator reads from the unix socket. No network
involvement = the malware cannot interfere with this channel.
### QMP for live oracle queries
```
-qmp unix:/run/qemu/qmp.sock,server=on,wait=off
```
The orchestrator polls `query-stats`, `query-blockstats`, and netdev stats over
this socket.
### perf stat on the QEMU process
```
perf stat -p <qemu_pid> -I 100 \
-e cycles,instructions,cache-references,cache-misses,branches,branch-misses,page-faults,context-switches \
-x , -o telemetry-perf.csv
```
The collector tails the CSV, parses, and emits JSONL.
### tcpdump on `br-malware`
```
tcpdump -i br-malware -w network.pcap -B 4096 -s 200
```
Post-process to `netflow.jsonl` with 100ms buckets.
## Snapshot loop sanity check
A green light before any data collection:
1. `qemu-img snapshot -l metasploitable2.qcow2` shows `baseline-v1`.
2. Boot the VM with the qcow2.
3. Touch a file in the guest. Shut down.
4. `qemu-img snapshot -a baseline-v1 metasploitable2.qcow2`.
5. Boot again. The file is gone. ✅
## Safety checks before running real samples
- `ip route show table all | grep br-malware` shows no route off the bridge.
- `dig @host example.com` from a guest fails (no DNS for malware).
- The host's WG interface is **not** bridged to `br-malware`.
(See `scripts/install-lab-host.sh` for the firewall plumbing — it isn't the
focus of this project.)
## Where to put VMs and snapshots
```
vm/images/ # qcow2 disk images (gitignored)
vm/snapshots/ # named snapshot exports if we ever externalize them
```
Both directories are gitignored. The repo only carries the *recipes* for
reproducing them.

94
docs/threat-model.md Normal file
View file

@ -0,0 +1,94 @@
# Threat Model & Train/Serve Parity
The single most important design rule in this project:
> **A feature used by the deployed model must exist on the deployed device.**
Violating this rule produces a model that scores 99% in the lab and is useless in
the field. This document spells out which features fall on which side of that
line, and why we still bother capturing both.
## The setting
The deployed model runs on a real, non-virtualized device — a Raspberry Pi, an
IoT endpoint, or similar. It tries to detect the moment that device gets
breached. Two adversarial facts shape the design:
1. **Malware can lie to in-device tools.** A sufficiently-privileged rootkit can
hook `/proc`, intercept `perf_event_open`, and hide its own processes.
2. **There is no host-side QEMU view.** The deployed device is the actual
machine. Nothing is watching it from outside *the OS itself*.
So the model has two trustworthy floors:
- **In-device features that survive most malware** (perf counters via the syscall
interface, thermals, gross resource counters) — fast to lie to in principle,
but in practice most commodity malware doesn't bother.
- **Off-device features at the gateway** (network telemetry observed by an
upstream router/gateway) — physics-bound, the malware cannot prevent bytes
from leaving the NIC.
## Two roles: features vs. oracles
Every measurement we capture in the lab gets one of two roles:
| Role | What it's used for | Available in deployment? |
|---|---|---|
| **Feature** | Input to the trained model | **Must be yes** |
| **Oracle** | Ground-truth labeling during training only | No — but we have it in the lab |
The oracle channels (host `/proc/<qemu_pid>`, QMP `query-stats`,
`perf -p qemu_pid`) are how we know with certainty what the guest is *actually*
doing — not what it claims to be doing. We use that certainty to assign correct
labels in `labels.jsonl`. The model never sees them at training time.
## Channel taxonomy
| # | Channel | Vantage | Role | Why |
|---|---|---|---|---|
| 1 | Host `/proc/<qemu_pid>` | outside guest | oracle | doesn't exist on real device |
| 2 | QEMU QMP `query-stats`, `query-blockstats` | outside guest | oracle | same |
| 3 | `perf stat -p <qemu_pid>` | outside guest | oracle | same |
| 4 | Bridge-side pcap (`tcpdump -i br-malware`) | gateway | **feature** | matches Pi5 gateway in field |
| 5 | In-guest `/proc/*`, `perf_event_open`, `/sys/class/thermal/*` | inside guest | **feature** | same exact source on real device |
Note: in-guest features (5) are the same syscall surfaces we'd read on a real
Pi. The data we capture from them in the lab and the data we capture from them
on a Pi at deployment are pulled from identical kernel APIs — that is what makes
parity hold.
## The two-model evaluation strategy
We will train two classifiers from the same dataset:
1. **Realistic model** — trained only on features (channels 4 + 5).
*This is what would be deployed.*
2. **Oracle model** — trained on everything (channels 15).
*This is the upper bound on what was learnable from this dataset.*
The interesting numbers are:
- The realistic model's accuracy → **how well the deployable system actually works**.
- The gap between realistic and oracle → **how much detection power a privileged
rootkit can take away from us by lying** (the cost of in-device measurement
being adversarial).
- Realistic-without-pcap vs. realistic-with-pcap → **how much the gateway
observer is worth** (the value of off-device telemetry).
Reporting all three is the paper-worthy result — not just one accuracy number.
## Using oracle disagreement as a label
When the in-guest agent reports CPU at 8% and the host says the QEMU process is
burning 70%, that *is* the rootkit-active signal — and it's more reliable than
any single in-guest measurement. Episodes where this divergence appears get a
high-confidence `infected_running` label even if the in-guest data alone looks
ambiguous. This is the practical payoff of capturing both sides.
## What we are not claiming
- We are not claiming to detect kernel rootkits robustly from in-guest data alone.
The oracle/feature gap will quantify the limit.
- We are not claiming the trained model is safe to deploy without the gateway
observer in production — for the strongest threat model, gateway-side fusion
is required.

164
docs/transport.md Normal file
View file

@ -0,0 +1,164 @@
# Transport — Centralized Episode Collection over WG
The dataset lives wherever it is convenient to train from. In our setup that is
the Pi5 (or whatever the team designates as the central collector), reachable
over the WG overlay at `<receiver-host>.wg`. This document describes how
episodes get from a lab host to the central collector.
## Design goals
1. **Easy to deploy.** One config file, one systemd unit per side. No DB
required to start collecting.
2. **WG-native.** Sender and receiver both live on the WG overlay; transport is
just HTTPS over WG. We use the existing wg-pki CA for mTLS.
3. **Idempotent.** Re-shipping the same episode is safe and cheap; the
receiver responds 200 if the bytes already match.
4. **Crash-safe.** Lab host crash mid-episode does not corrupt the central
store. Receiver crash mid-upload leaves no partial visible.
5. **Schema-free.** The receiver does not parse JSONL; it stores tarballs and
an append-only index. The schema lives only at training time.
## What gets shipped
A complete episode directory is tarred and zstd-compressed:
```
data/episodes/<episode_id>/ → <episode_id>.tar.zst
```
The orchestrator marks an episode complete by writing a `done.marker` file at
the *end* of the directory after `meta.json` is finalized. The shipper only
considers directories that contain `done.marker` — partially-written episodes
are invisible to it.
## Wire protocol
```
PUT https://<receiver-host>.wg/v1/episodes/<host_id>/<episode_id>.tar.zst
Content-Type: application/zstd
Content-Length: <bytes>
X-Content-SHA256: <sha256-of-body>
X-Schema-Version: 1
X-Lab-Host: <host_id>
X-Episode-Id: <episode_id>
body: <the tar.zst bytes>
```
Auth: mTLS using a leaf certificate issued by the wg-pki CA. The receiver
trusts only certs issued by that CA.
Responses:
| Status | Meaning |
|---|---|
| 201 | Stored; new |
| 200 | Already present with matching sha256; nothing to do |
| 409 | Already present with **different** sha256; receiver refuses to overwrite |
| 4xx | Bad request (missing header, malformed id, etc.) |
| 5xx | Server error; sender retries with backoff |
There is no DELETE. Episodes are immutable once shipped.
## Sender (`shipper`) state machine
```
scan data/episodes/
|
v
for each <id>/done.marker:
|
v
tar+zstd → data/outbox/<id>.tar.zst.partial
|
v
rename → data/outbox/<id>.tar.zst (atomic; visible to retry loop)
|
v
PUT to receiver
|
+-- 200/201 → mv data/episodes/<id> data/shipped/<id>;
| rm data/outbox/<id>.tar.zst
|
+-- 409 → log mismatch, leave files in place, alert (manual triage)
|
+-- 5xx/network → backoff (1s, 2s, 4s, 8s, ... cap 5min); retry
```
The shipper does the same scan on every wake-up, so a crash mid-tar or
mid-PUT is harmless — the next pass picks up wherever it left off.
## Receiver state machine
```
PUT body received
|
v
stream into /var/lib/cis490/incoming/<host>/<id>.tar.zst.partial
|
v
compute sha256 while streaming
|
+-- mismatch with header → 400, delete partial
|
+-- match:
|
v
if final path exists:
|
+-- existing sha256 == new sha256 → 200, delete partial
|
+-- existing sha256 != new sha256 → 409, delete partial
else:
|
v
atomic rename → /var/lib/cis490/episodes/<host>/<id>.tar.zst
|
v
append index.jsonl row
|
v
201
```
`index.jsonl` row:
```json
{
"received_at_wall": "2026-04-28T22:31:43Z",
"host_id": "lab-host-1",
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
"sha256": "...",
"size_bytes": 8412331,
"schema_version": 1
}
```
That index is the closest thing to a database we have until we decide on one.
A trainer can stream it to know what episodes exist, then untar on demand.
## Why not just rsync?
`rsync` works, but:
- No schema-version tagging at the protocol layer.
- No clean way to enforce "immutable once written".
- mTLS via WG-issued certs is more uniform with the rest of the overlay than
ssh-key juggling.
- A tiny FastAPI receiver is also a natural place to add ingest-time hooks
later (e.g. emit a Matrix notification on successful receipt, kick off a
training run when N new episodes arrive).
We may switch to rsync if the FastAPI receiver becomes a bottleneck. For a
class project that is unlikely.
## Operational notes
- **Disk on lab host.** The shipper keeps episodes locally in
`data/shipped/<id>/` until a retention pass prunes them. Default retention:
7 days *or* 80% disk usage, whichever comes first.
- **Disk on receiver.** No retention enforced by default — the central store
is the dataset.
- **Backpressure.** If the receiver is unreachable (WG down, Pi rebooting),
the shipper accumulates tarballs in `data/outbox/`. No data is lost.
- **Multiple lab hosts.** Each writes under its own `<host_id>/` prefix. No
coordination needed; episode ids are globally unique (ULID).

12
exploits/README.md Normal file
View file

@ -0,0 +1,12 @@
# exploits/
Metasploit resource scripts (`*.rc`) that drive specific exploit modules
deterministically — same inputs, same module options, every time.
Each script:
- Sets `RHOSTS` to the guest's bridge IP.
- Sets a payload that opens a session usable for sample upload + execute.
- Avoids any options that introduce randomness in the exploit fire timing
(so that the `armed → infecting` transition lands at a predictable offset).
These scripts pair with public Metasploit modules. We do not author exploits.

21
orchestrator/README.md Normal file
View file

@ -0,0 +1,21 @@
# orchestrator/
The state machine that drives a single **episode**:
```
snapshot_load → clean → armed → infecting → infected_running → dormant → reverting
```
Responsibilities:
- Bring up the host-only bridge and verify isolation before the guest starts.
- Boot the guest from a named snapshot.
- Spawn the five telemetry collectors (`collectors/`) with a shared episode id
and shared monotonic clock origin.
- Drive the Metasploit Framework over RPC to fire the configured exploit module.
- Upload + execute the configured malware sample once a session is open.
- Emit phase transitions to `labels.jsonl` *at the moment the action is taken*.
- Revert the snapshot at episode end.
- Write `meta.json` with the result summary.
Implementation lives in this directory and is imported as `orchestrator.*`.

33
samples/README.md Normal file
View file

@ -0,0 +1,33 @@
# samples/
**Sample binaries are NEVER committed to this repo.** This directory holds:
- `manifest.yaml` — sha256-pinned list of samples to fetch, with metadata
(source, category, expected behavior, target CVE).
- `fetch.py` — script that pulls samples from configured sources
(MalwareBazaar, theZoo, vx-underground), verifies sha256, and stores them
under `samples/store/` (gitignored).
- Per-sample notes in markdown describing observed behavior in our lab.
`samples/store/` lives only on the lab host. It is gitignored *and* should
sit on a disk that is not auto-mounted on developer workstations.
## Manifest entry shape (placeholder)
```yaml
samples:
- name: linux.miner.xmrig.elf
sha256: "..." # pinned
source: MalwareBazaar
category: miner
target_cve: null # cryptominers are usually post-exploit payloads
behavior: "high CPU, periodic stratum protocol traffic"
pairs_with_exploit: exploit/multi/samba/usermap_script
```
## Safety rules
- Only download to the lab host, never to a developer workstation.
- Verify sha256 immediately, before any other read.
- Keep the directory on a path that is *not* on the WG overlay.
- Re-verify sha256 before each detonation; refuse to run on mismatch.

23
training/README.md Normal file
View file

@ -0,0 +1,23 @@
# training/
Deferred until the dataset has substance. The plan, recorded so we don't lose
it:
1. Two models will be trained from the same episodes:
- **Realistic** — features only (`available_in_deployment: true`).
- **Oracle** — all rows, regardless of the deployment flag.
2. Baseline architecture: a rolling-window feature builder + a gradient-boosted
trees classifier (XGBoost or LightGBM). Cheap, strong, interpretable.
3. Window: 15 second sliding windows with per-channel summary stats
(mean, std, p95, slope, count of zero buckets).
4. Target: the phase enum from `labels.jsonl`, projected onto each window's
center timestamp.
5. Evaluation:
- Held-out *samples* (not just held-out time slices) — generalization to
unseen malware matters more than within-sample accuracy.
- Confusion matrix + per-phase precision/recall.
- Realistic vs. oracle gap, reported.
6. Stretch: trust-over-time scoring per the IEEE 9881803 paper, with a reset
threshold tuned for low false-positive cost.
See [`docs/threat-model.md`](../docs/threat-model.md) for why this split exists.

17
vm/README.md Normal file
View file

@ -0,0 +1,17 @@
# vm/
Recipes and helpers for building and snapshotting guest VMs. Disk images and
snapshots themselves are gitignored — this directory carries the *how*, not
the bytes.
```
vm/
images/ # qcow2 staging (gitignored)
snapshots/ # exported snapshots if needed (gitignored)
guest-agent/ # in-guest telemetry agent (shipped into the guest)
metasploitable2.md # download/convert/snapshot procedure (TODO)
custom-debian/ # cloud-init for our own vulnerable Debian (TODO)
```
See [`docs/lab-setup.md`](../docs/lab-setup.md) for the full host + guest
bring-up procedure.