From fa1574a0a69323816db1b500f043e5d6faf7c321 Mon Sep 17 00:00:00 2001 From: Maximus Gorog Date: Tue, 28 Apr 2026 23:21:00 -0600 Subject: [PATCH] Scaffold project: docs, repo skeleton, transport + deploy design MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) --- .gitignore | 47 ++++++++++ README.md | 51 ++++++++++ collectors/README.md | 23 +++++ docs/architecture.md | 107 +++++++++++++++++++++ docs/data-model.md | 205 +++++++++++++++++++++++++++++++++++++++++ docs/deploy.md | 138 +++++++++++++++++++++++++++ docs/lab-setup.md | 145 +++++++++++++++++++++++++++++ docs/threat-model.md | 94 +++++++++++++++++++ docs/transport.md | 164 +++++++++++++++++++++++++++++++++ exploits/README.md | 12 +++ orchestrator/README.md | 21 +++++ samples/README.md | 33 +++++++ training/README.md | 23 +++++ vm/README.md | 17 ++++ 14 files changed, 1080 insertions(+) create mode 100644 README.md create mode 100644 collectors/README.md create mode 100644 docs/architecture.md create mode 100644 docs/data-model.md create mode 100644 docs/deploy.md create mode 100644 docs/lab-setup.md create mode 100644 docs/threat-model.md create mode 100644 docs/transport.md create mode 100644 exploits/README.md create mode 100644 orchestrator/README.md create mode 100644 samples/README.md create mode 100644 training/README.md create mode 100644 vm/README.md diff --git a/.gitignore b/.gitignore index 6267c43..432eda9 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,48 @@ +# Disk images and snapshots *.iso +*.img +*.qcow2 +*.qcow2.* +*.vmdk +*.vdi +*.raw +vm/images/ +vm/snapshots/ + +# Telemetry output +data/episodes/ +*.pcap +*.pcapng + +# Malware samples — NEVER commit binaries +samples/store/ +*.bin +*.elf +*.exe +*.dll +*.so.malware + +# Python +__pycache__/ +*.py[cod] +.venv/ +venv/ +.pytest_cache/ +.mypy_cache/ +.ruff_cache/ +*.egg-info/ +dist/ +build/ + +# Editor +.vscode/ +.idea/ +*.swp +.DS_Store + +# Local secrets (never commit) +.env +.env.local +secrets.toml +*.pat +*.token diff --git a/README.md b/README.md new file mode 100644 index 0000000..a1882f3 --- /dev/null +++ b/README.md @@ -0,0 +1,51 @@ +# CIS490 — Behavioral Malware Detection Dataset & Model + +Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches +performance metrics on a real device, decides whether the device has been breached, +and triggers a hardware-level reset when confidence is high enough. + +This repository covers the **dataset side** of that pipeline: we run real, public +malware samples against intentionally vulnerable Linux VMs and capture labeled +time-series telemetry that mirrors what the same model would see in deployment on +a Raspberry Pi or similarly-constrained target. + +The work is grounded in the trust-over-time scoring model from +[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803) and a related +proprietary follow-on that pairs detection with blockchain-anchored hardware reset. + +## What lives where + +| Path | What it holds | +|---|---| +| [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning | +| [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split | +| [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum | +| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG | +| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles | +| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring | +| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop | +| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) | +| `vm/` | qcow2 images and snapshot scripts (binaries gitignored) | +| `exploits/` | Metasploit resource scripts for repeatable exploitation | +| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** | +| `training/` | Model training code (deferred — schema first) | + +## Quick orientation + +1. **Why VMs?** We need a clean snapshot/revert loop and we need to run real malware + without burning hardware. KVM gives us both at near-native speed. +2. **Why is the network isolated?** A host-only bridge keeps malware off the + internet and off the WG overlay. The Pi5 gateway is the **lab-side observer**, + playing the same role it would play in a deployed setting. +3. **Why JSONL and not a database (yet)?** Schema-last: collect first, decide + storage shape after we see what's actually useful. JSONL is crash-safe, + append-only, and reshapes trivially into Postgres/Timescale/Parquet later. +4. **Why two models?** One trained on features that exist on a real Pi + (*deployable*), one trained on host-side QEMU-only features (*oracle*). The + accuracy gap measures how much detection power a privileged rootkit can take + from the deployed model. See [docs/threat-model.md](docs/threat-model.md). + +## Status + +Project bootstrap. Skeleton, documentation, and design decisions in place; +collection and orchestration code in progress. diff --git a/collectors/README.md b/collectors/README.md new file mode 100644 index 0000000..0032d67 --- /dev/null +++ b/collectors/README.md @@ -0,0 +1,23 @@ +# collectors/ + +One module per telemetry source. All collectors: + +- Receive an `episode_id`, an output directory, and a shared `t_mono_origin_ns`. +- Write JSONL into `data/episodes//telemetry-.jsonl`. +- Stamp every row with the same `t_mono_ns` / `t_wall_ns` clock pair. +- Stamp every row with `source` and `available_in_deployment` (true/false). +- Exit cleanly on `SIGTERM` from the orchestrator. + +| Module | Source | Vantage | Role | +|---|---|---|---| +| `proc_qemu.py` | host `/proc//{stat,io,status,schedstat}` | outside guest | oracle | +| `qmp.py` | QEMU QMP `query-stats`, `query-blockstats`, netdev | outside guest | oracle | +| `perf_qemu.py` | `perf stat -p ` | outside guest | oracle | +| `pcap.py` | `tcpdump -i br-malware`, bucketed | gateway-side | feature | +| `guest_agent.py` | virtio-serial reader, parses agent JSONL | inside guest | feature | + +The in-guest agent itself (a small Python+psutil program that runs on the +guest and writes to `/dev/virtio-ports/cis490.guest.agent`) lives under +`vm/guest-agent/` because it is shipped *into* the guest at image-build time. + +See [`docs/data-model.md`](../docs/data-model.md) for row schemas. diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..41e4126 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,107 @@ +# Architecture + +## One-paragraph summary + +A QEMU/KVM host runs short, repeatable **episodes** against a vulnerable Linux +guest. Each episode boots from a clean snapshot, captures a baseline, fires a +known exploit, drops a public malware sample, observes the infection envelope, +and reverts the snapshot. Telemetry is captured from five vantage points +simultaneously, all stamped with the host monotonic clock so rows align. The +output of an episode is a self-contained directory of JSONL files plus a pcap. + +## Lab topology + +``` ++---------------------------------------------------------------+ +| VM HOST (this machine, /home/maximus/.env/qemu) | +| | +| +-----------------------+ +------------------------+ | +| | KVM guest | | orchestrator (host) | | +| | (Metasploitable2, | | - snapshot loop | | +| | 1 vCPU, capped) | | - exploit driver | | +| | |<====>| - phase labeler | | +| | in-guest agent -----|virtio| | | +| | |serial| collectors: | | +| | vNIC ----------------| | * host /proc/qemu_pid| | +| | | | | * QMP query-stats | | +| +--------|--------------+ | * perf -p qemu_pid | | +| | | * tcpdump on br | | +| v | * guest agent rx | | +| br-malware (host-only, NO NAT) | | | +| | +-----------|------------+ | +| +--- isolated, no internet | | +| v | +| data/episodes/ ++----------------------------------------------------------|----+ + | (later) + v + WG overlay -> Pi5 (DB + ingest) +``` + +The malware bridge `br-malware` is **host-only** — no NAT, no route to the WG +overlay, no DNS. The orchestrator also blocks egress with nftables on the host +as a belt-and-suspenders measure. + +## Why KVM, not TCG and not Docker + +| Option | Speed | Determinism | Real OS isolation | Verdict | +|---|---|---|---|---| +| TCG `-icount` | slow | bit-exact replay | yes | overkill — ML wants noise | +| **KVM** | near-native | host-scheduler noise (good) | yes | **chosen** | +| Docker | fastest | low | shares host kernel — unsafe for malware | ruled out | + +KVM is roughly 15× faster than TCG for boot/snapshot-revert cycles, which directly +multiplies dataset size for a fixed wall-clock budget. The "constrained +single-threaded device" framing from the project goal is preserved by pinning to +1 vCPU and applying a host cgroup CPU cap. + +## The episode state machine + +``` + snapshot_load(baseline) + | + v + [clean] ---- record T_baseline seconds of idle telemetry ----+ + | | + v | + [armed] ---- exploit module fires; session opens ------------+ + | | + v | + [infecting] ---- sample uploaded + executed -----------------+ + | | + v | + [infected_running] ---- observe T_active seconds ------------+ + | | + v | + [dormant] ---- (optional) wait for sample's idle window ----+ + | | + v | + [reverting] ---- snapshot_load(baseline); episode ends -----+ + | + v + write meta.json + close jsonl +``` + +Phase transitions are emitted by the orchestrator into `labels.jsonl` *at the +moment the orchestrator takes the action*, not inferred from metrics afterward. +This is what makes the dataset honestly labeled. + +## Why the lab topology mirrors deployment + +In the field, the ML model runs on a real Pi or constrained device. Whatever +sees the device's network traffic from outside (router, gateway, hypervisor) is +the **gateway observer**. In our lab, the host-only bridge plays exactly that +role — bridge-side pcap features at training time map 1:1 to gateway-side +NetFlow/pcap features at inference time. This is what makes +*train/serve parity* possible for the network channel even though we'll later +run on bare metal. + +See [`threat-model.md`](threat-model.md) for the rest of the parity story +(host-side QEMU features must NOT be used as model inputs — they are labeling +oracles only). + +## Out of scope for this repo + +- Authoring novel malware or zero-day exploits. +- Detection-evasion research targeting other vendors' AV. +- Production deployment of the trained model — that lives in a separate repo. diff --git a/docs/data-model.md b/docs/data-model.md new file mode 100644 index 0000000..4540e2c --- /dev/null +++ b/docs/data-model.md @@ -0,0 +1,205 @@ +# Data Model + +JSONL only, no database, schema-last. Each episode is a self-contained directory. + +## Per-episode layout + +``` +data/episodes// + meta.json # one-time, written at start; updated at end with summary + events.jsonl # orchestrator actions, one row per event + labels.jsonl # phase transitions, one row per transition + telemetry-proc.jsonl # source 1 (oracle) host /proc/ + telemetry-qmp.jsonl # source 2 (oracle) QEMU QMP queries + telemetry-perf.jsonl # source 3 (oracle) perf stat -p + telemetry-guest.jsonl # source 5 (feature) in-guest agent over virtio-serial + network.pcap # source 4 raw tcpdump -i br-malware + netflow.jsonl # source 4 bucketed 100ms aggregations of pcap + stderr.log # raw qemu + agent logs +``` + +`` is a [ULID](https://github.com/ulid/spec) — sortable by time, +unique without coordination, URL-safe. + +## Common fields on every telemetry row + +| Field | Type | Notes | +|---|---|---| +| `t_mono_ns` | int | host `CLOCK_MONOTONIC` at sample time, episode-relative origin | +| `t_wall_ns` | int | host wall clock, ns since epoch | +| `source` | string | one of `host_proc`, `host_qmp`, `host_perf`, `bridge_pcap`, `guest_agent` | +| `available_in_deployment` | bool | **true = feature, false = oracle** | + +The `available_in_deployment` flag is denormalized onto every row so downstream +loaders don't have to look up a separate manifest to filter for the realistic +model. + +## meta.json schema + +```json +{ + "episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0", + "schema_version": 1, + "started_at_wall": "2026-04-28T22:30:00Z", + "ended_at_wall": "2026-04-28T22:31:42Z", + "git_commit": "", + "host_fingerprint": { + "kernel": "6.18.8", + "qemu_version": "9.0.0", + "cpu_model": "...", + "smt_off": true + }, + "vm": { + "image_name": "metasploitable2", + "image_sha256": "...", + "vcpus": 1, + "ram_mib": 512, + "cgroup_cpu_cap": "800ms/1s", + "snapshot_name": "baseline-v1" + }, + "exploit": { + "framework": "metasploit", + "module": "exploit/multi/samba/usermap_script", + "rport": 445, + "rhost": "10.200.0.10" + }, + "sample": { + "name": "linux.miner.xmrig.elf", + "sha256": "...", + "source": "MalwareBazaar", + "first_seen": "2024-...", + "category": "miner" + }, + "schedule": { + "baseline_seconds": 30, + "infected_seconds": 90, + "dormant_seconds": 60 + }, + "result": { + "phases_observed": ["clean","armed","infecting","infected_running","dormant"], + "exploit_succeeded": true, + "sample_executed": true, + "snapshot_revert_ok": true + } +} +``` + +## events.jsonl + +One row per orchestrator action. Tells you exactly what happened and when. + +```json +{"t_mono_ns": 0, "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"} +{"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"} +{"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1} +{"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."} +{"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042} +{"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"} +``` + +## labels.jsonl + +```json +{"t_mono_ns": 0, "phase": "clean", "prev": null, "reason": "snapshot_loaded"} +{"t_mono_ns": 30100000000, "phase": "armed", "prev": "clean", "reason": "exploit_module_running"} +{"t_mono_ns": 31250000000, "phase": "infecting", "prev": "armed", "reason": "session_open"} +{"t_mono_ns": 31400000000, "phase": "infected_running","prev": "infecting", "reason": "sample_executed"} +{"t_mono_ns": 91400000000, "phase": "dormant", "prev": "infected_running", "reason": "scheduler_transition"} +``` + +### Phase enum (closed) + +``` +clean — known-good, post-snapshot-load, pre-exploit +armed — exploit module is running but no session yet +infecting — session opened, sample landing/starting +infected_running — sample is actively producing observable behavior +dormant — sample is present but idle (sleep timer, beacon interval) +reverting — snapshot_load triggered, episode ending +``` + +## telemetry-proc.jsonl (source 1, oracle) + +```json +{ + "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000, + "source": "host_proc", "available_in_deployment": false, + "cpu_user_jiffies": 142, "cpu_sys_jiffies": 38, + "rss_bytes": 542113792, "vsize_bytes": 1842933760, + "io_read_bytes": 0, "io_write_bytes": 4096, + "voluntary_ctxsw": 12, "involuntary_ctxsw": 3, + "minor_faults": 412, "major_faults": 0 +} +``` + +## telemetry-qmp.jsonl (source 2, oracle) + +```json +{ + "t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000, + "source": "host_qmp", "available_in_deployment": false, + "blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}}, + "kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110}, + "netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}} +} +``` + +## telemetry-perf.jsonl (source 3, oracle) + +```json +{ + "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000, + "source": "host_perf", "available_in_deployment": false, + "cycles": 184_213_104, "instructions": 121_987_001, + "cache_references": 1_041_213, "cache_misses": 38_104, + "branches": 24_198_421, "branch_misses": 412_004, + "page_faults": 12, "context_switches": 18, + "ipc": 0.66, "cache_miss_rate": 0.0366 +} +``` + +## netflow.jsonl (source 4, feature) + +Bucketed from the pcap. The pcap stays raw on disk for re-derivation later. + +```json +{ + "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000, + "source": "bridge_pcap", "available_in_deployment": true, + "bucket_ms": 100, + "pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0, + "unique_dst_ips": 0, "unique_dst_ports": 0, + "syn_count": 0, "fin_count": 0, "rst_count": 0, + "dns_query_count": 0, "tcp_new_flows": 0 +} +``` + +## telemetry-guest.jsonl (source 5, feature) + +```json +{ + "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000, + "source": "guest_agent", "available_in_deployment": true, + "cpu_pct_total": 12.4, "load_1m": 0.41, + "mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872, + "thermal_milli_c": 47200, + "net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}}, + "top_procs": [ + {"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576}, + {"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304} + ], + "listen_ports": [22, 80, 445] +} +``` + +## Versioning + +`schema_version` lives in `meta.json`. Bump when any row schema changes. Keep +old episodes untouched; loaders dispatch on version. + +## Ingest later + +When we move to a database (Timescale most likely), each `telemetry-*.jsonl` +becomes one hypertable, partitioned by `t_wall_ns`, indexed on +`(episode_id, source)`. The deployment-tag flag becomes a column we filter on +when materializing the realistic-model training view. diff --git a/docs/deploy.md b/docs/deploy.md new file mode 100644 index 0000000..7228764 --- /dev/null +++ b/docs/deploy.md @@ -0,0 +1,138 @@ +# Deploy + +Two roles. One install command each. + +## Roles + +| Role | Where it runs | What it does | +|---|---|---| +| `lab-host` | any KVM-capable Linux box on WG | runs episodes, ships completed episodes to the receiver | +| `receiver` | Pi5 (or any always-on WG node) | accepts ship uploads, stores tarballs + `index.jsonl` | + +## Lab host install + +```sh +git clone https://maxgit.wg/spectral/CIS490.git +cd CIS490 +./scripts/install-lab-host.sh +``` + +The installer: + +1. Verifies KVM (`/dev/kvm` exists, user in `kvm` group). +2. Installs system deps via the host package manager (qemu, tcpdump, + linux-tools/perf, zstd, python ≥ 3.11). +3. Bootstraps a [`uv`](https://github.com/astral-sh/uv)-managed venv at + `.venv/` and installs the pinned Python deps from `uv.lock`. +4. Drops two systemd units into `/etc/systemd/system/`: + - `cis490-orchestrator.service` — runs the episode loop on a queue + - `cis490-shipper.service` — watches `data/episodes/` and ships completed + episodes +5. Writes a config template to `/etc/cis490/lab-host.toml` (idempotent — only + on first install). + +You finish by editing `/etc/cis490/lab-host.toml` to point at your receiver +and to enroll your lab host's WG-issued client cert, then: + +```sh +sudo systemctl enable --now cis490-orchestrator cis490-shipper +``` + +### `lab-host.toml` + +```toml +host_id = "lab-host-1" + +[paths] +data_root = "/var/lib/cis490/data" +samples_store = "/var/lib/cis490/samples/store" +qcow_image = "/var/lib/cis490/vm/images/metasploitable2.qcow2" + +[receiver] +url = "https://collector.wg" +client_cert = "/etc/cis490/certs/lab-host-1.pem" +client_key = "/etc/cis490/certs/lab-host-1.key" +ca_bundle = "/etc/cis490/certs/wg-ca.pem" + +[episode] +baseline_seconds = 30 +infected_seconds = 90 +dormant_seconds = 60 + +[retention] +keep_local_for_days = 7 +prune_at_disk_pct = 80 +``` + +## Receiver install + +On the Pi5 (or designated central node): + +```sh +git clone https://maxgit.wg/spectral/CIS490.git +cd CIS490 +./scripts/install-receiver.sh +``` + +The installer: + +1. Installs Python ≥ 3.11 + zstd + a tiny WSGI runner (uvicorn). +2. Bootstraps the same `uv`-managed venv. +3. Drops `cis490-receiver.service` listening on `127.0.0.1:8443` (TLS + terminated by the existing Caddy in `spectral/caddy`, which already binds + `*.wg`). +4. Writes a config template to `/etc/cis490/receiver.toml`. + +Caddy block (added to your `spectral/caddy` config) for the receiver: + +```caddy +collector.wg { + tls internal + reverse_proxy 127.0.0.1:8443 { + transport http { + tls + tls_client_auth /etc/cis490/certs/wg-ca.pem + } + } +} +``` + +(mTLS terminates at the receiver, not Caddy — so the receiver sees the +client cert and can enforce per-host policies later.) + +### `receiver.toml` + +```toml +listen_addr = "127.0.0.1:8443" +store_root = "/var/lib/cis490/episodes" +incoming_root = "/var/lib/cis490/incoming" +index_path = "/var/lib/cis490/index.jsonl" +ca_bundle = "/etc/cis490/certs/wg-ca.pem" + +[limits] +max_episode_bytes = 268_435_456 # 256 MiB +``` + +## Day-2 operations + +```sh +# How many episodes have been shipped? +ssh collector.wg 'wc -l /var/lib/cis490/index.jsonl' + +# What's in the outbox on a lab host? (failed/pending shipments) +ls /var/lib/cis490/data/outbox/ + +# Tail the orchestrator log +journalctl -u cis490-orchestrator -f + +# Tail the shipper log +journalctl -u cis490-shipper -f +``` + +## Updating + +```sh +git pull +./scripts/install-lab-host.sh # idempotent; re-syncs deps and units +sudo systemctl restart cis490-orchestrator cis490-shipper +``` diff --git a/docs/lab-setup.md b/docs/lab-setup.md new file mode 100644 index 0000000..d4a82d4 --- /dev/null +++ b/docs/lab-setup.md @@ -0,0 +1,145 @@ +# Lab Setup + +How to bring up the host, build the guest, and verify the snapshot loop. + +## Host prerequisites + +``` +qemu-system-x86_64 >= 8.0 +qemu-img >= 8.0 +bridge-utils +tcpdump / tshark +linux-tools-common (for `perf`) +zstd +python >= 3.11 +uv (https://github.com/astral-sh/uv) +``` + +`scripts/install-lab-host.sh` installs all of these and wires up systemd — +see [`deploy.md`](deploy.md). + +KVM must be enabled in the kernel and the user must be in the `kvm` group: + +``` +ls /dev/kvm # must exist +groups # must include kvm +``` + +## Network: host-only malware bridge + +`br-malware` (10.200.0.1/24) is the only network the guest sees, and it is +host-only — no NAT, no upstream route. The host's WG interface is on a +*separate* link (`wg0`) used only for shipping completed episodes to the +collector; the bridge and WG never touch. + +| Interface | Purpose | +|---|---| +| `br-malware` (10.200.0.1/24) | host-only bridge, only NIC attached to the guest | +| guest `eth0` | DHCP from a dnsmasq bound only to `br-malware` | +| host WG (`wg0`) | shipping channel to the collector — not connected to the bridge | + +> Detailed firewall rules and the egress-drop safety net are out of scope for +> this document and live in the deploy script. The relevant invariant for +> readers is: **the guest cannot route off `br-malware`, period.** + +## Guest: Metasploitable 2 + +1. Download from the [Rapid7 mirror](https://information.rapid7.com/download-metasploitable-2017.html) + (verify sha256 against the published value before use). +2. Convert VMware → qcow2: + + ``` + qemu-img convert -O qcow2 -p Metasploitable.vmdk metasploitable2.qcow2 + ``` + +3. First boot (no snapshot yet) — let it come up, log in (msfadmin/msfadmin), + confirm services are listening on the expected ports, shut down cleanly. +4. Take the baseline snapshot: + + ``` + qemu-img snapshot -c baseline-v1 metasploitable2.qcow2 + ``` + + Internal qcow2 snapshots load in well under a second — this is the + "factory reset" mechanism for every episode. + +## Single-vCPU constrained-device emulation + +``` +-cpu host -smp 1,sockets=1,cores=1,threads=1 +-m 512 +-machine type=q35,accel=kvm +``` + +Plus a host-side cgroup CPU cap on the QEMU process (e.g. 80% of one core) so +the guest behaves like a small, constrained device under load. + +## Telemetry channels + +### virtio-serial for the in-guest agent + +``` +-device virtio-serial-pci +-chardev socket,path=/run/qemu/guest-agent.sock,server=on,wait=off,id=ga +-device virtserialport,chardev=ga,name=cis490.guest.agent +``` + +The in-guest agent opens `/dev/virtio-ports/cis490.guest.agent` and writes +JSONL to it. Host side, the orchestrator reads from the unix socket. No network +involvement = the malware cannot interfere with this channel. + +### QMP for live oracle queries + +``` +-qmp unix:/run/qemu/qmp.sock,server=on,wait=off +``` + +The orchestrator polls `query-stats`, `query-blockstats`, and netdev stats over +this socket. + +### perf stat on the QEMU process + +``` +perf stat -p -I 100 \ + -e cycles,instructions,cache-references,cache-misses,branches,branch-misses,page-faults,context-switches \ + -x , -o telemetry-perf.csv +``` + +The collector tails the CSV, parses, and emits JSONL. + +### tcpdump on `br-malware` + +``` +tcpdump -i br-malware -w network.pcap -B 4096 -s 200 +``` + +Post-process to `netflow.jsonl` with 100ms buckets. + +## Snapshot loop sanity check + +A green light before any data collection: + +1. `qemu-img snapshot -l metasploitable2.qcow2` shows `baseline-v1`. +2. Boot the VM with the qcow2. +3. Touch a file in the guest. Shut down. +4. `qemu-img snapshot -a baseline-v1 metasploitable2.qcow2`. +5. Boot again. The file is gone. ✅ + +## Safety checks before running real samples + +- `ip route show table all | grep br-malware` shows no route off the bridge. +- `dig @host example.com` from a guest fails (no DNS for malware). +- The host's WG interface is **not** bridged to `br-malware`. + +(See `scripts/install-lab-host.sh` for the firewall plumbing — it isn't the +focus of this project.) + +## Where to put VMs and snapshots + +``` +vm/images/ # qcow2 disk images (gitignored) +vm/snapshots/ # named snapshot exports if we ever externalize them +``` + +Both directories are gitignored. The repo only carries the *recipes* for +reproducing them. diff --git a/docs/threat-model.md b/docs/threat-model.md new file mode 100644 index 0000000..0bd8463 --- /dev/null +++ b/docs/threat-model.md @@ -0,0 +1,94 @@ +# Threat Model & Train/Serve Parity + +The single most important design rule in this project: + +> **A feature used by the deployed model must exist on the deployed device.** + +Violating this rule produces a model that scores 99% in the lab and is useless in +the field. This document spells out which features fall on which side of that +line, and why we still bother capturing both. + +## The setting + +The deployed model runs on a real, non-virtualized device — a Raspberry Pi, an +IoT endpoint, or similar. It tries to detect the moment that device gets +breached. Two adversarial facts shape the design: + +1. **Malware can lie to in-device tools.** A sufficiently-privileged rootkit can + hook `/proc`, intercept `perf_event_open`, and hide its own processes. +2. **There is no host-side QEMU view.** The deployed device is the actual + machine. Nothing is watching it from outside *the OS itself*. + +So the model has two trustworthy floors: + +- **In-device features that survive most malware** (perf counters via the syscall + interface, thermals, gross resource counters) — fast to lie to in principle, + but in practice most commodity malware doesn't bother. +- **Off-device features at the gateway** (network telemetry observed by an + upstream router/gateway) — physics-bound, the malware cannot prevent bytes + from leaving the NIC. + +## Two roles: features vs. oracles + +Every measurement we capture in the lab gets one of two roles: + +| Role | What it's used for | Available in deployment? | +|---|---|---| +| **Feature** | Input to the trained model | **Must be yes** | +| **Oracle** | Ground-truth labeling during training only | No — but we have it in the lab | + +The oracle channels (host `/proc/`, QMP `query-stats`, +`perf -p qemu_pid`) are how we know with certainty what the guest is *actually* +doing — not what it claims to be doing. We use that certainty to assign correct +labels in `labels.jsonl`. The model never sees them at training time. + +## Channel taxonomy + +| # | Channel | Vantage | Role | Why | +|---|---|---|---|---| +| 1 | Host `/proc/` | outside guest | oracle | doesn't exist on real device | +| 2 | QEMU QMP `query-stats`, `query-blockstats` | outside guest | oracle | same | +| 3 | `perf stat -p ` | outside guest | oracle | same | +| 4 | Bridge-side pcap (`tcpdump -i br-malware`) | gateway | **feature** | matches Pi5 gateway in field | +| 5 | In-guest `/proc/*`, `perf_event_open`, `/sys/class/thermal/*` | inside guest | **feature** | same exact source on real device | + +Note: in-guest features (5) are the same syscall surfaces we'd read on a real +Pi. The data we capture from them in the lab and the data we capture from them +on a Pi at deployment are pulled from identical kernel APIs — that is what makes +parity hold. + +## The two-model evaluation strategy + +We will train two classifiers from the same dataset: + +1. **Realistic model** — trained only on features (channels 4 + 5). + *This is what would be deployed.* +2. **Oracle model** — trained on everything (channels 1–5). + *This is the upper bound on what was learnable from this dataset.* + +The interesting numbers are: + +- The realistic model's accuracy → **how well the deployable system actually works**. +- The gap between realistic and oracle → **how much detection power a privileged + rootkit can take away from us by lying** (the cost of in-device measurement + being adversarial). +- Realistic-without-pcap vs. realistic-with-pcap → **how much the gateway + observer is worth** (the value of off-device telemetry). + +Reporting all three is the paper-worthy result — not just one accuracy number. + +## Using oracle disagreement as a label + +When the in-guest agent reports CPU at 8% and the host says the QEMU process is +burning 70%, that *is* the rootkit-active signal — and it's more reliable than +any single in-guest measurement. Episodes where this divergence appears get a +high-confidence `infected_running` label even if the in-guest data alone looks +ambiguous. This is the practical payoff of capturing both sides. + +## What we are not claiming + +- We are not claiming to detect kernel rootkits robustly from in-guest data alone. + The oracle/feature gap will quantify the limit. +- We are not claiming the trained model is safe to deploy without the gateway + observer in production — for the strongest threat model, gateway-side fusion + is required. diff --git a/docs/transport.md b/docs/transport.md new file mode 100644 index 0000000..f314008 --- /dev/null +++ b/docs/transport.md @@ -0,0 +1,164 @@ +# Transport — Centralized Episode Collection over WG + +The dataset lives wherever it is convenient to train from. In our setup that is +the Pi5 (or whatever the team designates as the central collector), reachable +over the WG overlay at `.wg`. This document describes how +episodes get from a lab host to the central collector. + +## Design goals + +1. **Easy to deploy.** One config file, one systemd unit per side. No DB + required to start collecting. +2. **WG-native.** Sender and receiver both live on the WG overlay; transport is + just HTTPS over WG. We use the existing wg-pki CA for mTLS. +3. **Idempotent.** Re-shipping the same episode is safe and cheap; the + receiver responds 200 if the bytes already match. +4. **Crash-safe.** Lab host crash mid-episode does not corrupt the central + store. Receiver crash mid-upload leaves no partial visible. +5. **Schema-free.** The receiver does not parse JSONL; it stores tarballs and + an append-only index. The schema lives only at training time. + +## What gets shipped + +A complete episode directory is tarred and zstd-compressed: + +``` +data/episodes// → .tar.zst +``` + +The orchestrator marks an episode complete by writing a `done.marker` file at +the *end* of the directory after `meta.json` is finalized. The shipper only +considers directories that contain `done.marker` — partially-written episodes +are invisible to it. + +## Wire protocol + +``` +PUT https://.wg/v1/episodes//.tar.zst + Content-Type: application/zstd + Content-Length: + X-Content-SHA256: + X-Schema-Version: 1 + X-Lab-Host: + X-Episode-Id: + body: +``` + +Auth: mTLS using a leaf certificate issued by the wg-pki CA. The receiver +trusts only certs issued by that CA. + +Responses: + +| Status | Meaning | +|---|---| +| 201 | Stored; new | +| 200 | Already present with matching sha256; nothing to do | +| 409 | Already present with **different** sha256; receiver refuses to overwrite | +| 4xx | Bad request (missing header, malformed id, etc.) | +| 5xx | Server error; sender retries with backoff | + +There is no DELETE. Episodes are immutable once shipped. + +## Sender (`shipper`) state machine + +``` + scan data/episodes/ + | + v + for each /done.marker: + | + v + tar+zstd → data/outbox/.tar.zst.partial + | + v + rename → data/outbox/.tar.zst (atomic; visible to retry loop) + | + v + PUT to receiver + | + +-- 200/201 → mv data/episodes/ data/shipped/; + | rm data/outbox/.tar.zst + | + +-- 409 → log mismatch, leave files in place, alert (manual triage) + | + +-- 5xx/network → backoff (1s, 2s, 4s, 8s, ... cap 5min); retry +``` + +The shipper does the same scan on every wake-up, so a crash mid-tar or +mid-PUT is harmless — the next pass picks up wherever it left off. + +## Receiver state machine + +``` + PUT body received + | + v + stream into /var/lib/cis490/incoming//.tar.zst.partial + | + v + compute sha256 while streaming + | + +-- mismatch with header → 400, delete partial + | + +-- match: + | + v + if final path exists: + | + +-- existing sha256 == new sha256 → 200, delete partial + | + +-- existing sha256 != new sha256 → 409, delete partial + else: + | + v + atomic rename → /var/lib/cis490/episodes//.tar.zst + | + v + append index.jsonl row + | + v + 201 +``` + +`index.jsonl` row: + +```json +{ + "received_at_wall": "2026-04-28T22:31:43Z", + "host_id": "lab-host-1", + "episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0", + "sha256": "...", + "size_bytes": 8412331, + "schema_version": 1 +} +``` + +That index is the closest thing to a database we have until we decide on one. +A trainer can stream it to know what episodes exist, then untar on demand. + +## Why not just rsync? + +`rsync` works, but: + +- No schema-version tagging at the protocol layer. +- No clean way to enforce "immutable once written". +- mTLS via WG-issued certs is more uniform with the rest of the overlay than + ssh-key juggling. +- A tiny FastAPI receiver is also a natural place to add ingest-time hooks + later (e.g. emit a Matrix notification on successful receipt, kick off a + training run when N new episodes arrive). + +We may switch to rsync if the FastAPI receiver becomes a bottleneck. For a +class project that is unlikely. + +## Operational notes + +- **Disk on lab host.** The shipper keeps episodes locally in + `data/shipped//` until a retention pass prunes them. Default retention: + 7 days *or* 80% disk usage, whichever comes first. +- **Disk on receiver.** No retention enforced by default — the central store + is the dataset. +- **Backpressure.** If the receiver is unreachable (WG down, Pi rebooting), + the shipper accumulates tarballs in `data/outbox/`. No data is lost. +- **Multiple lab hosts.** Each writes under its own `/` prefix. No + coordination needed; episode ids are globally unique (ULID). diff --git a/exploits/README.md b/exploits/README.md new file mode 100644 index 0000000..0d433e8 --- /dev/null +++ b/exploits/README.md @@ -0,0 +1,12 @@ +# exploits/ + +Metasploit resource scripts (`*.rc`) that drive specific exploit modules +deterministically — same inputs, same module options, every time. + +Each script: +- Sets `RHOSTS` to the guest's bridge IP. +- Sets a payload that opens a session usable for sample upload + execute. +- Avoids any options that introduce randomness in the exploit fire timing + (so that the `armed → infecting` transition lands at a predictable offset). + +These scripts pair with public Metasploit modules. We do not author exploits. diff --git a/orchestrator/README.md b/orchestrator/README.md new file mode 100644 index 0000000..5bd60ea --- /dev/null +++ b/orchestrator/README.md @@ -0,0 +1,21 @@ +# orchestrator/ + +The state machine that drives a single **episode**: + +``` +snapshot_load → clean → armed → infecting → infected_running → dormant → reverting +``` + +Responsibilities: + +- Bring up the host-only bridge and verify isolation before the guest starts. +- Boot the guest from a named snapshot. +- Spawn the five telemetry collectors (`collectors/`) with a shared episode id + and shared monotonic clock origin. +- Drive the Metasploit Framework over RPC to fire the configured exploit module. +- Upload + execute the configured malware sample once a session is open. +- Emit phase transitions to `labels.jsonl` *at the moment the action is taken*. +- Revert the snapshot at episode end. +- Write `meta.json` with the result summary. + +Implementation lives in this directory and is imported as `orchestrator.*`. diff --git a/samples/README.md b/samples/README.md new file mode 100644 index 0000000..eccdf3a --- /dev/null +++ b/samples/README.md @@ -0,0 +1,33 @@ +# samples/ + +**Sample binaries are NEVER committed to this repo.** This directory holds: + +- `manifest.yaml` — sha256-pinned list of samples to fetch, with metadata + (source, category, expected behavior, target CVE). +- `fetch.py` — script that pulls samples from configured sources + (MalwareBazaar, theZoo, vx-underground), verifies sha256, and stores them + under `samples/store/` (gitignored). +- Per-sample notes in markdown describing observed behavior in our lab. + +`samples/store/` lives only on the lab host. It is gitignored *and* should +sit on a disk that is not auto-mounted on developer workstations. + +## Manifest entry shape (placeholder) + +```yaml +samples: + - name: linux.miner.xmrig.elf + sha256: "..." # pinned + source: MalwareBazaar + category: miner + target_cve: null # cryptominers are usually post-exploit payloads + behavior: "high CPU, periodic stratum protocol traffic" + pairs_with_exploit: exploit/multi/samba/usermap_script +``` + +## Safety rules + +- Only download to the lab host, never to a developer workstation. +- Verify sha256 immediately, before any other read. +- Keep the directory on a path that is *not* on the WG overlay. +- Re-verify sha256 before each detonation; refuse to run on mismatch. diff --git a/training/README.md b/training/README.md new file mode 100644 index 0000000..32116d8 --- /dev/null +++ b/training/README.md @@ -0,0 +1,23 @@ +# training/ + +Deferred until the dataset has substance. The plan, recorded so we don't lose +it: + +1. Two models will be trained from the same episodes: + - **Realistic** — features only (`available_in_deployment: true`). + - **Oracle** — all rows, regardless of the deployment flag. +2. Baseline architecture: a rolling-window feature builder + a gradient-boosted + trees classifier (XGBoost or LightGBM). Cheap, strong, interpretable. +3. Window: 1–5 second sliding windows with per-channel summary stats + (mean, std, p95, slope, count of zero buckets). +4. Target: the phase enum from `labels.jsonl`, projected onto each window's + center timestamp. +5. Evaluation: + - Held-out *samples* (not just held-out time slices) — generalization to + unseen malware matters more than within-sample accuracy. + - Confusion matrix + per-phase precision/recall. + - Realistic vs. oracle gap, reported. +6. Stretch: trust-over-time scoring per the IEEE 9881803 paper, with a reset + threshold tuned for low false-positive cost. + +See [`docs/threat-model.md`](../docs/threat-model.md) for why this split exists. diff --git a/vm/README.md b/vm/README.md new file mode 100644 index 0000000..29788a8 --- /dev/null +++ b/vm/README.md @@ -0,0 +1,17 @@ +# vm/ + +Recipes and helpers for building and snapshotting guest VMs. Disk images and +snapshots themselves are gitignored — this directory carries the *how*, not +the bytes. + +``` +vm/ + images/ # qcow2 staging (gitignored) + snapshots/ # exported snapshots if needed (gitignored) + guest-agent/ # in-guest telemetry agent (shipped into the guest) + metasploitable2.md # download/convert/snapshot procedure (TODO) + custom-debian/ # cloud-init for our own vulnerable Debian (TODO) +``` + +See [`docs/lab-setup.md`](../docs/lab-setup.md) for the full host + guest +bring-up procedure.