README: embed demo plots, mark synthetic vs real clearly, add collapsibles

The README now leads with a 'What an episode looks like' section that shows both: * docs/images/synthetic-envelope.png — pipeline-validation plot. Real telemetry of a real process whose load is shaped by tools/load_mimic.py (Python). Explicitly labelled NOT REAL MALWARE in the caption — the earlier wording was unclear. * docs/images/real-vm-idle.png — real Cirros 0.6.3 booted under KVM, same orchestrator + /proc collector pointed at the qemu-system pid. Idle baseline; no exploit, no payload yet. A 'What's still missing for the real-malware envelope' table makes the tier path explicit (real VM idle → real workload in-guest → real exploit fire → real sample). Repository nav, deploy steps, design rationale, and threat model are moved into <details>...</details> blocks so first-time visitors see the demo plots and the status list without scrolling past wall-of-text. Stale Pi-as-deployment-target wording in the design-rationale section is fixed alongside. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:11:54 -06:00 · 2026-04-29 00:11:54 -06:00 · 32ae161ef2
commit 32ae161ef2
parent cc37fc6c4d
3 changed files with 227 additions and 40 deletions
--- a/README.md
+++ b/README.md
@ -1,24 +1,147 @@
 # CIS490 — Behavioral Malware Detection Dataset & Model

-Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches
-performance metrics on a real device, decides whether the device has been breached,
-and triggers a hardware-level reset when confidence is high enough.
-
-This repository covers the **dataset side** of that pipeline: we run real, public
-malware samples against intentionally vulnerable Linux VMs and capture labeled
-time-series telemetry that mirrors what the same model would see in deployment on
-an arbitrary target Linux device.
-
-> **Note on the topology:** in this project the **Pi5 is the WireGuard-side
-> collector** that receives episode tarballs from one or more lab hosts — it is
-> *not* the deployment target for the model. The deployment target is generic
-> ("any constrained Linux device"). See [`docs/architecture.md`](docs/architecture.md).
+Course project for CIS490 (Cybersecurity). The end-goal is an ML model that
+watches performance metrics on a real device, decides whether the device has
+been breached, and triggers a hardware-level reset when confidence is high
+enough. This repository covers the **dataset side** — we run public malware
+samples against intentionally vulnerable Linux VMs and capture labeled
+time-series telemetry that mirrors what the deployed model would see in the
+field.

 The work is grounded in the trust-over-time scoring model from
-[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803) and a related
-proprietary follow-on that pairs detection with blockchain-anchored hardware reset.
+[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803).

-## What lives where
+---
+
+## What an episode looks like
+
+Each episode runs a target through a labeled phase schedule
+(`clean → armed → infecting → infected_running → dormant → ...`) while
+sampling host-side `/proc` telemetry at 10 Hz. The dataset's "envelope" is
+the set of timestamped phase transitions written to `labels.jsonl` —
+sharing a monotonic clock with the metric rows so anything aligned in
+time can be aligned in code.
+
+### Pipeline-validation plot — *synthetic load, real telemetry*
+
+This is **not real malware**. The CPU/RSS/IO numbers are real
+`/proc/<pid>` reads of a real process; the *workload shape* is a Python
+program (`tools/load_mimic.py`) that mimics an XMRig-style envelope so we
+can validate the orchestrator + collector + labeling pipeline before
+plugging in a real exploit and a real sample. Coloured bands are phase
+labels straight out of `labels.jsonl`.
+
+![Synthetic envelope demo (pipeline validation only)](docs/images/synthetic-envelope.png)
+
+### Real-VM idle baseline — *real Cirros guest under KVM, no malware yet*
+
+Same pipeline, pointed at the real `qemu-system` process running a fresh
+Cirros 0.6.3 guest with nothing happening inside it. Periodic ~10% CPU
+spikes are KVM/timer interrupts; the single ~1 MiB write near t=3 s is
+the guest finishing its late-boot disk activity. No phase transitions —
+just labelled `clean` for the whole window.
+
+![Real Cirros VM idle](docs/images/real-vm-idle.png)
+
+### What's still missing for the real-malware envelope
+
+| Tier | What it gives | Status |
+|---|---|---|
+| 1 — real VM, idle | confidence the collector reads real KVM behaviour | ✅ done |
+| 2 — real VM, real workload from inside the guest | first real-load envelope shape | 🚧 next |
+| 3 — real VM, real exploit fire (Metasploitable + msfrpc) | honest `armed → infecting` transitions | 🚧 |
+| 4 — real VM, real malware sample (XMRig from MalwareBazaar) | the full envelope we ultimately train on | 🚧 |
+
+For an interactive view of any episode (zoom/pan/hover), run:
+
+```sh
+tools/show_envelope.sh data/episodes/<episode_id>
+# then open http://127.0.0.1:8988/
+```
+
+---
+
+## Status
+
+- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
+- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
+- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz
+- ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
+- ✅ Real VM (Cirros under KVM) — orchestrator collects against the real `qemu-system` pid
+- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
+- 🚧 Exploit driver (Metasploit RPC) for `armed → infecting` transitions on `session_open`
+- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)
+
+> **Topology note:** in this project the **Pi5 is the WireGuard-side
+> *collector*** that receives episode tarballs from one or more lab hosts.
+> It is *not* the deployment target for the model. The deployment target is
+> generic ("any constrained Linux device"). See
+> [`docs/architecture.md`](docs/architecture.md).
+
+---
+
+<details>
+<summary><b>Quick start — run the synthetic envelope demo (~90 s)</b></summary>
+
+```sh
+git clone https://maxgit.wg/spectral/CIS490.git
+cd CIS490
+
+# One-time setup.
+uv sync
+
+# Generate one labeled episode (8 phases, 851 telemetry rows, 85 s).
+uv run python tools/run_envelope_demo.py --data-root data
+
+# Render a static PNG envelope of that episode.
+uv run python tools/plot_envelope.py data/episodes/<episode_id>
+
+# Or open an interactive plot in your browser:
+tools/show_envelope.sh data/episodes/<episode_id>
+```
+
+The data lands in `data/episodes/<ulid>/`:
+
+```
+meta.json              episode metadata (image, snapshot, schedule, host fingerprint)
+events.jsonl           orchestrator actions (snapshot_load, phase_transition, episode_end)
+labels.jsonl           one row per phase transition — THIS is the envelope
+telemetry-proc.jsonl   host /proc sampler at 10 Hz
+done.marker            written last; the shipper only sees finished episodes
+```
+
+</details>
+
+<details>
+<summary><b>Quick start — boot a real Linux VM (Cirros)</b></summary>
+
+The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its
+QMP/monitor sockets and pidfile. The orchestrator then samples the real
+`qemu-system` process.
+
+```sh
+# Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'.
+# (See docs/sources.md for the Cirros sha256.)
+
+# Boot in one terminal:
+RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh
+
+# In another terminal, point the orchestrator at the VM's pid:
+QPID=$(cat /tmp/cis490-vm/qemu.pid)
+uv run python -m orchestrator --target-pid $QPID --duration 20
+
+# Plot:
+tools/show_envelope.sh data/episodes/<episode_id>
+```
+
+The idle-VM envelope shape is distinct from the synthetic load: periodic
+~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single
+late-boot disk write. That's a real KVM guest you're seeing.
+
+</details>
+
+<details>
+<summary><b>Repository layout</b></summary>

 | Path | What it holds |
 |---|---|
@ -31,33 +154,97 @@ proprietary follow-on that pairs detection with blockchain-anchored hardware res
 | [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
 | `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
 | `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
-| `vm/` | qcow2 images and snapshot scripts (binaries gitignored) |
-| `exploits/` | Metasploit resource scripts for repeatable exploitation |
+| `receiver/` | Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent |
+| `vm/` | qcow2 images, launch scripts, snapshot recipes (binaries gitignored) |
+| `tools/` | Demo runners, load mimic, plot scripts |
+| `exploits/` | Metasploit resource scripts for repeatable exploitation (TODO) |
 | `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
 | `training/` | Model training code (deferred — schema first) |
+| `etc/` | systemd units and config templates installed by the deploy scripts |

-## Quick orientation
+</details>

-1. **Why VMs?** We need a clean snapshot/revert loop and we need to run real malware
-   without burning hardware. KVM gives us both at near-native speed.
-2. **Why is the network isolated?** A host-only bridge keeps malware off the
-   internet and off the WG overlay. The Pi5 gateway is the **lab-side observer**,
-   playing the same role it would play in a deployed setting.
-3. **Why JSONL and not a database (yet)?** Schema-last: collect first, decide
-   storage shape after we see what's actually useful. JSONL is crash-safe,
-   append-only, and reshapes trivially into Postgres/Timescale/Parquet later.
-4. **Why two models?** One trained on features that exist on a real Pi
-   (*deployable*), one trained on host-side QEMU-only features (*oracle*). The
-   accuracy gap measures how much detection power a privileged rootkit can take
-   from the deployed model. See [docs/threat-model.md](docs/threat-model.md).
+<details>
+<summary><b>Design decisions — why these choices</b></summary>

-## Status
+- **Why VMs (not Docker)?** We need a clean snapshot/revert loop and we need
+  to run real malware without compromising the host. KVM gives both at
+  near-native speed; containers share the host kernel and many samples detect
+  containerization and refuse to detonate. See
+  [`docs/architecture.md`](docs/architecture.md).
+- **Why KVM (not TCG/-icount)?** ML training data wants noise to generalize
+  to. KVM is ~15× faster than TCG, which directly multiplies dataset size
+  per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the
+  "constrained device" framing.
+- **Why JSONL (not a DB yet)?** Schema-last. Collect first, decide storage
+  shape after we see what's useful. JSONL is crash-safe, append-only,
+  reshapes trivially into Postgres/Timescale/Parquet.
+- **Why two models — realistic vs. oracle?** Features that exist on a
+  deployed device train the *realistic* model. Host-side QEMU telemetry
+  (which doesn't exist in deployment) is *oracle*-only — used to assign
+  honest labels at training time, never as a model input. The accuracy gap
+  between the two measures how much detection power a privileged rootkit
+  can take from us by lying to in-device tools. See
+  [`docs/threat-model.md`](docs/threat-model.md).
+- **Why ULIDs for episode ids?** Time-sortable, no coordinator, URL-safe.

- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl.
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids.
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz.
- ✅ Synthetic envelope demo (`tools/run_envelope_demo.py`) — full 8-phase XMRig-shaped envelope produced end-to-end.
- ✅ **Phase 2 — real VM:** Cirros boots under KVM, orchestrator collects telemetry against the real `qemu-system` pid (`vm/launch_demo.sh` + the existing orchestrator).
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5).
- 🚧 Exploit driver (Metasploit RPC) for `armed → infecting` transitions on `session_open`.
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified).
+</details>
+
+<details>
+<summary><b>Deploying the receiver and lab-host roles</b></summary>
+
+Two roles, one bootstrap command each. Detailed in
+[`docs/deploy.md`](docs/deploy.md):
+
+- `lab-host` — runs episodes, ships completed episodes to the receiver.
+- `receiver` — accepts ship uploads, stores tarballs + appends to
+  `index.jsonl`. Runs on the Pi5 in our setup.
+
+```sh
+# On a lab host:
+./scripts/install-lab-host.sh   # (TODO — currently bring up by hand per docs/deploy.md)
+
+# On the Pi5 (or any always-on WG node):
+./scripts/install-receiver.sh   # (TODO — same)
+```
+
+For now both bootstrap scripts are scaffolds; the units and configs they
+install live in `etc/`. The receiver itself works today
+(`uv run python -m receiver --config etc/receiver.toml.example` — modify
+paths).
+
+</details>
+
+<details>
+<summary><b>Threat model and feature-availability split</b></summary>
+
+See [`docs/threat-model.md`](docs/threat-model.md) for the full argument.
+The short version:
+
+| Channel | Vantage | Role |
+|---|---|---|
+| Host `/proc/<qemu_pid>` | outside guest | oracle (label only) |
+| QEMU QMP `query-stats` etc. | outside guest | oracle (label only) |
+| `perf stat -p <qemu_pid>` | outside guest | oracle (label only) |
+| Bridge-side pcap | gateway-style | feature (deployable) |
+| In-guest `/proc`, perf, thermal | inside guest | feature (deployable) |
+
+We collect everything in the lab. Only the *features* go into the deployed
+model; the oracles are used to label episodes with high confidence
+(disagreement between in-guest and host-side data is itself a rootkit
+signal).
+
+</details>
+
+---
+
+## Citing this work
+
+A short course-project citation, until the dataset reaches a publishable
+form:
+
+> Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).*
+> Spectral lab, 2026.
+
+See [`docs/sources.md`](docs/sources.md) for everything else this project
+leans on.
--- a/docs/images/real-vm-idle.png
+++ b/docs/images/real-vm-idle.png
--- a/docs/images/synthetic-envelope.png
+++ b/docs/images/synthetic-envelope.png