Phase 2: real-VM episode (Cirros under KVM) + works-cited doc

vm/launch_demo.sh boots a Cirros qcow2 under KVM with QMP and a monitor
socket exposed; snapshot=on routes guest writes to a temporary overlay
so the on-disk image is never mutated (clean factory reset every boot).

End-to-end verified: vm/launch_demo.sh → orchestrator with --target-pid
<qemu pid> → 201 telemetry rows over 20s against the real qemu-system
process. The plotted envelope shows the expected idle-VM shape:
periodic ~10% CPU spikes from KVM/timer interrupts, flat 230 MiB RSS,
and a single late-boot disk write. Distinct from the synthetic
load_mimic envelope, confirming the collector reads real KVM behavior.

docs/sources.md is the works-cited doc — every tool, library, sample
source, paper, and standard the project leans on, grouped by category.
README's nav table now points at it. README's status section also lists
what's done vs. in progress so reviewers can see scope at a glance.

Note: vm/images/ stays gitignored. The Cirros 0.6.3 image is documented
with its sha256 (7d6355852aeb...) in docs/sources.md so any team member
can reproduce the bytes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Maximus Gorog 2026-04-29 00:00:25 -06:00
parent 970698af83
commit 69c09f4404
3 changed files with 245 additions and 2 deletions

View file

@ -28,6 +28,7 @@ proprietary follow-on that pairs detection with blockchain-anchored hardware res
| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG |
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
| [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
| `vm/` | qcow2 images and snapshot scripts (binaries gitignored) |
@ -52,5 +53,11 @@ proprietary follow-on that pairs detection with blockchain-anchored hardware res
## Status
Project bootstrap. Skeleton, documentation, and design decisions in place;
collection and orchestration code in progress.
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl.
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids.
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz.
- ✅ Synthetic envelope demo (`tools/run_envelope_demo.py`) — full 8-phase XMRig-shaped envelope produced end-to-end.
- ✅ **Phase 2 — real VM:** Cirros boots under KVM, orchestrator collects telemetry against the real `qemu-system` pid (`vm/launch_demo.sh` + the existing orchestrator).
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5).
- 🚧 Exploit driver (Metasploit RPC) for `armed → infecting` transitions on `session_open`.
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified).

190
docs/sources.md Normal file
View file

@ -0,0 +1,190 @@
# Sources & Works Cited
Every external thing this project depends on, leans on for design, or pulls
samples from. Grouped by category. Where relevant, we note the role each
thing plays in our pipeline.
---
## Prior work / academic
- **A Trust Model for Resource-Constrained IoT Devices Based on Performance
Metrics.** IEEE Document 9881803.
https://ieeexplore.ieee.org/document/9881803
*Role:* prerequisite paper for this project. Frames detection as a
trust-over-time score rather than a single-snapshot classifier.
- **Mirai: original Linux/IoT botnet using SSH/Telnet weak credentials**
(Antonakakis et al., USENIX Security 2017).
https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/antonakakis
*Role:* the canonical real-world Linux compromise pattern that motivates
our chosen attack vector ("SSH weak creds → drop payload"). The behavioral
envelope our model targets is shaped by Mirai-class workloads.
- **Linux man pages, `proc(5)`** — kernel ABI for /proc.
https://man7.org/linux/man-pages/man5/proc.5.html
*Role:* canonical reference for `/proc/<pid>/{stat,io,status,schedstat}`
field layout used by `collectors/proc_qemu.py`.
- **Linux `perf_event_open(2)` man page.**
https://man7.org/linux/man-pages/man2/perf_event_open.2.html
*Role:* the syscall that backs `perf stat` and any in-process hardware-
counter reads. Both the planned host-side `perf_qemu` collector and the
in-guest agent will read from this surface.
---
## Virtualization & operating system
- **QEMU** (10.2.0 in our lab). https://www.qemu.org
*License:* GPL-2.0-or-later. *Role:* the hypervisor running guest VMs;
we drive it via QMP for oracle telemetry.
- **KVM** (Linux kernel module). https://www.linux-kvm.org
*License:* GPL-2.0. *Role:* hardware-accelerated virtualization backend
for QEMU.
- **Linux kernel** (6.18.x lab host). https://www.kernel.org
*License:* GPL-2.0. *Role:* host kernel; supplies /proc, /sys, perf, KVM,
cgroups, virtio-serial.
- **systemd** — service supervisor for the receiver and the (planned)
orchestrator and shipper daemons. https://systemd.io
*License:* LGPL-2.1-or-later.
---
## VMs and intentionally-vulnerable images
- **Cirros 0.6.3** (current phase-2 demo image, x86_64).
https://download.cirros-cloud.net/0.6.3/
SHA-256 of `cirros-0.6.3-x86_64-disk.img`:
`7d6355852aeb6dbcd191bcda7cd74f1536cfe5cbf8a10495a7283a8396e4b75b`
*License:* GPL. *Role:* tiny (~21 MiB) Linux image used in OpenStack/QEMU
testing; we use it as the "real but boring" guest while the rest of the
pipeline is wired up. **No vulnerabilities baked in** — it's a clean
baseline.
- **Metasploitable 2** (planned for the exploit phase, Rapid7).
https://information.rapid7.com/download-metasploitable-2017.html
*Role:* purposely vulnerable Linux VM whose services have stable
Metasploit modules (vsftpd 2.3.4 backdoor, distccd RCE,
Samba `usermap_script`, PHP CGI arg injection, etc.) — gives us
reproducible exploit fire for the `armed → infecting` transition.
- **Metasploitable 3** (Rapid7, optional later, Vagrant-built).
https://github.com/rapid7/metasploitable3
*Role:* heavier, Win + Linux variants; reserved for adding diversity to
the dataset if time allows.
---
## Exploitation framework
- **Metasploit Framework** (Rapid7).
https://github.com/rapid7/metasploit-framework
*License:* BSD-3-Clause. *Role:* drives the exploit fire step
programmatically via `msfrpc`, so episodes label `armed → infecting`
transitions on `session_open` rather than guessing from metrics.
- **Exploit-DB** (Offensive Security).
https://www.exploit-db.com
*Role:* cross-reference for CVE → public PoC, where Metasploit doesn't
cover a vulnerability we want.
---
## Public malware sample sources
> **All samples used in this project are pre-existing, public, and
> hash-pinned.** We do not author novel malware or exploits.
> Sample binaries are NEVER committed to the repo — see
> [`samples/README.md`](../samples/README.md) for safety rules.
- **MalwareBazaar** (abuse.ch). https://bazaar.abuse.ch
*Role:* primary sample fetch source. Provides API + sha256 lookup. Used
for cryptominers (XMRig variants), webshells, and Linux ELF samples.
- **theZoo** (a public live-malware repository). https://thezoo.morirt.com
https://github.com/ytisf/theZoo
*Role:* secondary source for older/rarer samples. Categorized by family.
- **vx-underground** (collection of malware research artifacts).
https://vx-underground.org
*Role:* tertiary source; useful for academic context and
family-attribution metadata.
---
## Standards & specifications
- **ULID — Universally Unique Lexicographically Sortable Identifier.**
https://github.com/ulid/spec
*Role:* episode IDs. 26-char Crockford base32, time-sortable, no
coordinator. Implemented in `orchestrator/ulid.py`.
- **JSON Lines.** https://jsonlines.org
*Role:* on-disk telemetry, label, and event format. Append-only,
crash-safe, trivially loadable as a DataFrame.
- **PEP 735 — dependency groups.**
https://peps.python.org/pep-0735/
*Role:* `pyproject.toml` dependency grouping (the `dev` group).
- **Crockford base32.** https://www.crockford.com/base32.html
*Role:* alphabet for ULIDs.
---
## Python runtime & libraries
- **Python 3.11+** — runtime requirement. https://www.python.org
- **uv** (Astral) — Python project + dependency manager.
https://github.com/astral-sh/uv
- **Starlette** — ASGI framework for the receiver.
https://www.starlette.io
- **Uvicorn** — ASGI server. https://www.uvicorn.org
- **httptools, websockets, watchfiles, python-dotenv, pyyaml** — Uvicorn
`[standard]` extras.
- **pytest** — test runner. https://docs.pytest.org
- **pytest-asyncio** — async test support.
https://github.com/pytest-dev/pytest-asyncio
- **httpx** — async HTTP client used for receiver tests via ASGITransport.
https://www.python-httpx.org
- **matplotlib** + **numpy** — plotting (envelope visualization only).
https://matplotlib.org / https://numpy.org
---
## Lab infrastructure (spectral org, .wg overlay)
These are not part of this repo's code, but they are the platform the
pipeline runs on. See [`reference_wg_infra` memory] for context.
- **WireGuard** — VPN tunnel for the `.wg` overlay.
https://www.wireguard.com *License:* GPL-2.0.
- **Caddy** — reverse proxy in front of the receiver, terminates internal
TLS via `tls internal`. https://caddyserver.com
*License:* Apache-2.0.
- **Forgejo** — self-hosted git host at `maxgit.wg`.
https://forgejo.org *License:* GPL-3.0+.
- **Raspberry Pi 5** — central WG-side collector hardware (the receiver
+ dataset store run here). NOT the deployment target for the model.
---
## How to cite this dataset (placeholder)
When the dataset reaches a publishable form, the canonical citation will be
added here. Until then, a short course-project citation is fine:
> Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).*
> Spectral lab, 2026.
---
## Maintenance
When you add a new dependency, sample source, or external tool, add it
here in the same session. A "works cited" file with stale citations is
worse than none.

46
vm/launch_demo.sh Executable file
View file

@ -0,0 +1,46 @@
#!/usr/bin/env bash
# Boot the Cirros qcow2 under KVM with QMP and a monitor socket exposed.
#
# This is the v0 VM launcher for phase 2: validate that the orchestrator
# and host /proc collector work against a real qemu-system process. No
# host-only bridge yet, no exploit driver, no payload — just boot and
# idle. We add the bridge and exploit machinery in later phases.
#
# Run dir is exported so the orchestrator can read the qemu pid:
# $RUN_DIR/qemu.pid
# $RUN_DIR/qmp.sock
# $RUN_DIR/monitor.sock
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
IMAGE="${IMAGE:-$REPO_ROOT/vm/images/cirros-baseline.qcow2}"
RUN_DIR="${RUN_DIR:-/tmp/cis490-vm}"
mkdir -p "$RUN_DIR"
QMP_SOCK="$RUN_DIR/qmp.sock"
MON_SOCK="$RUN_DIR/monitor.sock"
PID_FILE="$RUN_DIR/qemu.pid"
if [[ ! -f "$IMAGE" ]]; then
echo "no image at $IMAGE" >&2
exit 1
fi
# snapshot=on routes guest writes through a temporary overlay so the qcow2
# on disk is never mutated — every boot starts from the same bytes.
exec qemu-system-x86_64 \
-name cis490-vm \
-machine q35,accel=kvm \
-cpu host \
-smp 1,sockets=1,cores=1,threads=1 \
-m 256 \
-drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
-netdev user,id=n0 \
-device virtio-net-pci,netdev=n0 \
-nographic \
-serial null \
-monitor unix:"$MON_SOCK",server=on,wait=off \
-qmp unix:"$QMP_SOCK",server=on,wait=off \
-pidfile "$PID_FILE" \
-display none