CIS490/docs/lab-setup.md
Maximus Gorog fa1574a0a6 Scaffold project: docs, repo skeleton, transport + deploy design
Lays down the design surface for the CIS490 behavioral-malware-detection
dataset and model. No code yet — schema and topology are decided first so
collection can start without rework.

Docs:
- README: project goal, navigation
- architecture: lab topology, KVM choice, episode state machine,
  deployment-mirror reasoning
- threat-model: train/serve parity rule, oracle-vs-deployable feature
  split, two-model evaluation strategy
- data-model: per-episode JSONL layout, row schemas, phase enum
- transport: WG-native shipper/receiver design, idempotent uploads
- deploy: one-command install for lab-host and receiver roles
- lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring

Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/,
training/ (each with a short README explaining purpose).
Extended .gitignore to exclude qcow2 images, pcaps, sample binaries,
secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:21:00 -06:00

145 lines
4.2 KiB
Markdown

# Lab Setup
How to bring up the host, build the guest, and verify the snapshot loop.
## Host prerequisites
```
qemu-system-x86_64 >= 8.0
qemu-img >= 8.0
bridge-utils
tcpdump / tshark
linux-tools-common (for `perf`)
zstd
python >= 3.11
uv (https://github.com/astral-sh/uv)
```
`scripts/install-lab-host.sh` installs all of these and wires up systemd —
see [`deploy.md`](deploy.md).
KVM must be enabled in the kernel and the user must be in the `kvm` group:
```
ls /dev/kvm # must exist
groups # must include kvm
```
## Network: host-only malware bridge
`br-malware` (10.200.0.1/24) is the only network the guest sees, and it is
host-only — no NAT, no upstream route. The host's WG interface is on a
*separate* link (`wg0`) used only for shipping completed episodes to the
collector; the bridge and WG never touch.
| Interface | Purpose |
|---|---|
| `br-malware` (10.200.0.1/24) | host-only bridge, only NIC attached to the guest |
| guest `eth0` | DHCP from a dnsmasq bound only to `br-malware` |
| host WG (`wg0`) | shipping channel to the collector — not connected to the bridge |
> Detailed firewall rules and the egress-drop safety net are out of scope for
> this document and live in the deploy script. The relevant invariant for
> readers is: **the guest cannot route off `br-malware`, period.**
## Guest: Metasploitable 2
1. Download from the [Rapid7 mirror](https://information.rapid7.com/download-metasploitable-2017.html)
(verify sha256 against the published value before use).
2. Convert VMware → qcow2:
```
qemu-img convert -O qcow2 -p Metasploitable.vmdk metasploitable2.qcow2
```
3. First boot (no snapshot yet) — let it come up, log in (msfadmin/msfadmin),
confirm services are listening on the expected ports, shut down cleanly.
4. Take the baseline snapshot:
```
qemu-img snapshot -c baseline-v1 metasploitable2.qcow2
```
Internal qcow2 snapshots load in well under a second — this is the
"factory reset" mechanism for every episode.
## Single-vCPU constrained-device emulation
```
-cpu host -smp 1,sockets=1,cores=1,threads=1
-m 512
-machine type=q35,accel=kvm
```
Plus a host-side cgroup CPU cap on the QEMU process (e.g. 80% of one core) so
the guest behaves like a small, constrained device under load.
## Telemetry channels
### virtio-serial for the in-guest agent
```
-device virtio-serial-pci
-chardev socket,path=/run/qemu/guest-agent.sock,server=on,wait=off,id=ga
-device virtserialport,chardev=ga,name=cis490.guest.agent
```
The in-guest agent opens `/dev/virtio-ports/cis490.guest.agent` and writes
JSONL to it. Host side, the orchestrator reads from the unix socket. No network
involvement = the malware cannot interfere with this channel.
### QMP for live oracle queries
```
-qmp unix:/run/qemu/qmp.sock,server=on,wait=off
```
The orchestrator polls `query-stats`, `query-blockstats`, and netdev stats over
this socket.
### perf stat on the QEMU process
```
perf stat -p <qemu_pid> -I 100 \
-e cycles,instructions,cache-references,cache-misses,branches,branch-misses,page-faults,context-switches \
-x , -o telemetry-perf.csv
```
The collector tails the CSV, parses, and emits JSONL.
### tcpdump on `br-malware`
```
tcpdump -i br-malware -w network.pcap -B 4096 -s 200
```
Post-process to `netflow.jsonl` with 100ms buckets.
## Snapshot loop sanity check
A green light before any data collection:
1. `qemu-img snapshot -l metasploitable2.qcow2` shows `baseline-v1`.
2. Boot the VM with the qcow2.
3. Touch a file in the guest. Shut down.
4. `qemu-img snapshot -a baseline-v1 metasploitable2.qcow2`.
5. Boot again. The file is gone. ✅
## Safety checks before running real samples
- `ip route show table all | grep br-malware` shows no route off the bridge.
- `dig @host example.com` from a guest fails (no DNS for malware).
- The host's WG interface is **not** bridged to `br-malware`.
(See `scripts/install-lab-host.sh` for the firewall plumbing — it isn't the
focus of this project.)
## Where to put VMs and snapshots
```
vm/images/ # qcow2 disk images (gitignored)
vm/snapshots/ # named snapshot exports if we ever externalize them
```
Both directories are gitignored. The repo only carries the *recipes* for
reproducing them.