Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts

This is the chunk that makes "real data" actually flow on multiple hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now the lab-host side has the diversity + concurrency it needs. Collectors landed: collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP client + row builder + run loop. Tolerates older qemu without query-stats. collectors/guest_agent.py — source 5 (deployable). Reads the virtio-serial host-side socket, parses agent JSON-lines, re-stamps to the host monotonic clock, persists. collectors/pcap.py — source 4 (deployable). tcpdump capture + pure-Python pcap reader + 100 ms netflow.jsonl bucketizer. Decodes Ethernet/IPv4/TCP/UDP enough for the schema in docs/data-model.md. In-guest agent: vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs, thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent. tools/build_cidata.py — embeds the agent + an OpenRC service into user-data so first boot of the Alpine cidata image auto-starts it. Launchers: vm/launch_demo.sh / launch_target.sh — second virtio-serial port for the agent socket; SLOT env support so multiple VMs run without socket / port collisions; PORT_BASE on launch_target so multiple target VMs hostfwd different host ports. vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24, no NAT). Idempotent. Fleet: orchestrator/fleet.py — capacity detector (cores / RAM / load headroom) + concurrent-slot runner. Per-slot ENV selects the sample. FleetCapacity dataclass round-trips into meta.json so "this episode ran with 6 concurrent VMs" is auditable post-hoc. tools/run_fleet.py — CLI: --capacity report; --waves N runs N waves of (max_concurrent) episodes each, every slot with a different sample. etc/cis490-orchestrator.service — now drives the fleet runner with Restart=always so each invocation runs one wave and respawns, giving a continuous stream. Samples: samples/manifest.toml — six profiles spanning the five major behaviour shapes. Each entry is real OR mimic (sha256 distinguishes). samples/manifest.py — strict TOML loader (rejects dups, unknown categories) + deterministic select(host_id, slot, episode_index) so different hosts on the network walk the catalog in different orders without any coordinator. EpisodeRunner: orchestrator/episode.py — optional qmp_socket + guest_agent_socket fields on EpisodeConfig; when set, additional collector threads run alongside proc_qemu. EpisodeResult now carries rows_qmp + rows_guest counters. Tier-3 setup automation: scripts/install-msfrpcd.sh — installs metasploit-framework where the package manager has it, generates a strong password into /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to 127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch once MSFRPC_PASSWORD is sourced. scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256 from the operator (Rapid7 download is registration-walled), pulls, verifies, converts vmdk → qcow2, lands at vm/images/. Tests: 82 pass (was 51). New suites: tests/test_qmp.py — fake QMP server, capability handshake, blockstats, async-event interleaving, 5-failure backoff tests/test_guest_agent.py — fake virtio socket, JSON-lines read + re-stamp, malformed-line tolerance tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames, bucketize correctness across windows tests/test_fleet.py — capacity math (8-core idle / low-RAM / high-load / Pi5 / 1-core box), manifest selection determinism + diversity What's queued for the next commit (already discussed in convo): - MSFExploitDriver v2: map sample.profile → distinct in-session workload so Tier-3 episodes don't all produce the same yes-loop envelope. Critical for ML to learn varied malware shapes. - Real-sample fetch from MalwareBazaar by sha256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:02:27 -05:00 · 2026-04-30 00:02:27 -05:00 · 1b6c7b2f4a
commit 1b6c7b2f4a
parent 2579683efb
22 changed files with 2825 additions and 40 deletions
--- a/README.md
+++ b/README.md
@ -94,15 +94,19 @@ tools/show_envelope.sh data/episodes/<episode_id>

 ## Status

- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
+- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — running on Pi5 via Caddy + mTLS (wg-pki client CA)
 - ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz
+- ✅ Host /proc oracle collector (source 1) @ 10 Hz
+- ✅ **QMP collector** (source 2) — query-status / query-blockstats / query-stats, 1 Hz
+- ✅ **Bridge pcap** (source 4) — pure-Python pcap parser + 100 ms-bucketed netflow.jsonl
+- ✅ **In-guest agent** (source 5) — virtio-serial; cidata-embedded for first-boot install on Alpine; host-side reader re-stamps to host clock
 - ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
- ✅ Real VM (Alpine 3.21 cloud-init under KVM) — orchestrator collects against the real `qemu-system` pid
+- ✅ Real VM (Alpine 3.21 cloud-init under KVM)
 - ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition
- 🟡 **Tier 3 — exploit driver:** `MSFExploitDriver` + msfrpc client + first module config landed (`exploits/`); end-to-end run against a live `msfrpcd` + Metasploitable2 image still pending.
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)
+- 🟡 **Tier 3 — exploit driver:** `MSFExploitDriver` + msfrpc client + first module config landed; `scripts/install-msfrpcd.sh` automates msfrpcd setup; `scripts/fetch-metasploitable2.sh` pulls + verifies the target image (URL+sha256 from operator). Driver v2 (sample-profile-driven workloads) is the next step for ML diversity.
+- ✅ **Shipper** — lab-host ↔ Pi receiver via tar+zstd PUT over WG with mTLS; `--ping` smoke mode
+- ✅ **Fleet runner** — host-capacity-aware concurrency (`tools/run_fleet.py`); resource detector reserves cores + RAM headroom; sample manifest with deterministic per-(host, slot, episode) selection so every host on the network produces *novel, varied, labeled* data
+- ✅ **Sample manifest** — six initial profiles (cryptominer / botnet / ransomware / banking-trojan / fileless / RAT). Real-malware fetch from MalwareBazaar is the Tier-4 follow-up.

 > **Topology note:** in this project the **Pi5 is the WireGuard-side
 > *collector*** that receives episode tarballs from one or more lab hosts.
--- a/collectors/guest_agent.py
+++ b/collectors/guest_agent.py
@ -0,0 +1,119 @@
+"""Source 5 (feature, deployable): in-guest agent reader.
+
+QEMU exposes a virtio-serial channel two ways:
+  - inside the guest: ``/dev/virtio-ports/cis490.guest.agent``
+  - on the host:      a unix socket at ``$RUN_DIR/agent.sock``
+
+The in-guest agent (`vm/guest-agent/cis490_agent.py`) writes one
+JSON-lines row per tick into the guest-side device. Bytes traverse the
+virtio bus and surface on the host socket. This collector reads them,
+re-stamps with the host's monotonic clock (so rows align with all
+other telemetry on a single timeline), and persists to
+``telemetry-guest.jsonl``.
+
+Why re-stamp? The agent's clock is the *guest* clock, which can drift
+from the host (rare in KVM, but happens during live-migration tests
+and on heavy host load). The original guest timestamps stay in the row
+under ``t_guest_*`` so analysts can quantify drift if they care.
+
+This source is the **deployable** side: every row is tagged
+``available_in_deployment: true``. See docs/threat-model.md.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import socket
+import threading
+import time
+from pathlib import Path
+
+
+log = logging.getLogger("cis490.collectors.guest_agent")
+
+SOURCE = "guest_agent"
+AVAILABLE_IN_DEPLOYMENT = True
+
+
+def _connect(socket_path: Path, timeout_s: float) -> socket.socket | None:
+    deadline = time.monotonic() + timeout_s
+    last_err: OSError | None = None
+    while time.monotonic() < deadline:
+        try:
+            s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+            s.settimeout(2.0)
+            s.connect(str(socket_path))
+            return s
+        except OSError as e:
+            last_err = e
+            time.sleep(0.5)
+    if last_err is not None:
+        log.warning("guest-agent socket %s never came up: %s", socket_path, last_err)
+    return None
+
+
+def _stamp(row: dict, t_mono_origin_ns: int) -> dict:
+    """Replace the agent's wall-only timestamps with host-clock ones,
+    keeping the originals under ``t_guest_*`` for drift analysis."""
+    out = dict(row)
+    out.setdefault("t_guest_mono_ns", row.get("t_guest_mono_ns"))
+    out.setdefault("t_guest_wall_ns", row.get("t_guest_wall_ns"))
+    out["t_mono_ns"] = time.monotonic_ns() - t_mono_origin_ns
+    out["t_wall_ns"] = time.time_ns()
+    out.setdefault("source", SOURCE)
+    out.setdefault("available_in_deployment", AVAILABLE_IN_DEPLOYMENT)
+    return out
+
+
+def run_loop(
+    socket_path: str | Path,
+    output_path: Path,
+    t_mono_origin_ns: int,
+    stop_event: threading.Event,
+    *,
+    connect_timeout_s: float = 30.0,
+) -> int:
+    """Read agent JSON-lines from the host-side virtio-serial unix
+    socket. Re-stamp each row with the host clock and persist."""
+    sock_path = Path(socket_path)
+    sock = _connect(sock_path, connect_timeout_s)
+    if sock is None:
+        return 0
+
+    rows = 0
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    buf = b""
+    try:
+        with output_path.open("a", buffering=1) as f:
+            while not stop_event.is_set():
+                try:
+                    sock.settimeout(0.5)
+                    chunk = sock.recv(8192)
+                except socket.timeout:
+                    continue
+                except OSError as e:
+                    log.warning("guest-agent recv failed: %s", e)
+                    break
+                if not chunk:
+                    log.info("guest-agent socket closed")
+                    break
+                buf += chunk
+                while b"\n" in buf:
+                    line, _, buf = buf.partition(b"\n")
+                    line = line.strip()
+                    if not line:
+                        continue
+                    try:
+                        row = json.loads(line)
+                    except json.JSONDecodeError as e:
+                        log.warning("dropping malformed guest-agent line: %s", e)
+                        continue
+                    f.write(json.dumps(_stamp(row, t_mono_origin_ns)) + "\n")
+                    rows += 1
+    finally:
+        try:
+            sock.close()
+        except OSError:
+            pass
+    return rows
--- a/collectors/pcap.py
+++ b/collectors/pcap.py
@ -0,0 +1,288 @@
+"""Source 4 (feature, deployable): bridge-side pcap + bucketed netflow.
+
+Captures packets on the host-only ``br-malware`` bridge during an
+episode, writes the raw pcap, and produces a bucketed JSONL file the
+trainer can consume directly.
+
+The capture is **gateway-side** — the orchestrator sees the same
+packets a real upstream router/gateway would see in deployment, so
+features derived here transfer 1:1 to the deployment-time gateway
+observer.
+
+Implementation:
+
+  - ``run_capture()`` spawns ``tcpdump -i <bridge> -U -w <out.pcap>``
+    as a subprocess for the episode duration. ``-U`` flushes per
+    packet so the file is consumable mid-flight.
+
+  - ``bucketize()`` reads a finished pcap and emits 100 ms-bucketed
+    rows into ``netflow.jsonl``. Pure-Python pcap parser (no scapy /
+    dpkt dependency); decodes Ethernet + IPv4 + TCP/UDP enough to fill
+    the schema in docs/data-model.md.
+
+The pure-Python parser is intentionally minimal — it does NOT do
+fragment reassembly, IPv6, VLAN tags, or anything fancy. It handles
+the cases that occur on a host-only bridge for malware behaviour:
+plain Ethernet II, IPv4, TCP/UDP. Other frames are still counted at
+the byte/packet level but skipped for protocol-specific stats.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import os
+import struct
+import subprocess
+import threading
+import time
+from collections import defaultdict
+from dataclasses import dataclass
+from pathlib import Path
+
+
+log = logging.getLogger("cis490.collectors.pcap")
+
+SOURCE = "bridge_pcap"
+AVAILABLE_IN_DEPLOYMENT = True
+
+# Pcap file-level header
+_PCAP_GLOBAL_HDR = "<IHHiIII"
+_PCAP_GLOBAL_HDR_SIZE = 24
+_PCAP_REC_HDR = "<IIII"
+_PCAP_REC_HDR_SIZE = 16
+_PCAP_MAGIC_USEC = 0xa1b2c3d4
+_PCAP_MAGIC_NSEC = 0xa1b23c4d  # nanosecond resolution variant
+
+
+# ---------------------------------------------------------------------------
+# Capture
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class CaptureHandle:
+    proc: subprocess.Popen
+    pcap_path: Path
+    bridge: str
+    started_mono_ns: int
+
+
+def run_capture(
+    *,
+    bridge: str,
+    pcap_path: Path,
+    snaplen: int = 256,
+    bpf: str | None = None,
+) -> CaptureHandle:
+    """Start a tcpdump capture on ``bridge``. Returns a handle the
+    caller stops via ``stop_capture()``."""
+    pcap_path.parent.mkdir(parents=True, exist_ok=True)
+    args = ["tcpdump", "-i", bridge, "-U", "-s", str(snaplen), "-w", str(pcap_path)]
+    if bpf:
+        args.append(bpf)
+    log.info("starting pcap: %s", " ".join(args))
+    proc = subprocess.Popen(
+        args,
+        stdout=subprocess.DEVNULL,
+        stderr=subprocess.PIPE,
+        # tcpdump may need root or CAP_NET_RAW. We don't elevate here.
+    )
+    return CaptureHandle(
+        proc=proc, pcap_path=pcap_path, bridge=bridge,
+        started_mono_ns=time.monotonic_ns(),
+    )
+
+
+def stop_capture(handle: CaptureHandle, *, timeout_s: float = 5.0) -> int:
+    """SIGINT tcpdump (the Right Signal — flushes buffers + exits 0).
+    Returns the process exit code."""
+    proc = handle.proc
+    if proc.poll() is None:
+        proc.send_signal(2)  # SIGINT
+        try:
+            proc.wait(timeout=timeout_s)
+        except subprocess.TimeoutExpired:
+            proc.kill()
+            proc.wait(timeout=timeout_s)
+    return proc.returncode
+
+
+# ---------------------------------------------------------------------------
+# Pure-Python pcap parser
+# ---------------------------------------------------------------------------
+
+
+def _iter_pcap(path: Path):
+    """Yield ``(t_pkt_ns, frame_bytes)`` for every record in a pcap
+    file. Tolerates either microsecond or nanosecond magics."""
+    with path.open("rb") as f:
+        hdr = f.read(_PCAP_GLOBAL_HDR_SIZE)
+        if len(hdr) < _PCAP_GLOBAL_HDR_SIZE:
+            return
+        magic = struct.unpack("<I", hdr[:4])[0]
+        if magic == _PCAP_MAGIC_USEC:
+            sub_mult = 1000  # us → ns
+        elif magic == _PCAP_MAGIC_NSEC:
+            sub_mult = 1
+        else:
+            log.warning("unknown pcap magic %#x in %s", magic, path)
+            return
+        while True:
+            rec = f.read(_PCAP_REC_HDR_SIZE)
+            if len(rec) < _PCAP_REC_HDR_SIZE:
+                return
+            ts_sec, ts_sub, caplen, _ = struct.unpack(_PCAP_REC_HDR, rec)
+            data = f.read(caplen)
+            if len(data) < caplen:
+                return
+            t_ns = ts_sec * 1_000_000_000 + ts_sub * sub_mult
+            yield t_ns, data
+
+
+def _decode(frame: bytes) -> dict:
+    """Decode an Ethernet/IPv4/{TCP,UDP} frame to a flat dict. Unknown
+    protocols return only the ethertype + lengths."""
+    out: dict = {"size": len(frame)}
+    if len(frame) < 14:
+        return out
+    ethertype = struct.unpack(">H", frame[12:14])[0]
+    out["ethertype"] = ethertype
+    if ethertype != 0x0800:  # not IPv4 — count, don't decode further
+        return out
+    ip = frame[14:]
+    if len(ip) < 20:
+        return out
+    ihl = (ip[0] & 0x0F) * 4
+    if ihl < 20 or len(ip) < ihl:
+        return out
+    proto = ip[9]
+    src = ip[12:16]
+    dst = ip[16:20]
+    out["ip_proto"] = proto
+    out["src_ip"] = ".".join(str(b) for b in src)
+    out["dst_ip"] = ".".join(str(b) for b in dst)
+    payload = ip[ihl:]
+    if proto == 6 and len(payload) >= 20:  # TCP
+        sport, dport, _, _, off_flags = struct.unpack(">HHIIH", payload[:14])
+        flags = off_flags & 0x003F
+        out["src_port"] = sport
+        out["dst_port"] = dport
+        out["tcp_flags"] = flags  # FIN=1 SYN=2 RST=4 PSH=8 ACK=16 URG=32
+    elif proto == 17 and len(payload) >= 8:  # UDP
+        sport, dport, _, _ = struct.unpack(">HHHH", payload[:8])
+        out["src_port"] = sport
+        out["dst_port"] = dport
+    return out
+
+
+def bucketize(
+    pcap_path: Path,
+    netflow_path: Path,
+    *,
+    bucket_ms: int = 100,
+    t_mono_origin_ns: int = 0,
+    bridge_ip: str = "10.200.0.1",
+) -> int:
+    """Read a pcap and emit one row per ``bucket_ms`` window into
+    ``netflow.jsonl``. The ``in/out`` direction is from the bridge
+    perspective (host = ``bridge_ip``):
+
+      out = packet whose src is the host-side address (host → guest)
+      in  = anything else seen on the bridge (guest → host or
+            guest-to-guest)
+
+    Returns the number of rows written."""
+    if not pcap_path.exists():
+        return 0
+    bucket_ns = bucket_ms * 1_000_000
+    netflow_path.parent.mkdir(parents=True, exist_ok=True)
+
+    rows = 0
+    bucket_start: int | None = None
+    agg: dict = _empty_bucket()
+    with netflow_path.open("a", buffering=1) as out:
+        for t_pkt_ns, frame in _iter_pcap(pcap_path):
+            d = _decode(frame)
+            # Establish first bucket origin on first packet.
+            if bucket_start is None:
+                bucket_start = t_pkt_ns - (t_pkt_ns % bucket_ns)
+            while t_pkt_ns >= bucket_start + bucket_ns:
+                _flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
+                rows += 1
+                agg = _empty_bucket()
+                bucket_start += bucket_ns
+            _accumulate(agg, d, bridge_ip)
+        if bucket_start is not None and any(v for v in agg.values() if v):
+            _flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
+            rows += 1
+    return rows
+
+
+def _empty_bucket() -> dict:
+    return {
+        "pkts_in": 0, "pkts_out": 0,
+        "bytes_in": 0, "bytes_out": 0,
+        "syn_count": 0, "fin_count": 0, "rst_count": 0,
+        "udp_count": 0, "tcp_count": 0,
+        "dns_query_count": 0,
+        "dst_ips": set(), "dst_ports": set(),
+        "tcp_new_flows": 0,
+    }
+
+
+def _accumulate(agg: dict, d: dict, bridge_ip: str) -> None:
+    sz = d.get("size", 0)
+    is_out = d.get("src_ip") == bridge_ip
+    if is_out:
+        agg["pkts_out"] += 1
+        agg["bytes_out"] += sz
+    else:
+        agg["pkts_in"] += 1
+        agg["bytes_in"] += sz
+
+    proto = d.get("ip_proto")
+    if proto == 6:
+        agg["tcp_count"] += 1
+        flags = d.get("tcp_flags", 0)
+        if flags & 0x02:  # SYN
+            agg["syn_count"] += 1
+            if not (flags & 0x10):  # SYN without ACK = new flow
+                agg["tcp_new_flows"] += 1
+        if flags & 0x01:
+            agg["fin_count"] += 1
+        if flags & 0x04:
+            agg["rst_count"] += 1
+    elif proto == 17:
+        agg["udp_count"] += 1
+        if d.get("dst_port") == 53:
+            agg["dns_query_count"] += 1
+
+    dst = d.get("dst_ip")
+    if dst:
+        agg["dst_ips"].add(dst)
+    dport = d.get("dst_port")
+    if dport is not None:
+        agg["dst_ports"].add(dport)
+
+
+def _flush(out, agg: dict, bucket_start_ns: int, bucket_ns: int, t_mono_origin_ns: int) -> None:
+    row = {
+        "t_mono_ns": bucket_start_ns - t_mono_origin_ns,
+        "t_wall_ns": bucket_start_ns,
+        "source": SOURCE,
+        "available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
+        "bucket_ms": bucket_ns // 1_000_000,
+        "pkts_in": agg["pkts_in"], "pkts_out": agg["pkts_out"],
+        "bytes_in": agg["bytes_in"], "bytes_out": agg["bytes_out"],
+        "syn_count": agg["syn_count"],
+        "fin_count": agg["fin_count"],
+        "rst_count": agg["rst_count"],
+        "udp_count": agg["udp_count"],
+        "tcp_count": agg["tcp_count"],
+        "dns_query_count": agg["dns_query_count"],
+        "unique_dst_ips": len(agg["dst_ips"]),
+        "unique_dst_ports": len(agg["dst_ports"]),
+        "tcp_new_flows": agg["tcp_new_flows"],
+    }
+    out.write(json.dumps(row) + "\n")
--- a/collectors/qmp.py
+++ b/collectors/qmp.py
@ -0,0 +1,244 @@
+"""Source 2 (oracle): QEMU QMP sampler.
+
+Connects to the QEMU monitor protocol socket exposed by the launcher
+($RUN_DIR/qmp.sock) and periodically queries the hypervisor for
+per-VM stats that don't show up in /proc/<qemu_pid>:
+
+  - per-disk block I/O (rd_bytes, wr_bytes, rd_ops, wr_ops)
+  - VM run state (running / paused / shutdown)
+  - per-netdev tx/rx counters (when available)
+  - KVM stat counters (when available; introspection differs by qemu
+    version, so anything we can't read is skipped silently)
+
+This source is **oracle-only** — it does not exist on a deployed
+device. Every row carries ``available_in_deployment: false``.
+
+Wire format: QMP is line-delimited JSON. The handshake is fixed:
+
+    server  → {"QMP": {capabilities: [...], version: ...}}
+    client  → {"execute": "qmp_capabilities"}
+    server  → {"return": {}}
+    (client may now issue commands)
+
+We use a dedicated synchronous client because QMP is request/response
+and we don't need pipelining; one query batch per tick keeps the
+on-disk schema simple.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import socket
+import threading
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+
+log = logging.getLogger("cis490.collectors.qmp")
+
+SOURCE = "host_qmp"
+AVAILABLE_IN_DEPLOYMENT = False
+
+
+class QMPError(RuntimeError):
+    pass
+
+
+@dataclass
+class _SockReader:
+    sock: socket.socket
+    buf: bytes = b""
+
+    def read_line(self, timeout_s: float = 5.0) -> str:
+        deadline = time.monotonic() + timeout_s
+        while b"\n" not in self.buf:
+            self.sock.settimeout(max(0.1, deadline - time.monotonic()))
+            try:
+                chunk = self.sock.recv(8192)
+            except socket.timeout as e:
+                raise QMPError(f"QMP read timed out: {e}") from e
+            if not chunk:
+                raise QMPError("QMP connection closed by peer")
+            self.buf += chunk
+        line, _, rest = self.buf.partition(b"\n")
+        self.buf = rest
+        return line.decode("utf-8", errors="replace")
+
+
+class QMPClient:
+    """Tiny synchronous QMP client over a unix socket."""
+
+    def __init__(self, socket_path: str | Path) -> None:
+        self.path = str(socket_path)
+        self._sock: socket.socket | None = None
+        self._reader: _SockReader | None = None
+
+    def connect(self, timeout_s: float = 5.0) -> dict[str, Any]:
+        s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        s.settimeout(timeout_s)
+        s.connect(self.path)
+        self._sock = s
+        self._reader = _SockReader(s)
+        # Read greeting.
+        greeting = json.loads(self._reader.read_line(timeout_s=timeout_s))
+        if "QMP" not in greeting:
+            raise QMPError(f"unexpected QMP greeting: {greeting!r}")
+        # Negotiate capabilities (no flags requested).
+        self.execute("qmp_capabilities")
+        return greeting["QMP"]
+
+    def execute(self, command: str, **arguments: Any) -> Any:
+        if self._sock is None or self._reader is None:
+            raise QMPError("not connected")
+        msg: dict[str, Any] = {"execute": command}
+        if arguments:
+            msg["arguments"] = arguments
+        body = (json.dumps(msg) + "\n").encode("utf-8")
+        self._sock.sendall(body)
+        # QMP can interleave async events with the response — drain
+        # until we see the matching {"return": ...} or {"error": ...}.
+        for _ in range(64):  # bounded to avoid an infinite loop on bugs
+            line = self._reader.read_line()
+            if not line.strip():
+                continue
+            resp = json.loads(line)
+            if "return" in resp:
+                return resp["return"]
+            if "error" in resp:
+                raise QMPError(f"{command}: {resp['error']}")
+            # Otherwise it's an async event; ignore and keep reading.
+        raise QMPError(f"{command}: too many async events without a response")
+
+    def close(self) -> None:
+        if self._sock is not None:
+            try:
+                self._sock.close()
+            except OSError:
+                pass
+            self._sock = None
+            self._reader = None
+
+
+# ---- row builders ----------------------------------------------------------
+
+
+def _flatten_blockstats(blockstats: list[dict] | None) -> dict[str, dict[str, int]]:
+    """Compact ``query-blockstats`` to ``{device: {rd_ops, wr_ops, ...}}``."""
+    out: dict[str, dict[str, int]] = {}
+    for entry in blockstats or []:
+        name = entry.get("device") or entry.get("qdev") or "unknown"
+        s = entry.get("stats") or {}
+        out[name] = {
+            "rd_ops": int(s.get("rd_operations", 0)),
+            "wr_ops": int(s.get("wr_operations", 0)),
+            "rd_bytes": int(s.get("rd_bytes", 0)),
+            "wr_bytes": int(s.get("wr_bytes", 0)),
+            "flush_ops": int(s.get("flush_operations", 0)),
+        }
+    return out
+
+
+def collect_once(client: QMPClient, t_mono_origin_ns: int) -> dict[str, Any]:
+    row: dict[str, Any] = {
+        "t_mono_ns": time.monotonic_ns() - t_mono_origin_ns,
+        "t_wall_ns": time.time_ns(),
+        "source": SOURCE,
+        "available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
+    }
+
+    # query-status is dirt cheap and tells us whether the guest is
+    # paused (rare) or running.
+    try:
+        status = client.execute("query-status")
+        row["vm_status"] = status.get("status")
+        row["vm_running"] = bool(status.get("running"))
+    except QMPError as e:
+        log.debug("query-status failed: %s", e)
+
+    try:
+        bs = client.execute("query-blockstats")
+        row["blockstats"] = _flatten_blockstats(bs)
+    except QMPError as e:
+        log.debug("query-blockstats failed: %s", e)
+
+    # query-stats is QEMU 7.1+ and the schema varies across versions.
+    # We only ask for KVM stats and tolerate any subset of fields.
+    try:
+        stats = client.execute("query-stats", target="vm")
+        row["kvm_stats"] = _summarize_query_stats(stats)
+    except QMPError as e:
+        log.debug("query-stats not supported: %s", e)
+
+    return row
+
+
+def _summarize_query_stats(stats_resp: list[dict] | dict) -> dict[str, int]:
+    """Reduce ``query-stats`` to a flat name→value map of integer
+    counters. The full payload is verbose and version-specific; we only
+    ever want individual scalar counters downstream."""
+    flat: dict[str, int] = {}
+    items = stats_resp if isinstance(stats_resp, list) else [stats_resp]
+    for entry in items:
+        for s in entry.get("stats", []) or []:
+            name = s.get("name")
+            value = s.get("value")
+            if isinstance(name, str) and isinstance(value, int):
+                flat[name] = value
+    return flat
+
+
+# ---- run loop --------------------------------------------------------------
+
+
+def run_loop(
+    socket_path: str | Path,
+    output_path: Path,
+    t_mono_origin_ns: int,
+    interval_ms: int,
+    stop_event: threading.Event,
+) -> int:
+    """Connect to ``socket_path`` and sample at ``interval_ms`` until
+    ``stop_event``. Returns the number of rows written.
+
+    A single missed sample (transient QMP error) is logged and skipped;
+    repeated failures terminate the loop so the episode finishes cleanly
+    rather than hanging on a dead hypervisor."""
+    interval_ns = interval_ms * 1_000_000
+    client = QMPClient(socket_path)
+    try:
+        client.connect(timeout_s=5.0)
+    except (OSError, QMPError) as e:
+        log.warning("QMP connect to %s failed: %s — collector exits cleanly", socket_path, e)
+        return 0
+
+    rows = 0
+    consecutive_failures = 0
+    next_tick = time.monotonic_ns()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    try:
+        with output_path.open("a", buffering=1) as f:
+            while not stop_event.is_set():
+                try:
+                    row = collect_once(client, t_mono_origin_ns)
+                    f.write(json.dumps(row) + "\n")
+                    rows += 1
+                    consecutive_failures = 0
+                except (QMPError, OSError) as e:
+                    consecutive_failures += 1
+                    log.warning("QMP sample %d failed: %s", rows, e)
+                    if consecutive_failures >= 5:
+                        log.warning("5 consecutive QMP failures; bailing")
+                        break
+
+                next_tick += interval_ns
+                sleep_ns = next_tick - time.monotonic_ns()
+                if sleep_ns > 0:
+                    stop_event.wait(sleep_ns / 1_000_000_000)
+                else:
+                    next_tick = time.monotonic_ns()
+    finally:
+        client.close()
+    return rows
--- a/etc/cis490-orchestrator.service
+++ b/etc/cis490-orchestrator.service
@ -1,8 +1,8 @@
 [Unit]
-Description=CIS490 lab-host episode orchestrator (queue mode)
+Description=CIS490 lab-host episode orchestrator (fleet mode)
 Documentation=https://maxgit.wg/spectral/CIS490
-# Episodes need KVM and (for Tier 3+) msfrpcd up. msfrpcd is brought
-# up out-of-band; this unit only requires the kernel + WG.
+# Episodes need KVM. msfrpcd (for Tier 3+) is brought up out-of-band
+# by cis490-msfrpcd.service when installed.
 After=network-online.target wg-quick@wg0.service
 Wants=network-online.target

@ -11,13 +11,18 @@ Type=simple
 User=cis490
 Group=cis490
 WorkingDirectory=/opt/cis490
-# Queue mode is currently a TODO — the binary takes a job-spec file
-# and runs episodes in a loop. Until that lands, this unit stays
-# disabled by default; lab-host operators kick off episodes by hand
-# via tools/run_*.py and let the shipper pick them up.
-ExecStart=/opt/cis490/.venv/bin/python -m orchestrator --queue /var/lib/cis490/data/queue
-Restart=on-failure
-RestartSec=10
+EnvironmentFile=-/etc/cis490/lab-host.toml.env
+# Fleet mode: detect host capacity, run that many concurrent episodes
+# per wave with samples drawn from the manifest. Each invocation runs
+# one wave and exits; systemd respawns per Restart= below, giving us
+# a continuous stream of fresh-sample episodes per host. The shipper
+# picks them up as `done.marker` files appear.
+ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/run_fleet.py \
+    --data-root /var/lib/cis490/data \
+    --manifest /opt/cis490/samples/manifest.toml \
+    --waves 1
+Restart=always
+RestartSec=15

 # Hardening
 NoNewPrivileges=true
--- a/orchestrator/episode.py
+++ b/orchestrator/episode.py
@ -36,7 +36,7 @@ from datetime import datetime, timezone
 from pathlib import Path
 from typing import Callable

-from collectors import proc_qemu
+from collectors import guest_agent, proc_qemu, qmp

 from .ulid import new_ulid

@ -61,6 +61,11 @@ class EpisodeConfig:
    # When set, walk this schedule and ignore duration_s for sleep timing.
    # ``duration_s`` still goes in meta.schedule for record-keeping.
    phase_schedule: PhaseSchedule | None = None
+    # Optional: paths to QEMU sockets exposed by the launcher. When
+    # set, EpisodeRunner spins up additional collector threads.
+    qmp_socket: Path | None = None
+    qmp_interval_ms: int = 1000  # QMP queries are heavier than /proc reads
+    guest_agent_socket: Path | None = None


@dataclass
@ -68,8 +73,10 @@ class EpisodeResult:
    episode_id: str
    episode_dir: Path
    rows_proc: int
-    pid_disappeared: bool
-    duration_observed_s: float
+    rows_qmp: int = 0
+    rows_guest: int = 0
+    pid_disappeared: bool = False
+    duration_observed_s: float = 0.0
    phases_observed: list[str] = field(default_factory=list)


@ -102,10 +109,10 @@ class EpisodeRunner:

        self.emit_event("snapshot_load", snapshot=self.cfg.snapshot_name)

-        rows_holder: dict[str, int] = {"rows": 0}
+        rows_holder: dict[str, int] = {"proc": 0, "qmp": 0, "guest": 0}

-        def _collector() -> None:
-            rows_holder["rows"] = proc_qemu.run_loop(
+        def _proc_collector() -> None:
+            rows_holder["proc"] = proc_qemu.run_loop(
                pid=self.cfg.target_pid,
                output_path=self.episode_dir / "telemetry-proc.jsonl",
                t_mono_origin_ns=self._t_mono_origin_ns,
@ -113,8 +120,33 @@ class EpisodeRunner:
                stop_event=self._stop,
            )

-        t = threading.Thread(target=_collector, daemon=True, name="proc_qemu")
-        t.start()
+        def _qmp_collector() -> None:
+            assert self.cfg.qmp_socket is not None
+            rows_holder["qmp"] = qmp.run_loop(
+                socket_path=self.cfg.qmp_socket,
+                output_path=self.episode_dir / "telemetry-qmp.jsonl",
+                t_mono_origin_ns=self._t_mono_origin_ns,
+                interval_ms=self.cfg.qmp_interval_ms,
+                stop_event=self._stop,
+            )
+
+        def _guest_collector() -> None:
+            assert self.cfg.guest_agent_socket is not None
+            rows_holder["guest"] = guest_agent.run_loop(
+                socket_path=self.cfg.guest_agent_socket,
+                output_path=self.episode_dir / "telemetry-guest.jsonl",
+                t_mono_origin_ns=self._t_mono_origin_ns,
+                stop_event=self._stop,
+            )
+
+        threads: list[threading.Thread] = []
+        threads.append(threading.Thread(target=_proc_collector, daemon=True, name="proc_qemu"))
+        if self.cfg.qmp_socket is not None:
+            threads.append(threading.Thread(target=_qmp_collector, daemon=True, name="qmp"))
+        if self.cfg.guest_agent_socket is not None:
+            threads.append(threading.Thread(target=_guest_collector, daemon=True, name="guest_agent"))
+        for t in threads:
+            t.start()

        phases_observed: list[str] = []
        try:
@ -126,7 +158,8 @@ class EpisodeRunner:
                self._stop.wait(timeout=self.cfg.duration_s)
        finally:
            self._stop.set()
-            t.join(timeout=2.0)
+            for t in threads:
+                t.join(timeout=3.0)

        pid_alive = _pid_alive(self.cfg.target_pid)
        self.emit_event("episode_end", target_pid_alive=pid_alive)
@ -135,7 +168,9 @@ class EpisodeRunner:
        meta["ended_at_wall"] = datetime.now(timezone.utc).isoformat()
        meta["result"] = {
            "phases_observed": phases_observed,
-            "rows_proc": rows_holder["rows"],
+            "rows_proc": rows_holder["proc"],
+            "rows_qmp": rows_holder["qmp"],
+            "rows_guest": rows_holder["guest"],
            "pid_alive_at_end": pid_alive,
            "duration_observed_s": end_mono_ns / 1_000_000_000,
        }
@ -143,16 +178,18 @@ class EpisodeRunner:
        (self.episode_dir / "done.marker").touch()

        log.info(
-            "episode %s complete: rows=%d duration=%.2fs phases=%s",
+            "episode %s complete: proc=%d qmp=%d guest=%d duration=%.2fs phases=%s",
            self.episode_id,
-            rows_holder["rows"],
+            rows_holder["proc"], rows_holder["qmp"], rows_holder["guest"],
            end_mono_ns / 1e9,
            phases_observed,
        )
        return EpisodeResult(
            episode_id=self.episode_id,
            episode_dir=self.episode_dir,
-            rows_proc=rows_holder["rows"],
+            rows_proc=rows_holder["proc"],
+            rows_qmp=rows_holder["qmp"],
+            rows_guest=rows_holder["guest"],
            pid_disappeared=not pid_alive,
            duration_observed_s=end_mono_ns / 1_000_000_000,
            phases_observed=phases_observed,
--- a/orchestrator/fleet.py
+++ b/orchestrator/fleet.py
@ -0,0 +1,362 @@
+"""Fleet runner — concurrent VM episodes with resource awareness.
+
+The lab host detects its own capacity, picks how many VMs to run in
+parallel without driving the box into swap or starving the host
+itself, and runs that many episodes simultaneously. Each slot gets a
+distinct ``Sample`` from the manifest (deterministically chosen by
+host_id + slot index), so every concurrent VM produces novel,
+labelable data.
+
+Capacity heuristic — defaults documented inline so they're auditable:
+
+  cores_total      = os.cpu_count()
+  cores_reserved   = max(1, cores_total // 8)        # host + collectors
+  ram_per_vm_mib   = 320                             # Alpine fits in 256
+                                                     # but leave 64 for
+                                                     # overhead (qemu+ovmf)
+  ram_headroom_mib = max(1024, ram_total // 8)       # never starve host
+  max_by_cores     = cores_total - cores_reserved
+  max_by_ram       = (ram_available - ram_headroom) // ram_per_vm
+  max_by_load      = if (load_1m / cores) > 0.75: tighter cap
+
+The smallest of these wins. The reasoning string is logged + saved
+into each episode's meta.json under ``fleet`` so post-hoc analysis
+can correlate "this episode was run when 6 VMs were concurrent" with
+its observed envelope.
+"""
+
+from __future__ import annotations
+
+import logging
+import os
+import shutil
+import signal
+import subprocess
+import threading
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from dataclasses import dataclass, field
+from pathlib import Path
+
+from samples.manifest import Sample, SampleManifest
+
+
+log = logging.getLogger("cis490.fleet")
+
+
+@dataclass(frozen=True)
+class FleetCapacity:
+    cores_total: int
+    cores_reserved: int
+    ram_total_mib: int
+    ram_available_mib: int
+    ram_per_vm_mib: int
+    ram_headroom_mib: int
+    load_1m: float
+    max_by_cores: int
+    max_by_ram: int
+    max_by_load: int
+    max_concurrent: int
+    rationale: str
+
+    def to_dict(self) -> dict:
+        return {
+            "cores_total": self.cores_total,
+            "cores_reserved": self.cores_reserved,
+            "ram_total_mib": self.ram_total_mib,
+            "ram_available_mib": self.ram_available_mib,
+            "ram_per_vm_mib": self.ram_per_vm_mib,
+            "ram_headroom_mib": self.ram_headroom_mib,
+            "load_1m": self.load_1m,
+            "max_by_cores": self.max_by_cores,
+            "max_by_ram": self.max_by_ram,
+            "max_by_load": self.max_by_load,
+            "max_concurrent": self.max_concurrent,
+            "rationale": self.rationale,
+        }
+
+
+@dataclass
+class FleetConfig:
+    host_id: str
+    repo_root: Path
+    data_root: Path
+    manifest: SampleManifest
+    # VM resource shape — must match what the launcher requests.
+    ram_per_vm_mib: int = 320
+    # Cap concurrency below the calculated max (e.g. for a smoke test).
+    max_concurrent_override: int | None = None
+    # Skip episodes whose sample requires a real binary that's not present.
+    require_real_samples: bool = False
+
+
+def _read_meminfo() -> dict[str, int]:
+    out: dict[str, int] = {}
+    try:
+        with open("/proc/meminfo") as f:
+            for line in f:
+                k, _, rest = line.partition(":")
+                v = rest.strip()
+                if v.endswith(" kB"):
+                    try:
+                        out[k] = int(v[:-3]) * 1024
+                    except ValueError:
+                        pass
+    except OSError:
+        pass
+    return out
+
+
+def _read_loadavg() -> float:
+    try:
+        with open("/proc/loadavg") as f:
+            return float(f.read().split()[0])
+    except (OSError, ValueError, IndexError):
+        return 0.0
+
+
+def detect_capacity(*, ram_per_vm_mib: int = 320) -> FleetCapacity:
+    cores_total = os.cpu_count() or 1
+    # Reserve at least 1 core, more if the host has many.
+    cores_reserved = max(1, cores_total // 8)
+
+    mem = _read_meminfo()
+    ram_total_b = mem.get("MemTotal", 0)
+    ram_avail_b = mem.get("MemAvailable", ram_total_b)
+    ram_total_mib = ram_total_b // (1024 * 1024)
+    ram_available_mib = ram_avail_b // (1024 * 1024)
+    # Never starve the host of more than ~7/8 of its memory.
+    ram_headroom_mib = max(1024, ram_total_mib // 8)
+
+    load_1m = _read_loadavg()
+
+    max_by_cores = max(0, cores_total - cores_reserved)
+    if ram_per_vm_mib <= 0:
+        max_by_ram = max_by_cores
+    else:
+        max_by_ram = max(0, (ram_available_mib - ram_headroom_mib) // ram_per_vm_mib)
+
+    # Load-based cap: if the host is already busy, run fewer VMs.
+    if cores_total and load_1m / cores_total > 0.75:
+        # Halve, floor 1.
+        max_by_load = max(1, max_by_cores // 2)
+    else:
+        max_by_load = max_by_cores
+
+    candidates = [max_by_cores, max_by_ram, max_by_load]
+    max_concurrent = max(0, min(candidates))
+
+    binding = ["cores", "ram", "load"][candidates.index(max_concurrent)] \
+        if max_concurrent < max_by_cores else "cores"
+    rationale = (
+        f"cores_total={cores_total} reserved={cores_reserved} "
+        f"ram_avail_mib={ram_available_mib} headroom={ram_headroom_mib} "
+        f"per_vm={ram_per_vm_mib} load_1m={load_1m:.2f} "
+        f"-> max_concurrent={max_concurrent} (binding={binding})"
+    )
+    log.info("capacity: %s", rationale)
+
+    return FleetCapacity(
+        cores_total=cores_total,
+        cores_reserved=cores_reserved,
+        ram_total_mib=ram_total_mib,
+        ram_available_mib=ram_available_mib,
+        ram_per_vm_mib=ram_per_vm_mib,
+        ram_headroom_mib=ram_headroom_mib,
+        load_1m=load_1m,
+        max_by_cores=max_by_cores,
+        max_by_ram=max_by_ram,
+        max_by_load=max_by_load,
+        max_concurrent=max_concurrent,
+        rationale=rationale,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Per-slot episode execution
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class SlotResult:
+    slot: int
+    sample_name: str
+    sample_kind: str
+    episode_id: str | None
+    rc: int
+    duration_s: float
+    error: str | None = None
+    extra: dict = field(default_factory=dict)
+
+
+def _run_slot(
+    cfg: FleetConfig,
+    slot: int,
+    sample: Sample,
+    episode_index: int,
+    capacity: FleetCapacity,
+) -> SlotResult:
+    """Run one Tier-2-shaped episode in a dedicated slot.
+
+    For now the per-slot driver shells out to ``tools/run_real_vm_demo.py``
+    with SLOT and PROFILE env so the launcher gives us a unique RUN_DIR
+    and the load mimic varies by sample. When the Tier-3/4 paths land,
+    add a sample-kind dispatch here."""
+    env = os.environ.copy()
+    env["SLOT"] = str(slot)
+    env["RUN_DIR"] = f"/tmp/cis490-vm-fleet-{slot}"
+    env["SAMPLE_NAME"] = sample.name
+    env["SAMPLE_PROFILE"] = sample.profile
+    env["SAMPLE_KIND"] = sample.kind
+    env["FLEET_HOST_ID"] = cfg.host_id
+    env["FLEET_EPISODE_INDEX"] = str(episode_index)
+    env["FLEET_MAX_CONCURRENT"] = str(capacity.max_concurrent)
+
+    log_dir = cfg.data_root / "fleet-logs"
+    log_dir.mkdir(parents=True, exist_ok=True)
+    out_log = log_dir / f"slot-{slot}-ep-{episode_index}.log"
+
+    started = time.monotonic()
+    try:
+        with out_log.open("ab") as logf:
+            proc = subprocess.run(
+                [
+                    "/usr/bin/env", "python3",
+                    str(cfg.repo_root / "tools" / "run_real_vm_demo.py"),
+                    "--data-root", str(cfg.data_root),
+                ],
+                cwd=str(cfg.repo_root),
+                env=env,
+                stdout=logf,
+                stderr=subprocess.STDOUT,
+                check=False,
+            )
+        rc = proc.returncode
+        err = None
+    except (OSError, subprocess.SubprocessError) as e:
+        rc = -1
+        err = str(e)
+    duration = time.monotonic() - started
+
+    return SlotResult(
+        slot=slot,
+        sample_name=sample.name,
+        sample_kind=sample.kind,
+        episode_id=None,  # parsed from the log later by the driver
+        rc=rc,
+        duration_s=duration,
+        error=err,
+    )
+
+
+# ---------------------------------------------------------------------------
+# FleetRunner
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class FleetRunResult:
+    capacity: FleetCapacity
+    slots: list[SlotResult]
+    total_duration_s: float
+
+
+class FleetRunner:
+    def __init__(self, cfg: FleetConfig) -> None:
+        self.cfg = cfg
+        self._stop = threading.Event()
+
+    def stop(self) -> None:
+        self._stop.set()
+
+    def run(
+        self,
+        *,
+        episodes: int = 1,
+        episode_index_base: int = 0,
+        capacity_override: FleetCapacity | None = None,
+    ) -> FleetRunResult:
+        capacity = capacity_override or detect_capacity(
+            ram_per_vm_mib=self.cfg.ram_per_vm_mib,
+        )
+        n_slots = capacity.max_concurrent
+        if self.cfg.max_concurrent_override is not None:
+            n_slots = min(n_slots, self.cfg.max_concurrent_override)
+        if n_slots <= 0:
+            log.warning(
+                "fleet capacity is zero (%s); cannot run", capacity.rationale,
+            )
+            return FleetRunResult(
+                capacity=capacity, slots=[], total_duration_s=0.0,
+            )
+
+        log.info(
+            "fleet host=%s slots=%d episodes=%d manifest_size=%d",
+            self.cfg.host_id, n_slots, episodes, len(self.cfg.manifest),
+        )
+
+        all_results: list[SlotResult] = []
+        t_start = time.monotonic()
+        for ep in range(episodes):
+            if self._stop.is_set():
+                break
+            episode_index = episode_index_base + ep
+            slot_samples = [
+                self.cfg.manifest.select(
+                    host_id=self.cfg.host_id,
+                    slot=slot,
+                    episode_index=episode_index,
+                )
+                for slot in range(n_slots)
+            ]
+            if self.cfg.require_real_samples:
+                slot_samples = [s for s in slot_samples if s.kind == "real"]
+                if not slot_samples:
+                    log.warning("require_real_samples: no real samples in manifest; skipping wave")
+                    continue
+
+            log.info(
+                "wave %d/%d: %s",
+                ep + 1, episodes,
+                [(i, s.name, s.kind) for i, s in enumerate(slot_samples)],
+            )
+
+            with ThreadPoolExecutor(max_workers=n_slots) as pool:
+                futures = [
+                    pool.submit(
+                        _run_slot, self.cfg, slot, sample, episode_index, capacity,
+                    )
+                    for slot, sample in enumerate(slot_samples)
+                ]
+                for fut in as_completed(futures):
+                    res = fut.result()
+                    log.info(
+                        "slot %d sample=%s rc=%d duration=%.1fs",
+                        res.slot, res.sample_name, res.rc, res.duration_s,
+                    )
+                    all_results.append(res)
+
+        total = time.monotonic() - t_start
+        return FleetRunResult(
+            capacity=capacity,
+            slots=all_results,
+            total_duration_s=total,
+        )
+
+
+# ---------------------------------------------------------------------------
+# Friendly capacity report (used by tools/run_fleet.py --capacity)
+# ---------------------------------------------------------------------------
+
+
+def capacity_report() -> str:
+    c = detect_capacity()
+    return (
+        f"cores: {c.cores_total} (reserve {c.cores_reserved})\n"
+        f"ram:   {c.ram_total_mib} MiB total, {c.ram_available_mib} MiB available "
+        f"(headroom {c.ram_headroom_mib} MiB, per-vm {c.ram_per_vm_mib} MiB)\n"
+        f"load:  1m={c.load_1m:.2f}\n"
+        f"caps:  by_cores={c.max_by_cores}, by_ram={c.max_by_ram}, "
+        f"by_load={c.max_by_load}\n"
+        f"--> max_concurrent VMs: {c.max_concurrent}\n"
+    )
--- a/samples/init.py
+++ b/samples/init.py
--- a/samples/manifest.py
+++ b/samples/manifest.py
@ -0,0 +1,105 @@
+"""Sample manifest loader + per-(host, slot) deterministic selection.
+
+The manifest at ``samples/manifest.toml`` defines the catalog of
+samples (real or mimic) the fleet draws from. Selection is
+**deterministic** given ``(host_id, slot, episode_index)`` so two lab
+hosts on the same fleet pick *different* samples for the same slot
+index, and the same host repeats only after exhausting the catalog.
+
+This gives us "all hosts on the network generating novel data" without
+needing a coordinator: every host's `host_id` seeds its own
+sample-rotation order, and the orderings spread across the catalog.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import tomllib
+from dataclasses import dataclass, field
+from pathlib import Path
+
+
+_VALID_CATEGORIES = {
+    "cryptominer", "botnet", "ransomware", "banking-trojan",
+    "fileless", "rat", "worm", "loader", "wiper", "other",
+}
+
+
+@dataclass(frozen=True)
+class Sample:
+    name: str
+    family: str
+    category: str
+    profile: str
+    description: str = ""
+    source: str | None = None
+    sha256: str | None = None
+    url: str | None = None
+
+    @property
+    def kind(self) -> str:
+        """``"real"`` if a sha256-pinned binary is expected, else ``"mimic"``.
+        Trainers filter on this so the realistic-model pipeline only
+        consumes real-malware episodes."""
+        return "real" if self.sha256 else "mimic"
+
+
+@dataclass(frozen=True)
+class SampleManifest:
+    samples: list[Sample] = field(default_factory=list)
+
+    def __len__(self) -> int:
+        return len(self.samples)
+
+    def select(self, *, host_id: str, slot: int, episode_index: int = 0) -> Sample:
+        """Deterministic selection. The host_id mixes into the seed so
+        different hosts visit the catalog in different orders; slot +
+        episode_index tick within a host. Same inputs always give the
+        same sample — replay-friendly for debugging."""
+        if not self.samples:
+            raise ValueError("manifest is empty")
+        # SHA-256 of the seed gives a uniformly distributed integer.
+        seed = f"{host_id}|{slot}|{episode_index}".encode()
+        h = hashlib.sha256(seed).digest()
+        idx = int.from_bytes(h[:8], "big") % len(self.samples)
+        return self.samples[idx]
+
+    @classmethod
+    def load(cls, path: str | Path) -> "SampleManifest":
+        with open(path, "rb") as f:
+            data = tomllib.load(f)
+        raw = data.get("sample") or []
+        if not isinstance(raw, list):
+            raise ValueError(f"{path}: 'sample' must be an array of tables")
+
+        samples: list[Sample] = []
+        for i, entry in enumerate(raw):
+            if not isinstance(entry, dict):
+                raise ValueError(f"{path}: sample[{i}] is not a table")
+            for key in ("name", "family", "category", "profile"):
+                if not isinstance(entry.get(key), str) or not entry[key]:
+                    raise ValueError(f"{path}: sample[{i}] missing or empty '{key}'")
+            if entry["category"] not in _VALID_CATEGORIES:
+                raise ValueError(
+                    f"{path}: sample[{i}] category {entry['category']!r} "
+                    f"not in {sorted(_VALID_CATEGORIES)}"
+                )
+            samples.append(Sample(
+                name=entry["name"],
+                family=entry["family"],
+                category=entry["category"],
+                profile=entry["profile"],
+                description=entry.get("description", ""),
+                source=entry.get("source"),
+                sha256=entry.get("sha256"),
+                url=entry.get("url"),
+            ))
+
+        # Reject duplicate names — trainers join on this.
+        seen: set[str] = set()
+        for s in samples:
+            if s.name in seen:
+                raise ValueError(f"{path}: duplicate sample name {s.name!r}")
+            seen.add(s.name)
+
+        return cls(samples=samples)
--- a/samples/manifest.toml
+++ b/samples/manifest.toml
@ -0,0 +1,61 @@
+# Sample manifest — what each fleet slot picks from.
+#
+# Each entry has three things:
+#   - identity (name, family, category) for labeling
+#   - acquisition (source, sha256, url) for reproducibility
+#   - behaviour (profile) so the synthetic load mimic can run a
+#     reasonable proxy until the real sample lands at vm/images/
+#
+# When the real malware binary is present at samples/store/<sha256>,
+# the orchestrator runs THAT inside the guest. When it's absent, the
+# orchestrator falls back to running tools/load_mimic.py with the
+# matching profile so the fleet still produces *labeled, varied* data
+# while we collect the real samples. Either way, meta.json records
+# which path the episode took, so trainers can filter on
+# meta.sample.kind ∈ {real, mimic}.
+
+[[sample]]
+name = "xmrig-cryptominer"
+family = "XMRig"
+category = "cryptominer"
+profile = "cpu-saturate"
+# A real XMRig fetch goes here when MalwareBazaar pull is wired up:
+# source = "MalwareBazaar"
+# sha256 = "TBD"
+# url    = "https://bazaar.abuse.ch/sample/TBD/"
+description = "Sustained 1-vCPU saturation, very low IO/net. Pure compute."
+
+[[sample]]
+name = "mirai-class-bot"
+family = "Mirai"
+category = "botnet"
+profile = "scan-and-dial"
+description = "SYN scans across the bridge IP space + periodic dial-home. High net, low CPU."
+
+[[sample]]
+name = "ransomware-mimic"
+family = "Cryptolocker-class"
+category = "ransomware"
+profile = "io-walk"
+description = "Heavy disk write + filesystem walk producing a per-file overwrite envelope."
+
+[[sample]]
+name = "dridex-class-trojan"
+family = "Dridex"
+category = "banking-trojan"
+profile = "bursty-c2"
+description = "Long idle, periodic short bursts of TCP egress to a fixed peer (C2 beacon shape)."
+
+[[sample]]
+name = "kovter-class-stealth"
+family = "Kovter"
+category = "fileless"
+profile = "low-and-slow"
+description = "Low CPU, periodic memory churn, no persistent on-disk artifacts. Hardest to label from /proc alone."
+
+[[sample]]
+name = "reverse-shell-resident"
+family = "Reverse-Shell"
+category = "rat"
+profile = "shell-resident"
+description = "Single TCP socket pinned to an attacker IP, occasional command bursts."
--- a/scripts/fetch-metasploitable2.sh
+++ b/scripts/fetch-metasploitable2.sh
@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+# Fetch + sha256-verify the Metasploitable2 disk image.
+#
+# Rapid7's official download is gated behind a registration form, so
+# we accept the URL + sha256 from env vars (with sane defaults pointing
+# at a public mirror). The user installs this once per lab host.
+#
+# Inputs (env):
+#   IMAGE_URL  — direct download URL for the metasploitable2 archive
+#   IMAGE_SHA256 — expected sha256 of the archive
+#   OUT_DIR    — where to drop the qcow2 (default vm/images/)
+#
+# Outputs:
+#   $OUT_DIR/metasploitable2.qcow2 — converted from the original VMDK
+#                                    if needed.
+#
+# We do NOT bake an image url+hash into the repo because the canonical
+# distribution is a registration-walled zip on Rapid7. Operators must
+# supply both; the rest is mechanical.
+
+set -euo pipefail
+
+IMAGE_URL="${IMAGE_URL:-}"
+IMAGE_SHA256="${IMAGE_SHA256:-}"
+OUT_DIR="${OUT_DIR:-$(cd "$(dirname "$0")/../vm/images" 2>/dev/null && pwd)}"
+WORK_DIR="${WORK_DIR:-/tmp/cis490-metasploitable-fetch}"
+
+log() { printf '[fetch-metasploitable2] %s\n' "$*" >&2; }
+die() { log "FATAL: $*"; exit 1; }
+
+[[ -n "$IMAGE_URL" ]] || die "set IMAGE_URL to the Metasploitable2 download URL"
+[[ -n "$IMAGE_SHA256" ]] || die "set IMAGE_SHA256 to the expected sha256 of the archive"
+
+mkdir -p "$OUT_DIR" "$WORK_DIR"
+
+ARCHIVE="$WORK_DIR/$(basename "$IMAGE_URL")"
+log "downloading $IMAGE_URL → $ARCHIVE"
+if [[ -f "$ARCHIVE" ]]; then
+    log "archive already present; skipping download"
+else
+    curl -fL --retry 3 --retry-delay 5 -o "$ARCHIVE.partial" "$IMAGE_URL"
+    mv "$ARCHIVE.partial" "$ARCHIVE"
+fi
+
+log "verifying sha256"
+ACTUAL="$(sha256sum "$ARCHIVE" | awk '{print $1}')"
+if [[ "$ACTUAL" != "$IMAGE_SHA256" ]]; then
+    die "sha256 mismatch: expected $IMAGE_SHA256, got $ACTUAL"
+fi
+log "sha256 ok"
+
+# Extract — handle either zip or 7z, since various mirrors choose one
+# or the other.
+case "$ARCHIVE" in
+    *.zip) ( cd "$WORK_DIR" && unzip -o "$ARCHIVE" ) ;;
+    *.7z|*.7zip) command -v 7z >/dev/null || die "7z not installed"; \
+                 ( cd "$WORK_DIR" && 7z x -y "$ARCHIVE" ) ;;
+    *) die "unsupported archive type: $ARCHIVE" ;;
+esac
+
+VMDK="$(find "$WORK_DIR" -name 'Metasploitable*.vmdk' -print -quit)"
+[[ -n "$VMDK" ]] || die "no Metasploitable*.vmdk in extracted archive"
+
+log "converting $VMDK → qcow2"
+command -v qemu-img >/dev/null || die "qemu-img required (apt install qemu-utils)"
+qemu-img convert -O qcow2 "$VMDK" "$OUT_DIR/metasploitable2.qcow2"
+
+log "done: $OUT_DIR/metasploitable2.qcow2"
+log "Tier-3 ready when msfrpcd is up. See scripts/install-msfrpcd.sh."
--- a/scripts/install-msfrpcd.sh
+++ b/scripts/install-msfrpcd.sh
@ -0,0 +1,124 @@
+#!/usr/bin/env bash
+# Install + configure ``msfrpcd`` for the Tier-3 exploit driver.
+#
+# Idempotent: re-running on a host that already has msfrpcd refreshes
+# the systemd unit and credentials but doesn't reinstall the framework.
+#
+# Steps:
+#   1. Install metasploit-framework via the host package manager (or
+#      report the right one-liner for that distro). Big download —
+#      ~1 GiB and several minutes.
+#   2. Generate a strong password and store at /etc/cis490/msfrpc.env
+#      (mode 0640, owner root:cis490).
+#   3. Drop /etc/systemd/system/cis490-msfrpcd.service that runs
+#      msfrpcd bound to 127.0.0.1:55553 with the generated password.
+#   4. Enable + start.
+#
+# After this runs, ``MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env;
+# echo $MSFRPC_PASSWORD)`` makes tools/run_tier3_demo.py work zero-touch.
+
+set -euo pipefail
+
+ETC_ROOT="/etc/cis490"
+ENV_FILE="$ETC_ROOT/msfrpc.env"
+UNIT="/etc/systemd/system/cis490-msfrpcd.service"
+PORT="${MSFRPC_PORT:-55553}"
+USER_NAME="${MSFRPC_USER:-msf}"
+
+log() { printf '[install-msfrpcd] %s\n' "$*" >&2; }
+die() { log "FATAL: $*"; exit 1; }
+
+[[ $EUID -eq 0 ]] || die "must run as root"
+command -v systemctl >/dev/null || die "systemd not found"
+
+# --- 1. install metasploit-framework -----------------------------------
+if ! command -v msfrpcd >/dev/null; then
+    log "msfrpcd not found; installing metasploit-framework"
+    if command -v apt-get >/dev/null; then
+        # The Debian/Ubuntu metasploit-framework package isn't in
+        # the default repos for most distros. Use Rapid7's official
+        # nightly installer when available.
+        if [[ ! -x /opt/metasploit-framework/bin/msfrpcd ]]; then
+            log "fetching Rapid7 nightly installer"
+            curl -fsSL https://raw.githubusercontent.com/rapid7/metasploit-omnibus/master/config/templates/metasploit-framework-wrappers/msfupdate.erb \
+                -o /tmp/msfinstall.sh || true
+            log "automated install not available — install manually:"
+            log "  https://docs.metasploit.com/docs/using-metasploit/getting-started/nightly-installers.html"
+            die "rerun once msfrpcd is on PATH"
+        fi
+        # Symlink the wrapper so ``msfrpcd`` is on PATH.
+        ln -sf /opt/metasploit-framework/bin/msfrpcd /usr/local/bin/msfrpcd
+    elif command -v pacman >/dev/null; then
+        log "pacman -S metasploit"
+        pacman -Sy --noconfirm metasploit
+    elif command -v dnf >/dev/null; then
+        die "Fedora/RHEL: install metasploit-framework manually, then re-run"
+    else
+        die "unknown package manager — install metasploit-framework manually"
+    fi
+fi
+
+command -v msfrpcd >/dev/null || die "msfrpcd still missing after install attempt"
+
+# --- 2. generate password ----------------------------------------------
+install -d -m 0755 -o root -g root "$ETC_ROOT"
+if ! id -u cis490 >/dev/null 2>&1; then
+    useradd --system --no-create-home --shell /usr/sbin/nologin cis490
+fi
+if [[ ! -f "$ENV_FILE" ]]; then
+    log "generating msfrpc password"
+    PW="$(openssl rand -base64 24 | tr -d '/+=' | head -c 32)"
+    install -m 0640 -o root -g cis490 /dev/stdin "$ENV_FILE" <<EOF
+# Auto-generated by install-msfrpcd.sh — do not edit.
+MSFRPC_HOST=127.0.0.1
+MSFRPC_PORT=$PORT
+MSFRPC_USER=$USER_NAME
+MSFRPC_PASSWORD=$PW
+EOF
+else
+    log "$ENV_FILE exists; preserving existing password"
+fi
+
+# --- 3. systemd unit ----------------------------------------------------
+log "installing systemd unit"
+cat > "$UNIT" <<EOF
+[Unit]
+Description=CIS490 — Metasploit RPC daemon (loopback only)
+Documentation=https://maxgit.wg/spectral/CIS490
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+EnvironmentFile=$ENV_FILE
+# msfrpcd flags:
+#   -P <pw>   password
+#   -U <user> username
+#   -a <ip>   bind address (loopback only — Tier-3 driver runs locally)
+#   -p <port> port
+#   -f        foreground (no daemonization, so systemd manages PID)
+ExecStart=/usr/bin/env msfrpcd -P \${MSFRPC_PASSWORD} -U \${MSFRPC_USER} -a 127.0.0.1 -p \${MSFRPC_PORT} -f
+Restart=on-failure
+RestartSec=5
+NoNewPrivileges=true
+PrivateTmp=true
+ProtectSystem=full
+ProtectHome=true
+
+[Install]
+WantedBy=multi-user.target
+EOF
+
+systemctl daemon-reload
+systemctl enable --now cis490-msfrpcd
+
+# --- 4. final smoke -----------------------------------------------------
+sleep 2
+if ! ss -ltn 2>/dev/null | grep -q ":$PORT"; then
+    log "WARN: nothing listening on 127.0.0.1:$PORT yet — check"
+    log "       journalctl -u cis490-msfrpcd"
+fi
+
+log "done. To run a Tier-3 episode:"
+log "  set -a; . $ENV_FILE; set +a"
+log "  python tools/run_tier3_demo.py --module vsftpd_234_backdoor"
--- a/tests/test_fleet.py
+++ b/tests/test_fleet.py
@ -0,0 +1,204 @@
+"""Tests for fleet capacity calculation + sample manifest selection.
+
+Capacity is unit-tested via deterministic monkeypatching of /proc and
+os.cpu_count so the math is exercised independently of the host
+running the suite. Sample selection has its own tests covering the
+"different hosts pick different samples" property.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+from orchestrator import fleet
+from samples.manifest import Sample, SampleManifest
+
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+
+
+# ---------------------------------------------------------------------------
+# Capacity
+# ---------------------------------------------------------------------------
+
+
+def _patch_capacity_inputs(
+    monkeypatch,
+    *,
+    cores: int,
+    ram_total_mib: int,
+    ram_available_mib: int,
+    load_1m: float = 0.0,
+) -> None:
+    monkeypatch.setattr(fleet.os, "cpu_count", lambda: cores)
+    monkeypatch.setattr(
+        fleet, "_read_meminfo",
+        lambda: {
+            "MemTotal": ram_total_mib * 1024 * 1024,
+            "MemAvailable": ram_available_mib * 1024 * 1024,
+        },
+    )
+    monkeypatch.setattr(fleet, "_read_loadavg", lambda: load_1m)
+
+
+def test_capacity_8core_idle_box(monkeypatch) -> None:
+    _patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    assert c.cores_total == 8
+    assert c.cores_reserved == 1  # 8 // 8 = 1
+    assert c.max_by_cores == 7
+    # Plenty of RAM, idle → cores binding.
+    assert c.max_concurrent == 7
+    assert "binding=cores" in c.rationale
+
+
+def test_capacity_low_ram_caps_below_cores(monkeypatch) -> None:
+    # 8 cores but only ~2 GiB free → ram caps below cores.
+    _patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=4096, ram_available_mib=2048)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    # headroom = max(1024, 4096//8) = 1024
+    # max_by_ram = (2048 - 1024) // 320 = 3
+    assert c.max_by_ram == 3
+    assert c.max_concurrent == 3
+
+
+def test_capacity_high_load_halves_concurrency(monkeypatch) -> None:
+    # 8 cores, plenty of RAM, but load_1m / cores > 0.75
+    _patch_capacity_inputs(
+        monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000,
+        load_1m=7.0,  # 7/8 = 0.875 > 0.75
+    )
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    # max_by_cores = 7; max_by_load = max(1, 7//2) = 3
+    assert c.max_by_load == 3
+    assert c.max_concurrent == 3
+
+
+def test_capacity_pi5_class(monkeypatch) -> None:
+    """4 cores + 8 GiB → reserve 1 core, run 3 concurrent."""
+    _patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=7951, ram_available_mib=5223)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    assert c.cores_total == 4
+    assert c.max_concurrent == 3
+
+
+def test_capacity_minimal_box(monkeypatch) -> None:
+    """1-core 1 GiB host shouldn't try to run any VMs."""
+    _patch_capacity_inputs(monkeypatch, cores=1, ram_total_mib=1024, ram_available_mib=512)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    assert c.max_concurrent == 0
+
+
+def test_capacity_to_dict_round_trips(monkeypatch) -> None:
+    _patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=8000, ram_available_mib=6000)
+    c = fleet.detect_capacity(ram_per_vm_mib=320)
+    d = c.to_dict()
+    assert d["cores_total"] == 4
+    assert d["max_concurrent"] == c.max_concurrent
+    assert "rationale" in d
+
+
+# ---------------------------------------------------------------------------
+# Sample manifest
+# ---------------------------------------------------------------------------
+
+
+def test_repo_manifest_loads() -> None:
+    m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
+    assert len(m) >= 4
+    # Every entry has required fields.
+    for s in m.samples:
+        assert s.name and s.family and s.category and s.profile
+    # All "mimic" today; will switch as real samples are added.
+    assert all(s.kind == "mimic" for s in m.samples)
+
+
+def test_selection_is_deterministic() -> None:
+    m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
+    a = m.select(host_id="lab-1", slot=2, episode_index=5)
+    b = m.select(host_id="lab-1", slot=2, episode_index=5)
+    assert a is b
+
+
+def test_selection_differs_across_hosts() -> None:
+    """Two hosts on the same slot/episode should generally hit
+    different samples (probabilistic — assert distribution, not
+    individual equality).
+    """
+    m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
+    if len(m) < 2:
+        pytest.skip("manifest too small for diversity check")
+    matches = 0
+    for slot in range(20):
+        a = m.select(host_id="alice", slot=slot, episode_index=0)
+        b = m.select(host_id="bob",   slot=slot, episode_index=0)
+        if a is b:
+            matches += 1
+    # If the catalog has N samples, naive collision rate ~1/N. With
+    # 20 trials and N≥4 we expect ~5 matches; allow up to half.
+    assert matches < 15, "host_id seed isn't producing variety"
+
+
+def test_selection_walks_catalog_across_episodes() -> None:
+    """A single host over many episodes should hit every sample at
+    least once."""
+    m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
+    seen = set()
+    for ep in range(200):
+        seen.add(m.select(host_id="lab-x", slot=0, episode_index=ep).name)
+    assert len(seen) == len(m), f"only saw {len(seen)}/{len(m)} samples"
+
+
+def test_manifest_rejects_missing_required_field(tmp_path: Path) -> None:
+    p = tmp_path / "bad.toml"
+    p.write_text(
+        '[[sample]]\n'
+        'name = "x"\n'
+        'family = "y"\n'
+        '# missing category\n'
+        'profile = "z"\n'
+    )
+    with pytest.raises(ValueError, match="category"):
+        SampleManifest.load(p)
+
+
+def test_manifest_rejects_unknown_category(tmp_path: Path) -> None:
+    p = tmp_path / "bad.toml"
+    p.write_text(
+        '[[sample]]\n'
+        'name = "x"\n'
+        'family = "y"\n'
+        'category = "fish"\n'
+        'profile = "z"\n'
+    )
+    with pytest.raises(ValueError, match="category"):
+        SampleManifest.load(p)
+
+
+def test_manifest_rejects_duplicate_names(tmp_path: Path) -> None:
+    p = tmp_path / "dup.toml"
+    p.write_text(
+        '[[sample]]\n'
+        'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
+        '\n[[sample]]\n'
+        'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
+    )
+    with pytest.raises(ValueError, match="duplicate"):
+        SampleManifest.load(p)
+
+
+def test_manifest_marks_real_when_sha256_present(tmp_path: Path) -> None:
+    p = tmp_path / "real.toml"
+    p.write_text(
+        '[[sample]]\n'
+        'name = "real-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
+        'sha256 = "abc123"\n'
+        '\n[[sample]]\n'
+        'name = "mimic-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
+    )
+    m = SampleManifest.load(p)
+    by_name = {s.name: s for s in m.samples}
+    assert by_name["real-one"].kind == "real"
+    assert by_name["mimic-one"].kind == "mimic"
--- a/tests/test_guest_agent.py
+++ b/tests/test_guest_agent.py
@ -0,0 +1,152 @@
+"""Tests for the host-side guest-agent collector.
+
+We simulate the in-guest agent by spinning up a unix socket server
+(stand-in for the QEMU virtio-serial chardev) that writes a few
+JSON-lines rows. The collector should read them, re-stamp with the
+host's monotonic clock, and persist to telemetry-guest.jsonl.
+"""
+
+from __future__ import annotations
+
+import json
+import socket
+import threading
+import time
+from pathlib import Path
+
+import pytest
+
+from collectors import guest_agent
+
+
+class FakeAgentServer(threading.Thread):
+    def __init__(self, sock_path: Path, rows: list[dict], delay_s: float = 0.05) -> None:
+        super().__init__(daemon=True)
+        self.sock_path = sock_path
+        self.rows = rows
+        self.delay_s = delay_s
+        self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        self._sock.bind(str(sock_path))
+        self._sock.listen(1)
+        self._sock.settimeout(5.0)
+
+    def run(self) -> None:
+        try:
+            conn, _ = self._sock.accept()
+        except socket.timeout:
+            return
+        try:
+            for row in self.rows:
+                conn.sendall((json.dumps(row) + "\n").encode())
+                time.sleep(self.delay_s)
+            time.sleep(0.1)
+        finally:
+            conn.close()
+            self._sock.close()
+
+
+def test_collector_reads_jsonl_and_restamps(tmp_path: Path) -> None:
+    sock_path = tmp_path / "agent.sock"
+    rows_in = [
+        {
+            "t_guest_mono_ns": 1, "t_guest_wall_ns": 2,
+            "source": "guest_agent", "available_in_deployment": True,
+            "mem_total_bytes": 256 * 1024 * 1024,
+            "mem_available_bytes": 200 * 1024 * 1024,
+            "load_1m_5m_15m": [0.1, 0.05, 0.0],
+            "cpu_total_jiffies": {"user": 10, "system": 5, "idle": 1000},
+        },
+        {
+            "t_guest_mono_ns": 100_000_000, "t_guest_wall_ns": 100_000_002,
+            "source": "guest_agent", "available_in_deployment": True,
+            "mem_total_bytes": 256 * 1024 * 1024,
+            "mem_available_bytes": 198 * 1024 * 1024,
+        },
+    ]
+    server = FakeAgentServer(sock_path, rows_in, delay_s=0.02)
+    server.start()
+    out_path = tmp_path / "telemetry-guest.jsonl"
+    stop = threading.Event()
+
+    def stop_after(ms: int) -> None:
+        time.sleep(ms / 1000.0)
+        stop.set()
+
+    threading.Thread(target=stop_after, args=(300,), daemon=True).start()
+
+    rows_written = guest_agent.run_loop(
+        socket_path=sock_path,
+        output_path=out_path,
+        t_mono_origin_ns=time.monotonic_ns(),
+        stop_event=stop,
+        connect_timeout_s=2.0,
+    )
+    server.join(timeout=2)
+
+    assert rows_written == 2
+    persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
+    assert len(persisted) == 2
+    for orig, got in zip(rows_in, persisted):
+        # Original guest timestamps preserved.
+        assert got["t_guest_mono_ns"] == orig["t_guest_mono_ns"]
+        # Host-clock fields added.
+        assert "t_mono_ns" in got
+        assert "t_wall_ns" in got
+        assert got["source"] == "guest_agent"
+        assert got["available_in_deployment"] is True
+
+
+def test_collector_returns_zero_when_socket_missing(tmp_path: Path) -> None:
+    rows = guest_agent.run_loop(
+        socket_path=tmp_path / "no-socket-here.sock",
+        output_path=tmp_path / "out.jsonl",
+        t_mono_origin_ns=time.monotonic_ns(),
+        stop_event=threading.Event(),
+        connect_timeout_s=0.5,
+    )
+    assert rows == 0
+
+
+def test_collector_drops_malformed_lines_but_keeps_going(tmp_path: Path) -> None:
+    sock_path = tmp_path / "agent.sock"
+    # Will be sent verbatim; the malformed line should be skipped.
+    payload = (
+        b'{"source":"guest_agent","mem_total_bytes":1}\n'
+        b'this-is-not-json\n'
+        b'{"source":"guest_agent","mem_total_bytes":2}\n'
+    )
+
+    class Server(threading.Thread):
+        def __init__(self) -> None:
+            super().__init__(daemon=True)
+            self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+            self._sock.bind(str(sock_path))
+            self._sock.listen(1)
+
+        def run(self) -> None:
+            conn, _ = self._sock.accept()
+            try:
+                conn.sendall(payload)
+                time.sleep(0.2)
+            finally:
+                conn.close()
+                self._sock.close()
+
+    s = Server()
+    s.start()
+    out_path = tmp_path / "out.jsonl"
+    stop = threading.Event()
+    threading.Thread(
+        target=lambda: (time.sleep(0.4), stop.set()), daemon=True
+    ).start()
+    rows = guest_agent.run_loop(
+        socket_path=sock_path,
+        output_path=out_path,
+        t_mono_origin_ns=time.monotonic_ns(),
+        stop_event=stop,
+        connect_timeout_s=2.0,
+    )
+    s.join(timeout=2)
+    assert rows == 2
+    persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
+    assert [r["mem_total_bytes"] for r in persisted] == [1, 2]
--- a/tests/test_pcap.py
+++ b/tests/test_pcap.py
@ -0,0 +1,188 @@
+"""Tests for the pcap collector's pure-Python parser + bucketizer.
+
+We synthesize a tiny pcap file in memory (Ethernet + IPv4 + TCP/UDP
+records with controlled timestamps), feed it to ``bucketize()``, and
+verify the produced netflow.jsonl rows are correct.
+"""
+
+from __future__ import annotations
+
+import json
+import struct
+from pathlib import Path
+
+import pytest
+
+from collectors import pcap
+
+
+# ---------------------------------------------------------------------------
+# pcap synthesis helpers
+# ---------------------------------------------------------------------------
+
+
+_PCAP_GLOBAL_HDR = struct.pack(
+    "<IHHiIII",
+    0xa1b2c3d4,  # magic (us)
+    2, 4,        # version
+    0,           # thiszone
+    0,           # sigfigs
+    65535,       # snaplen
+    1,           # linktype = LINKTYPE_ETHERNET
+)
+
+
+def _ipv4(src: str, dst: str, proto: int, payload: bytes) -> bytes:
+    s = bytes(int(x) for x in src.split("."))
+    d = bytes(int(x) for x in dst.split("."))
+    total_len = 20 + len(payload)
+    return struct.pack(
+        ">BBHHHBBHII"[:0] + "BBHHHBBH",
+        0x45,             # version=4, IHL=5
+        0,                # tos
+        total_len,
+        0, 0, 64, proto,
+        0,                # checksum (don't care)
+    ) + s + d + payload
+
+
+def _tcp(sport: int, dport: int, flags: int) -> bytes:
+    # Minimal 20-byte TCP header: sport, dport, seq, ack, off+flags, win, csum, urg
+    return struct.pack(">HHIIBBHHH",
+                       sport, dport,
+                       0, 0,
+                       0x50,           # data offset = 5 (no options)
+                       flags,
+                       0, 0, 0)
+
+
+def _udp(sport: int, dport: int, length: int = 8) -> bytes:
+    return struct.pack(">HHHH", sport, dport, length, 0)
+
+
+def _ether(payload: bytes, ethertype: int = 0x0800) -> bytes:
+    return b"\x02\x00\x00\x00\x00\x01" + b"\x02\x00\x00\x00\x00\x02" + struct.pack(">H", ethertype) + payload
+
+
+def _record(ts_ns: int, frame: bytes) -> bytes:
+    sec = ts_ns // 1_000_000_000
+    usec = (ts_ns // 1000) % 1_000_000
+    return struct.pack("<IIII", sec, usec, len(frame), len(frame)) + frame
+
+
+def _build_pcap(records: list[tuple[int, bytes]]) -> bytes:
+    out = bytearray(_PCAP_GLOBAL_HDR)
+    for ts, frame in records:
+        out += _record(ts, frame)
+    return bytes(out)
+
+
+def _write_pcap(path: Path, records: list[tuple[int, bytes]]) -> None:
+    path.write_bytes(_build_pcap(records))
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+def test_iter_pcap_reads_records_back(tmp_path: Path) -> None:
+    p = tmp_path / "a.pcap"
+    frame = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
+    _write_pcap(p, [(1_000_000_000, frame)])
+
+    records = list(pcap._iter_pcap(p))
+    assert len(records) == 1
+    t_ns, data = records[0]
+    assert t_ns == 1_000_000_000
+    assert data == frame
+
+
+def test_decode_tcp_syn() -> None:
+    f = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
+    d = pcap._decode(f)
+    assert d["ethertype"] == 0x0800
+    assert d["ip_proto"] == 6
+    assert d["src_ip"] == "10.200.0.1"
+    assert d["dst_ip"] == "10.200.0.10"
+    assert d["src_port"] == 40000
+    assert d["dst_port"] == 21
+    assert d["tcp_flags"] & 0x02
+
+
+def test_decode_udp_dns_query() -> None:
+    f = _ether(_ipv4("10.200.0.10", "10.200.0.1", 17, _udp(33333, 53)))
+    d = pcap._decode(f)
+    assert d["ip_proto"] == 17
+    assert d["dst_port"] == 53
+
+
+def test_bucketize_collapses_per_window(tmp_path: Path) -> None:
+    pcap_path = tmp_path / "ep.pcap"
+    netflow_path = tmp_path / "netflow.jsonl"
+
+    bridge_ip = "10.200.0.1"
+    guest_ip = "10.200.0.10"
+    base_ns = 1_700_000_000_000_000_000  # arbitrary, aligned-friendly
+
+    records = [
+        # Bucket A (0..100ms)
+        (base_ns + 5_000_000,
+         _ether(_ipv4(guest_ip, bridge_ip, 6, _tcp(40000, 21, flags=0x02)))),
+        (base_ns + 9_000_000,
+         _ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x12)))),
+        # Bucket B (100..200ms): UDP DNS query
+        (base_ns + 105_000_000,
+         _ether(_ipv4(guest_ip, bridge_ip, 17, _udp(33333, 53)))),
+        # Bucket B: TCP RST
+        (base_ns + 199_000_000,
+         _ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x04)))),
+    ]
+    _write_pcap(pcap_path, records)
+
+    rows_written = pcap.bucketize(
+        pcap_path, netflow_path,
+        bucket_ms=100,
+        t_mono_origin_ns=base_ns,
+        bridge_ip=bridge_ip,
+    )
+    assert rows_written == 2
+
+    rows = [json.loads(l) for l in netflow_path.read_text().splitlines()]
+    a, b = rows
+    assert a["bucket_ms"] == 100
+    # Bucket A: 1 in (SYN), 1 out (SYN-ACK)
+    assert a["pkts_in"] == 1
+    assert a["pkts_out"] == 1
+    assert a["syn_count"] == 2
+    assert a["tcp_new_flows"] == 1  # only the bare SYN counts as new flow
+    assert a["dns_query_count"] == 0
+    assert a["unique_dst_ips"] == 2
+
+    # Bucket B: DNS + RST
+    assert b["dns_query_count"] == 1
+    assert b["rst_count"] == 1
+
+
+def test_bucketize_returns_zero_for_missing_file(tmp_path: Path) -> None:
+    rows = pcap.bucketize(
+        tmp_path / "nope.pcap",
+        tmp_path / "netflow.jsonl",
+        bucket_ms=100,
+        t_mono_origin_ns=0,
+    )
+    assert rows == 0
+
+
+def test_bucketize_handles_unknown_ethertype(tmp_path: Path) -> None:
+    p = tmp_path / "x.pcap"
+    netflow = tmp_path / "n.jsonl"
+    # ARP frame (ethertype 0x0806) — counted but not decoded.
+    f = _ether(b"\x00" * 28, ethertype=0x0806)
+    _write_pcap(p, [(1_000_000_000, f)])
+    rows = pcap.bucketize(p, netflow, bucket_ms=100, t_mono_origin_ns=0)
+    assert rows == 1
+    out = json.loads(netflow.read_text().splitlines()[0])
+    # No IP info, but byte/packet count survives.
+    assert out["pkts_in"] + out["pkts_out"] == 1
+    assert out["tcp_count"] == 0
--- a/tests/test_qmp.py
+++ b/tests/test_qmp.py
@ -0,0 +1,295 @@
+"""Tests for the QMP collector against an in-process fake QMP server.
+
+The fake speaks just enough QMP to exercise:
+  - the greeting + qmp_capabilities handshake
+  - query-status
+  - query-blockstats
+  - query-stats target=vm
+  - error responses
+  - async events interleaved with command responses
+"""
+
+from __future__ import annotations
+
+import json
+import socket
+import tempfile
+import threading
+import time
+from pathlib import Path
+from typing import Any
+
+import pytest
+
+from collectors import qmp
+
+
+# ---------------------------------------------------------------------------
+# Fake QMP server
+# ---------------------------------------------------------------------------
+
+
+class FakeQMPServer(threading.Thread):
+    """Single-connection fake. Each line received from the client is
+    parsed as JSON; we look up ``execute`` in ``responses`` and emit
+    the configured reply. Optionally interleaves an async event before
+    the response."""
+
+    def __init__(
+        self,
+        socket_path: Path,
+        *,
+        responses: dict[str, Any] | None = None,
+        emit_event_before: set[str] | None = None,
+    ) -> None:
+        super().__init__(daemon=True)
+        self.socket_path = socket_path
+        self.responses = responses or {}
+        self.emit_event_before = emit_event_before or set()
+        self.received: list[dict] = []
+        self._stop = threading.Event()
+        self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        self._sock.bind(str(socket_path))
+        self._sock.listen(1)
+        self._sock.settimeout(5.0)
+
+    def run(self) -> None:
+        try:
+            conn, _ = self._sock.accept()
+        except socket.timeout:
+            return
+        conn.settimeout(5.0)
+        try:
+            # Greeting
+            conn.sendall(b'{"QMP": {"version": {"qemu": {"major":9,"minor":0,"micro":0}}, "capabilities": []}}\n')
+            buf = b""
+            while not self._stop.is_set():
+                try:
+                    chunk = conn.recv(4096)
+                except socket.timeout:
+                    if self._stop.is_set():
+                        return
+                    continue
+                if not chunk:
+                    return
+                buf += chunk
+                while b"\n" in buf:
+                    line, _, buf = buf.partition(b"\n")
+                    if not line.strip():
+                        continue
+                    msg = json.loads(line)
+                    self.received.append(msg)
+                    cmd = msg.get("execute")
+                    if cmd == "qmp_capabilities":
+                        conn.sendall(b'{"return": {}}\n')
+                        continue
+                    if cmd in self.emit_event_before:
+                        conn.sendall(b'{"event": "STOP", "timestamp": {"seconds": 1, "microseconds": 0}}\n')
+                    if cmd in self.responses:
+                        resp = self.responses[cmd]
+                        conn.sendall((json.dumps(resp) + "\n").encode())
+                    else:
+                        conn.sendall(b'{"error": {"class": "CommandNotFound", "desc": "unknown"}}\n')
+        finally:
+            conn.close()
+
+    def shutdown(self) -> None:
+        self._stop.set()
+        try:
+            self._sock.close()
+        except OSError:
+            pass
+
+
+@pytest.fixture
+def qmp_server(tmp_path: Path):
+    sock_path = tmp_path / "qmp.sock"
+    return sock_path
+
+
+# ---------------------------------------------------------------------------
+# Client tests
+# ---------------------------------------------------------------------------
+
+
+def test_connect_negotiates_capabilities(qmp_server: Path) -> None:
+    server = FakeQMPServer(qmp_server)
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        greeting = client.connect()
+        assert "version" in greeting
+    finally:
+        client.close()
+        server.shutdown()
+    # Server saw exactly the qmp_capabilities call.
+    assert any(m.get("execute") == "qmp_capabilities" for m in server.received)
+
+
+def test_execute_returns_payload(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+        },
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        out = client.execute("query-status")
+        assert out == {"status": "running", "running": True}
+    finally:
+        client.close()
+        server.shutdown()
+
+
+def test_execute_skips_async_events_before_response(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+        },
+        emit_event_before={"query-status"},
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        out = client.execute("query-status")
+        assert out["running"] is True
+    finally:
+        client.close()
+        server.shutdown()
+
+
+def test_execute_raises_on_qmp_error(qmp_server: Path) -> None:
+    server = FakeQMPServer(qmp_server)  # no responses → server sends error
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        with pytest.raises(qmp.QMPError):
+            client.execute("totally-fake-command")
+    finally:
+        client.close()
+        server.shutdown()
+
+
+# ---------------------------------------------------------------------------
+# Row builder tests
+# ---------------------------------------------------------------------------
+
+
+def test_collect_once_assembles_full_row(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+            "query-blockstats": {"return": [{
+                "device": "virtio0",
+                "stats": {
+                    "rd_operations": 12, "wr_operations": 4,
+                    "rd_bytes": 49152, "wr_bytes": 16384,
+                    "flush_operations": 1,
+                },
+            }]},
+            "query-stats": {"return": [{"stats": [
+                {"name": "halt_exits", "value": 17000},
+                {"name": "io_exits",   "value": 942},
+                {"name": "string-skipped", "value": "not-an-int"},
+            ]}]},
+        },
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
+    finally:
+        client.close()
+        server.shutdown()
+
+    assert row["source"] == "host_qmp"
+    assert row["available_in_deployment"] is False
+    assert row["vm_running"] is True
+    assert row["blockstats"]["virtio0"]["rd_bytes"] == 49152
+    assert row["blockstats"]["virtio0"]["flush_ops"] == 1
+    assert row["kvm_stats"]["halt_exits"] == 17000
+    assert "string-skipped" not in row["kvm_stats"]
+
+
+def test_collect_once_tolerates_missing_query_stats(qmp_server: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+            "query-blockstats": {"return": []},
+            # query-stats deliberately absent → server returns CommandNotFound
+        },
+    )
+    server.start()
+    try:
+        client = qmp.QMPClient(qmp_server)
+        client.connect()
+        row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
+    finally:
+        client.close()
+        server.shutdown()
+
+    # Older qemu without query-stats: row still exists, kvm_stats absent.
+    assert "kvm_stats" not in row
+    assert row["vm_running"] is True
+    assert row["blockstats"] == {}
+
+
+# ---------------------------------------------------------------------------
+# run_loop tests
+# ---------------------------------------------------------------------------
+
+
+def test_run_loop_writes_rows_and_stops_cleanly(qmp_server: Path, tmp_path: Path) -> None:
+    server = FakeQMPServer(
+        qmp_server,
+        responses={
+            "query-status": {"return": {"status": "running", "running": True}},
+            "query-blockstats": {"return": []},
+            "query-stats": {"error": {"class": "CommandNotFound", "desc": "n/a"}},
+        },
+    )
+    server.start()
+    out_path = tmp_path / "telemetry-qmp.jsonl"
+    stop = threading.Event()
+
+    def stop_after(ms: int) -> None:
+        time.sleep(ms / 1000.0)
+        stop.set()
+
+    threading.Thread(target=stop_after, args=(350,), daemon=True).start()
+    rows = qmp.run_loop(
+        socket_path=qmp_server,
+        output_path=out_path,
+        t_mono_origin_ns=time.monotonic_ns(),
+        interval_ms=100,
+        stop_event=stop,
+    )
+    server.shutdown()
+
+    assert rows >= 2, f"expected >=2 rows, got {rows}"
+    lines = [json.loads(l) for l in out_path.read_text().splitlines()]
+    assert len(lines) == rows
+    for r in lines:
+        assert r["source"] == "host_qmp"
+        assert r["vm_running"] is True
+
+
+def test_run_loop_returns_zero_when_socket_missing(tmp_path: Path) -> None:
+    # No server bound to the socket path.
+    rows = qmp.run_loop(
+        socket_path=tmp_path / "nonexistent.sock",
+        output_path=tmp_path / "telemetry-qmp.jsonl",
+        t_mono_origin_ns=time.monotonic_ns(),
+        interval_ms=100,
+        stop_event=threading.Event(),
+    )
+    assert rows == 0
--- a/tools/build_cidata.py
+++ b/tools/build_cidata.py
@ -28,7 +28,7 @@ from pathlib import Path
 import pycdlib


-DEFAULT_USER_DATA = """\
+DEFAULT_USER_DATA_HEAD = """\
 #cloud-config
 hostname: cis490
 manage_etc_hosts: true
@ -45,10 +45,70 @@ chpasswd:
  list: |
    root:cis490
    cis490:cis490
-runcmd:
-  - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]
 """

+# OpenRC service file shipped inside the guest. Alpine uses OpenRC;
+# the runcmd at the bottom of user-data wires it up on first boot.
+OPENRC_SERVICE = """\
+#!/sbin/openrc-run
+
+description="CIS490 in-guest telemetry agent"
+command="/usr/local/bin/cis490-agent"
+command_args="--port /dev/virtio-ports/cis490.guest.agent"
+command_background=true
+pidfile="/run/cis490-agent.pid"
+output_log="/var/log/cis490-agent.log"
+error_log="/var/log/cis490-agent.log"
+
+depend() {
+    need localmount
+}
+"""
+
+DEFAULT_META_DATA = """\
+instance-id: cis490-vm-001
+local-hostname: cis490
+"""
+
+
+def _indent(text: str, n: int) -> str:
+    pad = " " * n
+    return "\n".join(pad + line if line else line for line in text.splitlines())
+
+
+def build_user_data(*, embed_agent: bool, agent_path: Path | None) -> bytes:
+    """Build a cloud-init user-data document. When ``embed_agent`` is
+    True, also stuff the in-guest agent + an OpenRC service into
+    ``write_files`` and arrange to start the service on first boot."""
+    head = DEFAULT_USER_DATA_HEAD
+    if not embed_agent:
+        return (head + 'runcmd:\n  - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n').encode()
+
+    if agent_path is None:
+        agent_path = Path(__file__).resolve().parent.parent / "vm" / "guest-agent" / "cis490_agent.py"
+    if not agent_path.exists():
+        raise FileNotFoundError(f"agent script not found: {agent_path}")
+    agent_src = agent_path.read_text()
+
+    body = head + (
+        "write_files:\n"
+        "  - path: /usr/local/bin/cis490-agent\n"
+        "    permissions: '0755'\n"
+        "    owner: root:root\n"
+        "    content: |\n"
+        f"{_indent(agent_src, 6)}\n"
+        "  - path: /etc/init.d/cis490-agent\n"
+        "    permissions: '0755'\n"
+        "    owner: root:root\n"
+        "    content: |\n"
+        f"{_indent(OPENRC_SERVICE, 6)}\n"
+        "runcmd:\n"
+        '  - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n'
+        '  - [ sh, -c, "command -v rc-update >/dev/null && rc-update add cis490-agent default || true" ]\n'
+        '  - [ sh, -c, "command -v rc-service >/dev/null && rc-service cis490-agent start || true" ]\n'
+    )
+    return body.encode()
+
 DEFAULT_META_DATA = """\
 instance-id: cis490-vm-001
 local-hostname: cis490
@ -93,11 +153,26 @@ def main() -> int:
        default=None,
        help="path to a custom meta-data file",
    )
+    parser.add_argument(
+        "--no-embed-agent",
+        action="store_true",
+        help="don't bake the in-guest agent into user-data",
+    )
+    parser.add_argument(
+        "--agent-path",
+        type=Path,
+        default=None,
+        help="path to the in-guest agent (default: vm/guest-agent/cis490_agent.py)",
+    )
    args = parser.parse_args()

-    user_data = (
-        args.user_data.read_bytes() if args.user_data else DEFAULT_USER_DATA.encode()
-    )
+    if args.user_data:
+        user_data = args.user_data.read_bytes()
+    else:
+        user_data = build_user_data(
+            embed_agent=not args.no_embed_agent,
+            agent_path=args.agent_path,
+        )
    meta_data = (
        args.meta_data.read_bytes() if args.meta_data else DEFAULT_META_DATA.encode()
    )
--- a/tools/run_fleet.py
+++ b/tools/run_fleet.py
@ -0,0 +1,97 @@
+"""``cis490-fleet`` — run as many concurrent labeled episodes as the
+host can handle, drawing samples from the manifest.
+
+Modes:
+
+  --capacity     Print the resource calculation and exit. No VMs spawned.
+  --waves N      Run N waves of episodes (one wave = max_concurrent
+                 episodes, each in its own slot). Default: 1.
+  --max-concurrent N
+                 Cap concurrency below the auto-detected ceiling.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import os
+import signal
+import sys
+from pathlib import Path
+
+# Allow running as a script.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+from orchestrator.fleet import (  # noqa: E402
+    FleetConfig, FleetRunner, capacity_report, detect_capacity,
+)
+from samples.manifest import SampleManifest  # noqa: E402
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(prog="cis490-fleet")
+    p.add_argument("--capacity", action="store_true")
+    p.add_argument("--waves", type=int, default=1)
+    p.add_argument("--max-concurrent", type=int, default=None)
+    p.add_argument("--manifest",
+                   default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"))
+    p.add_argument("--data-root", default="data")
+    p.add_argument("--host-id", default=os.environ.get("FLEET_HOST_ID") or os.uname().nodename)
+    p.add_argument("--ram-per-vm-mib", type=int, default=320)
+    p.add_argument("--require-real-samples", action="store_true")
+    p.add_argument("--log-level", default="INFO")
+    args = p.parse_args(argv)
+
+    logging.basicConfig(
+        level=getattr(logging, args.log_level.upper(), logging.INFO),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+    )
+
+    if args.capacity:
+        print(capacity_report())
+        return 0
+
+    manifest = SampleManifest.load(args.manifest)
+    repo_root = Path(__file__).resolve().parent.parent
+
+    cfg = FleetConfig(
+        host_id=args.host_id,
+        repo_root=repo_root,
+        data_root=Path(args.data_root).resolve(),
+        manifest=manifest,
+        ram_per_vm_mib=args.ram_per_vm_mib,
+        max_concurrent_override=args.max_concurrent,
+        require_real_samples=args.require_real_samples,
+    )
+
+    runner = FleetRunner(cfg)
+
+    def _stop(signum, frame):  # noqa: ARG001
+        runner.stop()
+    signal.signal(signal.SIGTERM, _stop)
+    signal.signal(signal.SIGINT, _stop)
+
+    result = runner.run(episodes=args.waves)
+
+    print(json.dumps({
+        "host_id": args.host_id,
+        "capacity": result.capacity.to_dict(),
+        "slots": [
+            {
+                "slot": s.slot,
+                "sample": s.sample_name,
+                "sample_kind": s.sample_kind,
+                "rc": s.rc,
+                "duration_s": s.duration_s,
+                "error": s.error,
+            } for s in result.slots
+        ],
+        "total_duration_s": result.total_duration_s,
+    }, indent=2))
+
+    return 0 if all(s.rc == 0 for s in result.slots) else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/vm/guest-agent/cis490_agent.py
+++ b/vm/guest-agent/cis490_agent.py
@ -0,0 +1,274 @@
+#!/usr/bin/env python3
+"""In-guest telemetry agent — runs INSIDE the VM.
+
+Writes one JSON-lines row per tick to a virtio-serial port that the
+host has wired up as ``cis490.guest.agent``. The host-side collector
+(`collectors.guest_agent`) reads these rows and stamps them with the
+host's monotonic clock before persisting to ``telemetry-guest.jsonl``.
+
+Stdlib only — no `psutil`, no extra deps to bake into the guest. Every
+field is read from /proc on the guest, so this works on busybox-based
+Alpine, on Cirros, and on Metasploitable2 unchanged.
+
+Wire path inside the guest:
+    /dev/virtio-ports/cis490.guest.agent
+
+The host side opens the matching unix socket on the hypervisor.
+The protocol is intentionally trivial: the agent emits newline-
+delimited JSON; the host emits nothing back. One direction.
+
+This source is the **deployable** side — every row is tagged
+``available_in_deployment: true``. See docs/threat-model.md.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import platform
+import sys
+import time
+from typing import Any
+
+
+SOURCE = "guest_agent"
+AVAILABLE_IN_DEPLOYMENT = True
+DEFAULT_PORT = "/dev/virtio-ports/cis490.guest.agent"
+DEFAULT_INTERVAL_MS = 100  # 10 Hz
+DEFAULT_TOP_N = 8
+
+
+# ---------- /proc parsers ---------------------------------------------------
+
+
+def _read(path: str) -> str | None:
+    try:
+        with open(path, "rb") as f:
+            return f.read().decode("ascii", errors="replace")
+    except (FileNotFoundError, PermissionError):
+        return None
+
+
+def read_loadavg() -> tuple[float, float, float] | None:
+    text = _read("/proc/loadavg")
+    if text is None:
+        return None
+    parts = text.split()
+    return float(parts[0]), float(parts[1]), float(parts[2])
+
+
+def read_meminfo() -> dict[str, int]:
+    text = _read("/proc/meminfo")
+    out: dict[str, int] = {}
+    if text is None:
+        return out
+    for line in text.splitlines():
+        k, _, rest = line.partition(":")
+        v = rest.strip()
+        if v.endswith(" kB"):
+            try:
+                out[k] = int(v[:-3]) * 1024
+            except ValueError:
+                pass
+    return out
+
+
+def read_cpu_total() -> dict[str, int] | None:
+    """First line of /proc/stat: aggregate cpu user/nice/sys/idle/...
+    in jiffies since boot."""
+    text = _read("/proc/stat")
+    if text is None:
+        return None
+    line = text.splitlines()[0]
+    fields = line.split()
+    # cpu user nice system idle iowait irq softirq steal guest guest_nice
+    if not fields or fields[0] != "cpu":
+        return None
+    nums = [int(x) for x in fields[1:]]
+    pad = nums + [0] * max(0, 10 - len(nums))
+    return {
+        "user":      pad[0],
+        "nice":      pad[1],
+        "system":    pad[2],
+        "idle":      pad[3],
+        "iowait":    pad[4],
+        "irq":       pad[5],
+        "softirq":   pad[6],
+        "steal":     pad[7],
+        "guest":     pad[8],
+        "guest_nice":pad[9],
+    }
+
+
+def read_thermal_milli_c() -> int | None:
+    """Best-effort: /sys/class/thermal/thermal_zone0/temp."""
+    text = _read("/sys/class/thermal/thermal_zone0/temp")
+    if text is None:
+        return None
+    try:
+        return int(text.strip())
+    except ValueError:
+        return None
+
+
+def read_net_devs() -> dict[str, dict[str, int]]:
+    """Parse /proc/net/dev → {iface: {rx_bytes, tx_bytes, rx_pkts, tx_pkts}}."""
+    text = _read("/proc/net/dev")
+    out: dict[str, dict[str, int]] = {}
+    if text is None:
+        return out
+    lines = text.splitlines()
+    for line in lines[2:]:
+        if ":" not in line:
+            continue
+        name, _, rest = line.partition(":")
+        name = name.strip()
+        if name == "lo":
+            continue
+        cols = rest.split()
+        if len(cols) < 16:
+            continue
+        out[name] = {
+            "rx_bytes": int(cols[0]),
+            "rx_pkts":  int(cols[1]),
+            "tx_bytes": int(cols[8]),
+            "tx_pkts":  int(cols[9]),
+        }
+    return out
+
+
+def read_listen_ports() -> list[int]:
+    """TCP listen sockets from /proc/net/tcp + tcp6. State 0A = LISTEN."""
+    out: set[int] = set()
+    for path in ("/proc/net/tcp", "/proc/net/tcp6"):
+        text = _read(path)
+        if not text:
+            continue
+        for line in text.splitlines()[1:]:
+            cols = line.split()
+            if len(cols) < 4:
+                continue
+            if cols[3] != "0A":
+                continue
+            local = cols[1]  # "ADDR:PORT" with PORT in hex
+            _, _, port_hex = local.rpartition(":")
+            try:
+                out.add(int(port_hex, 16))
+            except ValueError:
+                pass
+    return sorted(out)
+
+
+def read_top_procs(top_n: int) -> list[dict[str, Any]]:
+    """Top-N processes by RSS. Cheap O(N) scan of /proc."""
+    procs: list[dict[str, Any]] = []
+    try:
+        entries = os.listdir("/proc")
+    except OSError:
+        return procs
+    for ent in entries:
+        if not ent.isdigit():
+            continue
+        pid = int(ent)
+        stat = _read(f"/proc/{pid}/stat")
+        if stat is None:
+            continue
+        try:
+            rparen = stat.rindex(")")
+            comm = stat[stat.index("(") + 1 : rparen]
+            fields = stat[rparen + 2:].split()
+            utime = int(fields[11])
+            stime = int(fields[12])
+            rss_pages = int(fields[21])
+        except (ValueError, IndexError):
+            continue
+        procs.append({
+            "pid": pid,
+            "comm": comm[:32],
+            "cpu_jiffies": utime + stime,
+            "rss_bytes": rss_pages * os.sysconf("SC_PAGESIZE"),
+        })
+    procs.sort(key=lambda p: p["rss_bytes"], reverse=True)
+    return procs[:top_n]
+
+
+# ---------- one tick --------------------------------------------------------
+
+
+def collect_once(top_n: int = DEFAULT_TOP_N) -> dict[str, Any]:
+    mem = read_meminfo()
+    cpu = read_cpu_total()
+    load = read_loadavg()
+    return {
+        "t_guest_mono_ns": time.monotonic_ns(),
+        "t_guest_wall_ns": time.time_ns(),
+        "source": SOURCE,
+        "available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
+        "kernel": platform.release(),
+        "cpu_total_jiffies": cpu,
+        "load_1m_5m_15m": list(load) if load else None,
+        "mem_total_bytes":     (mem.get("MemTotal") or 0),
+        "mem_available_bytes": (mem.get("MemAvailable") or 0),
+        "mem_buffers_bytes":   (mem.get("Buffers") or 0),
+        "mem_cached_bytes":    (mem.get("Cached") or 0),
+        "swap_used_bytes":     (mem.get("SwapTotal", 0) - mem.get("SwapFree", 0)),
+        "thermal_milli_c": read_thermal_milli_c(),
+        "net": read_net_devs(),
+        "listen_ports": read_listen_ports(),
+        "top_procs": read_top_procs(top_n),
+    }
+
+
+# ---------- main loop -------------------------------------------------------
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(prog="cis490-guest-agent")
+    p.add_argument("--port", default=DEFAULT_PORT,
+                   help="virtio-serial port path inside the guest")
+    p.add_argument("--interval-ms", type=int, default=DEFAULT_INTERVAL_MS)
+    p.add_argument("--top-n", type=int, default=DEFAULT_TOP_N)
+    p.add_argument("--once", action="store_true",
+                   help="emit a single row and exit (for smoke tests)")
+    args = p.parse_args(argv)
+
+    if args.once:
+        sys.stdout.write(json.dumps(collect_once(args.top_n)) + "\n")
+        sys.stdout.flush()
+        return 0
+
+    # Open the virtio-serial port. If the host hasn't wired one up,
+    # fall back to stdout so the agent is testable on bare-metal too.
+    out_fp: Any
+    if os.path.exists(args.port):
+        out_fp = open(args.port, "wb", buffering=0)
+    else:
+        sys.stderr.write(f"[cis490-agent] {args.port} missing; writing to stdout\n")
+        out_fp = sys.stdout.buffer
+
+    interval_ns = args.interval_ms * 1_000_000
+    next_tick = time.monotonic_ns()
+    try:
+        while True:
+            row = collect_once(args.top_n)
+            out_fp.write((json.dumps(row) + "\n").encode("utf-8"))
+            try:
+                out_fp.flush()
+            except (AttributeError, OSError):
+                pass
+            next_tick += interval_ns
+            sleep_ns = next_tick - time.monotonic_ns()
+            if sleep_ns > 0:
+                time.sleep(sleep_ns / 1_000_000_000)
+            else:
+                next_tick = time.monotonic_ns()
+    except KeyboardInterrupt:
+        return 0
+    except (BrokenPipeError, OSError) as e:
+        sys.stderr.write(f"[cis490-agent] write failed: {e}\n")
+        return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/vm/launch_demo.sh
+++ b/vm/launch_demo.sh
@ -16,7 +16,12 @@ set -euo pipefail
 REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 IMAGE="${IMAGE:-$REPO_ROOT/vm/images/alpine-baseline.qcow2}"
 CIDATA="${CIDATA:-$REPO_ROOT/vm/images/cidata.iso}"
-RUN_DIR="${RUN_DIR:-/tmp/cis490-vm}"
+# SLOT lets the fleet runner spin up N concurrent VMs without socket /
+# port collisions. Default RUN_DIR + ssh hostfwd port keep single-VM
+# usage unchanged.
+SLOT="${SLOT:-0}"
+RUN_DIR="${RUN_DIR:-/tmp/cis490-vm-$SLOT}"
+SSH_PORT="${SSH_PORT:-$((2222 + SLOT))}"

 mkdir -p "$RUN_DIR"
 QMP_SOCK="$RUN_DIR/qmp.sock"
@ -32,8 +37,14 @@ if [[ ! -f "$CIDATA" ]]; then
    exit 1
 fi

+AGENT_SOCK="$RUN_DIR/agent.sock"
+
 # snapshot=on routes guest writes through a temporary overlay so the qcow2
 # on disk is never mutated — every boot starts from the same bytes.
+#
+# Second virtio-serial port (cis490.guest.agent) carries telemetry
+# from the in-guest agent. Surfaces inside the guest at
+# /dev/virtio-ports/cis490.guest.agent and on the host at $AGENT_SOCK.
 exec qemu-system-x86_64 \
    -name cis490-vm \
    -machine q35,accel=kvm \
@ -42,8 +53,11 @@ exec qemu-system-x86_64 \
    -m 256 \
    -drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
    -drive file="$CIDATA",format=raw,if=virtio,readonly=on \
-    -netdev user,id=n0,hostfwd=tcp:127.0.0.1:2222-:22 \
+    -netdev user,id=n0,hostfwd=tcp:127.0.0.1:"$SSH_PORT"-:22 \
    -device virtio-net-pci,netdev=n0 \
+    -device virtio-serial-pci,id=cis490vs0 \
+    -chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
+    -device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
    -nographic \
    -serial unix:"$RUN_DIR/serial.sock",server=on,wait=off \
    -monitor unix:"$MON_SOCK",server=on,wait=off \
--- a/vm/launch_target.sh
+++ b/vm/launch_target.sh
@ -26,11 +26,14 @@ set -euo pipefail

 REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 IMAGE="${IMAGE:-$REPO_ROOT/vm/images/metasploitable2.qcow2}"
-RUN_DIR="${RUN_DIR:-/tmp/cis490-target}"
+SLOT="${SLOT:-0}"
+RUN_DIR="${RUN_DIR:-/tmp/cis490-target-$SLOT}"
 RAM_MIB="${RAM_MIB:-512}"
 # Ports the host should forward to the guest. Comma-separated host:guest pairs.
-# Default covers the vsftpd module's RPORT.
-TARGET_PORTS="${TARGET_PORTS:-21:21}"
+# Default covers the vsftpd module's RPORT. Slot offset makes per-VM
+# fleet runs collision-free (slot 0 → 21, slot 1 → 121, slot 2 → 221, ...).
+PORT_BASE="${PORT_BASE:-$((21 + SLOT * 100))}"
+TARGET_PORTS="${TARGET_PORTS:-${PORT_BASE}:21}"
 # KVM if the host can take it; otherwise fall back to TCG. Cross-arch
 # images (Metasploitable2 is x86-only) on aarch64 hosts will need TCG.
 ACCEL="${ACCEL:-}"
@ -77,7 +80,13 @@ if [[ "$ACCEL" == "kvm" ]]; then
    CPU_FLAGS=(-cpu host)
 fi

+AGENT_SOCK="$RUN_DIR/agent.sock"
+
 # snapshot=on so the qcow2 is never mutated — every boot is identical.
+# Second virtio-serial port carries the in-guest agent's telemetry to
+# the host (see vm/guest-agent/). Targets without the agent installed
+# (e.g. unmodified Metasploitable2) leave the device unused — the
+# host-side collector simply gets no rows. Harmless.
 exec qemu-system-x86_64 \
    -name cis490-target \
    -machine q35,accel="$ACCEL" \
@ -87,6 +96,9 @@ exec qemu-system-x86_64 \
    -drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
    -netdev "$NETDEV" \
    -device virtio-net-pci,netdev=n0 \
+    -device virtio-serial-pci,id=cis490vs0 \
+    -chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
+    -device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
    -nographic \
    -serial unix:"$SERIAL_SOCK",server=on,wait=off \
    -monitor unix:"$MON_SOCK",server=on,wait=off \
--- a/vm/setup_bridge.sh
+++ b/vm/setup_bridge.sh
@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Create the host-only ``br-malware`` bridge for Tier-3+ episodes.
+#
+# Properties (from docs/architecture.md):
+#   - Bridge address 10.200.0.1/24 on the host side.
+#   - NO NAT, NO route, NO DNS — guests cannot reach the host or the
+#     internet. The bridge only carries traffic between the host and
+#     the guests on it.
+#   - Lab-host and target VMs both attach via tap devices created by
+#     the launcher.
+#
+# Run as root, ONCE per host. Idempotent — re-running is safe.
+
+set -euo pipefail
+
+BRIDGE="${BRIDGE:-br-malware}"
+BRIDGE_IP="${BRIDGE_IP:-10.200.0.1/24}"
+
+log() { printf '[setup_bridge] %s\n' "$*" >&2; }
+
+[[ $EUID -eq 0 ]] || { log "must run as root"; exit 1; }
+
+if ! command -v ip >/dev/null; then
+    log "iproute2 (`ip`) is required"
+    exit 1
+fi
+
+if ! ip link show "$BRIDGE" >/dev/null 2>&1; then
+    log "creating bridge $BRIDGE"
+    ip link add name "$BRIDGE" type bridge
+    # Disable spanning-tree on the host-only bridge — it isn't needed
+    # and adds startup delay.
+    ip link set "$BRIDGE" type bridge stp_state 0
+fi
+
+ip link set "$BRIDGE" up
+
+# Add the host-side address if not already there.
+if ! ip -4 addr show dev "$BRIDGE" | grep -q "${BRIDGE_IP%%/*}"; then
+    log "adding $BRIDGE_IP to $BRIDGE"
+    ip addr add "$BRIDGE_IP" dev "$BRIDGE"
+fi
+
+# Make sure the kernel does NOT forward between this bridge and any
+# other interface. We don't want a misconfigured net.ipv4.ip_forward
+# to leak the malware bridge to the LAN.
+if [[ "$(cat /proc/sys/net/ipv4/ip_forward)" == "1" ]]; then
+    log "WARNING: net.ipv4.ip_forward=1 — make sure iptmonads / nftables"
+    log "blocks traffic from $BRIDGE to non-loopback devices."
+fi
+
+log "bridge ready: $(ip -4 -br addr show "$BRIDGE")"
+log ""
+log "Launchers can now opt into tap+bridge mode by setting:"
+log "  BRIDGE=$BRIDGE   (tells launch_target.sh to attach a tap to this bridge)"
+log "Default launcher behaviour stays SLIRP usermode for simplicity."