Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts

This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.

Collectors landed:
  collectors/qmp.py          — source 2 (oracle). Tiny synchronous QMP
                               client + row builder + run loop. Tolerates
                               older qemu without query-stats.
  collectors/guest_agent.py  — source 5 (deployable). Reads the
                               virtio-serial host-side socket, parses
                               agent JSON-lines, re-stamps to the host
                               monotonic clock, persists.
  collectors/pcap.py         — source 4 (deployable). tcpdump capture
                               + pure-Python pcap reader + 100 ms
                               netflow.jsonl bucketizer. Decodes
                               Ethernet/IPv4/TCP/UDP enough for the
                               schema in docs/data-model.md.

In-guest agent:
  vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
    /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
    thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
  tools/build_cidata.py — embeds the agent + an OpenRC service into
    user-data so first boot of the Alpine cidata image auto-starts it.

Launchers:
  vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
    the agent socket; SLOT env support so multiple VMs run without
    socket / port collisions; PORT_BASE on launch_target so multiple
    target VMs hostfwd different host ports.
  vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
    no NAT). Idempotent.

Fleet:
  orchestrator/fleet.py — capacity detector (cores / RAM / load
    headroom) + concurrent-slot runner. Per-slot ENV selects the
    sample. FleetCapacity dataclass round-trips into meta.json so
    "this episode ran with 6 concurrent VMs" is auditable post-hoc.
  tools/run_fleet.py — CLI: --capacity report; --waves N runs N
    waves of (max_concurrent) episodes each, every slot with a
    different sample.
  etc/cis490-orchestrator.service — now drives the fleet runner with
    Restart=always so each invocation runs one wave and respawns,
    giving a continuous stream.

Samples:
  samples/manifest.toml — six profiles spanning the five major
    behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
  samples/manifest.py — strict TOML loader (rejects dups, unknown
    categories) + deterministic select(host_id, slot, episode_index)
    so different hosts on the network walk the catalog in different
    orders without any coordinator.

EpisodeRunner:
  orchestrator/episode.py — optional qmp_socket + guest_agent_socket
    fields on EpisodeConfig; when set, additional collector threads
    run alongside proc_qemu. EpisodeResult now carries rows_qmp +
    rows_guest counters.

Tier-3 setup automation:
  scripts/install-msfrpcd.sh — installs metasploit-framework where
    the package manager has it, generates a strong password into
    /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
    127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
    once MSFRPC_PASSWORD is sourced.
  scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
    from the operator (Rapid7 download is registration-walled), pulls,
    verifies, converts vmdk → qcow2, lands at vm/images/.

Tests: 82 pass (was 51). New suites:
  tests/test_qmp.py       — fake QMP server, capability handshake,
                            blockstats, async-event interleaving,
                            5-failure backoff
  tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
                              re-stamp, malformed-line tolerance
  tests/test_pcap.py      — synthetic pcap with TCP/UDP/ARP frames,
                            bucketize correctness across windows
  tests/test_fleet.py     — capacity math (8-core idle / low-RAM /
                            high-load / Pi5 / 1-core box), manifest
                            selection determinism + diversity

What's queued for the next commit (already discussed in convo):
  - MSFExploitDriver v2: map sample.profile → distinct in-session
    workload so Tier-3 episodes don't all produce the same yes-loop
    envelope. Critical for ML to learn varied malware shapes.
  - Real-sample fetch from MalwareBazaar by sha256.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
max 2026-04-30 00:02:27 -05:00
parent 2579683efb
commit 1b6c7b2f4a
22 changed files with 2825 additions and 40 deletions

View file

@ -94,15 +94,19 @@ tools/show_envelope.sh data/episodes/<episode_id>
## Status
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — running on Pi5 via Caddy + mTLS (wg-pki client CA)
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz
- ✅ Host /proc oracle collector (source 1) @ 10 Hz
- ✅ **QMP collector** (source 2) — query-status / query-blockstats / query-stats, 1 Hz
- ✅ **Bridge pcap** (source 4) — pure-Python pcap parser + 100 ms-bucketed netflow.jsonl
- ✅ **In-guest agent** (source 5) — virtio-serial; cidata-embedded for first-boot install on Alpine; host-side reader re-stamps to host clock
- ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
- ✅ Real VM (Alpine 3.21 cloud-init under KVM) — orchestrator collects against the real `qemu-system` pid
- ✅ Real VM (Alpine 3.21 cloud-init under KVM)
- ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition
- 🟡 **Tier 3 — exploit driver:** `MSFExploitDriver` + msfrpc client + first module config landed (`exploits/`); end-to-end run against a live `msfrpcd` + Metasploitable2 image still pending.
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)
- 🟡 **Tier 3 — exploit driver:** `MSFExploitDriver` + msfrpc client + first module config landed; `scripts/install-msfrpcd.sh` automates msfrpcd setup; `scripts/fetch-metasploitable2.sh` pulls + verifies the target image (URL+sha256 from operator). Driver v2 (sample-profile-driven workloads) is the next step for ML diversity.
- ✅ **Shipper** — lab-host ↔ Pi receiver via tar+zstd PUT over WG with mTLS; `--ping` smoke mode
- ✅ **Fleet runner** — host-capacity-aware concurrency (`tools/run_fleet.py`); resource detector reserves cores + RAM headroom; sample manifest with deterministic per-(host, slot, episode) selection so every host on the network produces *novel, varied, labeled* data
- ✅ **Sample manifest** — six initial profiles (cryptominer / botnet / ransomware / banking-trojan / fileless / RAT). Real-malware fetch from MalwareBazaar is the Tier-4 follow-up.
> **Topology note:** in this project the **Pi5 is the WireGuard-side
> *collector*** that receives episode tarballs from one or more lab hosts.

119
collectors/guest_agent.py Normal file
View file

@ -0,0 +1,119 @@
"""Source 5 (feature, deployable): in-guest agent reader.
QEMU exposes a virtio-serial channel two ways:
- inside the guest: ``/dev/virtio-ports/cis490.guest.agent``
- on the host: a unix socket at ``$RUN_DIR/agent.sock``
The in-guest agent (`vm/guest-agent/cis490_agent.py`) writes one
JSON-lines row per tick into the guest-side device. Bytes traverse the
virtio bus and surface on the host socket. This collector reads them,
re-stamps with the host's monotonic clock (so rows align with all
other telemetry on a single timeline), and persists to
``telemetry-guest.jsonl``.
Why re-stamp? The agent's clock is the *guest* clock, which can drift
from the host (rare in KVM, but happens during live-migration tests
and on heavy host load). The original guest timestamps stay in the row
under ``t_guest_*`` so analysts can quantify drift if they care.
This source is the **deployable** side: every row is tagged
``available_in_deployment: true``. See docs/threat-model.md.
"""
from __future__ import annotations
import json
import logging
import socket
import threading
import time
from pathlib import Path
log = logging.getLogger("cis490.collectors.guest_agent")
SOURCE = "guest_agent"
AVAILABLE_IN_DEPLOYMENT = True
def _connect(socket_path: Path, timeout_s: float) -> socket.socket | None:
deadline = time.monotonic() + timeout_s
last_err: OSError | None = None
while time.monotonic() < deadline:
try:
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.settimeout(2.0)
s.connect(str(socket_path))
return s
except OSError as e:
last_err = e
time.sleep(0.5)
if last_err is not None:
log.warning("guest-agent socket %s never came up: %s", socket_path, last_err)
return None
def _stamp(row: dict, t_mono_origin_ns: int) -> dict:
"""Replace the agent's wall-only timestamps with host-clock ones,
keeping the originals under ``t_guest_*`` for drift analysis."""
out = dict(row)
out.setdefault("t_guest_mono_ns", row.get("t_guest_mono_ns"))
out.setdefault("t_guest_wall_ns", row.get("t_guest_wall_ns"))
out["t_mono_ns"] = time.monotonic_ns() - t_mono_origin_ns
out["t_wall_ns"] = time.time_ns()
out.setdefault("source", SOURCE)
out.setdefault("available_in_deployment", AVAILABLE_IN_DEPLOYMENT)
return out
def run_loop(
socket_path: str | Path,
output_path: Path,
t_mono_origin_ns: int,
stop_event: threading.Event,
*,
connect_timeout_s: float = 30.0,
) -> int:
"""Read agent JSON-lines from the host-side virtio-serial unix
socket. Re-stamp each row with the host clock and persist."""
sock_path = Path(socket_path)
sock = _connect(sock_path, connect_timeout_s)
if sock is None:
return 0
rows = 0
output_path.parent.mkdir(parents=True, exist_ok=True)
buf = b""
try:
with output_path.open("a", buffering=1) as f:
while not stop_event.is_set():
try:
sock.settimeout(0.5)
chunk = sock.recv(8192)
except socket.timeout:
continue
except OSError as e:
log.warning("guest-agent recv failed: %s", e)
break
if not chunk:
log.info("guest-agent socket closed")
break
buf += chunk
while b"\n" in buf:
line, _, buf = buf.partition(b"\n")
line = line.strip()
if not line:
continue
try:
row = json.loads(line)
except json.JSONDecodeError as e:
log.warning("dropping malformed guest-agent line: %s", e)
continue
f.write(json.dumps(_stamp(row, t_mono_origin_ns)) + "\n")
rows += 1
finally:
try:
sock.close()
except OSError:
pass
return rows

288
collectors/pcap.py Normal file
View file

@ -0,0 +1,288 @@
"""Source 4 (feature, deployable): bridge-side pcap + bucketed netflow.
Captures packets on the host-only ``br-malware`` bridge during an
episode, writes the raw pcap, and produces a bucketed JSONL file the
trainer can consume directly.
The capture is **gateway-side** the orchestrator sees the same
packets a real upstream router/gateway would see in deployment, so
features derived here transfer 1:1 to the deployment-time gateway
observer.
Implementation:
- ``run_capture()`` spawns ``tcpdump -i <bridge> -U -w <out.pcap>``
as a subprocess for the episode duration. ``-U`` flushes per
packet so the file is consumable mid-flight.
- ``bucketize()`` reads a finished pcap and emits 100 ms-bucketed
rows into ``netflow.jsonl``. Pure-Python pcap parser (no scapy /
dpkt dependency); decodes Ethernet + IPv4 + TCP/UDP enough to fill
the schema in docs/data-model.md.
The pure-Python parser is intentionally minimal it does NOT do
fragment reassembly, IPv6, VLAN tags, or anything fancy. It handles
the cases that occur on a host-only bridge for malware behaviour:
plain Ethernet II, IPv4, TCP/UDP. Other frames are still counted at
the byte/packet level but skipped for protocol-specific stats.
"""
from __future__ import annotations
import json
import logging
import os
import struct
import subprocess
import threading
import time
from collections import defaultdict
from dataclasses import dataclass
from pathlib import Path
log = logging.getLogger("cis490.collectors.pcap")
SOURCE = "bridge_pcap"
AVAILABLE_IN_DEPLOYMENT = True
# Pcap file-level header
_PCAP_GLOBAL_HDR = "<IHHiIII"
_PCAP_GLOBAL_HDR_SIZE = 24
_PCAP_REC_HDR = "<IIII"
_PCAP_REC_HDR_SIZE = 16
_PCAP_MAGIC_USEC = 0xa1b2c3d4
_PCAP_MAGIC_NSEC = 0xa1b23c4d # nanosecond resolution variant
# ---------------------------------------------------------------------------
# Capture
# ---------------------------------------------------------------------------
@dataclass
class CaptureHandle:
proc: subprocess.Popen
pcap_path: Path
bridge: str
started_mono_ns: int
def run_capture(
*,
bridge: str,
pcap_path: Path,
snaplen: int = 256,
bpf: str | None = None,
) -> CaptureHandle:
"""Start a tcpdump capture on ``bridge``. Returns a handle the
caller stops via ``stop_capture()``."""
pcap_path.parent.mkdir(parents=True, exist_ok=True)
args = ["tcpdump", "-i", bridge, "-U", "-s", str(snaplen), "-w", str(pcap_path)]
if bpf:
args.append(bpf)
log.info("starting pcap: %s", " ".join(args))
proc = subprocess.Popen(
args,
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
# tcpdump may need root or CAP_NET_RAW. We don't elevate here.
)
return CaptureHandle(
proc=proc, pcap_path=pcap_path, bridge=bridge,
started_mono_ns=time.monotonic_ns(),
)
def stop_capture(handle: CaptureHandle, *, timeout_s: float = 5.0) -> int:
"""SIGINT tcpdump (the Right Signal — flushes buffers + exits 0).
Returns the process exit code."""
proc = handle.proc
if proc.poll() is None:
proc.send_signal(2) # SIGINT
try:
proc.wait(timeout=timeout_s)
except subprocess.TimeoutExpired:
proc.kill()
proc.wait(timeout=timeout_s)
return proc.returncode
# ---------------------------------------------------------------------------
# Pure-Python pcap parser
# ---------------------------------------------------------------------------
def _iter_pcap(path: Path):
"""Yield ``(t_pkt_ns, frame_bytes)`` for every record in a pcap
file. Tolerates either microsecond or nanosecond magics."""
with path.open("rb") as f:
hdr = f.read(_PCAP_GLOBAL_HDR_SIZE)
if len(hdr) < _PCAP_GLOBAL_HDR_SIZE:
return
magic = struct.unpack("<I", hdr[:4])[0]
if magic == _PCAP_MAGIC_USEC:
sub_mult = 1000 # us → ns
elif magic == _PCAP_MAGIC_NSEC:
sub_mult = 1
else:
log.warning("unknown pcap magic %#x in %s", magic, path)
return
while True:
rec = f.read(_PCAP_REC_HDR_SIZE)
if len(rec) < _PCAP_REC_HDR_SIZE:
return
ts_sec, ts_sub, caplen, _ = struct.unpack(_PCAP_REC_HDR, rec)
data = f.read(caplen)
if len(data) < caplen:
return
t_ns = ts_sec * 1_000_000_000 + ts_sub * sub_mult
yield t_ns, data
def _decode(frame: bytes) -> dict:
"""Decode an Ethernet/IPv4/{TCP,UDP} frame to a flat dict. Unknown
protocols return only the ethertype + lengths."""
out: dict = {"size": len(frame)}
if len(frame) < 14:
return out
ethertype = struct.unpack(">H", frame[12:14])[0]
out["ethertype"] = ethertype
if ethertype != 0x0800: # not IPv4 — count, don't decode further
return out
ip = frame[14:]
if len(ip) < 20:
return out
ihl = (ip[0] & 0x0F) * 4
if ihl < 20 or len(ip) < ihl:
return out
proto = ip[9]
src = ip[12:16]
dst = ip[16:20]
out["ip_proto"] = proto
out["src_ip"] = ".".join(str(b) for b in src)
out["dst_ip"] = ".".join(str(b) for b in dst)
payload = ip[ihl:]
if proto == 6 and len(payload) >= 20: # TCP
sport, dport, _, _, off_flags = struct.unpack(">HHIIH", payload[:14])
flags = off_flags & 0x003F
out["src_port"] = sport
out["dst_port"] = dport
out["tcp_flags"] = flags # FIN=1 SYN=2 RST=4 PSH=8 ACK=16 URG=32
elif proto == 17 and len(payload) >= 8: # UDP
sport, dport, _, _ = struct.unpack(">HHHH", payload[:8])
out["src_port"] = sport
out["dst_port"] = dport
return out
def bucketize(
pcap_path: Path,
netflow_path: Path,
*,
bucket_ms: int = 100,
t_mono_origin_ns: int = 0,
bridge_ip: str = "10.200.0.1",
) -> int:
"""Read a pcap and emit one row per ``bucket_ms`` window into
``netflow.jsonl``. The ``in/out`` direction is from the bridge
perspective (host = ``bridge_ip``):
out = packet whose src is the host-side address (host guest)
in = anything else seen on the bridge (guest host or
guest-to-guest)
Returns the number of rows written."""
if not pcap_path.exists():
return 0
bucket_ns = bucket_ms * 1_000_000
netflow_path.parent.mkdir(parents=True, exist_ok=True)
rows = 0
bucket_start: int | None = None
agg: dict = _empty_bucket()
with netflow_path.open("a", buffering=1) as out:
for t_pkt_ns, frame in _iter_pcap(pcap_path):
d = _decode(frame)
# Establish first bucket origin on first packet.
if bucket_start is None:
bucket_start = t_pkt_ns - (t_pkt_ns % bucket_ns)
while t_pkt_ns >= bucket_start + bucket_ns:
_flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
rows += 1
agg = _empty_bucket()
bucket_start += bucket_ns
_accumulate(agg, d, bridge_ip)
if bucket_start is not None and any(v for v in agg.values() if v):
_flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
rows += 1
return rows
def _empty_bucket() -> dict:
return {
"pkts_in": 0, "pkts_out": 0,
"bytes_in": 0, "bytes_out": 0,
"syn_count": 0, "fin_count": 0, "rst_count": 0,
"udp_count": 0, "tcp_count": 0,
"dns_query_count": 0,
"dst_ips": set(), "dst_ports": set(),
"tcp_new_flows": 0,
}
def _accumulate(agg: dict, d: dict, bridge_ip: str) -> None:
sz = d.get("size", 0)
is_out = d.get("src_ip") == bridge_ip
if is_out:
agg["pkts_out"] += 1
agg["bytes_out"] += sz
else:
agg["pkts_in"] += 1
agg["bytes_in"] += sz
proto = d.get("ip_proto")
if proto == 6:
agg["tcp_count"] += 1
flags = d.get("tcp_flags", 0)
if flags & 0x02: # SYN
agg["syn_count"] += 1
if not (flags & 0x10): # SYN without ACK = new flow
agg["tcp_new_flows"] += 1
if flags & 0x01:
agg["fin_count"] += 1
if flags & 0x04:
agg["rst_count"] += 1
elif proto == 17:
agg["udp_count"] += 1
if d.get("dst_port") == 53:
agg["dns_query_count"] += 1
dst = d.get("dst_ip")
if dst:
agg["dst_ips"].add(dst)
dport = d.get("dst_port")
if dport is not None:
agg["dst_ports"].add(dport)
def _flush(out, agg: dict, bucket_start_ns: int, bucket_ns: int, t_mono_origin_ns: int) -> None:
row = {
"t_mono_ns": bucket_start_ns - t_mono_origin_ns,
"t_wall_ns": bucket_start_ns,
"source": SOURCE,
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
"bucket_ms": bucket_ns // 1_000_000,
"pkts_in": agg["pkts_in"], "pkts_out": agg["pkts_out"],
"bytes_in": agg["bytes_in"], "bytes_out": agg["bytes_out"],
"syn_count": agg["syn_count"],
"fin_count": agg["fin_count"],
"rst_count": agg["rst_count"],
"udp_count": agg["udp_count"],
"tcp_count": agg["tcp_count"],
"dns_query_count": agg["dns_query_count"],
"unique_dst_ips": len(agg["dst_ips"]),
"unique_dst_ports": len(agg["dst_ports"]),
"tcp_new_flows": agg["tcp_new_flows"],
}
out.write(json.dumps(row) + "\n")

244
collectors/qmp.py Normal file
View file

@ -0,0 +1,244 @@
"""Source 2 (oracle): QEMU QMP sampler.
Connects to the QEMU monitor protocol socket exposed by the launcher
($RUN_DIR/qmp.sock) and periodically queries the hypervisor for
per-VM stats that don't show up in /proc/<qemu_pid>:
- per-disk block I/O (rd_bytes, wr_bytes, rd_ops, wr_ops)
- VM run state (running / paused / shutdown)
- per-netdev tx/rx counters (when available)
- KVM stat counters (when available; introspection differs by qemu
version, so anything we can't read is skipped silently)
This source is **oracle-only** it does not exist on a deployed
device. Every row carries ``available_in_deployment: false``.
Wire format: QMP is line-delimited JSON. The handshake is fixed:
server {"QMP": {capabilities: [...], version: ...}}
client {"execute": "qmp_capabilities"}
server {"return": {}}
(client may now issue commands)
We use a dedicated synchronous client because QMP is request/response
and we don't need pipelining; one query batch per tick keeps the
on-disk schema simple.
"""
from __future__ import annotations
import json
import logging
import socket
import threading
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any
log = logging.getLogger("cis490.collectors.qmp")
SOURCE = "host_qmp"
AVAILABLE_IN_DEPLOYMENT = False
class QMPError(RuntimeError):
pass
@dataclass
class _SockReader:
sock: socket.socket
buf: bytes = b""
def read_line(self, timeout_s: float = 5.0) -> str:
deadline = time.monotonic() + timeout_s
while b"\n" not in self.buf:
self.sock.settimeout(max(0.1, deadline - time.monotonic()))
try:
chunk = self.sock.recv(8192)
except socket.timeout as e:
raise QMPError(f"QMP read timed out: {e}") from e
if not chunk:
raise QMPError("QMP connection closed by peer")
self.buf += chunk
line, _, rest = self.buf.partition(b"\n")
self.buf = rest
return line.decode("utf-8", errors="replace")
class QMPClient:
"""Tiny synchronous QMP client over a unix socket."""
def __init__(self, socket_path: str | Path) -> None:
self.path = str(socket_path)
self._sock: socket.socket | None = None
self._reader: _SockReader | None = None
def connect(self, timeout_s: float = 5.0) -> dict[str, Any]:
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.settimeout(timeout_s)
s.connect(self.path)
self._sock = s
self._reader = _SockReader(s)
# Read greeting.
greeting = json.loads(self._reader.read_line(timeout_s=timeout_s))
if "QMP" not in greeting:
raise QMPError(f"unexpected QMP greeting: {greeting!r}")
# Negotiate capabilities (no flags requested).
self.execute("qmp_capabilities")
return greeting["QMP"]
def execute(self, command: str, **arguments: Any) -> Any:
if self._sock is None or self._reader is None:
raise QMPError("not connected")
msg: dict[str, Any] = {"execute": command}
if arguments:
msg["arguments"] = arguments
body = (json.dumps(msg) + "\n").encode("utf-8")
self._sock.sendall(body)
# QMP can interleave async events with the response — drain
# until we see the matching {"return": ...} or {"error": ...}.
for _ in range(64): # bounded to avoid an infinite loop on bugs
line = self._reader.read_line()
if not line.strip():
continue
resp = json.loads(line)
if "return" in resp:
return resp["return"]
if "error" in resp:
raise QMPError(f"{command}: {resp['error']}")
# Otherwise it's an async event; ignore and keep reading.
raise QMPError(f"{command}: too many async events without a response")
def close(self) -> None:
if self._sock is not None:
try:
self._sock.close()
except OSError:
pass
self._sock = None
self._reader = None
# ---- row builders ----------------------------------------------------------
def _flatten_blockstats(blockstats: list[dict] | None) -> dict[str, dict[str, int]]:
"""Compact ``query-blockstats`` to ``{device: {rd_ops, wr_ops, ...}}``."""
out: dict[str, dict[str, int]] = {}
for entry in blockstats or []:
name = entry.get("device") or entry.get("qdev") or "unknown"
s = entry.get("stats") or {}
out[name] = {
"rd_ops": int(s.get("rd_operations", 0)),
"wr_ops": int(s.get("wr_operations", 0)),
"rd_bytes": int(s.get("rd_bytes", 0)),
"wr_bytes": int(s.get("wr_bytes", 0)),
"flush_ops": int(s.get("flush_operations", 0)),
}
return out
def collect_once(client: QMPClient, t_mono_origin_ns: int) -> dict[str, Any]:
row: dict[str, Any] = {
"t_mono_ns": time.monotonic_ns() - t_mono_origin_ns,
"t_wall_ns": time.time_ns(),
"source": SOURCE,
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
}
# query-status is dirt cheap and tells us whether the guest is
# paused (rare) or running.
try:
status = client.execute("query-status")
row["vm_status"] = status.get("status")
row["vm_running"] = bool(status.get("running"))
except QMPError as e:
log.debug("query-status failed: %s", e)
try:
bs = client.execute("query-blockstats")
row["blockstats"] = _flatten_blockstats(bs)
except QMPError as e:
log.debug("query-blockstats failed: %s", e)
# query-stats is QEMU 7.1+ and the schema varies across versions.
# We only ask for KVM stats and tolerate any subset of fields.
try:
stats = client.execute("query-stats", target="vm")
row["kvm_stats"] = _summarize_query_stats(stats)
except QMPError as e:
log.debug("query-stats not supported: %s", e)
return row
def _summarize_query_stats(stats_resp: list[dict] | dict) -> dict[str, int]:
"""Reduce ``query-stats`` to a flat name→value map of integer
counters. The full payload is verbose and version-specific; we only
ever want individual scalar counters downstream."""
flat: dict[str, int] = {}
items = stats_resp if isinstance(stats_resp, list) else [stats_resp]
for entry in items:
for s in entry.get("stats", []) or []:
name = s.get("name")
value = s.get("value")
if isinstance(name, str) and isinstance(value, int):
flat[name] = value
return flat
# ---- run loop --------------------------------------------------------------
def run_loop(
socket_path: str | Path,
output_path: Path,
t_mono_origin_ns: int,
interval_ms: int,
stop_event: threading.Event,
) -> int:
"""Connect to ``socket_path`` and sample at ``interval_ms`` until
``stop_event``. Returns the number of rows written.
A single missed sample (transient QMP error) is logged and skipped;
repeated failures terminate the loop so the episode finishes cleanly
rather than hanging on a dead hypervisor."""
interval_ns = interval_ms * 1_000_000
client = QMPClient(socket_path)
try:
client.connect(timeout_s=5.0)
except (OSError, QMPError) as e:
log.warning("QMP connect to %s failed: %s — collector exits cleanly", socket_path, e)
return 0
rows = 0
consecutive_failures = 0
next_tick = time.monotonic_ns()
output_path.parent.mkdir(parents=True, exist_ok=True)
try:
with output_path.open("a", buffering=1) as f:
while not stop_event.is_set():
try:
row = collect_once(client, t_mono_origin_ns)
f.write(json.dumps(row) + "\n")
rows += 1
consecutive_failures = 0
except (QMPError, OSError) as e:
consecutive_failures += 1
log.warning("QMP sample %d failed: %s", rows, e)
if consecutive_failures >= 5:
log.warning("5 consecutive QMP failures; bailing")
break
next_tick += interval_ns
sleep_ns = next_tick - time.monotonic_ns()
if sleep_ns > 0:
stop_event.wait(sleep_ns / 1_000_000_000)
else:
next_tick = time.monotonic_ns()
finally:
client.close()
return rows

View file

@ -1,8 +1,8 @@
[Unit]
Description=CIS490 lab-host episode orchestrator (queue mode)
Description=CIS490 lab-host episode orchestrator (fleet mode)
Documentation=https://maxgit.wg/spectral/CIS490
# Episodes need KVM and (for Tier 3+) msfrpcd up. msfrpcd is brought
# up out-of-band; this unit only requires the kernel + WG.
# Episodes need KVM. msfrpcd (for Tier 3+) is brought up out-of-band
# by cis490-msfrpcd.service when installed.
After=network-online.target wg-quick@wg0.service
Wants=network-online.target
@ -11,13 +11,18 @@ Type=simple
User=cis490
Group=cis490
WorkingDirectory=/opt/cis490
# Queue mode is currently a TODO — the binary takes a job-spec file
# and runs episodes in a loop. Until that lands, this unit stays
# disabled by default; lab-host operators kick off episodes by hand
# via tools/run_*.py and let the shipper pick them up.
ExecStart=/opt/cis490/.venv/bin/python -m orchestrator --queue /var/lib/cis490/data/queue
Restart=on-failure
RestartSec=10
EnvironmentFile=-/etc/cis490/lab-host.toml.env
# Fleet mode: detect host capacity, run that many concurrent episodes
# per wave with samples drawn from the manifest. Each invocation runs
# one wave and exits; systemd respawns per Restart= below, giving us
# a continuous stream of fresh-sample episodes per host. The shipper
# picks them up as `done.marker` files appear.
ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/run_fleet.py \
--data-root /var/lib/cis490/data \
--manifest /opt/cis490/samples/manifest.toml \
--waves 1
Restart=always
RestartSec=15
# Hardening
NoNewPrivileges=true

View file

@ -36,7 +36,7 @@ from datetime import datetime, timezone
from pathlib import Path
from typing import Callable
from collectors import proc_qemu
from collectors import guest_agent, proc_qemu, qmp
from .ulid import new_ulid
@ -61,6 +61,11 @@ class EpisodeConfig:
# When set, walk this schedule and ignore duration_s for sleep timing.
# ``duration_s`` still goes in meta.schedule for record-keeping.
phase_schedule: PhaseSchedule | None = None
# Optional: paths to QEMU sockets exposed by the launcher. When
# set, EpisodeRunner spins up additional collector threads.
qmp_socket: Path | None = None
qmp_interval_ms: int = 1000 # QMP queries are heavier than /proc reads
guest_agent_socket: Path | None = None
@dataclass
@ -68,8 +73,10 @@ class EpisodeResult:
episode_id: str
episode_dir: Path
rows_proc: int
pid_disappeared: bool
duration_observed_s: float
rows_qmp: int = 0
rows_guest: int = 0
pid_disappeared: bool = False
duration_observed_s: float = 0.0
phases_observed: list[str] = field(default_factory=list)
@ -102,10 +109,10 @@ class EpisodeRunner:
self.emit_event("snapshot_load", snapshot=self.cfg.snapshot_name)
rows_holder: dict[str, int] = {"rows": 0}
rows_holder: dict[str, int] = {"proc": 0, "qmp": 0, "guest": 0}
def _collector() -> None:
rows_holder["rows"] = proc_qemu.run_loop(
def _proc_collector() -> None:
rows_holder["proc"] = proc_qemu.run_loop(
pid=self.cfg.target_pid,
output_path=self.episode_dir / "telemetry-proc.jsonl",
t_mono_origin_ns=self._t_mono_origin_ns,
@ -113,8 +120,33 @@ class EpisodeRunner:
stop_event=self._stop,
)
t = threading.Thread(target=_collector, daemon=True, name="proc_qemu")
t.start()
def _qmp_collector() -> None:
assert self.cfg.qmp_socket is not None
rows_holder["qmp"] = qmp.run_loop(
socket_path=self.cfg.qmp_socket,
output_path=self.episode_dir / "telemetry-qmp.jsonl",
t_mono_origin_ns=self._t_mono_origin_ns,
interval_ms=self.cfg.qmp_interval_ms,
stop_event=self._stop,
)
def _guest_collector() -> None:
assert self.cfg.guest_agent_socket is not None
rows_holder["guest"] = guest_agent.run_loop(
socket_path=self.cfg.guest_agent_socket,
output_path=self.episode_dir / "telemetry-guest.jsonl",
t_mono_origin_ns=self._t_mono_origin_ns,
stop_event=self._stop,
)
threads: list[threading.Thread] = []
threads.append(threading.Thread(target=_proc_collector, daemon=True, name="proc_qemu"))
if self.cfg.qmp_socket is not None:
threads.append(threading.Thread(target=_qmp_collector, daemon=True, name="qmp"))
if self.cfg.guest_agent_socket is not None:
threads.append(threading.Thread(target=_guest_collector, daemon=True, name="guest_agent"))
for t in threads:
t.start()
phases_observed: list[str] = []
try:
@ -126,7 +158,8 @@ class EpisodeRunner:
self._stop.wait(timeout=self.cfg.duration_s)
finally:
self._stop.set()
t.join(timeout=2.0)
for t in threads:
t.join(timeout=3.0)
pid_alive = _pid_alive(self.cfg.target_pid)
self.emit_event("episode_end", target_pid_alive=pid_alive)
@ -135,7 +168,9 @@ class EpisodeRunner:
meta["ended_at_wall"] = datetime.now(timezone.utc).isoformat()
meta["result"] = {
"phases_observed": phases_observed,
"rows_proc": rows_holder["rows"],
"rows_proc": rows_holder["proc"],
"rows_qmp": rows_holder["qmp"],
"rows_guest": rows_holder["guest"],
"pid_alive_at_end": pid_alive,
"duration_observed_s": end_mono_ns / 1_000_000_000,
}
@ -143,16 +178,18 @@ class EpisodeRunner:
(self.episode_dir / "done.marker").touch()
log.info(
"episode %s complete: rows=%d duration=%.2fs phases=%s",
"episode %s complete: proc=%d qmp=%d guest=%d duration=%.2fs phases=%s",
self.episode_id,
rows_holder["rows"],
rows_holder["proc"], rows_holder["qmp"], rows_holder["guest"],
end_mono_ns / 1e9,
phases_observed,
)
return EpisodeResult(
episode_id=self.episode_id,
episode_dir=self.episode_dir,
rows_proc=rows_holder["rows"],
rows_proc=rows_holder["proc"],
rows_qmp=rows_holder["qmp"],
rows_guest=rows_holder["guest"],
pid_disappeared=not pid_alive,
duration_observed_s=end_mono_ns / 1_000_000_000,
phases_observed=phases_observed,

362
orchestrator/fleet.py Normal file
View file

@ -0,0 +1,362 @@
"""Fleet runner — concurrent VM episodes with resource awareness.
The lab host detects its own capacity, picks how many VMs to run in
parallel without driving the box into swap or starving the host
itself, and runs that many episodes simultaneously. Each slot gets a
distinct ``Sample`` from the manifest (deterministically chosen by
host_id + slot index), so every concurrent VM produces novel,
labelable data.
Capacity heuristic defaults documented inline so they're auditable:
cores_total = os.cpu_count()
cores_reserved = max(1, cores_total // 8) # host + collectors
ram_per_vm_mib = 320 # Alpine fits in 256
# but leave 64 for
# overhead (qemu+ovmf)
ram_headroom_mib = max(1024, ram_total // 8) # never starve host
max_by_cores = cores_total - cores_reserved
max_by_ram = (ram_available - ram_headroom) // ram_per_vm
max_by_load = if (load_1m / cores) > 0.75: tighter cap
The smallest of these wins. The reasoning string is logged + saved
into each episode's meta.json under ``fleet`` so post-hoc analysis
can correlate "this episode was run when 6 VMs were concurrent" with
its observed envelope.
"""
from __future__ import annotations
import logging
import os
import shutil
import signal
import subprocess
import threading
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, field
from pathlib import Path
from samples.manifest import Sample, SampleManifest
log = logging.getLogger("cis490.fleet")
@dataclass(frozen=True)
class FleetCapacity:
cores_total: int
cores_reserved: int
ram_total_mib: int
ram_available_mib: int
ram_per_vm_mib: int
ram_headroom_mib: int
load_1m: float
max_by_cores: int
max_by_ram: int
max_by_load: int
max_concurrent: int
rationale: str
def to_dict(self) -> dict:
return {
"cores_total": self.cores_total,
"cores_reserved": self.cores_reserved,
"ram_total_mib": self.ram_total_mib,
"ram_available_mib": self.ram_available_mib,
"ram_per_vm_mib": self.ram_per_vm_mib,
"ram_headroom_mib": self.ram_headroom_mib,
"load_1m": self.load_1m,
"max_by_cores": self.max_by_cores,
"max_by_ram": self.max_by_ram,
"max_by_load": self.max_by_load,
"max_concurrent": self.max_concurrent,
"rationale": self.rationale,
}
@dataclass
class FleetConfig:
host_id: str
repo_root: Path
data_root: Path
manifest: SampleManifest
# VM resource shape — must match what the launcher requests.
ram_per_vm_mib: int = 320
# Cap concurrency below the calculated max (e.g. for a smoke test).
max_concurrent_override: int | None = None
# Skip episodes whose sample requires a real binary that's not present.
require_real_samples: bool = False
def _read_meminfo() -> dict[str, int]:
out: dict[str, int] = {}
try:
with open("/proc/meminfo") as f:
for line in f:
k, _, rest = line.partition(":")
v = rest.strip()
if v.endswith(" kB"):
try:
out[k] = int(v[:-3]) * 1024
except ValueError:
pass
except OSError:
pass
return out
def _read_loadavg() -> float:
try:
with open("/proc/loadavg") as f:
return float(f.read().split()[0])
except (OSError, ValueError, IndexError):
return 0.0
def detect_capacity(*, ram_per_vm_mib: int = 320) -> FleetCapacity:
cores_total = os.cpu_count() or 1
# Reserve at least 1 core, more if the host has many.
cores_reserved = max(1, cores_total // 8)
mem = _read_meminfo()
ram_total_b = mem.get("MemTotal", 0)
ram_avail_b = mem.get("MemAvailable", ram_total_b)
ram_total_mib = ram_total_b // (1024 * 1024)
ram_available_mib = ram_avail_b // (1024 * 1024)
# Never starve the host of more than ~7/8 of its memory.
ram_headroom_mib = max(1024, ram_total_mib // 8)
load_1m = _read_loadavg()
max_by_cores = max(0, cores_total - cores_reserved)
if ram_per_vm_mib <= 0:
max_by_ram = max_by_cores
else:
max_by_ram = max(0, (ram_available_mib - ram_headroom_mib) // ram_per_vm_mib)
# Load-based cap: if the host is already busy, run fewer VMs.
if cores_total and load_1m / cores_total > 0.75:
# Halve, floor 1.
max_by_load = max(1, max_by_cores // 2)
else:
max_by_load = max_by_cores
candidates = [max_by_cores, max_by_ram, max_by_load]
max_concurrent = max(0, min(candidates))
binding = ["cores", "ram", "load"][candidates.index(max_concurrent)] \
if max_concurrent < max_by_cores else "cores"
rationale = (
f"cores_total={cores_total} reserved={cores_reserved} "
f"ram_avail_mib={ram_available_mib} headroom={ram_headroom_mib} "
f"per_vm={ram_per_vm_mib} load_1m={load_1m:.2f} "
f"-> max_concurrent={max_concurrent} (binding={binding})"
)
log.info("capacity: %s", rationale)
return FleetCapacity(
cores_total=cores_total,
cores_reserved=cores_reserved,
ram_total_mib=ram_total_mib,
ram_available_mib=ram_available_mib,
ram_per_vm_mib=ram_per_vm_mib,
ram_headroom_mib=ram_headroom_mib,
load_1m=load_1m,
max_by_cores=max_by_cores,
max_by_ram=max_by_ram,
max_by_load=max_by_load,
max_concurrent=max_concurrent,
rationale=rationale,
)
# ---------------------------------------------------------------------------
# Per-slot episode execution
# ---------------------------------------------------------------------------
@dataclass
class SlotResult:
slot: int
sample_name: str
sample_kind: str
episode_id: str | None
rc: int
duration_s: float
error: str | None = None
extra: dict = field(default_factory=dict)
def _run_slot(
cfg: FleetConfig,
slot: int,
sample: Sample,
episode_index: int,
capacity: FleetCapacity,
) -> SlotResult:
"""Run one Tier-2-shaped episode in a dedicated slot.
For now the per-slot driver shells out to ``tools/run_real_vm_demo.py``
with SLOT and PROFILE env so the launcher gives us a unique RUN_DIR
and the load mimic varies by sample. When the Tier-3/4 paths land,
add a sample-kind dispatch here."""
env = os.environ.copy()
env["SLOT"] = str(slot)
env["RUN_DIR"] = f"/tmp/cis490-vm-fleet-{slot}"
env["SAMPLE_NAME"] = sample.name
env["SAMPLE_PROFILE"] = sample.profile
env["SAMPLE_KIND"] = sample.kind
env["FLEET_HOST_ID"] = cfg.host_id
env["FLEET_EPISODE_INDEX"] = str(episode_index)
env["FLEET_MAX_CONCURRENT"] = str(capacity.max_concurrent)
log_dir = cfg.data_root / "fleet-logs"
log_dir.mkdir(parents=True, exist_ok=True)
out_log = log_dir / f"slot-{slot}-ep-{episode_index}.log"
started = time.monotonic()
try:
with out_log.open("ab") as logf:
proc = subprocess.run(
[
"/usr/bin/env", "python3",
str(cfg.repo_root / "tools" / "run_real_vm_demo.py"),
"--data-root", str(cfg.data_root),
],
cwd=str(cfg.repo_root),
env=env,
stdout=logf,
stderr=subprocess.STDOUT,
check=False,
)
rc = proc.returncode
err = None
except (OSError, subprocess.SubprocessError) as e:
rc = -1
err = str(e)
duration = time.monotonic() - started
return SlotResult(
slot=slot,
sample_name=sample.name,
sample_kind=sample.kind,
episode_id=None, # parsed from the log later by the driver
rc=rc,
duration_s=duration,
error=err,
)
# ---------------------------------------------------------------------------
# FleetRunner
# ---------------------------------------------------------------------------
@dataclass
class FleetRunResult:
capacity: FleetCapacity
slots: list[SlotResult]
total_duration_s: float
class FleetRunner:
def __init__(self, cfg: FleetConfig) -> None:
self.cfg = cfg
self._stop = threading.Event()
def stop(self) -> None:
self._stop.set()
def run(
self,
*,
episodes: int = 1,
episode_index_base: int = 0,
capacity_override: FleetCapacity | None = None,
) -> FleetRunResult:
capacity = capacity_override or detect_capacity(
ram_per_vm_mib=self.cfg.ram_per_vm_mib,
)
n_slots = capacity.max_concurrent
if self.cfg.max_concurrent_override is not None:
n_slots = min(n_slots, self.cfg.max_concurrent_override)
if n_slots <= 0:
log.warning(
"fleet capacity is zero (%s); cannot run", capacity.rationale,
)
return FleetRunResult(
capacity=capacity, slots=[], total_duration_s=0.0,
)
log.info(
"fleet host=%s slots=%d episodes=%d manifest_size=%d",
self.cfg.host_id, n_slots, episodes, len(self.cfg.manifest),
)
all_results: list[SlotResult] = []
t_start = time.monotonic()
for ep in range(episodes):
if self._stop.is_set():
break
episode_index = episode_index_base + ep
slot_samples = [
self.cfg.manifest.select(
host_id=self.cfg.host_id,
slot=slot,
episode_index=episode_index,
)
for slot in range(n_slots)
]
if self.cfg.require_real_samples:
slot_samples = [s for s in slot_samples if s.kind == "real"]
if not slot_samples:
log.warning("require_real_samples: no real samples in manifest; skipping wave")
continue
log.info(
"wave %d/%d: %s",
ep + 1, episodes,
[(i, s.name, s.kind) for i, s in enumerate(slot_samples)],
)
with ThreadPoolExecutor(max_workers=n_slots) as pool:
futures = [
pool.submit(
_run_slot, self.cfg, slot, sample, episode_index, capacity,
)
for slot, sample in enumerate(slot_samples)
]
for fut in as_completed(futures):
res = fut.result()
log.info(
"slot %d sample=%s rc=%d duration=%.1fs",
res.slot, res.sample_name, res.rc, res.duration_s,
)
all_results.append(res)
total = time.monotonic() - t_start
return FleetRunResult(
capacity=capacity,
slots=all_results,
total_duration_s=total,
)
# ---------------------------------------------------------------------------
# Friendly capacity report (used by tools/run_fleet.py --capacity)
# ---------------------------------------------------------------------------
def capacity_report() -> str:
c = detect_capacity()
return (
f"cores: {c.cores_total} (reserve {c.cores_reserved})\n"
f"ram: {c.ram_total_mib} MiB total, {c.ram_available_mib} MiB available "
f"(headroom {c.ram_headroom_mib} MiB, per-vm {c.ram_per_vm_mib} MiB)\n"
f"load: 1m={c.load_1m:.2f}\n"
f"caps: by_cores={c.max_by_cores}, by_ram={c.max_by_ram}, "
f"by_load={c.max_by_load}\n"
f"--> max_concurrent VMs: {c.max_concurrent}\n"
)

0
samples/__init__.py Normal file
View file

105
samples/manifest.py Normal file
View file

@ -0,0 +1,105 @@
"""Sample manifest loader + per-(host, slot) deterministic selection.
The manifest at ``samples/manifest.toml`` defines the catalog of
samples (real or mimic) the fleet draws from. Selection is
**deterministic** given ``(host_id, slot, episode_index)`` so two lab
hosts on the same fleet pick *different* samples for the same slot
index, and the same host repeats only after exhausting the catalog.
This gives us "all hosts on the network generating novel data" without
needing a coordinator: every host's `host_id` seeds its own
sample-rotation order, and the orderings spread across the catalog.
"""
from __future__ import annotations
import hashlib
import tomllib
from dataclasses import dataclass, field
from pathlib import Path
_VALID_CATEGORIES = {
"cryptominer", "botnet", "ransomware", "banking-trojan",
"fileless", "rat", "worm", "loader", "wiper", "other",
}
@dataclass(frozen=True)
class Sample:
name: str
family: str
category: str
profile: str
description: str = ""
source: str | None = None
sha256: str | None = None
url: str | None = None
@property
def kind(self) -> str:
"""``"real"`` if a sha256-pinned binary is expected, else ``"mimic"``.
Trainers filter on this so the realistic-model pipeline only
consumes real-malware episodes."""
return "real" if self.sha256 else "mimic"
@dataclass(frozen=True)
class SampleManifest:
samples: list[Sample] = field(default_factory=list)
def __len__(self) -> int:
return len(self.samples)
def select(self, *, host_id: str, slot: int, episode_index: int = 0) -> Sample:
"""Deterministic selection. The host_id mixes into the seed so
different hosts visit the catalog in different orders; slot +
episode_index tick within a host. Same inputs always give the
same sample replay-friendly for debugging."""
if not self.samples:
raise ValueError("manifest is empty")
# SHA-256 of the seed gives a uniformly distributed integer.
seed = f"{host_id}|{slot}|{episode_index}".encode()
h = hashlib.sha256(seed).digest()
idx = int.from_bytes(h[:8], "big") % len(self.samples)
return self.samples[idx]
@classmethod
def load(cls, path: str | Path) -> "SampleManifest":
with open(path, "rb") as f:
data = tomllib.load(f)
raw = data.get("sample") or []
if not isinstance(raw, list):
raise ValueError(f"{path}: 'sample' must be an array of tables")
samples: list[Sample] = []
for i, entry in enumerate(raw):
if not isinstance(entry, dict):
raise ValueError(f"{path}: sample[{i}] is not a table")
for key in ("name", "family", "category", "profile"):
if not isinstance(entry.get(key), str) or not entry[key]:
raise ValueError(f"{path}: sample[{i}] missing or empty '{key}'")
if entry["category"] not in _VALID_CATEGORIES:
raise ValueError(
f"{path}: sample[{i}] category {entry['category']!r} "
f"not in {sorted(_VALID_CATEGORIES)}"
)
samples.append(Sample(
name=entry["name"],
family=entry["family"],
category=entry["category"],
profile=entry["profile"],
description=entry.get("description", ""),
source=entry.get("source"),
sha256=entry.get("sha256"),
url=entry.get("url"),
))
# Reject duplicate names — trainers join on this.
seen: set[str] = set()
for s in samples:
if s.name in seen:
raise ValueError(f"{path}: duplicate sample name {s.name!r}")
seen.add(s.name)
return cls(samples=samples)

61
samples/manifest.toml Normal file
View file

@ -0,0 +1,61 @@
# Sample manifest — what each fleet slot picks from.
#
# Each entry has three things:
# - identity (name, family, category) for labeling
# - acquisition (source, sha256, url) for reproducibility
# - behaviour (profile) so the synthetic load mimic can run a
# reasonable proxy until the real sample lands at vm/images/
#
# When the real malware binary is present at samples/store/<sha256>,
# the orchestrator runs THAT inside the guest. When it's absent, the
# orchestrator falls back to running tools/load_mimic.py with the
# matching profile so the fleet still produces *labeled, varied* data
# while we collect the real samples. Either way, meta.json records
# which path the episode took, so trainers can filter on
# meta.sample.kind ∈ {real, mimic}.
[[sample]]
name = "xmrig-cryptominer"
family = "XMRig"
category = "cryptominer"
profile = "cpu-saturate"
# A real XMRig fetch goes here when MalwareBazaar pull is wired up:
# source = "MalwareBazaar"
# sha256 = "TBD"
# url = "https://bazaar.abuse.ch/sample/TBD/"
description = "Sustained 1-vCPU saturation, very low IO/net. Pure compute."
[[sample]]
name = "mirai-class-bot"
family = "Mirai"
category = "botnet"
profile = "scan-and-dial"
description = "SYN scans across the bridge IP space + periodic dial-home. High net, low CPU."
[[sample]]
name = "ransomware-mimic"
family = "Cryptolocker-class"
category = "ransomware"
profile = "io-walk"
description = "Heavy disk write + filesystem walk producing a per-file overwrite envelope."
[[sample]]
name = "dridex-class-trojan"
family = "Dridex"
category = "banking-trojan"
profile = "bursty-c2"
description = "Long idle, periodic short bursts of TCP egress to a fixed peer (C2 beacon shape)."
[[sample]]
name = "kovter-class-stealth"
family = "Kovter"
category = "fileless"
profile = "low-and-slow"
description = "Low CPU, periodic memory churn, no persistent on-disk artifacts. Hardest to label from /proc alone."
[[sample]]
name = "reverse-shell-resident"
family = "Reverse-Shell"
category = "rat"
profile = "shell-resident"
description = "Single TCP socket pinned to an attacker IP, occasional command bursts."

View file

@ -0,0 +1,69 @@
#!/usr/bin/env bash
# Fetch + sha256-verify the Metasploitable2 disk image.
#
# Rapid7's official download is gated behind a registration form, so
# we accept the URL + sha256 from env vars (with sane defaults pointing
# at a public mirror). The user installs this once per lab host.
#
# Inputs (env):
# IMAGE_URL — direct download URL for the metasploitable2 archive
# IMAGE_SHA256 — expected sha256 of the archive
# OUT_DIR — where to drop the qcow2 (default vm/images/)
#
# Outputs:
# $OUT_DIR/metasploitable2.qcow2 — converted from the original VMDK
# if needed.
#
# We do NOT bake an image url+hash into the repo because the canonical
# distribution is a registration-walled zip on Rapid7. Operators must
# supply both; the rest is mechanical.
set -euo pipefail
IMAGE_URL="${IMAGE_URL:-}"
IMAGE_SHA256="${IMAGE_SHA256:-}"
OUT_DIR="${OUT_DIR:-$(cd "$(dirname "$0")/../vm/images" 2>/dev/null && pwd)}"
WORK_DIR="${WORK_DIR:-/tmp/cis490-metasploitable-fetch}"
log() { printf '[fetch-metasploitable2] %s\n' "$*" >&2; }
die() { log "FATAL: $*"; exit 1; }
[[ -n "$IMAGE_URL" ]] || die "set IMAGE_URL to the Metasploitable2 download URL"
[[ -n "$IMAGE_SHA256" ]] || die "set IMAGE_SHA256 to the expected sha256 of the archive"
mkdir -p "$OUT_DIR" "$WORK_DIR"
ARCHIVE="$WORK_DIR/$(basename "$IMAGE_URL")"
log "downloading $IMAGE_URL$ARCHIVE"
if [[ -f "$ARCHIVE" ]]; then
log "archive already present; skipping download"
else
curl -fL --retry 3 --retry-delay 5 -o "$ARCHIVE.partial" "$IMAGE_URL"
mv "$ARCHIVE.partial" "$ARCHIVE"
fi
log "verifying sha256"
ACTUAL="$(sha256sum "$ARCHIVE" | awk '{print $1}')"
if [[ "$ACTUAL" != "$IMAGE_SHA256" ]]; then
die "sha256 mismatch: expected $IMAGE_SHA256, got $ACTUAL"
fi
log "sha256 ok"
# Extract — handle either zip or 7z, since various mirrors choose one
# or the other.
case "$ARCHIVE" in
*.zip) ( cd "$WORK_DIR" && unzip -o "$ARCHIVE" ) ;;
*.7z|*.7zip) command -v 7z >/dev/null || die "7z not installed"; \
( cd "$WORK_DIR" && 7z x -y "$ARCHIVE" ) ;;
*) die "unsupported archive type: $ARCHIVE" ;;
esac
VMDK="$(find "$WORK_DIR" -name 'Metasploitable*.vmdk' -print -quit)"
[[ -n "$VMDK" ]] || die "no Metasploitable*.vmdk in extracted archive"
log "converting $VMDK → qcow2"
command -v qemu-img >/dev/null || die "qemu-img required (apt install qemu-utils)"
qemu-img convert -O qcow2 "$VMDK" "$OUT_DIR/metasploitable2.qcow2"
log "done: $OUT_DIR/metasploitable2.qcow2"
log "Tier-3 ready when msfrpcd is up. See scripts/install-msfrpcd.sh."

124
scripts/install-msfrpcd.sh Executable file
View file

@ -0,0 +1,124 @@
#!/usr/bin/env bash
# Install + configure ``msfrpcd`` for the Tier-3 exploit driver.
#
# Idempotent: re-running on a host that already has msfrpcd refreshes
# the systemd unit and credentials but doesn't reinstall the framework.
#
# Steps:
# 1. Install metasploit-framework via the host package manager (or
# report the right one-liner for that distro). Big download —
# ~1 GiB and several minutes.
# 2. Generate a strong password and store at /etc/cis490/msfrpc.env
# (mode 0640, owner root:cis490).
# 3. Drop /etc/systemd/system/cis490-msfrpcd.service that runs
# msfrpcd bound to 127.0.0.1:55553 with the generated password.
# 4. Enable + start.
#
# After this runs, ``MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env;
# echo $MSFRPC_PASSWORD)`` makes tools/run_tier3_demo.py work zero-touch.
set -euo pipefail
ETC_ROOT="/etc/cis490"
ENV_FILE="$ETC_ROOT/msfrpc.env"
UNIT="/etc/systemd/system/cis490-msfrpcd.service"
PORT="${MSFRPC_PORT:-55553}"
USER_NAME="${MSFRPC_USER:-msf}"
log() { printf '[install-msfrpcd] %s\n' "$*" >&2; }
die() { log "FATAL: $*"; exit 1; }
[[ $EUID -eq 0 ]] || die "must run as root"
command -v systemctl >/dev/null || die "systemd not found"
# --- 1. install metasploit-framework -----------------------------------
if ! command -v msfrpcd >/dev/null; then
log "msfrpcd not found; installing metasploit-framework"
if command -v apt-get >/dev/null; then
# The Debian/Ubuntu metasploit-framework package isn't in
# the default repos for most distros. Use Rapid7's official
# nightly installer when available.
if [[ ! -x /opt/metasploit-framework/bin/msfrpcd ]]; then
log "fetching Rapid7 nightly installer"
curl -fsSL https://raw.githubusercontent.com/rapid7/metasploit-omnibus/master/config/templates/metasploit-framework-wrappers/msfupdate.erb \
-o /tmp/msfinstall.sh || true
log "automated install not available — install manually:"
log " https://docs.metasploit.com/docs/using-metasploit/getting-started/nightly-installers.html"
die "rerun once msfrpcd is on PATH"
fi
# Symlink the wrapper so ``msfrpcd`` is on PATH.
ln -sf /opt/metasploit-framework/bin/msfrpcd /usr/local/bin/msfrpcd
elif command -v pacman >/dev/null; then
log "pacman -S metasploit"
pacman -Sy --noconfirm metasploit
elif command -v dnf >/dev/null; then
die "Fedora/RHEL: install metasploit-framework manually, then re-run"
else
die "unknown package manager — install metasploit-framework manually"
fi
fi
command -v msfrpcd >/dev/null || die "msfrpcd still missing after install attempt"
# --- 2. generate password ----------------------------------------------
install -d -m 0755 -o root -g root "$ETC_ROOT"
if ! id -u cis490 >/dev/null 2>&1; then
useradd --system --no-create-home --shell /usr/sbin/nologin cis490
fi
if [[ ! -f "$ENV_FILE" ]]; then
log "generating msfrpc password"
PW="$(openssl rand -base64 24 | tr -d '/+=' | head -c 32)"
install -m 0640 -o root -g cis490 /dev/stdin "$ENV_FILE" <<EOF
# Auto-generated by install-msfrpcd.sh — do not edit.
MSFRPC_HOST=127.0.0.1
MSFRPC_PORT=$PORT
MSFRPC_USER=$USER_NAME
MSFRPC_PASSWORD=$PW
EOF
else
log "$ENV_FILE exists; preserving existing password"
fi
# --- 3. systemd unit ----------------------------------------------------
log "installing systemd unit"
cat > "$UNIT" <<EOF
[Unit]
Description=CIS490 — Metasploit RPC daemon (loopback only)
Documentation=https://maxgit.wg/spectral/CIS490
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
EnvironmentFile=$ENV_FILE
# msfrpcd flags:
# -P <pw> password
# -U <user> username
# -a <ip> bind address (loopback only — Tier-3 driver runs locally)
# -p <port> port
# -f foreground (no daemonization, so systemd manages PID)
ExecStart=/usr/bin/env msfrpcd -P \${MSFRPC_PASSWORD} -U \${MSFRPC_USER} -a 127.0.0.1 -p \${MSFRPC_PORT} -f
Restart=on-failure
RestartSec=5
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=full
ProtectHome=true
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now cis490-msfrpcd
# --- 4. final smoke -----------------------------------------------------
sleep 2
if ! ss -ltn 2>/dev/null | grep -q ":$PORT"; then
log "WARN: nothing listening on 127.0.0.1:$PORT yet — check"
log " journalctl -u cis490-msfrpcd"
fi
log "done. To run a Tier-3 episode:"
log " set -a; . $ENV_FILE; set +a"
log " python tools/run_tier3_demo.py --module vsftpd_234_backdoor"

204
tests/test_fleet.py Normal file
View file

@ -0,0 +1,204 @@
"""Tests for fleet capacity calculation + sample manifest selection.
Capacity is unit-tested via deterministic monkeypatching of /proc and
os.cpu_count so the math is exercised independently of the host
running the suite. Sample selection has its own tests covering the
"different hosts pick different samples" property.
"""
from __future__ import annotations
from pathlib import Path
import pytest
from orchestrator import fleet
from samples.manifest import Sample, SampleManifest
REPO_ROOT = Path(__file__).resolve().parent.parent
# ---------------------------------------------------------------------------
# Capacity
# ---------------------------------------------------------------------------
def _patch_capacity_inputs(
monkeypatch,
*,
cores: int,
ram_total_mib: int,
ram_available_mib: int,
load_1m: float = 0.0,
) -> None:
monkeypatch.setattr(fleet.os, "cpu_count", lambda: cores)
monkeypatch.setattr(
fleet, "_read_meminfo",
lambda: {
"MemTotal": ram_total_mib * 1024 * 1024,
"MemAvailable": ram_available_mib * 1024 * 1024,
},
)
monkeypatch.setattr(fleet, "_read_loadavg", lambda: load_1m)
def test_capacity_8core_idle_box(monkeypatch) -> None:
_patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000)
c = fleet.detect_capacity(ram_per_vm_mib=320)
assert c.cores_total == 8
assert c.cores_reserved == 1 # 8 // 8 = 1
assert c.max_by_cores == 7
# Plenty of RAM, idle → cores binding.
assert c.max_concurrent == 7
assert "binding=cores" in c.rationale
def test_capacity_low_ram_caps_below_cores(monkeypatch) -> None:
# 8 cores but only ~2 GiB free → ram caps below cores.
_patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=4096, ram_available_mib=2048)
c = fleet.detect_capacity(ram_per_vm_mib=320)
# headroom = max(1024, 4096//8) = 1024
# max_by_ram = (2048 - 1024) // 320 = 3
assert c.max_by_ram == 3
assert c.max_concurrent == 3
def test_capacity_high_load_halves_concurrency(monkeypatch) -> None:
# 8 cores, plenty of RAM, but load_1m / cores > 0.75
_patch_capacity_inputs(
monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000,
load_1m=7.0, # 7/8 = 0.875 > 0.75
)
c = fleet.detect_capacity(ram_per_vm_mib=320)
# max_by_cores = 7; max_by_load = max(1, 7//2) = 3
assert c.max_by_load == 3
assert c.max_concurrent == 3
def test_capacity_pi5_class(monkeypatch) -> None:
"""4 cores + 8 GiB → reserve 1 core, run 3 concurrent."""
_patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=7951, ram_available_mib=5223)
c = fleet.detect_capacity(ram_per_vm_mib=320)
assert c.cores_total == 4
assert c.max_concurrent == 3
def test_capacity_minimal_box(monkeypatch) -> None:
"""1-core 1 GiB host shouldn't try to run any VMs."""
_patch_capacity_inputs(monkeypatch, cores=1, ram_total_mib=1024, ram_available_mib=512)
c = fleet.detect_capacity(ram_per_vm_mib=320)
assert c.max_concurrent == 0
def test_capacity_to_dict_round_trips(monkeypatch) -> None:
_patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=8000, ram_available_mib=6000)
c = fleet.detect_capacity(ram_per_vm_mib=320)
d = c.to_dict()
assert d["cores_total"] == 4
assert d["max_concurrent"] == c.max_concurrent
assert "rationale" in d
# ---------------------------------------------------------------------------
# Sample manifest
# ---------------------------------------------------------------------------
def test_repo_manifest_loads() -> None:
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
assert len(m) >= 4
# Every entry has required fields.
for s in m.samples:
assert s.name and s.family and s.category and s.profile
# All "mimic" today; will switch as real samples are added.
assert all(s.kind == "mimic" for s in m.samples)
def test_selection_is_deterministic() -> None:
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
a = m.select(host_id="lab-1", slot=2, episode_index=5)
b = m.select(host_id="lab-1", slot=2, episode_index=5)
assert a is b
def test_selection_differs_across_hosts() -> None:
"""Two hosts on the same slot/episode should generally hit
different samples (probabilistic assert distribution, not
individual equality).
"""
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
if len(m) < 2:
pytest.skip("manifest too small for diversity check")
matches = 0
for slot in range(20):
a = m.select(host_id="alice", slot=slot, episode_index=0)
b = m.select(host_id="bob", slot=slot, episode_index=0)
if a is b:
matches += 1
# If the catalog has N samples, naive collision rate ~1/N. With
# 20 trials and N≥4 we expect ~5 matches; allow up to half.
assert matches < 15, "host_id seed isn't producing variety"
def test_selection_walks_catalog_across_episodes() -> None:
"""A single host over many episodes should hit every sample at
least once."""
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
seen = set()
for ep in range(200):
seen.add(m.select(host_id="lab-x", slot=0, episode_index=ep).name)
assert len(seen) == len(m), f"only saw {len(seen)}/{len(m)} samples"
def test_manifest_rejects_missing_required_field(tmp_path: Path) -> None:
p = tmp_path / "bad.toml"
p.write_text(
'[[sample]]\n'
'name = "x"\n'
'family = "y"\n'
'# missing category\n'
'profile = "z"\n'
)
with pytest.raises(ValueError, match="category"):
SampleManifest.load(p)
def test_manifest_rejects_unknown_category(tmp_path: Path) -> None:
p = tmp_path / "bad.toml"
p.write_text(
'[[sample]]\n'
'name = "x"\n'
'family = "y"\n'
'category = "fish"\n'
'profile = "z"\n'
)
with pytest.raises(ValueError, match="category"):
SampleManifest.load(p)
def test_manifest_rejects_duplicate_names(tmp_path: Path) -> None:
p = tmp_path / "dup.toml"
p.write_text(
'[[sample]]\n'
'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
'\n[[sample]]\n'
'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
)
with pytest.raises(ValueError, match="duplicate"):
SampleManifest.load(p)
def test_manifest_marks_real_when_sha256_present(tmp_path: Path) -> None:
p = tmp_path / "real.toml"
p.write_text(
'[[sample]]\n'
'name = "real-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
'sha256 = "abc123"\n'
'\n[[sample]]\n'
'name = "mimic-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
)
m = SampleManifest.load(p)
by_name = {s.name: s for s in m.samples}
assert by_name["real-one"].kind == "real"
assert by_name["mimic-one"].kind == "mimic"

152
tests/test_guest_agent.py Normal file
View file

@ -0,0 +1,152 @@
"""Tests for the host-side guest-agent collector.
We simulate the in-guest agent by spinning up a unix socket server
(stand-in for the QEMU virtio-serial chardev) that writes a few
JSON-lines rows. The collector should read them, re-stamp with the
host's monotonic clock, and persist to telemetry-guest.jsonl.
"""
from __future__ import annotations
import json
import socket
import threading
import time
from pathlib import Path
import pytest
from collectors import guest_agent
class FakeAgentServer(threading.Thread):
def __init__(self, sock_path: Path, rows: list[dict], delay_s: float = 0.05) -> None:
super().__init__(daemon=True)
self.sock_path = sock_path
self.rows = rows
self.delay_s = delay_s
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
self._sock.bind(str(sock_path))
self._sock.listen(1)
self._sock.settimeout(5.0)
def run(self) -> None:
try:
conn, _ = self._sock.accept()
except socket.timeout:
return
try:
for row in self.rows:
conn.sendall((json.dumps(row) + "\n").encode())
time.sleep(self.delay_s)
time.sleep(0.1)
finally:
conn.close()
self._sock.close()
def test_collector_reads_jsonl_and_restamps(tmp_path: Path) -> None:
sock_path = tmp_path / "agent.sock"
rows_in = [
{
"t_guest_mono_ns": 1, "t_guest_wall_ns": 2,
"source": "guest_agent", "available_in_deployment": True,
"mem_total_bytes": 256 * 1024 * 1024,
"mem_available_bytes": 200 * 1024 * 1024,
"load_1m_5m_15m": [0.1, 0.05, 0.0],
"cpu_total_jiffies": {"user": 10, "system": 5, "idle": 1000},
},
{
"t_guest_mono_ns": 100_000_000, "t_guest_wall_ns": 100_000_002,
"source": "guest_agent", "available_in_deployment": True,
"mem_total_bytes": 256 * 1024 * 1024,
"mem_available_bytes": 198 * 1024 * 1024,
},
]
server = FakeAgentServer(sock_path, rows_in, delay_s=0.02)
server.start()
out_path = tmp_path / "telemetry-guest.jsonl"
stop = threading.Event()
def stop_after(ms: int) -> None:
time.sleep(ms / 1000.0)
stop.set()
threading.Thread(target=stop_after, args=(300,), daemon=True).start()
rows_written = guest_agent.run_loop(
socket_path=sock_path,
output_path=out_path,
t_mono_origin_ns=time.monotonic_ns(),
stop_event=stop,
connect_timeout_s=2.0,
)
server.join(timeout=2)
assert rows_written == 2
persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
assert len(persisted) == 2
for orig, got in zip(rows_in, persisted):
# Original guest timestamps preserved.
assert got["t_guest_mono_ns"] == orig["t_guest_mono_ns"]
# Host-clock fields added.
assert "t_mono_ns" in got
assert "t_wall_ns" in got
assert got["source"] == "guest_agent"
assert got["available_in_deployment"] is True
def test_collector_returns_zero_when_socket_missing(tmp_path: Path) -> None:
rows = guest_agent.run_loop(
socket_path=tmp_path / "no-socket-here.sock",
output_path=tmp_path / "out.jsonl",
t_mono_origin_ns=time.monotonic_ns(),
stop_event=threading.Event(),
connect_timeout_s=0.5,
)
assert rows == 0
def test_collector_drops_malformed_lines_but_keeps_going(tmp_path: Path) -> None:
sock_path = tmp_path / "agent.sock"
# Will be sent verbatim; the malformed line should be skipped.
payload = (
b'{"source":"guest_agent","mem_total_bytes":1}\n'
b'this-is-not-json\n'
b'{"source":"guest_agent","mem_total_bytes":2}\n'
)
class Server(threading.Thread):
def __init__(self) -> None:
super().__init__(daemon=True)
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
self._sock.bind(str(sock_path))
self._sock.listen(1)
def run(self) -> None:
conn, _ = self._sock.accept()
try:
conn.sendall(payload)
time.sleep(0.2)
finally:
conn.close()
self._sock.close()
s = Server()
s.start()
out_path = tmp_path / "out.jsonl"
stop = threading.Event()
threading.Thread(
target=lambda: (time.sleep(0.4), stop.set()), daemon=True
).start()
rows = guest_agent.run_loop(
socket_path=sock_path,
output_path=out_path,
t_mono_origin_ns=time.monotonic_ns(),
stop_event=stop,
connect_timeout_s=2.0,
)
s.join(timeout=2)
assert rows == 2
persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
assert [r["mem_total_bytes"] for r in persisted] == [1, 2]

188
tests/test_pcap.py Normal file
View file

@ -0,0 +1,188 @@
"""Tests for the pcap collector's pure-Python parser + bucketizer.
We synthesize a tiny pcap file in memory (Ethernet + IPv4 + TCP/UDP
records with controlled timestamps), feed it to ``bucketize()``, and
verify the produced netflow.jsonl rows are correct.
"""
from __future__ import annotations
import json
import struct
from pathlib import Path
import pytest
from collectors import pcap
# ---------------------------------------------------------------------------
# pcap synthesis helpers
# ---------------------------------------------------------------------------
_PCAP_GLOBAL_HDR = struct.pack(
"<IHHiIII",
0xa1b2c3d4, # magic (us)
2, 4, # version
0, # thiszone
0, # sigfigs
65535, # snaplen
1, # linktype = LINKTYPE_ETHERNET
)
def _ipv4(src: str, dst: str, proto: int, payload: bytes) -> bytes:
s = bytes(int(x) for x in src.split("."))
d = bytes(int(x) for x in dst.split("."))
total_len = 20 + len(payload)
return struct.pack(
">BBHHHBBHII"[:0] + "BBHHHBBH",
0x45, # version=4, IHL=5
0, # tos
total_len,
0, 0, 64, proto,
0, # checksum (don't care)
) + s + d + payload
def _tcp(sport: int, dport: int, flags: int) -> bytes:
# Minimal 20-byte TCP header: sport, dport, seq, ack, off+flags, win, csum, urg
return struct.pack(">HHIIBBHHH",
sport, dport,
0, 0,
0x50, # data offset = 5 (no options)
flags,
0, 0, 0)
def _udp(sport: int, dport: int, length: int = 8) -> bytes:
return struct.pack(">HHHH", sport, dport, length, 0)
def _ether(payload: bytes, ethertype: int = 0x0800) -> bytes:
return b"\x02\x00\x00\x00\x00\x01" + b"\x02\x00\x00\x00\x00\x02" + struct.pack(">H", ethertype) + payload
def _record(ts_ns: int, frame: bytes) -> bytes:
sec = ts_ns // 1_000_000_000
usec = (ts_ns // 1000) % 1_000_000
return struct.pack("<IIII", sec, usec, len(frame), len(frame)) + frame
def _build_pcap(records: list[tuple[int, bytes]]) -> bytes:
out = bytearray(_PCAP_GLOBAL_HDR)
for ts, frame in records:
out += _record(ts, frame)
return bytes(out)
def _write_pcap(path: Path, records: list[tuple[int, bytes]]) -> None:
path.write_bytes(_build_pcap(records))
# ---------------------------------------------------------------------------
# Tests
# ---------------------------------------------------------------------------
def test_iter_pcap_reads_records_back(tmp_path: Path) -> None:
p = tmp_path / "a.pcap"
frame = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
_write_pcap(p, [(1_000_000_000, frame)])
records = list(pcap._iter_pcap(p))
assert len(records) == 1
t_ns, data = records[0]
assert t_ns == 1_000_000_000
assert data == frame
def test_decode_tcp_syn() -> None:
f = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
d = pcap._decode(f)
assert d["ethertype"] == 0x0800
assert d["ip_proto"] == 6
assert d["src_ip"] == "10.200.0.1"
assert d["dst_ip"] == "10.200.0.10"
assert d["src_port"] == 40000
assert d["dst_port"] == 21
assert d["tcp_flags"] & 0x02
def test_decode_udp_dns_query() -> None:
f = _ether(_ipv4("10.200.0.10", "10.200.0.1", 17, _udp(33333, 53)))
d = pcap._decode(f)
assert d["ip_proto"] == 17
assert d["dst_port"] == 53
def test_bucketize_collapses_per_window(tmp_path: Path) -> None:
pcap_path = tmp_path / "ep.pcap"
netflow_path = tmp_path / "netflow.jsonl"
bridge_ip = "10.200.0.1"
guest_ip = "10.200.0.10"
base_ns = 1_700_000_000_000_000_000 # arbitrary, aligned-friendly
records = [
# Bucket A (0..100ms)
(base_ns + 5_000_000,
_ether(_ipv4(guest_ip, bridge_ip, 6, _tcp(40000, 21, flags=0x02)))),
(base_ns + 9_000_000,
_ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x12)))),
# Bucket B (100..200ms): UDP DNS query
(base_ns + 105_000_000,
_ether(_ipv4(guest_ip, bridge_ip, 17, _udp(33333, 53)))),
# Bucket B: TCP RST
(base_ns + 199_000_000,
_ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x04)))),
]
_write_pcap(pcap_path, records)
rows_written = pcap.bucketize(
pcap_path, netflow_path,
bucket_ms=100,
t_mono_origin_ns=base_ns,
bridge_ip=bridge_ip,
)
assert rows_written == 2
rows = [json.loads(l) for l in netflow_path.read_text().splitlines()]
a, b = rows
assert a["bucket_ms"] == 100
# Bucket A: 1 in (SYN), 1 out (SYN-ACK)
assert a["pkts_in"] == 1
assert a["pkts_out"] == 1
assert a["syn_count"] == 2
assert a["tcp_new_flows"] == 1 # only the bare SYN counts as new flow
assert a["dns_query_count"] == 0
assert a["unique_dst_ips"] == 2
# Bucket B: DNS + RST
assert b["dns_query_count"] == 1
assert b["rst_count"] == 1
def test_bucketize_returns_zero_for_missing_file(tmp_path: Path) -> None:
rows = pcap.bucketize(
tmp_path / "nope.pcap",
tmp_path / "netflow.jsonl",
bucket_ms=100,
t_mono_origin_ns=0,
)
assert rows == 0
def test_bucketize_handles_unknown_ethertype(tmp_path: Path) -> None:
p = tmp_path / "x.pcap"
netflow = tmp_path / "n.jsonl"
# ARP frame (ethertype 0x0806) — counted but not decoded.
f = _ether(b"\x00" * 28, ethertype=0x0806)
_write_pcap(p, [(1_000_000_000, f)])
rows = pcap.bucketize(p, netflow, bucket_ms=100, t_mono_origin_ns=0)
assert rows == 1
out = json.loads(netflow.read_text().splitlines()[0])
# No IP info, but byte/packet count survives.
assert out["pkts_in"] + out["pkts_out"] == 1
assert out["tcp_count"] == 0

295
tests/test_qmp.py Normal file
View file

@ -0,0 +1,295 @@
"""Tests for the QMP collector against an in-process fake QMP server.
The fake speaks just enough QMP to exercise:
- the greeting + qmp_capabilities handshake
- query-status
- query-blockstats
- query-stats target=vm
- error responses
- async events interleaved with command responses
"""
from __future__ import annotations
import json
import socket
import tempfile
import threading
import time
from pathlib import Path
from typing import Any
import pytest
from collectors import qmp
# ---------------------------------------------------------------------------
# Fake QMP server
# ---------------------------------------------------------------------------
class FakeQMPServer(threading.Thread):
"""Single-connection fake. Each line received from the client is
parsed as JSON; we look up ``execute`` in ``responses`` and emit
the configured reply. Optionally interleaves an async event before
the response."""
def __init__(
self,
socket_path: Path,
*,
responses: dict[str, Any] | None = None,
emit_event_before: set[str] | None = None,
) -> None:
super().__init__(daemon=True)
self.socket_path = socket_path
self.responses = responses or {}
self.emit_event_before = emit_event_before or set()
self.received: list[dict] = []
self._stop = threading.Event()
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
self._sock.bind(str(socket_path))
self._sock.listen(1)
self._sock.settimeout(5.0)
def run(self) -> None:
try:
conn, _ = self._sock.accept()
except socket.timeout:
return
conn.settimeout(5.0)
try:
# Greeting
conn.sendall(b'{"QMP": {"version": {"qemu": {"major":9,"minor":0,"micro":0}}, "capabilities": []}}\n')
buf = b""
while not self._stop.is_set():
try:
chunk = conn.recv(4096)
except socket.timeout:
if self._stop.is_set():
return
continue
if not chunk:
return
buf += chunk
while b"\n" in buf:
line, _, buf = buf.partition(b"\n")
if not line.strip():
continue
msg = json.loads(line)
self.received.append(msg)
cmd = msg.get("execute")
if cmd == "qmp_capabilities":
conn.sendall(b'{"return": {}}\n')
continue
if cmd in self.emit_event_before:
conn.sendall(b'{"event": "STOP", "timestamp": {"seconds": 1, "microseconds": 0}}\n')
if cmd in self.responses:
resp = self.responses[cmd]
conn.sendall((json.dumps(resp) + "\n").encode())
else:
conn.sendall(b'{"error": {"class": "CommandNotFound", "desc": "unknown"}}\n')
finally:
conn.close()
def shutdown(self) -> None:
self._stop.set()
try:
self._sock.close()
except OSError:
pass
@pytest.fixture
def qmp_server(tmp_path: Path):
sock_path = tmp_path / "qmp.sock"
return sock_path
# ---------------------------------------------------------------------------
# Client tests
# ---------------------------------------------------------------------------
def test_connect_negotiates_capabilities(qmp_server: Path) -> None:
server = FakeQMPServer(qmp_server)
server.start()
try:
client = qmp.QMPClient(qmp_server)
greeting = client.connect()
assert "version" in greeting
finally:
client.close()
server.shutdown()
# Server saw exactly the qmp_capabilities call.
assert any(m.get("execute") == "qmp_capabilities" for m in server.received)
def test_execute_returns_payload(qmp_server: Path) -> None:
server = FakeQMPServer(
qmp_server,
responses={
"query-status": {"return": {"status": "running", "running": True}},
},
)
server.start()
try:
client = qmp.QMPClient(qmp_server)
client.connect()
out = client.execute("query-status")
assert out == {"status": "running", "running": True}
finally:
client.close()
server.shutdown()
def test_execute_skips_async_events_before_response(qmp_server: Path) -> None:
server = FakeQMPServer(
qmp_server,
responses={
"query-status": {"return": {"status": "running", "running": True}},
},
emit_event_before={"query-status"},
)
server.start()
try:
client = qmp.QMPClient(qmp_server)
client.connect()
out = client.execute("query-status")
assert out["running"] is True
finally:
client.close()
server.shutdown()
def test_execute_raises_on_qmp_error(qmp_server: Path) -> None:
server = FakeQMPServer(qmp_server) # no responses → server sends error
server.start()
try:
client = qmp.QMPClient(qmp_server)
client.connect()
with pytest.raises(qmp.QMPError):
client.execute("totally-fake-command")
finally:
client.close()
server.shutdown()
# ---------------------------------------------------------------------------
# Row builder tests
# ---------------------------------------------------------------------------
def test_collect_once_assembles_full_row(qmp_server: Path) -> None:
server = FakeQMPServer(
qmp_server,
responses={
"query-status": {"return": {"status": "running", "running": True}},
"query-blockstats": {"return": [{
"device": "virtio0",
"stats": {
"rd_operations": 12, "wr_operations": 4,
"rd_bytes": 49152, "wr_bytes": 16384,
"flush_operations": 1,
},
}]},
"query-stats": {"return": [{"stats": [
{"name": "halt_exits", "value": 17000},
{"name": "io_exits", "value": 942},
{"name": "string-skipped", "value": "not-an-int"},
]}]},
},
)
server.start()
try:
client = qmp.QMPClient(qmp_server)
client.connect()
row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
finally:
client.close()
server.shutdown()
assert row["source"] == "host_qmp"
assert row["available_in_deployment"] is False
assert row["vm_running"] is True
assert row["blockstats"]["virtio0"]["rd_bytes"] == 49152
assert row["blockstats"]["virtio0"]["flush_ops"] == 1
assert row["kvm_stats"]["halt_exits"] == 17000
assert "string-skipped" not in row["kvm_stats"]
def test_collect_once_tolerates_missing_query_stats(qmp_server: Path) -> None:
server = FakeQMPServer(
qmp_server,
responses={
"query-status": {"return": {"status": "running", "running": True}},
"query-blockstats": {"return": []},
# query-stats deliberately absent → server returns CommandNotFound
},
)
server.start()
try:
client = qmp.QMPClient(qmp_server)
client.connect()
row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
finally:
client.close()
server.shutdown()
# Older qemu without query-stats: row still exists, kvm_stats absent.
assert "kvm_stats" not in row
assert row["vm_running"] is True
assert row["blockstats"] == {}
# ---------------------------------------------------------------------------
# run_loop tests
# ---------------------------------------------------------------------------
def test_run_loop_writes_rows_and_stops_cleanly(qmp_server: Path, tmp_path: Path) -> None:
server = FakeQMPServer(
qmp_server,
responses={
"query-status": {"return": {"status": "running", "running": True}},
"query-blockstats": {"return": []},
"query-stats": {"error": {"class": "CommandNotFound", "desc": "n/a"}},
},
)
server.start()
out_path = tmp_path / "telemetry-qmp.jsonl"
stop = threading.Event()
def stop_after(ms: int) -> None:
time.sleep(ms / 1000.0)
stop.set()
threading.Thread(target=stop_after, args=(350,), daemon=True).start()
rows = qmp.run_loop(
socket_path=qmp_server,
output_path=out_path,
t_mono_origin_ns=time.monotonic_ns(),
interval_ms=100,
stop_event=stop,
)
server.shutdown()
assert rows >= 2, f"expected >=2 rows, got {rows}"
lines = [json.loads(l) for l in out_path.read_text().splitlines()]
assert len(lines) == rows
for r in lines:
assert r["source"] == "host_qmp"
assert r["vm_running"] is True
def test_run_loop_returns_zero_when_socket_missing(tmp_path: Path) -> None:
# No server bound to the socket path.
rows = qmp.run_loop(
socket_path=tmp_path / "nonexistent.sock",
output_path=tmp_path / "telemetry-qmp.jsonl",
t_mono_origin_ns=time.monotonic_ns(),
interval_ms=100,
stop_event=threading.Event(),
)
assert rows == 0

View file

@ -28,7 +28,7 @@ from pathlib import Path
import pycdlib
DEFAULT_USER_DATA = """\
DEFAULT_USER_DATA_HEAD = """\
#cloud-config
hostname: cis490
manage_etc_hosts: true
@ -45,10 +45,70 @@ chpasswd:
list: |
root:cis490
cis490:cis490
runcmd:
- [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]
"""
# OpenRC service file shipped inside the guest. Alpine uses OpenRC;
# the runcmd at the bottom of user-data wires it up on first boot.
OPENRC_SERVICE = """\
#!/sbin/openrc-run
description="CIS490 in-guest telemetry agent"
command="/usr/local/bin/cis490-agent"
command_args="--port /dev/virtio-ports/cis490.guest.agent"
command_background=true
pidfile="/run/cis490-agent.pid"
output_log="/var/log/cis490-agent.log"
error_log="/var/log/cis490-agent.log"
depend() {
need localmount
}
"""
DEFAULT_META_DATA = """\
instance-id: cis490-vm-001
local-hostname: cis490
"""
def _indent(text: str, n: int) -> str:
pad = " " * n
return "\n".join(pad + line if line else line for line in text.splitlines())
def build_user_data(*, embed_agent: bool, agent_path: Path | None) -> bytes:
"""Build a cloud-init user-data document. When ``embed_agent`` is
True, also stuff the in-guest agent + an OpenRC service into
``write_files`` and arrange to start the service on first boot."""
head = DEFAULT_USER_DATA_HEAD
if not embed_agent:
return (head + 'runcmd:\n - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n').encode()
if agent_path is None:
agent_path = Path(__file__).resolve().parent.parent / "vm" / "guest-agent" / "cis490_agent.py"
if not agent_path.exists():
raise FileNotFoundError(f"agent script not found: {agent_path}")
agent_src = agent_path.read_text()
body = head + (
"write_files:\n"
" - path: /usr/local/bin/cis490-agent\n"
" permissions: '0755'\n"
" owner: root:root\n"
" content: |\n"
f"{_indent(agent_src, 6)}\n"
" - path: /etc/init.d/cis490-agent\n"
" permissions: '0755'\n"
" owner: root:root\n"
" content: |\n"
f"{_indent(OPENRC_SERVICE, 6)}\n"
"runcmd:\n"
' - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n'
' - [ sh, -c, "command -v rc-update >/dev/null && rc-update add cis490-agent default || true" ]\n'
' - [ sh, -c, "command -v rc-service >/dev/null && rc-service cis490-agent start || true" ]\n'
)
return body.encode()
DEFAULT_META_DATA = """\
instance-id: cis490-vm-001
local-hostname: cis490
@ -93,11 +153,26 @@ def main() -> int:
default=None,
help="path to a custom meta-data file",
)
parser.add_argument(
"--no-embed-agent",
action="store_true",
help="don't bake the in-guest agent into user-data",
)
parser.add_argument(
"--agent-path",
type=Path,
default=None,
help="path to the in-guest agent (default: vm/guest-agent/cis490_agent.py)",
)
args = parser.parse_args()
user_data = (
args.user_data.read_bytes() if args.user_data else DEFAULT_USER_DATA.encode()
)
if args.user_data:
user_data = args.user_data.read_bytes()
else:
user_data = build_user_data(
embed_agent=not args.no_embed_agent,
agent_path=args.agent_path,
)
meta_data = (
args.meta_data.read_bytes() if args.meta_data else DEFAULT_META_DATA.encode()
)

97
tools/run_fleet.py Normal file
View file

@ -0,0 +1,97 @@
"""``cis490-fleet`` — run as many concurrent labeled episodes as the
host can handle, drawing samples from the manifest.
Modes:
--capacity Print the resource calculation and exit. No VMs spawned.
--waves N Run N waves of episodes (one wave = max_concurrent
episodes, each in its own slot). Default: 1.
--max-concurrent N
Cap concurrency below the auto-detected ceiling.
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import signal
import sys
from pathlib import Path
# Allow running as a script.
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from orchestrator.fleet import ( # noqa: E402
FleetConfig, FleetRunner, capacity_report, detect_capacity,
)
from samples.manifest import SampleManifest # noqa: E402
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser(prog="cis490-fleet")
p.add_argument("--capacity", action="store_true")
p.add_argument("--waves", type=int, default=1)
p.add_argument("--max-concurrent", type=int, default=None)
p.add_argument("--manifest",
default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"))
p.add_argument("--data-root", default="data")
p.add_argument("--host-id", default=os.environ.get("FLEET_HOST_ID") or os.uname().nodename)
p.add_argument("--ram-per-vm-mib", type=int, default=320)
p.add_argument("--require-real-samples", action="store_true")
p.add_argument("--log-level", default="INFO")
args = p.parse_args(argv)
logging.basicConfig(
level=getattr(logging, args.log_level.upper(), logging.INFO),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
)
if args.capacity:
print(capacity_report())
return 0
manifest = SampleManifest.load(args.manifest)
repo_root = Path(__file__).resolve().parent.parent
cfg = FleetConfig(
host_id=args.host_id,
repo_root=repo_root,
data_root=Path(args.data_root).resolve(),
manifest=manifest,
ram_per_vm_mib=args.ram_per_vm_mib,
max_concurrent_override=args.max_concurrent,
require_real_samples=args.require_real_samples,
)
runner = FleetRunner(cfg)
def _stop(signum, frame): # noqa: ARG001
runner.stop()
signal.signal(signal.SIGTERM, _stop)
signal.signal(signal.SIGINT, _stop)
result = runner.run(episodes=args.waves)
print(json.dumps({
"host_id": args.host_id,
"capacity": result.capacity.to_dict(),
"slots": [
{
"slot": s.slot,
"sample": s.sample_name,
"sample_kind": s.sample_kind,
"rc": s.rc,
"duration_s": s.duration_s,
"error": s.error,
} for s in result.slots
],
"total_duration_s": result.total_duration_s,
}, indent=2))
return 0 if all(s.rc == 0 for s in result.slots) else 1
if __name__ == "__main__":
sys.exit(main())

View file

@ -0,0 +1,274 @@
#!/usr/bin/env python3
"""In-guest telemetry agent — runs INSIDE the VM.
Writes one JSON-lines row per tick to a virtio-serial port that the
host has wired up as ``cis490.guest.agent``. The host-side collector
(`collectors.guest_agent`) reads these rows and stamps them with the
host's monotonic clock before persisting to ``telemetry-guest.jsonl``.
Stdlib only no `psutil`, no extra deps to bake into the guest. Every
field is read from /proc on the guest, so this works on busybox-based
Alpine, on Cirros, and on Metasploitable2 unchanged.
Wire path inside the guest:
/dev/virtio-ports/cis490.guest.agent
The host side opens the matching unix socket on the hypervisor.
The protocol is intentionally trivial: the agent emits newline-
delimited JSON; the host emits nothing back. One direction.
This source is the **deployable** side every row is tagged
``available_in_deployment: true``. See docs/threat-model.md.
"""
from __future__ import annotations
import argparse
import json
import os
import platform
import sys
import time
from typing import Any
SOURCE = "guest_agent"
AVAILABLE_IN_DEPLOYMENT = True
DEFAULT_PORT = "/dev/virtio-ports/cis490.guest.agent"
DEFAULT_INTERVAL_MS = 100 # 10 Hz
DEFAULT_TOP_N = 8
# ---------- /proc parsers ---------------------------------------------------
def _read(path: str) -> str | None:
try:
with open(path, "rb") as f:
return f.read().decode("ascii", errors="replace")
except (FileNotFoundError, PermissionError):
return None
def read_loadavg() -> tuple[float, float, float] | None:
text = _read("/proc/loadavg")
if text is None:
return None
parts = text.split()
return float(parts[0]), float(parts[1]), float(parts[2])
def read_meminfo() -> dict[str, int]:
text = _read("/proc/meminfo")
out: dict[str, int] = {}
if text is None:
return out
for line in text.splitlines():
k, _, rest = line.partition(":")
v = rest.strip()
if v.endswith(" kB"):
try:
out[k] = int(v[:-3]) * 1024
except ValueError:
pass
return out
def read_cpu_total() -> dict[str, int] | None:
"""First line of /proc/stat: aggregate cpu user/nice/sys/idle/...
in jiffies since boot."""
text = _read("/proc/stat")
if text is None:
return None
line = text.splitlines()[0]
fields = line.split()
# cpu user nice system idle iowait irq softirq steal guest guest_nice
if not fields or fields[0] != "cpu":
return None
nums = [int(x) for x in fields[1:]]
pad = nums + [0] * max(0, 10 - len(nums))
return {
"user": pad[0],
"nice": pad[1],
"system": pad[2],
"idle": pad[3],
"iowait": pad[4],
"irq": pad[5],
"softirq": pad[6],
"steal": pad[7],
"guest": pad[8],
"guest_nice":pad[9],
}
def read_thermal_milli_c() -> int | None:
"""Best-effort: /sys/class/thermal/thermal_zone0/temp."""
text = _read("/sys/class/thermal/thermal_zone0/temp")
if text is None:
return None
try:
return int(text.strip())
except ValueError:
return None
def read_net_devs() -> dict[str, dict[str, int]]:
"""Parse /proc/net/dev → {iface: {rx_bytes, tx_bytes, rx_pkts, tx_pkts}}."""
text = _read("/proc/net/dev")
out: dict[str, dict[str, int]] = {}
if text is None:
return out
lines = text.splitlines()
for line in lines[2:]:
if ":" not in line:
continue
name, _, rest = line.partition(":")
name = name.strip()
if name == "lo":
continue
cols = rest.split()
if len(cols) < 16:
continue
out[name] = {
"rx_bytes": int(cols[0]),
"rx_pkts": int(cols[1]),
"tx_bytes": int(cols[8]),
"tx_pkts": int(cols[9]),
}
return out
def read_listen_ports() -> list[int]:
"""TCP listen sockets from /proc/net/tcp + tcp6. State 0A = LISTEN."""
out: set[int] = set()
for path in ("/proc/net/tcp", "/proc/net/tcp6"):
text = _read(path)
if not text:
continue
for line in text.splitlines()[1:]:
cols = line.split()
if len(cols) < 4:
continue
if cols[3] != "0A":
continue
local = cols[1] # "ADDR:PORT" with PORT in hex
_, _, port_hex = local.rpartition(":")
try:
out.add(int(port_hex, 16))
except ValueError:
pass
return sorted(out)
def read_top_procs(top_n: int) -> list[dict[str, Any]]:
"""Top-N processes by RSS. Cheap O(N) scan of /proc."""
procs: list[dict[str, Any]] = []
try:
entries = os.listdir("/proc")
except OSError:
return procs
for ent in entries:
if not ent.isdigit():
continue
pid = int(ent)
stat = _read(f"/proc/{pid}/stat")
if stat is None:
continue
try:
rparen = stat.rindex(")")
comm = stat[stat.index("(") + 1 : rparen]
fields = stat[rparen + 2:].split()
utime = int(fields[11])
stime = int(fields[12])
rss_pages = int(fields[21])
except (ValueError, IndexError):
continue
procs.append({
"pid": pid,
"comm": comm[:32],
"cpu_jiffies": utime + stime,
"rss_bytes": rss_pages * os.sysconf("SC_PAGESIZE"),
})
procs.sort(key=lambda p: p["rss_bytes"], reverse=True)
return procs[:top_n]
# ---------- one tick --------------------------------------------------------
def collect_once(top_n: int = DEFAULT_TOP_N) -> dict[str, Any]:
mem = read_meminfo()
cpu = read_cpu_total()
load = read_loadavg()
return {
"t_guest_mono_ns": time.monotonic_ns(),
"t_guest_wall_ns": time.time_ns(),
"source": SOURCE,
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
"kernel": platform.release(),
"cpu_total_jiffies": cpu,
"load_1m_5m_15m": list(load) if load else None,
"mem_total_bytes": (mem.get("MemTotal") or 0),
"mem_available_bytes": (mem.get("MemAvailable") or 0),
"mem_buffers_bytes": (mem.get("Buffers") or 0),
"mem_cached_bytes": (mem.get("Cached") or 0),
"swap_used_bytes": (mem.get("SwapTotal", 0) - mem.get("SwapFree", 0)),
"thermal_milli_c": read_thermal_milli_c(),
"net": read_net_devs(),
"listen_ports": read_listen_ports(),
"top_procs": read_top_procs(top_n),
}
# ---------- main loop -------------------------------------------------------
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser(prog="cis490-guest-agent")
p.add_argument("--port", default=DEFAULT_PORT,
help="virtio-serial port path inside the guest")
p.add_argument("--interval-ms", type=int, default=DEFAULT_INTERVAL_MS)
p.add_argument("--top-n", type=int, default=DEFAULT_TOP_N)
p.add_argument("--once", action="store_true",
help="emit a single row and exit (for smoke tests)")
args = p.parse_args(argv)
if args.once:
sys.stdout.write(json.dumps(collect_once(args.top_n)) + "\n")
sys.stdout.flush()
return 0
# Open the virtio-serial port. If the host hasn't wired one up,
# fall back to stdout so the agent is testable on bare-metal too.
out_fp: Any
if os.path.exists(args.port):
out_fp = open(args.port, "wb", buffering=0)
else:
sys.stderr.write(f"[cis490-agent] {args.port} missing; writing to stdout\n")
out_fp = sys.stdout.buffer
interval_ns = args.interval_ms * 1_000_000
next_tick = time.monotonic_ns()
try:
while True:
row = collect_once(args.top_n)
out_fp.write((json.dumps(row) + "\n").encode("utf-8"))
try:
out_fp.flush()
except (AttributeError, OSError):
pass
next_tick += interval_ns
sleep_ns = next_tick - time.monotonic_ns()
if sleep_ns > 0:
time.sleep(sleep_ns / 1_000_000_000)
else:
next_tick = time.monotonic_ns()
except KeyboardInterrupt:
return 0
except (BrokenPipeError, OSError) as e:
sys.stderr.write(f"[cis490-agent] write failed: {e}\n")
return 1
if __name__ == "__main__":
sys.exit(main())

View file

@ -16,7 +16,12 @@ set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
IMAGE="${IMAGE:-$REPO_ROOT/vm/images/alpine-baseline.qcow2}"
CIDATA="${CIDATA:-$REPO_ROOT/vm/images/cidata.iso}"
RUN_DIR="${RUN_DIR:-/tmp/cis490-vm}"
# SLOT lets the fleet runner spin up N concurrent VMs without socket /
# port collisions. Default RUN_DIR + ssh hostfwd port keep single-VM
# usage unchanged.
SLOT="${SLOT:-0}"
RUN_DIR="${RUN_DIR:-/tmp/cis490-vm-$SLOT}"
SSH_PORT="${SSH_PORT:-$((2222 + SLOT))}"
mkdir -p "$RUN_DIR"
QMP_SOCK="$RUN_DIR/qmp.sock"
@ -32,8 +37,14 @@ if [[ ! -f "$CIDATA" ]]; then
exit 1
fi
AGENT_SOCK="$RUN_DIR/agent.sock"
# snapshot=on routes guest writes through a temporary overlay so the qcow2
# on disk is never mutated — every boot starts from the same bytes.
#
# Second virtio-serial port (cis490.guest.agent) carries telemetry
# from the in-guest agent. Surfaces inside the guest at
# /dev/virtio-ports/cis490.guest.agent and on the host at $AGENT_SOCK.
exec qemu-system-x86_64 \
-name cis490-vm \
-machine q35,accel=kvm \
@ -42,8 +53,11 @@ exec qemu-system-x86_64 \
-m 256 \
-drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
-drive file="$CIDATA",format=raw,if=virtio,readonly=on \
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2222-:22 \
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:"$SSH_PORT"-:22 \
-device virtio-net-pci,netdev=n0 \
-device virtio-serial-pci,id=cis490vs0 \
-chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
-device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
-nographic \
-serial unix:"$RUN_DIR/serial.sock",server=on,wait=off \
-monitor unix:"$MON_SOCK",server=on,wait=off \

View file

@ -26,11 +26,14 @@ set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
IMAGE="${IMAGE:-$REPO_ROOT/vm/images/metasploitable2.qcow2}"
RUN_DIR="${RUN_DIR:-/tmp/cis490-target}"
SLOT="${SLOT:-0}"
RUN_DIR="${RUN_DIR:-/tmp/cis490-target-$SLOT}"
RAM_MIB="${RAM_MIB:-512}"
# Ports the host should forward to the guest. Comma-separated host:guest pairs.
# Default covers the vsftpd module's RPORT.
TARGET_PORTS="${TARGET_PORTS:-21:21}"
# Default covers the vsftpd module's RPORT. Slot offset makes per-VM
# fleet runs collision-free (slot 0 → 21, slot 1 → 121, slot 2 → 221, ...).
PORT_BASE="${PORT_BASE:-$((21 + SLOT * 100))}"
TARGET_PORTS="${TARGET_PORTS:-${PORT_BASE}:21}"
# KVM if the host can take it; otherwise fall back to TCG. Cross-arch
# images (Metasploitable2 is x86-only) on aarch64 hosts will need TCG.
ACCEL="${ACCEL:-}"
@ -77,7 +80,13 @@ if [[ "$ACCEL" == "kvm" ]]; then
CPU_FLAGS=(-cpu host)
fi
AGENT_SOCK="$RUN_DIR/agent.sock"
# snapshot=on so the qcow2 is never mutated — every boot is identical.
# Second virtio-serial port carries the in-guest agent's telemetry to
# the host (see vm/guest-agent/). Targets without the agent installed
# (e.g. unmodified Metasploitable2) leave the device unused — the
# host-side collector simply gets no rows. Harmless.
exec qemu-system-x86_64 \
-name cis490-target \
-machine q35,accel="$ACCEL" \
@ -87,6 +96,9 @@ exec qemu-system-x86_64 \
-drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
-netdev "$NETDEV" \
-device virtio-net-pci,netdev=n0 \
-device virtio-serial-pci,id=cis490vs0 \
-chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
-device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
-nographic \
-serial unix:"$SERIAL_SOCK",server=on,wait=off \
-monitor unix:"$MON_SOCK",server=on,wait=off \

56
vm/setup_bridge.sh Executable file
View file

@ -0,0 +1,56 @@
#!/usr/bin/env bash
# Create the host-only ``br-malware`` bridge for Tier-3+ episodes.
#
# Properties (from docs/architecture.md):
# - Bridge address 10.200.0.1/24 on the host side.
# - NO NAT, NO route, NO DNS — guests cannot reach the host or the
# internet. The bridge only carries traffic between the host and
# the guests on it.
# - Lab-host and target VMs both attach via tap devices created by
# the launcher.
#
# Run as root, ONCE per host. Idempotent — re-running is safe.
set -euo pipefail
BRIDGE="${BRIDGE:-br-malware}"
BRIDGE_IP="${BRIDGE_IP:-10.200.0.1/24}"
log() { printf '[setup_bridge] %s\n' "$*" >&2; }
[[ $EUID -eq 0 ]] || { log "must run as root"; exit 1; }
if ! command -v ip >/dev/null; then
log "iproute2 (`ip`) is required"
exit 1
fi
if ! ip link show "$BRIDGE" >/dev/null 2>&1; then
log "creating bridge $BRIDGE"
ip link add name "$BRIDGE" type bridge
# Disable spanning-tree on the host-only bridge — it isn't needed
# and adds startup delay.
ip link set "$BRIDGE" type bridge stp_state 0
fi
ip link set "$BRIDGE" up
# Add the host-side address if not already there.
if ! ip -4 addr show dev "$BRIDGE" | grep -q "${BRIDGE_IP%%/*}"; then
log "adding $BRIDGE_IP to $BRIDGE"
ip addr add "$BRIDGE_IP" dev "$BRIDGE"
fi
# Make sure the kernel does NOT forward between this bridge and any
# other interface. We don't want a misconfigured net.ipv4.ip_forward
# to leak the malware bridge to the LAN.
if [[ "$(cat /proc/sys/net/ipv4/ip_forward)" == "1" ]]; then
log "WARNING: net.ipv4.ip_forward=1 — make sure iptmonads / nftables"
log "blocks traffic from $BRIDGE to non-loopback devices."
fi
log "bridge ready: $(ip -4 -br addr show "$BRIDGE")"
log ""
log "Launchers can now opt into tap+bridge mode by setting:"
log " BRIDGE=$BRIDGE (tells launch_target.sh to attach a tap to this bridge)"
log "Default launcher behaviour stays SLIRP usermode for simplicity."