CIS490/tools/run_tier3_demo.py
Max Gorog 4ab5477226 PIPELINE §5 step 1: fix four root-cause defects
Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.

perf collector (rows_perf=0 on 100% of episodes):
  - perf stat -j writes to stderr by default with -p; we read stdout.
    Add --log-fd 1 so JSON reaches stdout where the parser sees it.
  - Event names come back annotated with the privilege scope perf
    actually measured ("cycles:u" under perf_event_paranoid=2). Strip
    the suffix so _build_row's plain-name lookups hit. Without this
    every metric was None even when perf reported real numbers.
  - tests/test_collectors_emit.py covers the regression with a real
    busy-loop fixture; emit-test discipline per §4.4.

guest-agent collector (rows_guest=0 on 100% of episodes):
  - Alpine cloud image doesn't ship python3, so the in-guest agent's
    `#!/usr/bin/env python3` shebang silently fails. Add packages:
    [python3] to cidata user-data so cloud-init installs it before
    the OpenRC service starts.
  - Guest agent now exits nonzero (was: silent stdout fallback) when
    /dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
    reports the failure to /var/log/cis490-agent.log instead of the
    bytes vanishing into the void. Refs §1.
  - Host-side collector emits guest_agent_connected /
    guest_agent_first_byte / guest_agent_silent_window into the
    orchestrator's events.jsonl. Future episodes show the in-guest
    failure mode per-episode instead of inferring from rows_guest=0.

k-gamingcom missing qmp/netflow/pcap (also affected elliott on
  Tier-3 episodes — was misclassified as host divergence):
  - tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
    qmp_socket / guest_agent_socket / bridge_iface — even though
    launch_target.sh creates the underlying chardevs and BRIDGE
    supplies the iface. tools/run_real_vm_demo.py wires them
    correctly; Tier-3 had a copy-paste gap.
  - tests/test_collectors_emit.py adds a source-grep regression so
    the wiring stays honest.

samba_usermap_script never lands session (0/67 in §3 probe):
  - Bind handler default WfsDelay (~5s) gives up before bind_perl on
    Metasploitable2 has finished forking + binding LPORT under
    SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
    exploits/driver.py so framework + driver agree on the wait
    budget. Add ConnectTimeout=15 so the handler's bind connect has
    retry budget instead of one-shot.

orchestrator/fleet.py: usable_modules + BRIDGE handling were both
  unconditional, so:
  - With BRIDGE set, requires_bridge modules were still being
    dropped — picker only ever returned samba_usermap_script across
    every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
    failure on HEAD).
  - env.pop("BRIDGE") fired even when BRIDGE was the operator's
    explicit setup, breaking modules that need bridge mode (vsftpd
    backdoor on hardcoded port 6200, distccd, etc.).
  Both made conditional on bridge_set so the picker walks the full
  catalog under bridge mode and SLIRP-only modules still get a
  clean SLIRP env when BRIDGE is unset.

receiver/app.py: half-pregnant v2 schema state in HEAD — calling
  store.ingest_stream(episode_type=..., benign_profile=...) with
  kwargs the matching store.py change was in the WIP stash. Removed
  v2 awareness from app.py so v1 episodes (what the producer ships
  today) get accepted again. SCHEMA_VERSION default reset to 1 to
  match.

229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:05:25 -05:00

371 lines
14 KiB
Python

"""Tier-3: real VM, real exploit, honest ``armed -> infecting`` transition.
Boots the vulnerable target VM, drives an msfrpcd-fired exploit module
against it, and lets the orchestrator's host /proc collector sample
the qemu-system pid throughout. Compared to ``run_real_vm_demo.py``:
the workload that crosses the ``armed -> infecting`` boundary is now
generated by an actual exploit landing a session, not by a script in
the guest.
Prereqs:
- vm/images/<target>.qcow2 (e.g. Metasploitable2)
- msfrpcd running locally:
msfrpcd -P <password> -U msf -a 127.0.0.1 -p 55553
- ``msgpack`` python package installed (added to runtime deps)
Run:
MSFRPC_PASSWORD=<pass> uv run python tools/run_tier3_demo.py \\
--module vsftpd_234_backdoor \\
--data-root data
"""
from __future__ import annotations
import argparse
import logging
import os
import signal
import subprocess
import sys
import time
from pathlib import Path
# Allow running as a script.
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from collectors import qmp # noqa: E402
from exploits.driver import DriverConfig, MSFExploitDriver # noqa: E402
from exploits.modules import load_module_config # noqa: E402
from exploits.msfrpc import MSFRpcClient, MSFRpcConfig # noqa: E402
from orchestrator.episode import EpisodeConfig, EpisodeRunner # noqa: E402
from samples.manifest import SampleManifest # noqa: E402
# Same envelope shape as Tier 2 so plots are comparable. Slightly more
# armed/infecting time because real exploit fire + session establishment
# takes hundreds of ms to a few seconds.
DEFAULT_SCHEDULE = [
("clean", 10.0),
("armed", 3.0),
("infecting", 5.0),
("infected_running", 25.0),
("dormant", 15.0),
("infected_running", 20.0),
("dormant", 5.0),
("clean", 5.0),
]
def _wait_for_path(path: Path, timeout_s: float) -> None:
deadline = time.monotonic() + timeout_s
while time.monotonic() < deadline:
if path.exists() and path.read_text().strip():
return
time.sleep(0.2)
raise TimeoutError(f"{path} never appeared within {timeout_s}s")
def _wait_for_tcp(host: str, port: int, timeout_s: float) -> None:
"""Probe a TCP port and block until a connection is accepted and the
remote service is waiting for client input (recv timeout = success).
SLIRP completes the TCP handshake before the guest OS boots, making
a bare ``connect()`` unreliable. However, after the guest kernel and
TCP stack are up, SLIRP forwards the connection to the guest. If the
port is not yet open (service not started), the guest RSTs → OSError
→ retry. Once the service is listening and waiting for the client to
speak first (e.g. Samba on 139), ``recv`` times out → we return.
To avoid false-positives during early boot (before the guest TCP stack
is running), callers should enforce a minimum wall-clock wait after
QEMU starts before calling this function. With a 65 s floor,
Metasploitable2's kernel and init are always up by the time we probe.
"""
import socket
deadline = time.monotonic() + timeout_s
last_err: Exception | None = None
while time.monotonic() < deadline:
try:
with socket.create_connection((host, port), timeout=2.0) as s:
# recv with a generous timeout. Three outcomes:
# socket.timeout → service is up, waiting for client data ✓
# data received → service sent a banner; also up ✓
# OSError/reset → port closed; retry
s.settimeout(3.0)
try:
s.recv(1)
except socket.timeout:
pass # service alive, waiting for us to speak first
return
except OSError as e:
last_err = e
time.sleep(1.0)
raise TimeoutError(
f"target service {host}:{port} not reachable within {timeout_s}s "
f"(last: {last_err})"
)
# Metasploitable2 takes ~50-70 s to boot fully under normal load.
# SLIRP accepts TCP connections before the guest TCP stack is up, so we
# must wait at least this long before _wait_for_tcp will give a reliable
# signal. 65 s is a safe floor; the boot-timeout arg covers the rest.
_METASPLOITABLE2_MIN_BOOT_S: float = 65.0
def main() -> int:
parser = argparse.ArgumentParser(prog="run_tier3_demo")
parser.add_argument("--data-root", default="data")
parser.add_argument("--interval-ms", type=int, default=100)
parser.add_argument(
"--module",
default="vsftpd_234_backdoor",
help="Module config name in exploits/modules/<name>.toml",
)
parser.add_argument(
"--target-ip",
default="127.0.0.1",
help="Address the exploit module sets RHOSTS to. With the SLIRP "
"launcher (default), the guest's vulnerable port is hostfwd'd to "
"loopback; on a host-only bridge, this is the guest's bridge IP.",
)
parser.add_argument(
"--target-port",
type=int,
default=21,
help="Probe port to wait on before firing the exploit",
)
parser.add_argument(
"--run-dir",
# Per-slot defaults so the fleet runner's parallel calls don't
# collide on the same /tmp dir. See run_real_vm_demo.py for
# the same fix.
default=(
os.environ.get("RUN_DIR")
or f"/tmp/cis490-target-{os.environ.get('SLOT', '0')}"
),
help="QEMU run dir (sockets + pidfile)",
)
parser.add_argument(
"--msfrpc-host", default=os.environ.get("MSFRPC_HOST", "127.0.0.1"),
)
parser.add_argument(
"--msfrpc-port", type=int,
default=int(os.environ.get("MSFRPC_PORT", "55553")),
)
parser.add_argument(
"--msfrpc-user", default=os.environ.get("MSFRPC_USER", "msf"),
)
parser.add_argument(
"--keep-vm",
action="store_true",
help="leave the VM running after the episode finishes",
)
parser.add_argument(
"--target-boot-timeout",
type=float,
default=180.0,
help="how long to wait for the guest's vulnerable service to listen",
)
parser.add_argument(
"--sample",
default=os.environ.get("SAMPLE_NAME"),
help="Pick a workload profile from the manifest by name. Fleet runner "
"passes this via SAMPLE_NAME env. Without it, falls back to the v1 yes-loop.",
)
parser.add_argument(
"--manifest",
default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"),
)
args = parser.parse_args()
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s %(message)s",
)
log = logging.getLogger("cis490.run_tier3_demo")
msfrpc_password = os.environ.get("MSFRPC_PASSWORD")
if not msfrpc_password:
log.error("MSFRPC_PASSWORD env var must be set")
return 2
repo_root = Path(__file__).resolve().parent.parent
launcher = repo_root / "vm" / "launch_target.sh"
modules_dir = repo_root / "exploits" / "modules"
module_path = modules_dir / f"{args.module}.toml"
if not module_path.exists():
log.error("no module config at %s", module_path)
return 2
module = load_module_config(module_path)
log.info("module loaded: %s (%s)", module.name, module.module_path)
sample = None
if args.sample:
manifest = SampleManifest.load(args.manifest)
sample = next((s for s in manifest.samples if s.name == args.sample), None)
if sample is None:
log.error("sample %r not in manifest %s", args.sample, args.manifest)
return 2
log.info("sample=%s profile=%s kind=%s",
sample.name, sample.profile, sample.kind)
run_dir = Path(args.run_dir)
# Kill any QEMU still holding this slot's run_dir from a previous wave.
# QEMU is started with start_new_session=True so it survives orchestrator
# SIGTERM without explicit cleanup here.
old_pid_file = run_dir / "qemu.pid"
if old_pid_file.exists():
try:
old_pid = int(old_pid_file.read_text().strip())
import os as _os
_os.killpg(_os.getpgid(old_pid), signal.SIGTERM)
time.sleep(1.5)
except (ProcessLookupError, ValueError, OSError):
pass
if run_dir.exists():
import shutil
shutil.rmtree(run_dir)
run_dir.mkdir(parents=True, exist_ok=True)
pid_file = run_dir / "qemu.pid"
log.info("booting target VM via %s (RUN_DIR=%s)", launcher, run_dir)
env = os.environ.copy()
env["RUN_DIR"] = str(run_dir)
qemu = subprocess.Popen(
[str(launcher)],
cwd=str(repo_root),
env=env,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
start_new_session=True,
)
try:
_wait_for_path(pid_file, timeout_s=15.0)
qemu_pid = int(pid_file.read_text().strip())
# Enforce a minimum boot floor before probing. SLIRP completes the
# TCP handshake immediately even before the guest kernel loads, so
# a bare connect() would always succeed. After the floor the guest
# TCP stack is up and a recv-timeout from the probe means the
# service is genuinely listening.
qemu_start = time.monotonic()
elapsed = qemu_start - qemu_start # 0: we just got the pidfile
boot_floor = _METASPLOITABLE2_MIN_BOOT_S
if boot_floor > 0:
log.info("qemu pid = %d; waiting %.0fs for target OS to boot before probing",
qemu_pid, boot_floor)
time.sleep(boot_floor)
remaining = max(0.0, args.target_boot_timeout - boot_floor)
log.info("probing %s:%d (up to %.0fs remaining)",
args.target_ip, args.target_port, remaining)
_wait_for_tcp(args.target_ip, args.target_port, timeout_s=remaining)
log.info("target service is up")
# Pre-exploit savevm so EpisodeConfig.revert_at_{start,end}
# has a known-good baseline to load. Best-effort — we still
# run the episode if savevm fails (just without revert
# support). See run_real_vm_demo.py for the same pattern.
qmp_sock = run_dir / "qmp.sock"
if qmp_sock.exists():
try:
_qmp = qmp.QMPClient(qmp_sock)
_qmp.connect()
try:
out = _qmp.savevm("baseline-v1")
log.info("savevm baseline-v1 OK: %s", out.strip()[:160])
finally:
_qmp.close()
except Exception as e:
log.warning("savevm failed; revert_at_start unusable: %s", e)
client = MSFRpcClient(
MSFRpcConfig(
host=args.msfrpc_host,
port=args.msfrpc_port,
user=args.msfrpc_user,
password=msfrpc_password,
)
)
# Wire the same collector sockets the Tier-2 path wires. Without
# these, EpisodeConfig defaults to None and the qmp / guest-agent
# / pcap collectors never start — even though launch_target.sh
# creates the qmp.sock + agent.sock chardevs and BRIDGE supplies
# the iface. Refs PIPELINE.md §4.4: a collector that appears
# configured but emits zero rows is exactly the silent-downgrade
# pattern §1 forbids.
agent_sock = run_dir / "agent.sock"
cfg = EpisodeConfig(
target_pid=qemu_pid,
duration_s=sum(d for _, d in DEFAULT_SCHEDULE),
interval_ms=args.interval_ms,
data_root=Path(args.data_root),
phase_schedule=DEFAULT_SCHEDULE,
image_name=module.name + "-target",
snapshot_name="baseline-v1",
qmp_socket=qmp_sock if qmp_sock.exists() else None,
guest_agent_socket=agent_sock if agent_sock.exists() else None,
bridge_iface=os.environ.get("BRIDGE") or None,
sample=sample,
exploit_meta={
"framework": "metasploit",
"module": module.module_path,
"module_type": module.module_type,
"module_name": module.name,
"payload": module.payload_path,
"rport": module.options.get("RPORT"),
"rhost_template": module.options.get("RHOSTS"),
},
)
runner = EpisodeRunner(cfg)
driver = MSFExploitDriver(
client=client,
module=module,
cfg=DriverConfig(
target_ip=args.target_ip,
# Override RPORT when target_port is an unprivileged host port
# (i.e. fleet runner remapped the guest's privileged port to a
# loopback port > 1024). When target_port == module RPORT the
# caller wants direct guest access; leave RPORT unchanged.
target_port=args.target_port if args.target_port > 1024 else None,
sample_store_root=repo_root / "samples" / "store",
),
emit_event=runner.emit_event,
sample=sample,
)
runner.on_phase = driver.set_phase
driver.setup()
try:
result = runner.run()
finally:
driver.teardown()
print()
print(f"episode_id = {result.episode_id}")
print(f"path = {result.episode_dir}")
print(f"rows_proc = {result.rows_proc}")
print(f"phases = {result.phases_observed}")
print(f"module = {module.module_path}")
print()
print("To plot:")
print(f" uv run python tools/plot_envelope.py {result.episode_dir}")
return 0
finally:
if not args.keep_vm:
log.info("shutting down VM (pid=%d)", qemu.pid)
try:
os.killpg(os.getpgid(qemu.pid), signal.SIGTERM)
except ProcessLookupError:
pass
try:
qemu.wait(timeout=5)
except subprocess.TimeoutExpired:
os.killpg(os.getpgid(qemu.pid), signal.SIGKILL)
if __name__ == "__main__":
sys.exit(main())