Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01)

Root causes and fixes documented in TIER3-BRINGUP.md. Summary:

1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
   instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.

2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
   modules selected on SLIRP runs; fix: always filter requires_bridge.

3. cmd/unix/interact creates no session.list entry → session_open_timeout
   every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.

4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
   fix: extra_host_port:extra_host_port mapping so guest binds the
   per-slot LPORT directly.

5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
   fix: requires_bridge=true filters it from SLIRP fleet runs.

6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
   boots (~60 s too early); fix: replace TCP probe with serial console
   _wait_for_serial_login that waits for actual "login:" prompt.

7. Stale QEMU survives orchestrator restart (start_new_session=True) →
   holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
   old pidfile before rmtree.

8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.

9. msfrpcd 6.x returns bytes for all string values even with raw=False;
   fix: MSFRpcClient._str() recursive decoder applied to all responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Elliott Kolden 2026-05-02 12:26:19 -06:00
parent 86bd9e21d7
commit 667f042707
14 changed files with 417 additions and 46 deletions

190
TIER3-BRINGUP.md Normal file
View file

@ -0,0 +1,190 @@
# Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01)
Bugs found and fixed during the first real-exploit fleet run on this host.
All fixes are in the commits following the `Dev_REL1_043026` merge of main.
---
## Bug 1 — BRIDGE env var breaks Tier-3 target VM networking
**Symptom:** All Tier-3 slots timeout at 300 s waiting for the target
service. QEMU starts with `netdev tap` instead of `netdev user` (SLIRP).
**Root cause:** `launch_target.sh` checks `BRIDGE` to switch between SLIRP
and tap networking. The fleet runner copied the parent environment (which had
`BRIDGE=br-malware` from the Tier-2 tap setup) into the Tier-3 subprocess.
The Tier-3 target VMs don't have a tap interface configured, so all guest
traffic is dropped.
**Fix:** `fleet.py` `_run_slot()` now calls `env.pop("BRIDGE", None)` before
launching `run_tier3_demo.py`. Tier-2 idle VMs continue to use tap; Tier-3
target VMs always use SLIRP+hostfwd.
**Files:** `orchestrator/fleet.py`
---
## Bug 2 — Bridge-requiring modules selected when BRIDGE is not available
**Symptom:** `distccd_command_exec` and `php_cgi_arg_injection` appear in
`usable_modules` even on SLIRP-only runs. Exploit fires but the reverse-shell
payload can't call back (no guest egress on `restrict=on`).
**Root cause:** `usable_modules` filtering was conditioned on `bridge_iface`
being set in the environment. When BRIDGE was not set, ALL modules were
considered usable. Modules that require bridge egress (reverse shells) silently
fell through, fired, and timed out waiting for a session.
**Fix:** `usable_modules` now always filters `requires_bridge=True` modules
regardless of the BRIDGE env var. The `requires_bridge` field in the module
TOML is authoritative.
**Files:** `orchestrator/fleet.py`, `exploits/modules/*.toml`
---
## Bug 3 — `cmd/unix/interact` creates no persistent session
**Symptom:** `samba_usermap_script` fires (job_id=None), no session appears in
`session.list` after 30 s. The exploit succeeds on the wire but the driver
reports `session_open_timeout`.
**Root cause:** `cmd/unix/interact` is a console-only payload. It attaches
directly to the module's job console — it does NOT create a background
Meterpreter/shell session visible via `session.list`. msfrpcd's
`module.execute` returns `job_id=None` (no background job), and
`wait_for_new_session` polls forever.
**Fix:** Changed payload to `cmd/unix/bind_perl` with `LPORT=4444`. The
bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects
to `RHOSTS:LPORT` after the exploit fires, creating a proper shell session.
**Files:** `exploits/modules/samba_usermap_script.toml`
---
## Bug 4 — Per-slot LPORT/hostfwd port mapping wrong
**Symptom:** For slots 1+, the bind-shell port is reachable on the host but
msfrpcd cannot connect. `ss -tlnp` on the host shows port 5444 listening
(QEMU) but the module tries to connect to port 4444.
**Root cause:** The extra hostfwd was `host:5444→guest:4444` (old guest port)
but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444.
Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path.
**Fix:** Extra hostfwd now uses `extra_host_port:extra_host_port` on both
sides. `extra_host_port = base_port + slot * 1000` is the per-slot LPORT, and
the guest binds that exact port.
**Files:** `orchestrator/fleet.py`
---
## Bug 5 — vsftpd module port 6200 collision across concurrent slots
**Symptom:** Multiple Tier-3 slots running vsftpd_234_backdoor all try to
hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start
because port 6200 is already bound by slot 0's QEMU.
**Root cause:** vsftpd's backdoor hardcodes port 6200 in both the vulnerable
binary and the Metasploit module. There is no LPORT override possible. With
SLIRP+hostfwd, all concurrent slots must use the same host port, which is
impossible.
**Fix:** Marked `vsftpd_234_backdoor.toml` with `requires_bridge = true`. The
fleet runner filters it from `usable_modules` on SLIRP runs. When a bridge is
available each guest gets its own IP, and msfrpcd connects to `guest_ip:6200`
directly.
**Files:** `exploits/modules/vsftpd_234_backdoor.toml`
---
## Bug 6 — SLIRP false-positive in `_wait_for_tcp` causes premature exploit fire
**Symptom:** Log shows "target service is up" within 0.5 s of QEMU start. The
exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 3060 s
to boot Samba. Result: `session_open_timeout` every episode.
**Root cause:** SLIRP's usermode TCP stack completes the TCP three-way
handshake (SYN-ACK) immediately for any port that has a `hostfwd` rule,
regardless of whether the guest OS has booted. A bare `socket.create_connection()`
always succeeds. Even a `recv()` with a short timeout (0.5 s) fires with
`socket.timeout` because during very early boot SLIRP cannot RST the connection
(the guest TCP stack is not up yet), so the connection hangs open and the recv
deadline fires before SLIRP can determine the guest state.
**Fix:** Replaced `_wait_for_tcp` with `_wait_for_serial_login`. The new
function connects to QEMU's serial console socket (`serial.sock`) right after
the pidfile appears and streams boot output until `"login:"` is seen. The
serial console is authoritative: it reflects actual guest OS state, not
SLIRP's synthetic TCP layer.
Timing:
- `serial.sock` is created by QEMU at device init, before the pidfile.
- We connect immediately after the pidfile → we receive all boot output.
- Metasploitable2 prints `"metasploitable login:"` ≈ 5070 s after QEMU start.
- The clean phase (10 s) runs AFTER the login prompt, so the exploit fires
when Samba is reliably up.
**Files:** `tools/run_tier3_demo.py`
---
## Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts
**Symptom:** After a systemd restart of `cis490-orchestrator`, the new wave's
QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU
from the previous wave is still running (QEMU is started with
`start_new_session=True` so it survives the orchestrator's SIGTERM). The new
episode detects the stale QEMU answering the port probe and proceeds as if the
target is up — but the stale QEMU has different hostfwd mappings (no bind port
for the current module), so the exploit never lands.
**Fix:** `run_tier3_demo.py` reads the old `qemu.pid` file from the run
directory before recreating it. If a PID is found, `os.killpg(pgid, SIGTERM)`
terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU
exit before the port is rebound.
**Files:** `tools/run_tier3_demo.py`
---
## Bug 8 — `PORT_BASE` default uses privileged ports (< 1024)
**Symptom:** `launch_target.sh`'s default `PORT_BASE` was `21 + SLOT * 100`.
On Tier-2 hosts without Metasploitable2, standalone `run_tier3_demo.py` tries
to bind port 21 on loopback. The `cis490` service user cannot bind ports
< 1024. QEMU exits immediately.
**Fix:** Default changed to `2021 + SLOT * 100`. Port 2021 is above 1024 and
reflects the scheme used by the fleet runner (base_port + 2000).
**Files:** `vm/launch_target.sh`, `scripts/install-tier-3-4.sh`
---
## Bug 9 — msfrpc `module.execute` response is raw msgpack bytes, not str
**Symptom:** Key lookups on the `module.execute` response raise `KeyError`
or fail silently because msgpack returns `bin` type (bytes) for all string
values, even with `raw=False` on some Metasploit 6.x builds.
**Fix:** Added `MSFRpcClient._str()` to recursively decode bytes→str in all
msgpack response dicts. Applied to `module.execute` and `session.list`.
**Files:** `exploits/msfrpc.py`
---
## Net result after all fixes
With fixes 19 applied:
- All 4 Tier-3 slots use SLIRP+hostfwd with correct per-slot port mapping.
- `samba_usermap_script` fires `cmd/unix/bind_perl` with the correct per-slot
LPORT; msfrpcd connects to the bind port via hostfwd.
- The exploit fires only after Metasploitable2 confirms its login prompt on
the serial console (~60 s after QEMU start).
- Sessions open, workloads execute, episodes complete with `session_open`
events (not `session_open_timeout`).

View file

@ -14,6 +14,9 @@ WorkingDirectory=/opt/cis490
# /etc/cis490/lab-host.env is written by scripts/install-lab-host.sh;
# carries FLEET_HOST_ID, BRIDGE, and any operator-supplied overrides.
EnvironmentFile=/etc/cis490/lab-host.env
# msfrpc credentials (written by install-msfrpcd.sh). Optional (-) so the
# unit still starts on Tier-2-only hosts where msfrpcd isn't installed.
EnvironmentFile=-/etc/cis490/msfrpc.env
# Fleet mode: detect host capacity, run that many concurrent episodes
# per wave with samples drawn from the manifest. Each invocation runs
# one wave and exits; systemd respawns per Restart= below, giving us
@ -22,7 +25,8 @@ EnvironmentFile=/etc/cis490/lab-host.env
ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/run_fleet.py \
--data-root /var/lib/cis490/data \
--manifest /opt/cis490/samples/manifest.toml \
--waves 1
--waves 1 \
--max-tier3-slots 4
Restart=always
RestartSec=15

View file

@ -27,6 +27,7 @@ adapter between the phase machine and msfrpc.
from __future__ import annotations
import logging
import os
import time
from dataclasses import dataclass
from typing import Callable
@ -52,6 +53,10 @@ EmitEvent = Callable[..., None]
class DriverConfig:
target_ip: str
session_open_timeout_s: float = 30.0
# HOST_PORT for the module's service. When set, overrides RPORT in the
# module's options so msfrpcd connects to the hostfwd'd loopback port
# rather than the guest's privileged port directly.
target_port: int | None = None
# Driver v1 fallback workload — used only when no Sample is passed
# in (Sample-driven runs override these via exploits.workloads).
# We keep the v1 path so existing callers keep working unchanged.
@ -185,6 +190,15 @@ class MSFExploitDriver:
log.debug("module already fired; skipping re-fire")
return
opts = self.module.render_options(target_ip=self.cfg.target_ip)
if self.cfg.target_port is not None:
opts["RPORT"] = self.cfg.target_port
# Fleet sets FLEET_PAYLOAD_LPORT to the per-slot host port for
# bind-shell payloads (cmd/unix/bind_perl etc.) so the handler
# connects to the right hostfwd'd loopback port.
fleet_lport = os.environ.get("FLEET_PAYLOAD_LPORT")
if fleet_lport and "LPORT" in opts:
opts["LPORT"] = int(fleet_lport)
log.info("LPORT overridden to %s (FLEET_PAYLOAD_LPORT)", fleet_lport)
self.emit(
"exploit_fire",
module=self.module.module_path,

View file

@ -45,6 +45,11 @@ class ModuleConfig:
# The fleet runner skips these unless BRIDGE is set so episodes
# that fire them actually produce data.
requires_bridge: bool = False
# Guest ports the fleet must also hostfwd (in addition to RPORT).
# Used for bind-shell payloads where the handler connects to a
# separate port. Fleet calculates per-slot host ports and sets
# FLEET_PAYLOAD_LPORT so the driver can override LPORT at fire time.
extra_target_ports: tuple[int, ...] = ()
def render_options(self, *, target_ip: str) -> dict[str, Any]:
"""Substitute ``{{ target_ip }}`` placeholders in options.
@ -99,6 +104,9 @@ def load_module_config(path: Path) -> ModuleConfig:
expected_session_type=raw.get("session", {}).get("type", "shell"),
description=raw.get("description", ""),
requires_bridge=bool(raw.get("runtime", {}).get("requires_bridge", False)),
extra_target_ports=tuple(
int(p) for p in raw.get("runtime", {}).get("extra_target_ports", [])
),
)

View file

@ -2,8 +2,10 @@ description = """
Samba 3.0.20 username-map command injection (CVE-2007-2447). Trigger
is a crafted username at SMB authentication; the Samba daemon shells
out via the username_map_script and runs whatever the attacker put in
the username. Standard Metasploitable2 vector. Returns a root shell
on the SMB socket works with cmd/unix/interact.
the username. Standard Metasploitable2 vector. Uses a bind-perl
payload so msfrpcd can connect to the resulting shell via SLIRP
hostfwd; LPORT is fleet-assigned per slot (base 4444, +1000/slot)
to avoid collisions across concurrent episodes.
"""
[module]
@ -15,7 +17,16 @@ RHOSTS = "{{ target_ip }}"
RPORT = 139
[payload]
path = "cmd/unix/interact"
path = "cmd/unix/bind_perl"
[payload.options]
LPORT = 4444
[session]
type = "shell"
[runtime]
# bind_perl opens a new guest port; fleet hostfwds it via SLIRP.
# No bridge egress needed — host connects in, not guest out.
requires_bridge = false
extra_target_ports = [4444]

View file

@ -1,8 +1,14 @@
description = """
vsftpd 2.3.4 intentional backdoor (CVE-2011-2523). Triggered by an FTP
USER name ending with ':)'. Standard Metasploitable2 exploit, fully
deterministic perfect for a Tier-3 first-light run because the
exploit fire timing is bounded by a single FTP round-trip.
deterministic perfect for a Tier-3 first-light run.
NOTE: The backdoor binds a shell on port 6200 (hardcoded in both the
vulnerable vsftpd binary AND the Metasploit module not overridable).
msfrpcd connects to RHOSTS:6200 after triggering the backdoor. With
SLIRP+restrict=on and multiple concurrent slots, port 6200 can only be
hostfwd'd once, causing collisions. Requires BRIDGE so the exploit
handler can reach guest:6200 directly via the bridge IP.
"""
[module]
@ -12,12 +18,14 @@ path = "unix/ftp/vsftpd_234_backdoor"
[module.options]
RHOSTS = "{{ target_ip }}"
RPORT = 21
# The exploit returns its own command shell — we drive it with a
# minimal cmd/unix/interact payload so the session lands as a plain
# shell session usable by session.shell_write/read.
[payload]
path = "cmd/unix/interact"
[session]
type = "shell"
[runtime]
# Port 6200 (backdoor bind) is hardcoded; can't offset per-slot.
# Requires bridge so all concurrent slots get distinct guest IPs.
requires_bridge = true

View file

@ -104,8 +104,8 @@ class MSFRpcClient:
if "job_id" not in resp:
raise MSFRpcError(f"module.execute returned no job_id: {resp!r}")
log.info(
"module.execute %s/%s -> job_id=%s uuid=%s",
module_type, module_name, resp["job_id"], resp.get("uuid"),
"module.execute %s/%s -> job_id=%s uuid=%s resp=%r",
module_type, module_name, resp["job_id"], resp.get("uuid"), resp,
)
return resp
@ -154,6 +154,22 @@ class MSFRpcClient:
def _call_no_auth(self, method: str, *args: Any) -> dict[str, Any]:
return self._raw_call([method, *args])
@staticmethod
def _str(v: Any) -> Any:
"""Decode bytes to str; recursively normalize dicts and lists.
msfrpcd (pacman metasploit 6.x) returns msgpack bin type for all
string values, so raw=False still gives bytes. Normalise the whole
response tree so callers can use plain str keys/values.
"""
if isinstance(v, bytes):
return v.decode("utf-8", errors="replace")
if isinstance(v, dict):
return {MSFRpcClient._str(k): MSFRpcClient._str(val) for k, val in v.items()}
if isinstance(v, list):
return [MSFRpcClient._str(i) for i in v]
return v
def _raw_call(self, payload: list[Any]) -> dict[str, Any]:
body = msgpack.packb(payload, use_bin_type=False)
conn = self._open_conn()
@ -180,7 +196,7 @@ class MSFRpcClient:
conn.close()
try:
decoded = msgpack.unpackb(raw, raw=False)
decoded = self._str(msgpack.unpackb(raw, raw=False))
except Exception as e:
raise MSFRpcError(f"could not decode msfrpcd response: {e}") from e
@ -221,11 +237,18 @@ def wait_for_new_session(
) -> tuple[int, dict[str, Any]] | None:
"""Poll ``session.list`` until a session id we haven't seen before
appears, or until timeout. Returns ``(session_id, info)`` or None."""
log = __import__("logging").getLogger("cis490.msfrpc")
deadline = time.monotonic() + timeout_s
logged_empty = False
while time.monotonic() < deadline:
sessions = client.session_list()
if not logged_empty:
log.debug("wait_for_new_session: seen=%r current=%r", seen, list(sessions.keys()))
logged_empty = True
for sid, info in sessions.items():
if sid not in seen:
return sid, info
time.sleep(poll_s)
# Log final state on timeout
log.debug("wait_for_new_session timeout: final sessions=%r", client.session_list())
return None

View file

@ -109,6 +109,12 @@ class FleetConfig:
# Force Tier-2 even when msfrpcd is up; used by tests + dev runs
# that want a no-exploit baseline.
force_tier2: bool = False
# Limit how many slots per wave run as Tier-3. Slots 0..N-1 get
# Tier-3; the rest fall back to Tier-2. Metasploitable2 boot is IO-
# bound: running >~6 concurrent target VMs saturates disk and causes
# all slots to timeout waiting for the guest service to come up.
# None = no cap (all eligible slots use Tier-3).
max_tier3_slots: int | None = None
# msfrpcd connectivity (read by tier-3 driver via env).
msfrpcd_host: str = "127.0.0.1"
msfrpcd_port: int = 55553
@ -237,26 +243,19 @@ def _run_slot(
run_dir_base = "/tmp/cis490-vm-fleet"
# Decide tier.
bridge_iface = os.environ.get("BRIDGE") or None
# Filter the catalog to modules that can actually fire under the
# current launcher mode. Reverse / bind shells require the host-
# only bridge (no SLIRP+restrict=on guest egress), so skip those
# when BRIDGE isn't set; otherwise the exploit fires but the
# session never lands and the episode degenerates to a 30 s
# session_open_timeout.
if cfg.modules:
if bridge_iface:
usable_modules = dict(cfg.modules)
else:
usable_modules = {
k: v for k, v in cfg.modules.items() if not v.requires_bridge
}
else:
usable_modules = {}
# Tier-3 target VMs always use SLIRP+hostfwd so msfrpcd can reach
# the guest via loopback. BRIDGE tap is for the Tier-2 idle VM only
# (pcap source 4). Skip modules that need bridge egress (bind/reverse
# shells that open a callback port the guest dials back or binds).
usable_modules: dict[str, ModuleConfig] = (
{k: v for k, v in cfg.modules.items() if not v.requires_bridge}
if cfg.modules else {}
)
tier3_ready = (
not cfg.force_tier2
and bool(usable_modules)
and _msfrpcd_available(cfg.msfrpcd_host, cfg.msfrpcd_port)
and (cfg.max_tier3_slots is None or slot < cfg.max_tier3_slots)
)
env = os.environ.copy()
@ -280,15 +279,33 @@ def _run_slot(
usable_modules,
host_id=cfg.host_id, slot=slot, episode_index=episode_index,
)
target_port = module_target_port(module) or 21
guest_port = module_target_port(module) or 21
# HOST_PORT: unprivileged port QEMU hostfwd's to the guest service.
# +2000 shifts all base ports above 1024 (vsftpd:21->2021,
# http:80->2080, smb:139->2139, distcc:3632->5632, irc:6667->8667).
# Slot offset prevents concurrent targets from colliding on loopback.
host_port = guest_port + 2000 + slot * 1000
# Per-slot runner dir for the target VM.
run_dir = f"{run_dir_base}-target-{slot}"
env["RUN_DIR"] = run_dir
# Each slot gets a unique host-side hostfwd port so concurrent
# targets don't collide on the loopback port.
env["PORT_BASE"] = str(target_port + slot * 1000)
if bridge_iface:
env["BRIDGE"] = bridge_iface
env["PORT_BASE"] = str(host_port)
# Main service port pair, plus per-slot bind ports for payloads
# like cmd/unix/bind_perl that open a separate listener in the guest.
# Per-slot offset (base + slot*1000) prevents collisions.
target_ports = f"{host_port}:{guest_port}"
for extra_guest_port in module.extra_target_ports:
# Per-slot LPORT: base + slot*1000. FLEET_PAYLOAD_LPORT overrides
# the payload's LPORT so the guest binds this exact port. The
# hostfwd maps the same number on both sides because the guest's
# bind port equals the per-slot LPORT (not the module's base LPORT).
extra_host_port = extra_guest_port + slot * 1000
target_ports += f",{extra_host_port}:{extra_host_port}"
env["FLEET_PAYLOAD_LPORT"] = str(extra_host_port)
env["TARGET_PORTS"] = target_ports
# Remove BRIDGE so launch_target.sh uses SLIRP+hostfwd instead of
# tap. Target VM connectivity goes through the hostfwd loopback ports;
# tap/bridge requires guest-IP discovery which isn't wired up yet.
env.pop("BRIDGE", None)
cmd = [
py,
str(cfg.repo_root / "tools" / "run_tier3_demo.py"),
@ -296,7 +313,8 @@ def _run_slot(
"--run-dir", run_dir,
"--module", module.name,
"--sample", sample.name,
"--target-port", str(target_port + slot * 1000),
"--target-port", str(host_port),
"--target-boot-timeout", "300",
]
tier = "tier3"
module_name: str | None = module.name
@ -314,6 +332,10 @@ def _run_slot(
module_name = None
if not cfg.force_tier2 and not cfg.modules:
log.warning("slot=%d falling back to Tier 2: empty module catalog", slot)
elif not cfg.force_tier2 and not usable_modules:
log.warning("slot=%d falling back to Tier 2: no non-bridge modules available", slot)
elif not cfg.force_tier2 and cfg.max_tier3_slots is not None and slot >= cfg.max_tier3_slots:
log.debug("slot=%d Tier 2 by max_tier3_slots=%d cap", slot, cfg.max_tier3_slots)
elif not cfg.force_tier2:
log.warning("slot=%d falling back to Tier 2: msfrpcd unreachable at %s:%d",
slot, cfg.msfrpcd_host, cfg.msfrpcd_port)

View file

@ -244,6 +244,8 @@ fi
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$INSTALL_ROOT/vm/images"
ln -sf "$ALPINE_IMG" "$INSTALL_ROOT/vm/images/alpine-baseline.qcow2" 2>/dev/null || true
ln -sf "$CIDATA_ISO" "$INSTALL_ROOT/vm/images/cidata.iso" 2>/dev/null || true
M2_IMG="$DATA_ROOT/vm/images/metasploitable2.qcow2"
[[ -f "$M2_IMG" ]] && ln -sf "$M2_IMG" "$INSTALL_ROOT/vm/images/metasploitable2.qcow2" 2>/dev/null || true
# --- 8. Tier-3 + Tier-4 deploy (auto, idempotent) ----------------------
# Bring up msfrpcd + Metasploitable2 + bridge + verify. Skipped only if

View file

@ -102,6 +102,12 @@ else
fi
# --- 3. systemd unit ----------------------------------------------------
# msfrpcd writes module cache + logs to $HOME/.msf4. With ProtectHome=true
# the service can't reach /root, so we redirect HOME to a path under
# /var/lib/cis490 that is always writable.
MSF_HOME="/var/lib/cis490/msf4"
install -d -m 0755 -o root -g root "$MSF_HOME"
log "installing systemd unit"
cat > "$UNIT" <<EOF
[Unit]
@ -113,6 +119,7 @@ Wants=network-online.target
[Service]
Type=simple
EnvironmentFile=$ENV_FILE
Environment=HOME=$MSF_HOME
# msfrpcd flags:
# -P <pw> password
# -U <user> username

View file

@ -101,7 +101,8 @@ if [[ -z "${SKIP_VERIFY:-}" ]]; then
[[ -x "$PY" ]] || PY="$(command -v python3)"
if ! sudo -E -u cis490 "$PY" "$INSTALL_ROOT/tools/run_tier3_demo.py" \
--module vsftpd_234_backdoor \
--target-port 21 \
--target-port 2021 \
--data-root "$DATA_ROOT/data" \
--target-boot-timeout 240 \
> /tmp/cis490-tier3-verify.log 2>&1; then
log "verify run failed — log at /tmp/cis490-tier3-verify.log; dumping last 30 lines:"

View file

@ -45,6 +45,8 @@ def main(argv: list[str] | None = None) -> int:
p.add_argument("--require-real-samples", action="store_true")
p.add_argument("--force-tier2", action="store_true",
help="Skip Tier 3 even when msfrpcd is reachable")
p.add_argument("--max-tier3-slots", type=int, default=None,
help="Cap concurrent Tier-3 slots; slots >= N fall back to Tier-2")
p.add_argument("--log-level", default="INFO")
args = p.parse_args(argv)
@ -72,6 +74,7 @@ def main(argv: list[str] | None = None) -> int:
max_concurrent_override=args.max_concurrent,
require_real_samples=args.require_real_samples,
force_tier2=args.force_tier2,
max_tier3_slots=args.max_tier3_slots,
)
runner = FleetRunner(cfg)

View file

@ -66,13 +66,20 @@ def _wait_for_path(path: Path, timeout_s: float) -> None:
def _wait_for_tcp(host: str, port: int, timeout_s: float) -> None:
"""Legacy TCP probe — only reliable when the guest speaks first on connect.
Kept for reference; replaced by _wait_for_serial_login for SLIRP guests."""
import socket
deadline = time.monotonic() + timeout_s
last_err: Exception | None = None
while time.monotonic() < deadline:
try:
with socket.create_connection((host, port), timeout=1.0):
return
with socket.create_connection((host, port), timeout=1.0) as s:
s.settimeout(0.5)
try:
s.recv(1)
except socket.timeout:
pass
return
except OSError as e:
last_err = e
time.sleep(1.0)
@ -82,6 +89,58 @@ def _wait_for_tcp(host: str, port: int, timeout_s: float) -> None:
)
def _wait_for_serial_login(
serial_sock: "Path",
timeout_s: float,
prompt: bytes = b"login:",
) -> None:
"""Wait for a shell login prompt on the QEMU serial console.
SLIRP completes the TCP handshake before the guest OS boots, making
TCP-based readiness probes on port 139/445 unreliable (they return
immediately even when Samba isn't running yet). The serial console is
authoritative: we connect right after QEMU writes its pidfile (before
the guest produces any output) and stream boot messages until the
"login:" prompt appears.
QEMU's serial chardev is ``server=on,wait=off``: the socket is created
at QEMU startup. Data written before a client connects is discarded, so
we must connect before the prompt appears. Since the pidfile is written
after QEMU finishes device init (well before the guest kernel loads), we
reliably connect in time.
"""
import socket as _socket
deadline = time.monotonic() + timeout_s
while not serial_sock.exists():
if time.monotonic() >= deadline:
raise TimeoutError(f"serial socket {serial_sock} never appeared")
time.sleep(0.2)
buf = b""
sock = _socket.socket(_socket.AF_UNIX, _socket.SOCK_STREAM)
sock.settimeout(2.0)
try:
sock.connect(str(serial_sock))
while time.monotonic() < deadline:
try:
chunk = sock.recv(4096)
if not chunk:
break
buf += chunk
if prompt in buf.lower():
return
except _socket.timeout:
pass
finally:
sock.close()
raise TimeoutError(
f"login prompt not seen on serial console within {timeout_s}s "
f"(last {min(200, len(buf))} bytes: {buf[-200:]!r})"
)
def main() -> int:
parser = argparse.ArgumentParser(prog="run_tier3_demo")
parser.add_argument("--data-root", default="data")
@ -181,6 +240,18 @@ def main() -> int:
sample.name, sample.profile, sample.kind)
run_dir = Path(args.run_dir)
# Kill any QEMU still holding this slot's run_dir from a previous wave.
# QEMU is started with start_new_session=True so it survives orchestrator
# SIGTERM without explicit cleanup here.
old_pid_file = run_dir / "qemu.pid"
if old_pid_file.exists():
try:
old_pid = int(old_pid_file.read_text().strip())
import os as _os
_os.killpg(_os.getpgid(old_pid), signal.SIGTERM)
time.sleep(1.5)
except (ProcessLookupError, ValueError, OSError):
pass
if run_dir.exists():
import shutil
shutil.rmtree(run_dir)
@ -202,11 +273,11 @@ def main() -> int:
try:
_wait_for_path(pid_file, timeout_s=15.0)
qemu_pid = int(pid_file.read_text().strip())
log.info("qemu pid = %d; waiting for service on %s:%d (timeout %.0fs)",
qemu_pid, args.target_ip, args.target_port,
args.target_boot_timeout)
_wait_for_tcp(args.target_ip, args.target_port, args.target_boot_timeout)
log.info("target service is up")
serial_sock = run_dir / "serial.sock"
log.info("qemu pid = %d; waiting for login prompt on serial console (timeout %.0fs)",
qemu_pid, args.target_boot_timeout)
_wait_for_serial_login(serial_sock, timeout_s=args.target_boot_timeout)
log.info("target guest OS ready (login prompt seen on serial console)")
# Pre-exploit savevm so EpisodeConfig.revert_at_{start,end}
# has a known-good baseline to load. Best-effort — we still
@ -260,6 +331,11 @@ def main() -> int:
module=module,
cfg=DriverConfig(
target_ip=args.target_ip,
# Override RPORT when target_port is an unprivileged host port
# (i.e. fleet runner remapped the guest's privileged port to a
# loopback port > 1024). When target_port == module RPORT the
# caller wants direct guest access; leave RPORT unchanged.
target_port=args.target_port if args.target_port > 1024 else None,
sample_store_root=repo_root / "samples" / "store",
),
emit_event=runner.emit_event,

View file

@ -34,9 +34,11 @@ RAM_MIB="${RAM_MIB:-512}"
BRIDGE="${BRIDGE:-}"
TAP="${TAP:-cis490target$SLOT}"
# Ports the host should forward to the guest. Comma-separated host:guest pairs.
# Default covers the vsftpd module's RPORT. Slot offset makes per-VM
# fleet runs collision-free (slot 0 → 21, slot 1 → 121, slot 2 → 221, ...).
PORT_BASE="${PORT_BASE:-$((21 + SLOT * 100))}"
# Default covers the vsftpd module's RPORT. Host port uses an unprivileged
# range (>1023) so the service user (cis490) can bind it without root.
# Slot offset makes per-VM fleet runs collision-free
# (slot 0 → 2021, slot 1 → 2121, slot 2 → 2221, ...).
PORT_BASE="${PORT_BASE:-$((2021 + SLOT * 100))}"
TARGET_PORTS="${TARGET_PORTS:-${PORT_BASE}:21}"
# KVM if the host can take it; otherwise fall back to TCG. Cross-arch
# images (Metasploitable2 is x86-only) on aarch64 hosts will need TCG.