diff --git a/TIER3-BRINGUP.md b/TIER3-BRINGUP.md new file mode 100644 index 0000000..7bc54e9 --- /dev/null +++ b/TIER3-BRINGUP.md @@ -0,0 +1,190 @@ +# Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01) + +Bugs found and fixed during the first real-exploit fleet run on this host. +All fixes are in the commits following the `Dev_REL1_043026` merge of main. + +--- + +## Bug 1 — BRIDGE env var breaks Tier-3 target VM networking + +**Symptom:** All Tier-3 slots timeout at 300 s waiting for the target +service. QEMU starts with `netdev tap` instead of `netdev user` (SLIRP). + +**Root cause:** `launch_target.sh` checks `BRIDGE` to switch between SLIRP +and tap networking. The fleet runner copied the parent environment (which had +`BRIDGE=br-malware` from the Tier-2 tap setup) into the Tier-3 subprocess. +The Tier-3 target VMs don't have a tap interface configured, so all guest +traffic is dropped. + +**Fix:** `fleet.py` `_run_slot()` now calls `env.pop("BRIDGE", None)` before +launching `run_tier3_demo.py`. Tier-2 idle VMs continue to use tap; Tier-3 +target VMs always use SLIRP+hostfwd. + +**Files:** `orchestrator/fleet.py` + +--- + +## Bug 2 — Bridge-requiring modules selected when BRIDGE is not available + +**Symptom:** `distccd_command_exec` and `php_cgi_arg_injection` appear in +`usable_modules` even on SLIRP-only runs. Exploit fires but the reverse-shell +payload can't call back (no guest egress on `restrict=on`). + +**Root cause:** `usable_modules` filtering was conditioned on `bridge_iface` +being set in the environment. When BRIDGE was not set, ALL modules were +considered usable. Modules that require bridge egress (reverse shells) silently +fell through, fired, and timed out waiting for a session. + +**Fix:** `usable_modules` now always filters `requires_bridge=True` modules +regardless of the BRIDGE env var. The `requires_bridge` field in the module +TOML is authoritative. + +**Files:** `orchestrator/fleet.py`, `exploits/modules/*.toml` + +--- + +## Bug 3 — `cmd/unix/interact` creates no persistent session + +**Symptom:** `samba_usermap_script` fires (job_id=None), no session appears in +`session.list` after 30 s. The exploit succeeds on the wire but the driver +reports `session_open_timeout`. + +**Root cause:** `cmd/unix/interact` is a console-only payload. It attaches +directly to the module's job console — it does NOT create a background +Meterpreter/shell session visible via `session.list`. msfrpcd's +`module.execute` returns `job_id=None` (no background job), and +`wait_for_new_session` polls forever. + +**Fix:** Changed payload to `cmd/unix/bind_perl` with `LPORT=4444`. The +bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects +to `RHOSTS:LPORT` after the exploit fires, creating a proper shell session. + +**Files:** `exploits/modules/samba_usermap_script.toml` + +--- + +## Bug 4 — Per-slot LPORT/hostfwd port mapping wrong + +**Symptom:** For slots 1+, the bind-shell port is reachable on the host but +msfrpcd cannot connect. `ss -tlnp` on the host shows port 5444 listening +(QEMU) but the module tries to connect to port 4444. + +**Root cause:** The extra hostfwd was `host:5444→guest:4444` (old guest port) +but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444. +Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path. + +**Fix:** Extra hostfwd now uses `extra_host_port:extra_host_port` on both +sides. `extra_host_port = base_port + slot * 1000` is the per-slot LPORT, and +the guest binds that exact port. + +**Files:** `orchestrator/fleet.py` + +--- + +## Bug 5 — vsftpd module port 6200 collision across concurrent slots + +**Symptom:** Multiple Tier-3 slots running vsftpd_234_backdoor all try to +hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start +because port 6200 is already bound by slot 0's QEMU. + +**Root cause:** vsftpd's backdoor hardcodes port 6200 in both the vulnerable +binary and the Metasploit module. There is no LPORT override possible. With +SLIRP+hostfwd, all concurrent slots must use the same host port, which is +impossible. + +**Fix:** Marked `vsftpd_234_backdoor.toml` with `requires_bridge = true`. The +fleet runner filters it from `usable_modules` on SLIRP runs. When a bridge is +available each guest gets its own IP, and msfrpcd connects to `guest_ip:6200` +directly. + +**Files:** `exploits/modules/vsftpd_234_backdoor.toml` + +--- + +## Bug 6 — SLIRP false-positive in `_wait_for_tcp` causes premature exploit fire + +**Symptom:** Log shows "target service is up" within 0.5 s of QEMU start. The +exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 30–60 s +to boot Samba. Result: `session_open_timeout` every episode. + +**Root cause:** SLIRP's usermode TCP stack completes the TCP three-way +handshake (SYN-ACK) immediately for any port that has a `hostfwd` rule, +regardless of whether the guest OS has booted. A bare `socket.create_connection()` +always succeeds. Even a `recv()` with a short timeout (0.5 s) fires with +`socket.timeout` because during very early boot SLIRP cannot RST the connection +(the guest TCP stack is not up yet), so the connection hangs open and the recv +deadline fires before SLIRP can determine the guest state. + +**Fix:** Replaced `_wait_for_tcp` with `_wait_for_serial_login`. The new +function connects to QEMU's serial console socket (`serial.sock`) right after +the pidfile appears and streams boot output until `"login:"` is seen. The +serial console is authoritative: it reflects actual guest OS state, not +SLIRP's synthetic TCP layer. + +Timing: +- `serial.sock` is created by QEMU at device init, before the pidfile. +- We connect immediately after the pidfile → we receive all boot output. +- Metasploitable2 prints `"metasploitable login:"` ≈ 50–70 s after QEMU start. +- The clean phase (10 s) runs AFTER the login prompt, so the exploit fires + when Samba is reliably up. + +**Files:** `tools/run_tier3_demo.py` + +--- + +## Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts + +**Symptom:** After a systemd restart of `cis490-orchestrator`, the new wave's +QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU +from the previous wave is still running (QEMU is started with +`start_new_session=True` so it survives the orchestrator's SIGTERM). The new +episode detects the stale QEMU answering the port probe and proceeds as if the +target is up — but the stale QEMU has different hostfwd mappings (no bind port +for the current module), so the exploit never lands. + +**Fix:** `run_tier3_demo.py` reads the old `qemu.pid` file from the run +directory before recreating it. If a PID is found, `os.killpg(pgid, SIGTERM)` +terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU +exit before the port is rebound. + +**Files:** `tools/run_tier3_demo.py` + +--- + +## Bug 8 — `PORT_BASE` default uses privileged ports (< 1024) + +**Symptom:** `launch_target.sh`'s default `PORT_BASE` was `21 + SLOT * 100`. +On Tier-2 hosts without Metasploitable2, standalone `run_tier3_demo.py` tries +to bind port 21 on loopback. The `cis490` service user cannot bind ports +< 1024. QEMU exits immediately. + +**Fix:** Default changed to `2021 + SLOT * 100`. Port 2021 is above 1024 and +reflects the scheme used by the fleet runner (base_port + 2000). + +**Files:** `vm/launch_target.sh`, `scripts/install-tier-3-4.sh` + +--- + +## Bug 9 — msfrpc `module.execute` response is raw msgpack bytes, not str + +**Symptom:** Key lookups on the `module.execute` response raise `KeyError` +or fail silently because msgpack returns `bin` type (bytes) for all string +values, even with `raw=False` on some Metasploit 6.x builds. + +**Fix:** Added `MSFRpcClient._str()` to recursively decode bytes→str in all +msgpack response dicts. Applied to `module.execute` and `session.list`. + +**Files:** `exploits/msfrpc.py` + +--- + +## Net result after all fixes + +With fixes 1–9 applied: +- All 4 Tier-3 slots use SLIRP+hostfwd with correct per-slot port mapping. +- `samba_usermap_script` fires `cmd/unix/bind_perl` with the correct per-slot + LPORT; msfrpcd connects to the bind port via hostfwd. +- The exploit fires only after Metasploitable2 confirms its login prompt on + the serial console (~60 s after QEMU start). +- Sessions open, workloads execute, episodes complete with `session_open` + events (not `session_open_timeout`). diff --git a/etc/cis490-orchestrator.service b/etc/cis490-orchestrator.service index fbb32b1..fae8fdf 100644 --- a/etc/cis490-orchestrator.service +++ b/etc/cis490-orchestrator.service @@ -14,6 +14,9 @@ WorkingDirectory=/opt/cis490 # /etc/cis490/lab-host.env is written by scripts/install-lab-host.sh; # carries FLEET_HOST_ID, BRIDGE, and any operator-supplied overrides. EnvironmentFile=/etc/cis490/lab-host.env +# msfrpc credentials (written by install-msfrpcd.sh). Optional (-) so the +# unit still starts on Tier-2-only hosts where msfrpcd isn't installed. +EnvironmentFile=-/etc/cis490/msfrpc.env # Fleet mode: detect host capacity, run that many concurrent episodes # per wave with samples drawn from the manifest. Each invocation runs # one wave and exits; systemd respawns per Restart= below, giving us @@ -22,7 +25,8 @@ EnvironmentFile=/etc/cis490/lab-host.env ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/run_fleet.py \ --data-root /var/lib/cis490/data \ --manifest /opt/cis490/samples/manifest.toml \ - --waves 1 + --waves 1 \ + --max-tier3-slots 4 Restart=always RestartSec=15 diff --git a/exploits/driver.py b/exploits/driver.py index f116719..634a9c5 100644 --- a/exploits/driver.py +++ b/exploits/driver.py @@ -27,6 +27,7 @@ adapter between the phase machine and msfrpc. from __future__ import annotations import logging +import os import time from dataclasses import dataclass from typing import Callable @@ -52,6 +53,10 @@ EmitEvent = Callable[..., None] class DriverConfig: target_ip: str session_open_timeout_s: float = 30.0 + # HOST_PORT for the module's service. When set, overrides RPORT in the + # module's options so msfrpcd connects to the hostfwd'd loopback port + # rather than the guest's privileged port directly. + target_port: int | None = None # Driver v1 fallback workload — used only when no Sample is passed # in (Sample-driven runs override these via exploits.workloads). # We keep the v1 path so existing callers keep working unchanged. @@ -185,6 +190,15 @@ class MSFExploitDriver: log.debug("module already fired; skipping re-fire") return opts = self.module.render_options(target_ip=self.cfg.target_ip) + if self.cfg.target_port is not None: + opts["RPORT"] = self.cfg.target_port + # Fleet sets FLEET_PAYLOAD_LPORT to the per-slot host port for + # bind-shell payloads (cmd/unix/bind_perl etc.) so the handler + # connects to the right hostfwd'd loopback port. + fleet_lport = os.environ.get("FLEET_PAYLOAD_LPORT") + if fleet_lport and "LPORT" in opts: + opts["LPORT"] = int(fleet_lport) + log.info("LPORT overridden to %s (FLEET_PAYLOAD_LPORT)", fleet_lport) self.emit( "exploit_fire", module=self.module.module_path, diff --git a/exploits/modules.py b/exploits/modules.py index b0f39c2..cfb01b1 100644 --- a/exploits/modules.py +++ b/exploits/modules.py @@ -45,6 +45,11 @@ class ModuleConfig: # The fleet runner skips these unless BRIDGE is set so episodes # that fire them actually produce data. requires_bridge: bool = False + # Guest ports the fleet must also hostfwd (in addition to RPORT). + # Used for bind-shell payloads where the handler connects to a + # separate port. Fleet calculates per-slot host ports and sets + # FLEET_PAYLOAD_LPORT so the driver can override LPORT at fire time. + extra_target_ports: tuple[int, ...] = () def render_options(self, *, target_ip: str) -> dict[str, Any]: """Substitute ``{{ target_ip }}`` placeholders in options. @@ -99,6 +104,9 @@ def load_module_config(path: Path) -> ModuleConfig: expected_session_type=raw.get("session", {}).get("type", "shell"), description=raw.get("description", ""), requires_bridge=bool(raw.get("runtime", {}).get("requires_bridge", False)), + extra_target_ports=tuple( + int(p) for p in raw.get("runtime", {}).get("extra_target_ports", []) + ), ) diff --git a/exploits/modules/samba_usermap_script.toml b/exploits/modules/samba_usermap_script.toml index a4ccce8..a1957dc 100644 --- a/exploits/modules/samba_usermap_script.toml +++ b/exploits/modules/samba_usermap_script.toml @@ -2,8 +2,10 @@ description = """ Samba 3.0.20 username-map command injection (CVE-2007-2447). Trigger is a crafted username at SMB authentication; the Samba daemon shells out via the username_map_script and runs whatever the attacker put in -the username. Standard Metasploitable2 vector. Returns a root shell -on the SMB socket — works with cmd/unix/interact. +the username. Standard Metasploitable2 vector. Uses a bind-perl +payload so msfrpcd can connect to the resulting shell via SLIRP +hostfwd; LPORT is fleet-assigned per slot (base 4444, +1000/slot) +to avoid collisions across concurrent episodes. """ [module] @@ -15,7 +17,16 @@ RHOSTS = "{{ target_ip }}" RPORT = 139 [payload] -path = "cmd/unix/interact" +path = "cmd/unix/bind_perl" + +[payload.options] +LPORT = 4444 [session] type = "shell" + +[runtime] +# bind_perl opens a new guest port; fleet hostfwds it via SLIRP. +# No bridge egress needed — host connects in, not guest out. +requires_bridge = false +extra_target_ports = [4444] diff --git a/exploits/modules/vsftpd_234_backdoor.toml b/exploits/modules/vsftpd_234_backdoor.toml index 4e7374f..49725fc 100644 --- a/exploits/modules/vsftpd_234_backdoor.toml +++ b/exploits/modules/vsftpd_234_backdoor.toml @@ -1,8 +1,14 @@ description = """ vsftpd 2.3.4 intentional backdoor (CVE-2011-2523). Triggered by an FTP USER name ending with ':)'. Standard Metasploitable2 exploit, fully -deterministic — perfect for a Tier-3 first-light run because the -exploit fire timing is bounded by a single FTP round-trip. +deterministic — perfect for a Tier-3 first-light run. + +NOTE: The backdoor binds a shell on port 6200 (hardcoded in both the +vulnerable vsftpd binary AND the Metasploit module — not overridable). +msfrpcd connects to RHOSTS:6200 after triggering the backdoor. With +SLIRP+restrict=on and multiple concurrent slots, port 6200 can only be +hostfwd'd once, causing collisions. Requires BRIDGE so the exploit +handler can reach guest:6200 directly via the bridge IP. """ [module] @@ -12,12 +18,14 @@ path = "unix/ftp/vsftpd_234_backdoor" [module.options] RHOSTS = "{{ target_ip }}" RPORT = 21 -# The exploit returns its own command shell — we drive it with a -# minimal cmd/unix/interact payload so the session lands as a plain -# shell session usable by session.shell_write/read. [payload] path = "cmd/unix/interact" [session] type = "shell" + +[runtime] +# Port 6200 (backdoor bind) is hardcoded; can't offset per-slot. +# Requires bridge so all concurrent slots get distinct guest IPs. +requires_bridge = true diff --git a/exploits/msfrpc.py b/exploits/msfrpc.py index f39ca49..e408357 100644 --- a/exploits/msfrpc.py +++ b/exploits/msfrpc.py @@ -104,8 +104,8 @@ class MSFRpcClient: if "job_id" not in resp: raise MSFRpcError(f"module.execute returned no job_id: {resp!r}") log.info( - "module.execute %s/%s -> job_id=%s uuid=%s", - module_type, module_name, resp["job_id"], resp.get("uuid"), + "module.execute %s/%s -> job_id=%s uuid=%s resp=%r", + module_type, module_name, resp["job_id"], resp.get("uuid"), resp, ) return resp @@ -154,6 +154,22 @@ class MSFRpcClient: def _call_no_auth(self, method: str, *args: Any) -> dict[str, Any]: return self._raw_call([method, *args]) + @staticmethod + def _str(v: Any) -> Any: + """Decode bytes to str; recursively normalize dicts and lists. + + msfrpcd (pacman metasploit 6.x) returns msgpack bin type for all + string values, so raw=False still gives bytes. Normalise the whole + response tree so callers can use plain str keys/values. + """ + if isinstance(v, bytes): + return v.decode("utf-8", errors="replace") + if isinstance(v, dict): + return {MSFRpcClient._str(k): MSFRpcClient._str(val) for k, val in v.items()} + if isinstance(v, list): + return [MSFRpcClient._str(i) for i in v] + return v + def _raw_call(self, payload: list[Any]) -> dict[str, Any]: body = msgpack.packb(payload, use_bin_type=False) conn = self._open_conn() @@ -180,7 +196,7 @@ class MSFRpcClient: conn.close() try: - decoded = msgpack.unpackb(raw, raw=False) + decoded = self._str(msgpack.unpackb(raw, raw=False)) except Exception as e: raise MSFRpcError(f"could not decode msfrpcd response: {e}") from e @@ -221,11 +237,18 @@ def wait_for_new_session( ) -> tuple[int, dict[str, Any]] | None: """Poll ``session.list`` until a session id we haven't seen before appears, or until timeout. Returns ``(session_id, info)`` or None.""" + log = __import__("logging").getLogger("cis490.msfrpc") deadline = time.monotonic() + timeout_s + logged_empty = False while time.monotonic() < deadline: sessions = client.session_list() + if not logged_empty: + log.debug("wait_for_new_session: seen=%r current=%r", seen, list(sessions.keys())) + logged_empty = True for sid, info in sessions.items(): if sid not in seen: return sid, info time.sleep(poll_s) + # Log final state on timeout + log.debug("wait_for_new_session timeout: final sessions=%r", client.session_list()) return None diff --git a/orchestrator/fleet.py b/orchestrator/fleet.py index 339cefe..b58a9dd 100644 --- a/orchestrator/fleet.py +++ b/orchestrator/fleet.py @@ -109,6 +109,12 @@ class FleetConfig: # Force Tier-2 even when msfrpcd is up; used by tests + dev runs # that want a no-exploit baseline. force_tier2: bool = False + # Limit how many slots per wave run as Tier-3. Slots 0..N-1 get + # Tier-3; the rest fall back to Tier-2. Metasploitable2 boot is IO- + # bound: running >~6 concurrent target VMs saturates disk and causes + # all slots to timeout waiting for the guest service to come up. + # None = no cap (all eligible slots use Tier-3). + max_tier3_slots: int | None = None # msfrpcd connectivity (read by tier-3 driver via env). msfrpcd_host: str = "127.0.0.1" msfrpcd_port: int = 55553 @@ -237,26 +243,19 @@ def _run_slot( run_dir_base = "/tmp/cis490-vm-fleet" # Decide tier. - bridge_iface = os.environ.get("BRIDGE") or None - # Filter the catalog to modules that can actually fire under the - # current launcher mode. Reverse / bind shells require the host- - # only bridge (no SLIRP+restrict=on guest egress), so skip those - # when BRIDGE isn't set; otherwise the exploit fires but the - # session never lands and the episode degenerates to a 30 s - # session_open_timeout. - if cfg.modules: - if bridge_iface: - usable_modules = dict(cfg.modules) - else: - usable_modules = { - k: v for k, v in cfg.modules.items() if not v.requires_bridge - } - else: - usable_modules = {} + # Tier-3 target VMs always use SLIRP+hostfwd so msfrpcd can reach + # the guest via loopback. BRIDGE tap is for the Tier-2 idle VM only + # (pcap source 4). Skip modules that need bridge egress (bind/reverse + # shells that open a callback port the guest dials back or binds). + usable_modules: dict[str, ModuleConfig] = ( + {k: v for k, v in cfg.modules.items() if not v.requires_bridge} + if cfg.modules else {} + ) tier3_ready = ( not cfg.force_tier2 and bool(usable_modules) and _msfrpcd_available(cfg.msfrpcd_host, cfg.msfrpcd_port) + and (cfg.max_tier3_slots is None or slot < cfg.max_tier3_slots) ) env = os.environ.copy() @@ -280,15 +279,33 @@ def _run_slot( usable_modules, host_id=cfg.host_id, slot=slot, episode_index=episode_index, ) - target_port = module_target_port(module) or 21 + guest_port = module_target_port(module) or 21 + # HOST_PORT: unprivileged port QEMU hostfwd's to the guest service. + # +2000 shifts all base ports above 1024 (vsftpd:21->2021, + # http:80->2080, smb:139->2139, distcc:3632->5632, irc:6667->8667). + # Slot offset prevents concurrent targets from colliding on loopback. + host_port = guest_port + 2000 + slot * 1000 # Per-slot runner dir for the target VM. run_dir = f"{run_dir_base}-target-{slot}" env["RUN_DIR"] = run_dir - # Each slot gets a unique host-side hostfwd port so concurrent - # targets don't collide on the loopback port. - env["PORT_BASE"] = str(target_port + slot * 1000) - if bridge_iface: - env["BRIDGE"] = bridge_iface + env["PORT_BASE"] = str(host_port) + # Main service port pair, plus per-slot bind ports for payloads + # like cmd/unix/bind_perl that open a separate listener in the guest. + # Per-slot offset (base + slot*1000) prevents collisions. + target_ports = f"{host_port}:{guest_port}" + for extra_guest_port in module.extra_target_ports: + # Per-slot LPORT: base + slot*1000. FLEET_PAYLOAD_LPORT overrides + # the payload's LPORT so the guest binds this exact port. The + # hostfwd maps the same number on both sides because the guest's + # bind port equals the per-slot LPORT (not the module's base LPORT). + extra_host_port = extra_guest_port + slot * 1000 + target_ports += f",{extra_host_port}:{extra_host_port}" + env["FLEET_PAYLOAD_LPORT"] = str(extra_host_port) + env["TARGET_PORTS"] = target_ports + # Remove BRIDGE so launch_target.sh uses SLIRP+hostfwd instead of + # tap. Target VM connectivity goes through the hostfwd loopback ports; + # tap/bridge requires guest-IP discovery which isn't wired up yet. + env.pop("BRIDGE", None) cmd = [ py, str(cfg.repo_root / "tools" / "run_tier3_demo.py"), @@ -296,7 +313,8 @@ def _run_slot( "--run-dir", run_dir, "--module", module.name, "--sample", sample.name, - "--target-port", str(target_port + slot * 1000), + "--target-port", str(host_port), + "--target-boot-timeout", "300", ] tier = "tier3" module_name: str | None = module.name @@ -314,6 +332,10 @@ def _run_slot( module_name = None if not cfg.force_tier2 and not cfg.modules: log.warning("slot=%d falling back to Tier 2: empty module catalog", slot) + elif not cfg.force_tier2 and not usable_modules: + log.warning("slot=%d falling back to Tier 2: no non-bridge modules available", slot) + elif not cfg.force_tier2 and cfg.max_tier3_slots is not None and slot >= cfg.max_tier3_slots: + log.debug("slot=%d Tier 2 by max_tier3_slots=%d cap", slot, cfg.max_tier3_slots) elif not cfg.force_tier2: log.warning("slot=%d falling back to Tier 2: msfrpcd unreachable at %s:%d", slot, cfg.msfrpcd_host, cfg.msfrpcd_port) diff --git a/scripts/install-lab-host.sh b/scripts/install-lab-host.sh index 113a8f1..44c4d53 100755 --- a/scripts/install-lab-host.sh +++ b/scripts/install-lab-host.sh @@ -244,6 +244,8 @@ fi install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$INSTALL_ROOT/vm/images" ln -sf "$ALPINE_IMG" "$INSTALL_ROOT/vm/images/alpine-baseline.qcow2" 2>/dev/null || true ln -sf "$CIDATA_ISO" "$INSTALL_ROOT/vm/images/cidata.iso" 2>/dev/null || true +M2_IMG="$DATA_ROOT/vm/images/metasploitable2.qcow2" +[[ -f "$M2_IMG" ]] && ln -sf "$M2_IMG" "$INSTALL_ROOT/vm/images/metasploitable2.qcow2" 2>/dev/null || true # --- 8. Tier-3 + Tier-4 deploy (auto, idempotent) ---------------------- # Bring up msfrpcd + Metasploitable2 + bridge + verify. Skipped only if diff --git a/scripts/install-msfrpcd.sh b/scripts/install-msfrpcd.sh index ec7477a..0141fd7 100755 --- a/scripts/install-msfrpcd.sh +++ b/scripts/install-msfrpcd.sh @@ -102,6 +102,12 @@ else fi # --- 3. systemd unit ---------------------------------------------------- +# msfrpcd writes module cache + logs to $HOME/.msf4. With ProtectHome=true +# the service can't reach /root, so we redirect HOME to a path under +# /var/lib/cis490 that is always writable. +MSF_HOME="/var/lib/cis490/msf4" +install -d -m 0755 -o root -g root "$MSF_HOME" + log "installing systemd unit" cat > "$UNIT" < password # -U username diff --git a/scripts/install-tier-3-4.sh b/scripts/install-tier-3-4.sh index ca2e7b9..49dbc88 100755 --- a/scripts/install-tier-3-4.sh +++ b/scripts/install-tier-3-4.sh @@ -101,7 +101,8 @@ if [[ -z "${SKIP_VERIFY:-}" ]]; then [[ -x "$PY" ]] || PY="$(command -v python3)" if ! sudo -E -u cis490 "$PY" "$INSTALL_ROOT/tools/run_tier3_demo.py" \ --module vsftpd_234_backdoor \ - --target-port 21 \ + --target-port 2021 \ + --data-root "$DATA_ROOT/data" \ --target-boot-timeout 240 \ > /tmp/cis490-tier3-verify.log 2>&1; then log "verify run failed — log at /tmp/cis490-tier3-verify.log; dumping last 30 lines:" diff --git a/tools/run_fleet.py b/tools/run_fleet.py index bfbe5a2..5dfb29e 100644 --- a/tools/run_fleet.py +++ b/tools/run_fleet.py @@ -45,6 +45,8 @@ def main(argv: list[str] | None = None) -> int: p.add_argument("--require-real-samples", action="store_true") p.add_argument("--force-tier2", action="store_true", help="Skip Tier 3 even when msfrpcd is reachable") + p.add_argument("--max-tier3-slots", type=int, default=None, + help="Cap concurrent Tier-3 slots; slots >= N fall back to Tier-2") p.add_argument("--log-level", default="INFO") args = p.parse_args(argv) @@ -72,6 +74,7 @@ def main(argv: list[str] | None = None) -> int: max_concurrent_override=args.max_concurrent, require_real_samples=args.require_real_samples, force_tier2=args.force_tier2, + max_tier3_slots=args.max_tier3_slots, ) runner = FleetRunner(cfg) diff --git a/tools/run_tier3_demo.py b/tools/run_tier3_demo.py index 536cdf6..95ffdd8 100644 --- a/tools/run_tier3_demo.py +++ b/tools/run_tier3_demo.py @@ -66,13 +66,20 @@ def _wait_for_path(path: Path, timeout_s: float) -> None: def _wait_for_tcp(host: str, port: int, timeout_s: float) -> None: + """Legacy TCP probe — only reliable when the guest speaks first on connect. + Kept for reference; replaced by _wait_for_serial_login for SLIRP guests.""" import socket deadline = time.monotonic() + timeout_s last_err: Exception | None = None while time.monotonic() < deadline: try: - with socket.create_connection((host, port), timeout=1.0): - return + with socket.create_connection((host, port), timeout=1.0) as s: + s.settimeout(0.5) + try: + s.recv(1) + except socket.timeout: + pass + return except OSError as e: last_err = e time.sleep(1.0) @@ -82,6 +89,58 @@ def _wait_for_tcp(host: str, port: int, timeout_s: float) -> None: ) +def _wait_for_serial_login( + serial_sock: "Path", + timeout_s: float, + prompt: bytes = b"login:", +) -> None: + """Wait for a shell login prompt on the QEMU serial console. + + SLIRP completes the TCP handshake before the guest OS boots, making + TCP-based readiness probes on port 139/445 unreliable (they return + immediately even when Samba isn't running yet). The serial console is + authoritative: we connect right after QEMU writes its pidfile (before + the guest produces any output) and stream boot messages until the + "login:" prompt appears. + + QEMU's serial chardev is ``server=on,wait=off``: the socket is created + at QEMU startup. Data written before a client connects is discarded, so + we must connect before the prompt appears. Since the pidfile is written + after QEMU finishes device init (well before the guest kernel loads), we + reliably connect in time. + """ + import socket as _socket + + deadline = time.monotonic() + timeout_s + while not serial_sock.exists(): + if time.monotonic() >= deadline: + raise TimeoutError(f"serial socket {serial_sock} never appeared") + time.sleep(0.2) + + buf = b"" + sock = _socket.socket(_socket.AF_UNIX, _socket.SOCK_STREAM) + sock.settimeout(2.0) + try: + sock.connect(str(serial_sock)) + while time.monotonic() < deadline: + try: + chunk = sock.recv(4096) + if not chunk: + break + buf += chunk + if prompt in buf.lower(): + return + except _socket.timeout: + pass + finally: + sock.close() + + raise TimeoutError( + f"login prompt not seen on serial console within {timeout_s}s " + f"(last {min(200, len(buf))} bytes: {buf[-200:]!r})" + ) + + def main() -> int: parser = argparse.ArgumentParser(prog="run_tier3_demo") parser.add_argument("--data-root", default="data") @@ -181,6 +240,18 @@ def main() -> int: sample.name, sample.profile, sample.kind) run_dir = Path(args.run_dir) + # Kill any QEMU still holding this slot's run_dir from a previous wave. + # QEMU is started with start_new_session=True so it survives orchestrator + # SIGTERM without explicit cleanup here. + old_pid_file = run_dir / "qemu.pid" + if old_pid_file.exists(): + try: + old_pid = int(old_pid_file.read_text().strip()) + import os as _os + _os.killpg(_os.getpgid(old_pid), signal.SIGTERM) + time.sleep(1.5) + except (ProcessLookupError, ValueError, OSError): + pass if run_dir.exists(): import shutil shutil.rmtree(run_dir) @@ -202,11 +273,11 @@ def main() -> int: try: _wait_for_path(pid_file, timeout_s=15.0) qemu_pid = int(pid_file.read_text().strip()) - log.info("qemu pid = %d; waiting for service on %s:%d (timeout %.0fs)", - qemu_pid, args.target_ip, args.target_port, - args.target_boot_timeout) - _wait_for_tcp(args.target_ip, args.target_port, args.target_boot_timeout) - log.info("target service is up") + serial_sock = run_dir / "serial.sock" + log.info("qemu pid = %d; waiting for login prompt on serial console (timeout %.0fs)", + qemu_pid, args.target_boot_timeout) + _wait_for_serial_login(serial_sock, timeout_s=args.target_boot_timeout) + log.info("target guest OS ready (login prompt seen on serial console)") # Pre-exploit savevm so EpisodeConfig.revert_at_{start,end} # has a known-good baseline to load. Best-effort — we still @@ -260,6 +331,11 @@ def main() -> int: module=module, cfg=DriverConfig( target_ip=args.target_ip, + # Override RPORT when target_port is an unprivileged host port + # (i.e. fleet runner remapped the guest's privileged port to a + # loopback port > 1024). When target_port == module RPORT the + # caller wants direct guest access; leave RPORT unchanged. + target_port=args.target_port if args.target_port > 1024 else None, sample_store_root=repo_root / "samples" / "store", ), emit_event=runner.emit_event, diff --git a/vm/launch_target.sh b/vm/launch_target.sh index 882a7a1..055e881 100755 --- a/vm/launch_target.sh +++ b/vm/launch_target.sh @@ -34,9 +34,11 @@ RAM_MIB="${RAM_MIB:-512}" BRIDGE="${BRIDGE:-}" TAP="${TAP:-cis490target$SLOT}" # Ports the host should forward to the guest. Comma-separated host:guest pairs. -# Default covers the vsftpd module's RPORT. Slot offset makes per-VM -# fleet runs collision-free (slot 0 → 21, slot 1 → 121, slot 2 → 221, ...). -PORT_BASE="${PORT_BASE:-$((21 + SLOT * 100))}" +# Default covers the vsftpd module's RPORT. Host port uses an unprivileged +# range (>1023) so the service user (cis490) can bind it without root. +# Slot offset makes per-VM fleet runs collision-free +# (slot 0 → 2021, slot 1 → 2121, slot 2 → 2221, ...). +PORT_BASE="${PORT_BASE:-$((2021 + SLOT * 100))}" TARGET_PORTS="${TARGET_PORTS:-${PORT_BASE}:21}" # KVM if the host can take it; otherwise fall back to TCG. Cross-arch # images (Metasploitable2 is x86-only) on aarch64 hosts will need TCG.