# Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01) Bugs found and fixed during the first real-exploit fleet run on this host. All fixes are in the commits following the `Dev_REL1_043026` merge of main. --- ## Bug 1 — BRIDGE env var breaks Tier-3 target VM networking **Symptom:** All Tier-3 slots timeout at 300 s waiting for the target service. QEMU starts with `netdev tap` instead of `netdev user` (SLIRP). **Root cause:** `launch_target.sh` checks `BRIDGE` to switch between SLIRP and tap networking. The fleet runner copied the parent environment (which had `BRIDGE=br-malware` from the Tier-2 tap setup) into the Tier-3 subprocess. The Tier-3 target VMs don't have a tap interface configured, so all guest traffic is dropped. **Fix:** `fleet.py` `_run_slot()` now calls `env.pop("BRIDGE", None)` before launching `run_tier3_demo.py`. Tier-2 idle VMs continue to use tap; Tier-3 target VMs always use SLIRP+hostfwd. **Files:** `orchestrator/fleet.py` --- ## Bug 2 — Bridge-requiring modules selected when BRIDGE is not available **Symptom:** `distccd_command_exec` and `php_cgi_arg_injection` appear in `usable_modules` even on SLIRP-only runs. Exploit fires but the reverse-shell payload can't call back (no guest egress on `restrict=on`). **Root cause:** `usable_modules` filtering was conditioned on `bridge_iface` being set in the environment. When BRIDGE was not set, ALL modules were considered usable. Modules that require bridge egress (reverse shells) silently fell through, fired, and timed out waiting for a session. **Fix:** `usable_modules` now always filters `requires_bridge=True` modules regardless of the BRIDGE env var. The `requires_bridge` field in the module TOML is authoritative. **Files:** `orchestrator/fleet.py`, `exploits/modules/*.toml` --- ## Bug 3 — `cmd/unix/interact` creates no persistent session **Symptom:** `samba_usermap_script` fires (job_id=None), no session appears in `session.list` after 30 s. The exploit succeeds on the wire but the driver reports `session_open_timeout`. **Root cause:** `cmd/unix/interact` is a console-only payload. It attaches directly to the module's job console — it does NOT create a background Meterpreter/shell session visible via `session.list`. msfrpcd's `module.execute` returns `job_id=None` (no background job), and `wait_for_new_session` polls forever. **Fix:** Changed payload to `cmd/unix/bind_perl` with `LPORT=4444`. The bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects to `RHOSTS:LPORT` after the exploit fires, creating a proper shell session. **Files:** `exploits/modules/samba_usermap_script.toml` --- ## Bug 4 — Per-slot LPORT/hostfwd port mapping wrong **Symptom:** For slots 1+, the bind-shell port is reachable on the host but msfrpcd cannot connect. `ss -tlnp` on the host shows port 5444 listening (QEMU) but the module tries to connect to port 4444. **Root cause:** The extra hostfwd was `host:5444→guest:4444` (old guest port) but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444. Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path. **Fix:** Extra hostfwd now uses `extra_host_port:extra_host_port` on both sides. `extra_host_port = base_port + slot * 1000` is the per-slot LPORT, and the guest binds that exact port. **Files:** `orchestrator/fleet.py` --- ## Bug 5 — vsftpd module port 6200 collision across concurrent slots **Symptom:** Multiple Tier-3 slots running vsftpd_234_backdoor all try to hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start because port 6200 is already bound by slot 0's QEMU. **Root cause:** vsftpd's backdoor hardcodes port 6200 in both the vulnerable binary and the Metasploit module. There is no LPORT override possible. With SLIRP+hostfwd, all concurrent slots must use the same host port, which is impossible. **Fix:** Marked `vsftpd_234_backdoor.toml` with `requires_bridge = true`. The fleet runner filters it from `usable_modules` on SLIRP runs. When a bridge is available each guest gets its own IP, and msfrpcd connects to `guest_ip:6200` directly. **Files:** `exploits/modules/vsftpd_234_backdoor.toml` --- ## Bug 6 — SLIRP false-positive in `_wait_for_tcp` causes premature exploit fire **Symptom:** Log shows "target service is up" within 0.5 s of QEMU start. The exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 30–60 s to boot Samba. Result: `session_open_timeout` every episode. **Root cause:** SLIRP's usermode TCP stack completes the TCP three-way handshake (SYN-ACK) immediately for any port that has a `hostfwd` rule, regardless of whether the guest OS has booted. A bare `socket.create_connection()` always succeeds. Even a `recv()` with a short timeout (0.5 s) fires with `socket.timeout` because during very early boot SLIRP cannot RST the connection (the guest TCP stack is not up yet), so the connection hangs open and the recv deadline fires before SLIRP can determine the guest state. **Fix:** Replaced `_wait_for_tcp` with `_wait_for_serial_login`. The new function connects to QEMU's serial console socket (`serial.sock`) right after the pidfile appears and streams boot output until `"login:"` is seen. The serial console is authoritative: it reflects actual guest OS state, not SLIRP's synthetic TCP layer. Timing: - `serial.sock` is created by QEMU at device init, before the pidfile. - We connect immediately after the pidfile → we receive all boot output. - Metasploitable2 prints `"metasploitable login:"` ≈ 50–70 s after QEMU start. - The clean phase (10 s) runs AFTER the login prompt, so the exploit fires when Samba is reliably up. **Files:** `tools/run_tier3_demo.py` --- ## Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts **Symptom:** After a systemd restart of `cis490-orchestrator`, the new wave's QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU from the previous wave is still running (QEMU is started with `start_new_session=True` so it survives the orchestrator's SIGTERM). The new episode detects the stale QEMU answering the port probe and proceeds as if the target is up — but the stale QEMU has different hostfwd mappings (no bind port for the current module), so the exploit never lands. **Fix:** `run_tier3_demo.py` reads the old `qemu.pid` file from the run directory before recreating it. If a PID is found, `os.killpg(pgid, SIGTERM)` terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU exit before the port is rebound. **Files:** `tools/run_tier3_demo.py` --- ## Bug 8 — `PORT_BASE` default uses privileged ports (< 1024) **Symptom:** `launch_target.sh`'s default `PORT_BASE` was `21 + SLOT * 100`. On Tier-2 hosts without Metasploitable2, standalone `run_tier3_demo.py` tries to bind port 21 on loopback. The `cis490` service user cannot bind ports < 1024. QEMU exits immediately. **Fix:** Default changed to `2021 + SLOT * 100`. Port 2021 is above 1024 and reflects the scheme used by the fleet runner (base_port + 2000). **Files:** `vm/launch_target.sh`, `scripts/install-tier-3-4.sh` --- ## Bug 9 — msfrpc `module.execute` response is raw msgpack bytes, not str **Symptom:** Key lookups on the `module.execute` response raise `KeyError` or fail silently because msgpack returns `bin` type (bytes) for all string values, even with `raw=False` on some Metasploit 6.x builds. **Fix:** Added `MSFRpcClient._str()` to recursively decode bytes→str in all msgpack response dicts. Applied to `module.execute` and `session.list`. **Files:** `exploits/msfrpc.py` --- ## Bug 10 — `_wait_for_tcp` returns success on `b''` (connection-closed-by-peer) **Symptom:** Log shows "target service is up" within 0.5 s of the 65 s boot floor, but all exploit fires time out waiting for a session. FTP (port 21), Samba (139), and distccd (3632) all returned `b''`. The VM's services were not up; the probe was wrong. **Root cause:** When `recv(1)` returns `b''` (empty bytes), Python raises no exception. The code fell through to `return`, incorrectly reporting "service is up". `b''` means SLIRP forwarded the connection to the guest, the guest's TCP stack RST'd (no service listening), and SLIRP converted RST→FIN → the host sees connection closed. Only `socket.timeout` (remote end holding the connection open, waiting for client data) and non-empty `data` (banner received) are genuine ready signals. **Fix:** Changed `recv(1)` to save the return value. On `socket.timeout`, return immediately (genuine up). On non-empty `data`, return (banner). On `b''`, set `last_err` and `continue` (retry). **Files:** `tools/run_tier3_demo.py` --- ## Bug 11 — `distccd` and `unreal_ircd` incorrectly marked `requires_bridge = true` **Symptom:** `distcc_exec` and `unreal_ircd_3281_backdoor` were filtered from `usable_modules` on every SLIRP-only run, even though their `cmd/unix/bind_perl` payloads create an inward-connecting bind shell (host connects to guest), which does NOT require the bridge. **Root cause:** The comment in `distccd_command_exec.toml` said "needs bridge so the guest can reach the attacker" — correct for reverse_tcp payloads, wrong for bind_perl. bind_perl listens on the guest; msfrpcd connects to the hostfwd'd loopback port. No guest egress is needed. **Fix:** Set `requires_bridge = false` in both modules. The fleet already adds per-slot hostfwd entries for `extra_target_ports`, so these modules now work on SLIRP+hostfwd runs without any other change. **Files:** `exploits/modules/distccd_command_exec.toml`, `exploits/modules/unreal_ircd_3281_backdoor.toml` --- ## Bug 12 — `msgpack.unpackb` crashes on integer session IDs **Symptom:** `wait_for_new_session` raises `ValueError: int is not allowed for map key` when msfrpcd returns a session dict keyed by integer session IDs. Traceback seen in slot-0 logs on 2026-05-01. **Root cause:** `msgpack.unpackb(raw, raw=False)` defaults to `strict_map_key=True`, which rejects non-string keys. Metasploit 6.x msfrpcd encodes session IDs as msgpack int64 map keys. **Fix:** Added `strict_map_key=False` to the `unpackb` call in `_raw_call`. **Files:** `exploits/msfrpc.py` --- ## Bug 13 — `samba_usermap_script` never opens a session (removed from catalog) **Symptom:** `multi/samba/usermap_script` fired, port 4444 bound in guest, but Metasploit reported `Rex::Proto::SMB::Exceptions::NoReply` on every run. `session.list` stayed empty for the full 30 s timeout. **Root cause:** The SMB auth connection is disrupted when Samba's `username map script` executes the injected command (smbd kills the auth handler). Metasploit never received an SMB response → marked exploit "failed" → skipped calling the bind-shell handler → session never created. **Fix:** Removed `samba_usermap_script.toml` from the catalog. The fleet now uses `distccd_command_exec` and `unreal_ircd_3281_backdoor` as SLIRP-capable modules (see Bug 11 fix). Both protocols return a proper response after the exploit fires, so Metasploit's handler is called and sessions open. **Files:** `exploits/modules/samba_usermap_script.toml` (deleted), `orchestrator/fleet.py` --- ## Bug 14 — QEMU launch config incompatible with Metasploitable2 (boot hang) **Symptom:** Every `_wait_for_tcp` probe returns `b''` for the full timeout (even after the Bug 10 fix). No service — FTP, Samba, distccd, IRC — ever becomes reachable. The VM consumes CPU (QEMU runs) but nothing listens. **Root cause (three compounding issues in `launch_target.sh`):** 1. `-drive if=virtio` presents the disk as `/dev/vda`. Metasploitable2's GRUB was built for VMware SCSI (`/dev/sda`). Ubuntu 8.04's kernel command line says `root=/dev/sda1`. The kernel can't mount root on `/dev/vda` → kernel panic immediately after decompression. Services never start. 2. `-machine q35` is a PCIe chipset (Sandy Bridge era). Old ISA-emulated devices and BIOS assumptions in Ubuntu 8.04 break under q35. 3. `-cpu host` exposes AVX/XSAVE and other modern CPU features. Linux 2.6.24 doesn't know how to save/restore these in context switches; the kernel freezes or mishandles the first SIMD operation during boot. **Fix:** Three changes in `vm/launch_target.sh`: - `-machine q35` → `-machine pc` (i440fx, the classic PC compatible machine) - `-drive if=virtio` → `-drive if=ide` (Ubuntu 8.04 libata presents this as `/dev/sda`, matching the GRUB `root=` line) - `-cpu host` (KVM) → `-cpu kvm32` (safe 32-bit KVM model, no exotic flags) - `-device virtio-net-pci` → `-device e1000` (Intel e1000: universally supported since Linux 2.2, in every kernel config Metasploitable2 uses) **Files:** `vm/launch_target.sh` --- ## Bug 15 — Tier-3 verify uses vsftpd (bridge-only, always fails on SLIRP) **Symptom:** `install-tier-3-4.sh` verify step always fails. The vsftpd module's backdoor opens port 6200 (hardcoded in the binary and the MSF module). On SLIRP, all slots would need to share the same host port 6200, which QEMU refuses. The verify is killed by `_wait_for_tcp` or by the exploit itself never reaching a session. **Root cause:** The verify step was left on `vsftpd_234_backdoor` after Bug 5 marked that module `requires_bridge = true`. The verify subprocess doesn't have a bridge configured and doesn't set up the extra hostfwd for port 6200. **Fix:** Changed verify to `distccd_command_exec` with correct SLIRP port mappings: `TARGET_PORTS="5632:3632,4444:4444"` and `--target-port 5632`. distccd doesn't hardcode a backdoor port — the bind shell uses the fleet-assigned `LPORT`. No bridge needed. **Files:** `scripts/install-tier-3-4.sh` --- ## Net result after all fixes With fixes 1–15 applied: - Metasploitable2 boots correctly under KVM (pc machine, kvm32 CPU, ide disk, e1000 network). Services start ~60–70 s after QEMU launch. - `_wait_for_tcp` correctly waits until a service is genuinely listening (returns only on `socket.timeout` or non-empty banner data). - `distccd_command_exec` and `unreal_ircd_3281_backdoor` are admitted to the module catalog; both are SLIRP-compatible with `cmd/unix/bind_perl`. - `samba_usermap_script` removed from catalog (NoReply, sessions never open). - `msgpack.unpackb` accepts integer session ID keys without crashing. - The verify step uses `distccd_command_exec` on SLIRP+hostfwd. - Sessions open, workloads execute, episodes complete with `session_open` events.