Root causes and fixes documented in TIER3-BRINGUP.md. Summary:
1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.
2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
modules selected on SLIRP runs; fix: always filter requires_bridge.
3. cmd/unix/interact creates no session.list entry → session_open_timeout
every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.
4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
fix: extra_host_port:extra_host_port mapping so guest binds the
per-slot LPORT directly.
5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
fix: requires_bridge=true filters it from SLIRP fleet runs.
6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
boots (~60 s too early); fix: replace TCP probe with serial console
_wait_for_serial_login that waits for actual "login:" prompt.
7. Stale QEMU survives orchestrator restart (start_new_session=True) →
holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
old pidfile before rmtree.
8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.
9. msfrpcd 6.x returns bytes for all string values even with raw=False;
fix: MSFRpcClient._str() recursive decoder applied to all responses.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 KiB
Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01)
Bugs found and fixed during the first real-exploit fleet run on this host.
All fixes are in the commits following the Dev_REL1_043026 merge of main.
Bug 1 — BRIDGE env var breaks Tier-3 target VM networking
Symptom: All Tier-3 slots timeout at 300 s waiting for the target
service. QEMU starts with netdev tap instead of netdev user (SLIRP).
Root cause: launch_target.sh checks BRIDGE to switch between SLIRP
and tap networking. The fleet runner copied the parent environment (which had
BRIDGE=br-malware from the Tier-2 tap setup) into the Tier-3 subprocess.
The Tier-3 target VMs don't have a tap interface configured, so all guest
traffic is dropped.
Fix: fleet.py _run_slot() now calls env.pop("BRIDGE", None) before
launching run_tier3_demo.py. Tier-2 idle VMs continue to use tap; Tier-3
target VMs always use SLIRP+hostfwd.
Files: orchestrator/fleet.py
Bug 2 — Bridge-requiring modules selected when BRIDGE is not available
Symptom: distccd_command_exec and php_cgi_arg_injection appear in
usable_modules even on SLIRP-only runs. Exploit fires but the reverse-shell
payload can't call back (no guest egress on restrict=on).
Root cause: usable_modules filtering was conditioned on bridge_iface
being set in the environment. When BRIDGE was not set, ALL modules were
considered usable. Modules that require bridge egress (reverse shells) silently
fell through, fired, and timed out waiting for a session.
Fix: usable_modules now always filters requires_bridge=True modules
regardless of the BRIDGE env var. The requires_bridge field in the module
TOML is authoritative.
Files: orchestrator/fleet.py, exploits/modules/*.toml
Bug 3 — cmd/unix/interact creates no persistent session
Symptom: samba_usermap_script fires (job_id=None), no session appears in
session.list after 30 s. The exploit succeeds on the wire but the driver
reports session_open_timeout.
Root cause: cmd/unix/interact is a console-only payload. It attaches
directly to the module's job console — it does NOT create a background
Meterpreter/shell session visible via session.list. msfrpcd's
module.execute returns job_id=None (no background job), and
wait_for_new_session polls forever.
Fix: Changed payload to cmd/unix/bind_perl with LPORT=4444. The
bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects
to RHOSTS:LPORT after the exploit fires, creating a proper shell session.
Files: exploits/modules/samba_usermap_script.toml
Bug 4 — Per-slot LPORT/hostfwd port mapping wrong
Symptom: For slots 1+, the bind-shell port is reachable on the host but
msfrpcd cannot connect. ss -tlnp on the host shows port 5444 listening
(QEMU) but the module tries to connect to port 4444.
Root cause: The extra hostfwd was host:5444→guest:4444 (old guest port)
but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444.
Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path.
Fix: Extra hostfwd now uses extra_host_port:extra_host_port on both
sides. extra_host_port = base_port + slot * 1000 is the per-slot LPORT, and
the guest binds that exact port.
Files: orchestrator/fleet.py
Bug 5 — vsftpd module port 6200 collision across concurrent slots
Symptom: Multiple Tier-3 slots running vsftpd_234_backdoor all try to hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start because port 6200 is already bound by slot 0's QEMU.
Root cause: vsftpd's backdoor hardcodes port 6200 in both the vulnerable binary and the Metasploit module. There is no LPORT override possible. With SLIRP+hostfwd, all concurrent slots must use the same host port, which is impossible.
Fix: Marked vsftpd_234_backdoor.toml with requires_bridge = true. The
fleet runner filters it from usable_modules on SLIRP runs. When a bridge is
available each guest gets its own IP, and msfrpcd connects to guest_ip:6200
directly.
Files: exploits/modules/vsftpd_234_backdoor.toml
Bug 6 — SLIRP false-positive in _wait_for_tcp causes premature exploit fire
Symptom: Log shows "target service is up" within 0.5 s of QEMU start. The
exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 30–60 s
to boot Samba. Result: session_open_timeout every episode.
Root cause: SLIRP's usermode TCP stack completes the TCP three-way
handshake (SYN-ACK) immediately for any port that has a hostfwd rule,
regardless of whether the guest OS has booted. A bare socket.create_connection()
always succeeds. Even a recv() with a short timeout (0.5 s) fires with
socket.timeout because during very early boot SLIRP cannot RST the connection
(the guest TCP stack is not up yet), so the connection hangs open and the recv
deadline fires before SLIRP can determine the guest state.
Fix: Replaced _wait_for_tcp with _wait_for_serial_login. The new
function connects to QEMU's serial console socket (serial.sock) right after
the pidfile appears and streams boot output until "login:" is seen. The
serial console is authoritative: it reflects actual guest OS state, not
SLIRP's synthetic TCP layer.
Timing:
serial.sockis created by QEMU at device init, before the pidfile.- We connect immediately after the pidfile → we receive all boot output.
- Metasploitable2 prints
"metasploitable login:"≈ 50–70 s after QEMU start. - The clean phase (10 s) runs AFTER the login prompt, so the exploit fires when Samba is reliably up.
Files: tools/run_tier3_demo.py
Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts
Symptom: After a systemd restart of cis490-orchestrator, the new wave's
QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU
from the previous wave is still running (QEMU is started with
start_new_session=True so it survives the orchestrator's SIGTERM). The new
episode detects the stale QEMU answering the port probe and proceeds as if the
target is up — but the stale QEMU has different hostfwd mappings (no bind port
for the current module), so the exploit never lands.
Fix: run_tier3_demo.py reads the old qemu.pid file from the run
directory before recreating it. If a PID is found, os.killpg(pgid, SIGTERM)
terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU
exit before the port is rebound.
Files: tools/run_tier3_demo.py
Bug 8 — PORT_BASE default uses privileged ports (< 1024)
Symptom: launch_target.sh's default PORT_BASE was 21 + SLOT * 100.
On Tier-2 hosts without Metasploitable2, standalone run_tier3_demo.py tries
to bind port 21 on loopback. The cis490 service user cannot bind ports
< 1024. QEMU exits immediately.
Fix: Default changed to 2021 + SLOT * 100. Port 2021 is above 1024 and
reflects the scheme used by the fleet runner (base_port + 2000).
Files: vm/launch_target.sh, scripts/install-tier-3-4.sh
Bug 9 — msfrpc module.execute response is raw msgpack bytes, not str
Symptom: Key lookups on the module.execute response raise KeyError
or fail silently because msgpack returns bin type (bytes) for all string
values, even with raw=False on some Metasploit 6.x builds.
Fix: Added MSFRpcClient._str() to recursively decode bytes→str in all
msgpack response dicts. Applied to module.execute and session.list.
Files: exploits/msfrpc.py
Net result after all fixes
With fixes 1–9 applied:
- All 4 Tier-3 slots use SLIRP+hostfwd with correct per-slot port mapping.
samba_usermap_scriptfirescmd/unix/bind_perlwith the correct per-slot LPORT; msfrpcd connects to the bind port via hostfwd.- The exploit fires only after Metasploitable2 confirms its login prompt on the serial console (~60 s after QEMU start).
- Sessions open, workloads execute, episodes complete with
session_openevents (notsession_open_timeout).