Bug 10: _wait_for_tcp returned on recv()→b'' (connection closed by peer),
falsely signalling service-ready. Only socket.timeout or non-empty data
are genuine ready signals; b'' now retries.
Bug 11: distccd_command_exec and unreal_ircd_3281_backdoor incorrectly
had requires_bridge=true. bind_perl payloads connect inward (host→guest
via hostfwd), not outward — no bridge egress needed. Both modules now
run on SLIRP-only fleet slots.
Bug 12: msgpack.unpackb crashed on integer session IDs from msfrpcd 6.x
(strict_map_key=True default). Added strict_map_key=False.
Bug 13 (documented): samba_usermap_script removed from catalog (NoReply
on every fire — already handled in dca6144 on origin/main).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 KiB
Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01)
Bugs found and fixed during the first real-exploit fleet run on this host.
All fixes are in the commits following the Dev_REL1_043026 merge of main.
Bug 1 — BRIDGE env var breaks Tier-3 target VM networking
Symptom: All Tier-3 slots timeout at 300 s waiting for the target
service. QEMU starts with netdev tap instead of netdev user (SLIRP).
Root cause: launch_target.sh checks BRIDGE to switch between SLIRP
and tap networking. The fleet runner copied the parent environment (which had
BRIDGE=br-malware from the Tier-2 tap setup) into the Tier-3 subprocess.
The Tier-3 target VMs don't have a tap interface configured, so all guest
traffic is dropped.
Fix: fleet.py _run_slot() now calls env.pop("BRIDGE", None) before
launching run_tier3_demo.py. Tier-2 idle VMs continue to use tap; Tier-3
target VMs always use SLIRP+hostfwd.
Files: orchestrator/fleet.py
Bug 2 — Bridge-requiring modules selected when BRIDGE is not available
Symptom: distccd_command_exec and php_cgi_arg_injection appear in
usable_modules even on SLIRP-only runs. Exploit fires but the reverse-shell
payload can't call back (no guest egress on restrict=on).
Root cause: usable_modules filtering was conditioned on bridge_iface
being set in the environment. When BRIDGE was not set, ALL modules were
considered usable. Modules that require bridge egress (reverse shells) silently
fell through, fired, and timed out waiting for a session.
Fix: usable_modules now always filters requires_bridge=True modules
regardless of the BRIDGE env var. The requires_bridge field in the module
TOML is authoritative.
Files: orchestrator/fleet.py, exploits/modules/*.toml
Bug 3 — cmd/unix/interact creates no persistent session
Symptom: samba_usermap_script fires (job_id=None), no session appears in
session.list after 30 s. The exploit succeeds on the wire but the driver
reports session_open_timeout.
Root cause: cmd/unix/interact is a console-only payload. It attaches
directly to the module's job console — it does NOT create a background
Meterpreter/shell session visible via session.list. msfrpcd's
module.execute returns job_id=None (no background job), and
wait_for_new_session polls forever.
Fix: Changed payload to cmd/unix/bind_perl with LPORT=4444. The
bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects
to RHOSTS:LPORT after the exploit fires, creating a proper shell session.
Files: exploits/modules/samba_usermap_script.toml
Bug 4 — Per-slot LPORT/hostfwd port mapping wrong
Symptom: For slots 1+, the bind-shell port is reachable on the host but
msfrpcd cannot connect. ss -tlnp on the host shows port 5444 listening
(QEMU) but the module tries to connect to port 4444.
Root cause: The extra hostfwd was host:5444→guest:4444 (old guest port)
but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444.
Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path.
Fix: Extra hostfwd now uses extra_host_port:extra_host_port on both
sides. extra_host_port = base_port + slot * 1000 is the per-slot LPORT, and
the guest binds that exact port.
Files: orchestrator/fleet.py
Bug 5 — vsftpd module port 6200 collision across concurrent slots
Symptom: Multiple Tier-3 slots running vsftpd_234_backdoor all try to hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start because port 6200 is already bound by slot 0's QEMU.
Root cause: vsftpd's backdoor hardcodes port 6200 in both the vulnerable binary and the Metasploit module. There is no LPORT override possible. With SLIRP+hostfwd, all concurrent slots must use the same host port, which is impossible.
Fix: Marked vsftpd_234_backdoor.toml with requires_bridge = true. The
fleet runner filters it from usable_modules on SLIRP runs. When a bridge is
available each guest gets its own IP, and msfrpcd connects to guest_ip:6200
directly.
Files: exploits/modules/vsftpd_234_backdoor.toml
Bug 6 — SLIRP false-positive in _wait_for_tcp causes premature exploit fire
Symptom: Log shows "target service is up" within 0.5 s of QEMU start. The
exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 30–60 s
to boot Samba. Result: session_open_timeout every episode.
Root cause: SLIRP's usermode TCP stack completes the TCP three-way
handshake (SYN-ACK) immediately for any port that has a hostfwd rule,
regardless of whether the guest OS has booted. A bare socket.create_connection()
always succeeds. Even a recv() with a short timeout (0.5 s) fires with
socket.timeout because during very early boot SLIRP cannot RST the connection
(the guest TCP stack is not up yet), so the connection hangs open and the recv
deadline fires before SLIRP can determine the guest state.
Fix: Replaced _wait_for_tcp with _wait_for_serial_login. The new
function connects to QEMU's serial console socket (serial.sock) right after
the pidfile appears and streams boot output until "login:" is seen. The
serial console is authoritative: it reflects actual guest OS state, not
SLIRP's synthetic TCP layer.
Timing:
serial.sockis created by QEMU at device init, before the pidfile.- We connect immediately after the pidfile → we receive all boot output.
- Metasploitable2 prints
"metasploitable login:"≈ 50–70 s after QEMU start. - The clean phase (10 s) runs AFTER the login prompt, so the exploit fires when Samba is reliably up.
Files: tools/run_tier3_demo.py
Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts
Symptom: After a systemd restart of cis490-orchestrator, the new wave's
QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU
from the previous wave is still running (QEMU is started with
start_new_session=True so it survives the orchestrator's SIGTERM). The new
episode detects the stale QEMU answering the port probe and proceeds as if the
target is up — but the stale QEMU has different hostfwd mappings (no bind port
for the current module), so the exploit never lands.
Fix: run_tier3_demo.py reads the old qemu.pid file from the run
directory before recreating it. If a PID is found, os.killpg(pgid, SIGTERM)
terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU
exit before the port is rebound.
Files: tools/run_tier3_demo.py
Bug 8 — PORT_BASE default uses privileged ports (< 1024)
Symptom: launch_target.sh's default PORT_BASE was 21 + SLOT * 100.
On Tier-2 hosts without Metasploitable2, standalone run_tier3_demo.py tries
to bind port 21 on loopback. The cis490 service user cannot bind ports
< 1024. QEMU exits immediately.
Fix: Default changed to 2021 + SLOT * 100. Port 2021 is above 1024 and
reflects the scheme used by the fleet runner (base_port + 2000).
Files: vm/launch_target.sh, scripts/install-tier-3-4.sh
Bug 9 — msfrpc module.execute response is raw msgpack bytes, not str
Symptom: Key lookups on the module.execute response raise KeyError
or fail silently because msgpack returns bin type (bytes) for all string
values, even with raw=False on some Metasploit 6.x builds.
Fix: Added MSFRpcClient._str() to recursively decode bytes→str in all
msgpack response dicts. Applied to module.execute and session.list.
Files: exploits/msfrpc.py
Bug 10 — _wait_for_tcp returns success on b'' (connection-closed-by-peer)
Symptom: Log shows "target service is up" within 0.5 s of the 65 s boot
floor, but all exploit fires time out waiting for a session. FTP (port 21),
Samba (139), and distccd (3632) all returned b''. The VM's services were not
up; the probe was wrong.
Root cause: When recv(1) returns b'' (empty bytes), Python raises no
exception. The code fell through to return, incorrectly reporting "service
is up". b'' means SLIRP forwarded the connection to the guest, the guest's
TCP stack RST'd (no service listening), and SLIRP converted RST→FIN → the
host sees connection closed. Only socket.timeout (remote end holding the
connection open, waiting for client data) and non-empty data (banner
received) are genuine ready signals.
Fix: Changed recv(1) to save the return value. On socket.timeout,
return immediately (genuine up). On non-empty data, return (banner). On
b'', set last_err and continue (retry).
Files: tools/run_tier3_demo.py
Bug 11 — distccd and unreal_ircd incorrectly marked requires_bridge = true
Symptom: distcc_exec and unreal_ircd_3281_backdoor were filtered from
usable_modules on every SLIRP-only run, even though their cmd/unix/bind_perl
payloads create an inward-connecting bind shell (host connects to guest), which
does NOT require the bridge.
Root cause: The comment in distccd_command_exec.toml said "needs bridge so
the guest can reach the attacker" — correct for reverse_tcp payloads, wrong for
bind_perl. bind_perl listens on the guest; msfrpcd connects to the hostfwd'd
loopback port. No guest egress is needed.
Fix: Set requires_bridge = false in both modules. The fleet already adds
per-slot hostfwd entries for extra_target_ports, so these modules now work on
SLIRP+hostfwd runs without any other change.
Files: exploits/modules/distccd_command_exec.toml,
exploits/modules/unreal_ircd_3281_backdoor.toml
Bug 12 — msgpack.unpackb crashes on integer session IDs
Symptom: wait_for_new_session raises ValueError: int is not allowed for map key when msfrpcd returns a session dict keyed by integer session IDs.
Traceback seen in slot-0 logs on 2026-05-01.
Root cause: msgpack.unpackb(raw, raw=False) defaults to
strict_map_key=True, which rejects non-string keys. Metasploit 6.x msfrpcd
encodes session IDs as msgpack int64 map keys.
Fix: Added strict_map_key=False to the unpackb call in _raw_call.
Files: exploits/msfrpc.py
Bug 13 — samba_usermap_script never opens a session (removed from catalog)
Symptom: multi/samba/usermap_script fired, port 4444 bound in guest, but
Metasploit reported Rex::Proto::SMB::Exceptions::NoReply on every run.
session.list stayed empty for the full 30 s timeout.
Root cause: The SMB auth connection is disrupted when Samba's
username map script executes the injected command (smbd kills the auth
handler). Metasploit never received an SMB response → marked exploit "failed"
→ skipped calling the bind-shell handler → session never created.
Fix: Removed samba_usermap_script.toml from the catalog. The fleet now
uses distccd_command_exec and unreal_ircd_3281_backdoor as SLIRP-capable
modules (see Bug 11 fix). Both protocols return a proper response after the
exploit fires, so Metasploit's handler is called and sessions open.
Files: exploits/modules/samba_usermap_script.toml (deleted),
orchestrator/fleet.py
Net result after all fixes
With fixes 1–13 applied:
_wait_for_tcpcorrectly waits until a service is genuinely listening (returns only onsocket.timeoutor non-empty banner data).distccd_command_execandunreal_ircd_3281_backdoorare now available on SLIRP-only runs;samba_usermap_scriptis removed.msgpack.unpackbaccepts integer session ID keys without crashing.- Sessions open, workloads execute, episodes complete with
session_openevents.