Bug 14 (vm/launch_target.sh): Metasploitable2 requires -machine pc (i440fx), -cpu kvm32, -drive if=ide, and -device e1000. The previous config (-machine q35, -cpu host, -drive if=virtio, virtio-net-pci) caused a kernel panic at boot because /dev/vda != the grub root=/dev/sda1. Services never started; the b'' probe fix (Bug 10) then correctly waited out the full timeout with no result. Bug 15 (scripts/install-tier-3-4.sh): verify step used vsftpd_234_backdoor which is requires_bridge=true and has a hardcoded port-6200 backdoor. Changed to distccd_command_exec with TARGET_PORTS="5632:3632,4444:4444". manifest.toml: admit distccd_command_exec and unreal_ircd_3281_backdoor to the module catalog. Both use cmd/unix/bind_perl (bind shell, no guest egress, SLIRP-safe). distccd returns a valid protocol response so MSF's handler runs and session_open fires. Verified against Metasploitable2 sourceforge image sha256 a8c019c3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14 KiB
Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01)
Bugs found and fixed during the first real-exploit fleet run on this host.
All fixes are in the commits following the Dev_REL1_043026 merge of main.
Bug 1 — BRIDGE env var breaks Tier-3 target VM networking
Symptom: All Tier-3 slots timeout at 300 s waiting for the target
service. QEMU starts with netdev tap instead of netdev user (SLIRP).
Root cause: launch_target.sh checks BRIDGE to switch between SLIRP
and tap networking. The fleet runner copied the parent environment (which had
BRIDGE=br-malware from the Tier-2 tap setup) into the Tier-3 subprocess.
The Tier-3 target VMs don't have a tap interface configured, so all guest
traffic is dropped.
Fix: fleet.py _run_slot() now calls env.pop("BRIDGE", None) before
launching run_tier3_demo.py. Tier-2 idle VMs continue to use tap; Tier-3
target VMs always use SLIRP+hostfwd.
Files: orchestrator/fleet.py
Bug 2 — Bridge-requiring modules selected when BRIDGE is not available
Symptom: distccd_command_exec and php_cgi_arg_injection appear in
usable_modules even on SLIRP-only runs. Exploit fires but the reverse-shell
payload can't call back (no guest egress on restrict=on).
Root cause: usable_modules filtering was conditioned on bridge_iface
being set in the environment. When BRIDGE was not set, ALL modules were
considered usable. Modules that require bridge egress (reverse shells) silently
fell through, fired, and timed out waiting for a session.
Fix: usable_modules now always filters requires_bridge=True modules
regardless of the BRIDGE env var. The requires_bridge field in the module
TOML is authoritative.
Files: orchestrator/fleet.py, exploits/modules/*.toml
Bug 3 — cmd/unix/interact creates no persistent session
Symptom: samba_usermap_script fires (job_id=None), no session appears in
session.list after 30 s. The exploit succeeds on the wire but the driver
reports session_open_timeout.
Root cause: cmd/unix/interact is a console-only payload. It attaches
directly to the module's job console — it does NOT create a background
Meterpreter/shell session visible via session.list. msfrpcd's
module.execute returns job_id=None (no background job), and
wait_for_new_session polls forever.
Fix: Changed payload to cmd/unix/bind_perl with LPORT=4444. The
bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects
to RHOSTS:LPORT after the exploit fires, creating a proper shell session.
Files: exploits/modules/samba_usermap_script.toml
Bug 4 — Per-slot LPORT/hostfwd port mapping wrong
Symptom: For slots 1+, the bind-shell port is reachable on the host but
msfrpcd cannot connect. ss -tlnp on the host shows port 5444 listening
(QEMU) but the module tries to connect to port 4444.
Root cause: The extra hostfwd was host:5444→guest:4444 (old guest port)
but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444.
Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path.
Fix: Extra hostfwd now uses extra_host_port:extra_host_port on both
sides. extra_host_port = base_port + slot * 1000 is the per-slot LPORT, and
the guest binds that exact port.
Files: orchestrator/fleet.py
Bug 5 — vsftpd module port 6200 collision across concurrent slots
Symptom: Multiple Tier-3 slots running vsftpd_234_backdoor all try to hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start because port 6200 is already bound by slot 0's QEMU.
Root cause: vsftpd's backdoor hardcodes port 6200 in both the vulnerable binary and the Metasploit module. There is no LPORT override possible. With SLIRP+hostfwd, all concurrent slots must use the same host port, which is impossible.
Fix: Marked vsftpd_234_backdoor.toml with requires_bridge = true. The
fleet runner filters it from usable_modules on SLIRP runs. When a bridge is
available each guest gets its own IP, and msfrpcd connects to guest_ip:6200
directly.
Files: exploits/modules/vsftpd_234_backdoor.toml
Bug 6 — SLIRP false-positive in _wait_for_tcp causes premature exploit fire
Symptom: Log shows "target service is up" within 0.5 s of QEMU start. The
exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 30–60 s
to boot Samba. Result: session_open_timeout every episode.
Root cause: SLIRP's usermode TCP stack completes the TCP three-way
handshake (SYN-ACK) immediately for any port that has a hostfwd rule,
regardless of whether the guest OS has booted. A bare socket.create_connection()
always succeeds. Even a recv() with a short timeout (0.5 s) fires with
socket.timeout because during very early boot SLIRP cannot RST the connection
(the guest TCP stack is not up yet), so the connection hangs open and the recv
deadline fires before SLIRP can determine the guest state.
Fix: Replaced _wait_for_tcp with _wait_for_serial_login. The new
function connects to QEMU's serial console socket (serial.sock) right after
the pidfile appears and streams boot output until "login:" is seen. The
serial console is authoritative: it reflects actual guest OS state, not
SLIRP's synthetic TCP layer.
Timing:
serial.sockis created by QEMU at device init, before the pidfile.- We connect immediately after the pidfile → we receive all boot output.
- Metasploitable2 prints
"metasploitable login:"≈ 50–70 s after QEMU start. - The clean phase (10 s) runs AFTER the login prompt, so the exploit fires when Samba is reliably up.
Files: tools/run_tier3_demo.py
Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts
Symptom: After a systemd restart of cis490-orchestrator, the new wave's
QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU
from the previous wave is still running (QEMU is started with
start_new_session=True so it survives the orchestrator's SIGTERM). The new
episode detects the stale QEMU answering the port probe and proceeds as if the
target is up — but the stale QEMU has different hostfwd mappings (no bind port
for the current module), so the exploit never lands.
Fix: run_tier3_demo.py reads the old qemu.pid file from the run
directory before recreating it. If a PID is found, os.killpg(pgid, SIGTERM)
terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU
exit before the port is rebound.
Files: tools/run_tier3_demo.py
Bug 8 — PORT_BASE default uses privileged ports (< 1024)
Symptom: launch_target.sh's default PORT_BASE was 21 + SLOT * 100.
On Tier-2 hosts without Metasploitable2, standalone run_tier3_demo.py tries
to bind port 21 on loopback. The cis490 service user cannot bind ports
< 1024. QEMU exits immediately.
Fix: Default changed to 2021 + SLOT * 100. Port 2021 is above 1024 and
reflects the scheme used by the fleet runner (base_port + 2000).
Files: vm/launch_target.sh, scripts/install-tier-3-4.sh
Bug 9 — msfrpc module.execute response is raw msgpack bytes, not str
Symptom: Key lookups on the module.execute response raise KeyError
or fail silently because msgpack returns bin type (bytes) for all string
values, even with raw=False on some Metasploit 6.x builds.
Fix: Added MSFRpcClient._str() to recursively decode bytes→str in all
msgpack response dicts. Applied to module.execute and session.list.
Files: exploits/msfrpc.py
Bug 10 — _wait_for_tcp returns success on b'' (connection-closed-by-peer)
Symptom: Log shows "target service is up" within 0.5 s of the 65 s boot
floor, but all exploit fires time out waiting for a session. FTP (port 21),
Samba (139), and distccd (3632) all returned b''. The VM's services were not
up; the probe was wrong.
Root cause: When recv(1) returns b'' (empty bytes), Python raises no
exception. The code fell through to return, incorrectly reporting "service
is up". b'' means SLIRP forwarded the connection to the guest, the guest's
TCP stack RST'd (no service listening), and SLIRP converted RST→FIN → the
host sees connection closed. Only socket.timeout (remote end holding the
connection open, waiting for client data) and non-empty data (banner
received) are genuine ready signals.
Fix: Changed recv(1) to save the return value. On socket.timeout,
return immediately (genuine up). On non-empty data, return (banner). On
b'', set last_err and continue (retry).
Files: tools/run_tier3_demo.py
Bug 11 — distccd and unreal_ircd incorrectly marked requires_bridge = true
Symptom: distcc_exec and unreal_ircd_3281_backdoor were filtered from
usable_modules on every SLIRP-only run, even though their cmd/unix/bind_perl
payloads create an inward-connecting bind shell (host connects to guest), which
does NOT require the bridge.
Root cause: The comment in distccd_command_exec.toml said "needs bridge so
the guest can reach the attacker" — correct for reverse_tcp payloads, wrong for
bind_perl. bind_perl listens on the guest; msfrpcd connects to the hostfwd'd
loopback port. No guest egress is needed.
Fix: Set requires_bridge = false in both modules. The fleet already adds
per-slot hostfwd entries for extra_target_ports, so these modules now work on
SLIRP+hostfwd runs without any other change.
Files: exploits/modules/distccd_command_exec.toml,
exploits/modules/unreal_ircd_3281_backdoor.toml
Bug 12 — msgpack.unpackb crashes on integer session IDs
Symptom: wait_for_new_session raises ValueError: int is not allowed for map key when msfrpcd returns a session dict keyed by integer session IDs.
Traceback seen in slot-0 logs on 2026-05-01.
Root cause: msgpack.unpackb(raw, raw=False) defaults to
strict_map_key=True, which rejects non-string keys. Metasploit 6.x msfrpcd
encodes session IDs as msgpack int64 map keys.
Fix: Added strict_map_key=False to the unpackb call in _raw_call.
Files: exploits/msfrpc.py
Bug 13 — samba_usermap_script never opens a session (removed from catalog)
Symptom: multi/samba/usermap_script fired, port 4444 bound in guest, but
Metasploit reported Rex::Proto::SMB::Exceptions::NoReply on every run.
session.list stayed empty for the full 30 s timeout.
Root cause: The SMB auth connection is disrupted when Samba's
username map script executes the injected command (smbd kills the auth
handler). Metasploit never received an SMB response → marked exploit "failed"
→ skipped calling the bind-shell handler → session never created.
Fix: Removed samba_usermap_script.toml from the catalog. The fleet now
uses distccd_command_exec and unreal_ircd_3281_backdoor as SLIRP-capable
modules (see Bug 11 fix). Both protocols return a proper response after the
exploit fires, so Metasploit's handler is called and sessions open.
Files: exploits/modules/samba_usermap_script.toml (deleted),
orchestrator/fleet.py
Bug 14 — QEMU launch config incompatible with Metasploitable2 (boot hang)
Symptom: Every _wait_for_tcp probe returns b'' for the full timeout
(even after the Bug 10 fix). No service — FTP, Samba, distccd, IRC — ever
becomes reachable. The VM consumes CPU (QEMU runs) but nothing listens.
Root cause (three compounding issues in launch_target.sh):
-
-drive if=virtiopresents the disk as/dev/vda. Metasploitable2's GRUB was built for VMware SCSI (/dev/sda). Ubuntu 8.04's kernel command line saysroot=/dev/sda1. The kernel can't mount root on/dev/vda→ kernel panic immediately after decompression. Services never start. -
-machine q35is a PCIe chipset (Sandy Bridge era). Old ISA-emulated devices and BIOS assumptions in Ubuntu 8.04 break under q35. -
-cpu hostexposes AVX/XSAVE and other modern CPU features. Linux 2.6.24 doesn't know how to save/restore these in context switches; the kernel freezes or mishandles the first SIMD operation during boot.
Fix: Three changes in vm/launch_target.sh:
-machine q35→-machine pc(i440fx, the classic PC compatible machine)-drive if=virtio→-drive if=ide(Ubuntu 8.04 libata presents this as/dev/sda, matching the GRUBroot=line)-cpu host(KVM) →-cpu kvm32(safe 32-bit KVM model, no exotic flags)-device virtio-net-pci→-device e1000(Intel e1000: universally supported since Linux 2.2, in every kernel config Metasploitable2 uses)
Files: vm/launch_target.sh
Bug 15 — Tier-3 verify uses vsftpd (bridge-only, always fails on SLIRP)
Symptom: install-tier-3-4.sh verify step always fails. The vsftpd
module's backdoor opens port 6200 (hardcoded in the binary and the MSF
module). On SLIRP, all slots would need to share the same host port 6200,
which QEMU refuses. The verify is killed by _wait_for_tcp or by the exploit
itself never reaching a session.
Root cause: The verify step was left on vsftpd_234_backdoor after Bug 5
marked that module requires_bridge = true. The verify subprocess doesn't
have a bridge configured and doesn't set up the extra hostfwd for port 6200.
Fix: Changed verify to distccd_command_exec with correct SLIRP port
mappings: TARGET_PORTS="5632:3632,4444:4444" and --target-port 5632.
distccd doesn't hardcode a backdoor port — the bind shell uses the
fleet-assigned LPORT. No bridge needed.
Files: scripts/install-tier-3-4.sh
Net result after all fixes
With fixes 1–15 applied:
- Metasploitable2 boots correctly under KVM (pc machine, kvm32 CPU, ide disk, e1000 network). Services start ~60–70 s after QEMU launch.
_wait_for_tcpcorrectly waits until a service is genuinely listening (returns only onsocket.timeoutor non-empty banner data).distccd_command_execandunreal_ircd_3281_backdoorare admitted to the module catalog; both are SLIRP-compatible withcmd/unix/bind_perl.samba_usermap_scriptremoved from catalog (NoReply, sessions never open).msgpack.unpackbaccepts integer session ID keys without crashing.- The verify step uses
distccd_command_execon SLIRP+hostfwd. - Sessions open, workloads execute, episodes complete with
session_openevents.