CIS490/TIER3-BRINGUP.md
Elliott Kolden b29d30a1b2 Tier-3: fix QEMU boot, catalog admission, verify module
Bug 14 (vm/launch_target.sh): Metasploitable2 requires -machine pc
(i440fx), -cpu kvm32, -drive if=ide, and -device e1000. The previous
config (-machine q35, -cpu host, -drive if=virtio, virtio-net-pci)
caused a kernel panic at boot because /dev/vda != the grub root=/dev/sda1.
Services never started; the b'' probe fix (Bug 10) then correctly waited
out the full timeout with no result.

Bug 15 (scripts/install-tier-3-4.sh): verify step used vsftpd_234_backdoor
which is requires_bridge=true and has a hardcoded port-6200 backdoor.
Changed to distccd_command_exec with TARGET_PORTS="5632:3632,4444:4444".

manifest.toml: admit distccd_command_exec and unreal_ircd_3281_backdoor
to the module catalog. Both use cmd/unix/bind_perl (bind shell, no guest
egress, SLIRP-safe). distccd returns a valid protocol response so MSF's
handler runs and session_open fires. Verified against Metasploitable2
sourceforge image sha256 a8c019c3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 16:41:41 -06:00

14 KiB
Raw Blame History

Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01)

Bugs found and fixed during the first real-exploit fleet run on this host. All fixes are in the commits following the Dev_REL1_043026 merge of main.


Bug 1 — BRIDGE env var breaks Tier-3 target VM networking

Symptom: All Tier-3 slots timeout at 300 s waiting for the target service. QEMU starts with netdev tap instead of netdev user (SLIRP).

Root cause: launch_target.sh checks BRIDGE to switch between SLIRP and tap networking. The fleet runner copied the parent environment (which had BRIDGE=br-malware from the Tier-2 tap setup) into the Tier-3 subprocess. The Tier-3 target VMs don't have a tap interface configured, so all guest traffic is dropped.

Fix: fleet.py _run_slot() now calls env.pop("BRIDGE", None) before launching run_tier3_demo.py. Tier-2 idle VMs continue to use tap; Tier-3 target VMs always use SLIRP+hostfwd.

Files: orchestrator/fleet.py


Bug 2 — Bridge-requiring modules selected when BRIDGE is not available

Symptom: distccd_command_exec and php_cgi_arg_injection appear in usable_modules even on SLIRP-only runs. Exploit fires but the reverse-shell payload can't call back (no guest egress on restrict=on).

Root cause: usable_modules filtering was conditioned on bridge_iface being set in the environment. When BRIDGE was not set, ALL modules were considered usable. Modules that require bridge egress (reverse shells) silently fell through, fired, and timed out waiting for a session.

Fix: usable_modules now always filters requires_bridge=True modules regardless of the BRIDGE env var. The requires_bridge field in the module TOML is authoritative.

Files: orchestrator/fleet.py, exploits/modules/*.toml


Bug 3 — cmd/unix/interact creates no persistent session

Symptom: samba_usermap_script fires (job_id=None), no session appears in session.list after 30 s. The exploit succeeds on the wire but the driver reports session_open_timeout.

Root cause: cmd/unix/interact is a console-only payload. It attaches directly to the module's job console — it does NOT create a background Meterpreter/shell session visible via session.list. msfrpcd's module.execute returns job_id=None (no background job), and wait_for_new_session polls forever.

Fix: Changed payload to cmd/unix/bind_perl with LPORT=4444. The bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects to RHOSTS:LPORT after the exploit fires, creating a proper shell session.

Files: exploits/modules/samba_usermap_script.toml


Bug 4 — Per-slot LPORT/hostfwd port mapping wrong

Symptom: For slots 1+, the bind-shell port is reachable on the host but msfrpcd cannot connect. ss -tlnp on the host shows port 5444 listening (QEMU) but the module tries to connect to port 4444.

Root cause: The extra hostfwd was host:5444→guest:4444 (old guest port) but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444. Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path.

Fix: Extra hostfwd now uses extra_host_port:extra_host_port on both sides. extra_host_port = base_port + slot * 1000 is the per-slot LPORT, and the guest binds that exact port.

Files: orchestrator/fleet.py


Bug 5 — vsftpd module port 6200 collision across concurrent slots

Symptom: Multiple Tier-3 slots running vsftpd_234_backdoor all try to hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start because port 6200 is already bound by slot 0's QEMU.

Root cause: vsftpd's backdoor hardcodes port 6200 in both the vulnerable binary and the Metasploit module. There is no LPORT override possible. With SLIRP+hostfwd, all concurrent slots must use the same host port, which is impossible.

Fix: Marked vsftpd_234_backdoor.toml with requires_bridge = true. The fleet runner filters it from usable_modules on SLIRP runs. When a bridge is available each guest gets its own IP, and msfrpcd connects to guest_ip:6200 directly.

Files: exploits/modules/vsftpd_234_backdoor.toml


Bug 6 — SLIRP false-positive in _wait_for_tcp causes premature exploit fire

Symptom: Log shows "target service is up" within 0.5 s of QEMU start. The exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 3060 s to boot Samba. Result: session_open_timeout every episode.

Root cause: SLIRP's usermode TCP stack completes the TCP three-way handshake (SYN-ACK) immediately for any port that has a hostfwd rule, regardless of whether the guest OS has booted. A bare socket.create_connection() always succeeds. Even a recv() with a short timeout (0.5 s) fires with socket.timeout because during very early boot SLIRP cannot RST the connection (the guest TCP stack is not up yet), so the connection hangs open and the recv deadline fires before SLIRP can determine the guest state.

Fix: Replaced _wait_for_tcp with _wait_for_serial_login. The new function connects to QEMU's serial console socket (serial.sock) right after the pidfile appears and streams boot output until "login:" is seen. The serial console is authoritative: it reflects actual guest OS state, not SLIRP's synthetic TCP layer.

Timing:

  • serial.sock is created by QEMU at device init, before the pidfile.
  • We connect immediately after the pidfile → we receive all boot output.
  • Metasploitable2 prints "metasploitable login:" ≈ 5070 s after QEMU start.
  • The clean phase (10 s) runs AFTER the login prompt, so the exploit fires when Samba is reliably up.

Files: tools/run_tier3_demo.py


Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts

Symptom: After a systemd restart of cis490-orchestrator, the new wave's QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU from the previous wave is still running (QEMU is started with start_new_session=True so it survives the orchestrator's SIGTERM). The new episode detects the stale QEMU answering the port probe and proceeds as if the target is up — but the stale QEMU has different hostfwd mappings (no bind port for the current module), so the exploit never lands.

Fix: run_tier3_demo.py reads the old qemu.pid file from the run directory before recreating it. If a PID is found, os.killpg(pgid, SIGTERM) terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU exit before the port is rebound.

Files: tools/run_tier3_demo.py


Bug 8 — PORT_BASE default uses privileged ports (< 1024)

Symptom: launch_target.sh's default PORT_BASE was 21 + SLOT * 100. On Tier-2 hosts without Metasploitable2, standalone run_tier3_demo.py tries to bind port 21 on loopback. The cis490 service user cannot bind ports < 1024. QEMU exits immediately.

Fix: Default changed to 2021 + SLOT * 100. Port 2021 is above 1024 and reflects the scheme used by the fleet runner (base_port + 2000).

Files: vm/launch_target.sh, scripts/install-tier-3-4.sh


Bug 9 — msfrpc module.execute response is raw msgpack bytes, not str

Symptom: Key lookups on the module.execute response raise KeyError or fail silently because msgpack returns bin type (bytes) for all string values, even with raw=False on some Metasploit 6.x builds.

Fix: Added MSFRpcClient._str() to recursively decode bytes→str in all msgpack response dicts. Applied to module.execute and session.list.

Files: exploits/msfrpc.py


Bug 10 — _wait_for_tcp returns success on b'' (connection-closed-by-peer)

Symptom: Log shows "target service is up" within 0.5 s of the 65 s boot floor, but all exploit fires time out waiting for a session. FTP (port 21), Samba (139), and distccd (3632) all returned b''. The VM's services were not up; the probe was wrong.

Root cause: When recv(1) returns b'' (empty bytes), Python raises no exception. The code fell through to return, incorrectly reporting "service is up". b'' means SLIRP forwarded the connection to the guest, the guest's TCP stack RST'd (no service listening), and SLIRP converted RST→FIN → the host sees connection closed. Only socket.timeout (remote end holding the connection open, waiting for client data) and non-empty data (banner received) are genuine ready signals.

Fix: Changed recv(1) to save the return value. On socket.timeout, return immediately (genuine up). On non-empty data, return (banner). On b'', set last_err and continue (retry).

Files: tools/run_tier3_demo.py


Bug 11 — distccd and unreal_ircd incorrectly marked requires_bridge = true

Symptom: distcc_exec and unreal_ircd_3281_backdoor were filtered from usable_modules on every SLIRP-only run, even though their cmd/unix/bind_perl payloads create an inward-connecting bind shell (host connects to guest), which does NOT require the bridge.

Root cause: The comment in distccd_command_exec.toml said "needs bridge so the guest can reach the attacker" — correct for reverse_tcp payloads, wrong for bind_perl. bind_perl listens on the guest; msfrpcd connects to the hostfwd'd loopback port. No guest egress is needed.

Fix: Set requires_bridge = false in both modules. The fleet already adds per-slot hostfwd entries for extra_target_ports, so these modules now work on SLIRP+hostfwd runs without any other change.

Files: exploits/modules/distccd_command_exec.toml, exploits/modules/unreal_ircd_3281_backdoor.toml


Bug 12 — msgpack.unpackb crashes on integer session IDs

Symptom: wait_for_new_session raises ValueError: int is not allowed for map key when msfrpcd returns a session dict keyed by integer session IDs. Traceback seen in slot-0 logs on 2026-05-01.

Root cause: msgpack.unpackb(raw, raw=False) defaults to strict_map_key=True, which rejects non-string keys. Metasploit 6.x msfrpcd encodes session IDs as msgpack int64 map keys.

Fix: Added strict_map_key=False to the unpackb call in _raw_call.

Files: exploits/msfrpc.py


Bug 13 — samba_usermap_script never opens a session (removed from catalog)

Symptom: multi/samba/usermap_script fired, port 4444 bound in guest, but Metasploit reported Rex::Proto::SMB::Exceptions::NoReply on every run. session.list stayed empty for the full 30 s timeout.

Root cause: The SMB auth connection is disrupted when Samba's username map script executes the injected command (smbd kills the auth handler). Metasploit never received an SMB response → marked exploit "failed" → skipped calling the bind-shell handler → session never created.

Fix: Removed samba_usermap_script.toml from the catalog. The fleet now uses distccd_command_exec and unreal_ircd_3281_backdoor as SLIRP-capable modules (see Bug 11 fix). Both protocols return a proper response after the exploit fires, so Metasploit's handler is called and sessions open.

Files: exploits/modules/samba_usermap_script.toml (deleted), orchestrator/fleet.py


Bug 14 — QEMU launch config incompatible with Metasploitable2 (boot hang)

Symptom: Every _wait_for_tcp probe returns b'' for the full timeout (even after the Bug 10 fix). No service — FTP, Samba, distccd, IRC — ever becomes reachable. The VM consumes CPU (QEMU runs) but nothing listens.

Root cause (three compounding issues in launch_target.sh):

  1. -drive if=virtio presents the disk as /dev/vda. Metasploitable2's GRUB was built for VMware SCSI (/dev/sda). Ubuntu 8.04's kernel command line says root=/dev/sda1. The kernel can't mount root on /dev/vda → kernel panic immediately after decompression. Services never start.

  2. -machine q35 is a PCIe chipset (Sandy Bridge era). Old ISA-emulated devices and BIOS assumptions in Ubuntu 8.04 break under q35.

  3. -cpu host exposes AVX/XSAVE and other modern CPU features. Linux 2.6.24 doesn't know how to save/restore these in context switches; the kernel freezes or mishandles the first SIMD operation during boot.

Fix: Three changes in vm/launch_target.sh:

  • -machine q35-machine pc (i440fx, the classic PC compatible machine)
  • -drive if=virtio-drive if=ide (Ubuntu 8.04 libata presents this as /dev/sda, matching the GRUB root= line)
  • -cpu host (KVM) → -cpu kvm32 (safe 32-bit KVM model, no exotic flags)
  • -device virtio-net-pci-device e1000 (Intel e1000: universally supported since Linux 2.2, in every kernel config Metasploitable2 uses)

Files: vm/launch_target.sh


Bug 15 — Tier-3 verify uses vsftpd (bridge-only, always fails on SLIRP)

Symptom: install-tier-3-4.sh verify step always fails. The vsftpd module's backdoor opens port 6200 (hardcoded in the binary and the MSF module). On SLIRP, all slots would need to share the same host port 6200, which QEMU refuses. The verify is killed by _wait_for_tcp or by the exploit itself never reaching a session.

Root cause: The verify step was left on vsftpd_234_backdoor after Bug 5 marked that module requires_bridge = true. The verify subprocess doesn't have a bridge configured and doesn't set up the extra hostfwd for port 6200.

Fix: Changed verify to distccd_command_exec with correct SLIRP port mappings: TARGET_PORTS="5632:3632,4444:4444" and --target-port 5632. distccd doesn't hardcode a backdoor port — the bind shell uses the fleet-assigned LPORT. No bridge needed.

Files: scripts/install-tier-3-4.sh


Net result after all fixes

With fixes 115 applied:

  • Metasploitable2 boots correctly under KVM (pc machine, kvm32 CPU, ide disk, e1000 network). Services start ~6070 s after QEMU launch.
  • _wait_for_tcp correctly waits until a service is genuinely listening (returns only on socket.timeout or non-empty banner data).
  • distccd_command_exec and unreal_ircd_3281_backdoor are admitted to the module catalog; both are SLIRP-compatible with cmd/unix/bind_perl.
  • samba_usermap_script removed from catalog (NoReply, sessions never open).
  • msgpack.unpackb accepts integer session ID keys without crashing.
  • The verify step uses distccd_command_exec on SLIRP+hostfwd.
  • Sessions open, workloads execute, episodes complete with session_open events.