CIS490/TIER3-BRINGUP.md
Elliott Kolden 667f042707 Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01)
Root causes and fixes documented in TIER3-BRINGUP.md. Summary:

1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
   instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.

2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
   modules selected on SLIRP runs; fix: always filter requires_bridge.

3. cmd/unix/interact creates no session.list entry → session_open_timeout
   every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.

4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
   fix: extra_host_port:extra_host_port mapping so guest binds the
   per-slot LPORT directly.

5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
   fix: requires_bridge=true filters it from SLIRP fleet runs.

6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
   boots (~60 s too early); fix: replace TCP probe with serial console
   _wait_for_serial_login that waits for actual "login:" prompt.

7. Stale QEMU survives orchestrator restart (start_new_session=True) →
   holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
   old pidfile before rmtree.

8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.

9. msfrpcd 6.x returns bytes for all string values even with raw=False;
   fix: MSFRpcClient._str() recursive decoder applied to all responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:26:19 -06:00

8 KiB
Raw Blame History

Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01)

Bugs found and fixed during the first real-exploit fleet run on this host. All fixes are in the commits following the Dev_REL1_043026 merge of main.


Bug 1 — BRIDGE env var breaks Tier-3 target VM networking

Symptom: All Tier-3 slots timeout at 300 s waiting for the target service. QEMU starts with netdev tap instead of netdev user (SLIRP).

Root cause: launch_target.sh checks BRIDGE to switch between SLIRP and tap networking. The fleet runner copied the parent environment (which had BRIDGE=br-malware from the Tier-2 tap setup) into the Tier-3 subprocess. The Tier-3 target VMs don't have a tap interface configured, so all guest traffic is dropped.

Fix: fleet.py _run_slot() now calls env.pop("BRIDGE", None) before launching run_tier3_demo.py. Tier-2 idle VMs continue to use tap; Tier-3 target VMs always use SLIRP+hostfwd.

Files: orchestrator/fleet.py


Bug 2 — Bridge-requiring modules selected when BRIDGE is not available

Symptom: distccd_command_exec and php_cgi_arg_injection appear in usable_modules even on SLIRP-only runs. Exploit fires but the reverse-shell payload can't call back (no guest egress on restrict=on).

Root cause: usable_modules filtering was conditioned on bridge_iface being set in the environment. When BRIDGE was not set, ALL modules were considered usable. Modules that require bridge egress (reverse shells) silently fell through, fired, and timed out waiting for a session.

Fix: usable_modules now always filters requires_bridge=True modules regardless of the BRIDGE env var. The requires_bridge field in the module TOML is authoritative.

Files: orchestrator/fleet.py, exploits/modules/*.toml


Bug 3 — cmd/unix/interact creates no persistent session

Symptom: samba_usermap_script fires (job_id=None), no session appears in session.list after 30 s. The exploit succeeds on the wire but the driver reports session_open_timeout.

Root cause: cmd/unix/interact is a console-only payload. It attaches directly to the module's job console — it does NOT create a background Meterpreter/shell session visible via session.list. msfrpcd's module.execute returns job_id=None (no background job), and wait_for_new_session polls forever.

Fix: Changed payload to cmd/unix/bind_perl with LPORT=4444. The bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects to RHOSTS:LPORT after the exploit fires, creating a proper shell session.

Files: exploits/modules/samba_usermap_script.toml


Bug 4 — Per-slot LPORT/hostfwd port mapping wrong

Symptom: For slots 1+, the bind-shell port is reachable on the host but msfrpcd cannot connect. ss -tlnp on the host shows port 5444 listening (QEMU) but the module tries to connect to port 4444.

Root cause: The extra hostfwd was host:5444→guest:4444 (old guest port) but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444. Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path.

Fix: Extra hostfwd now uses extra_host_port:extra_host_port on both sides. extra_host_port = base_port + slot * 1000 is the per-slot LPORT, and the guest binds that exact port.

Files: orchestrator/fleet.py


Bug 5 — vsftpd module port 6200 collision across concurrent slots

Symptom: Multiple Tier-3 slots running vsftpd_234_backdoor all try to hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start because port 6200 is already bound by slot 0's QEMU.

Root cause: vsftpd's backdoor hardcodes port 6200 in both the vulnerable binary and the Metasploit module. There is no LPORT override possible. With SLIRP+hostfwd, all concurrent slots must use the same host port, which is impossible.

Fix: Marked vsftpd_234_backdoor.toml with requires_bridge = true. The fleet runner filters it from usable_modules on SLIRP runs. When a bridge is available each guest gets its own IP, and msfrpcd connects to guest_ip:6200 directly.

Files: exploits/modules/vsftpd_234_backdoor.toml


Bug 6 — SLIRP false-positive in _wait_for_tcp causes premature exploit fire

Symptom: Log shows "target service is up" within 0.5 s of QEMU start. The exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 3060 s to boot Samba. Result: session_open_timeout every episode.

Root cause: SLIRP's usermode TCP stack completes the TCP three-way handshake (SYN-ACK) immediately for any port that has a hostfwd rule, regardless of whether the guest OS has booted. A bare socket.create_connection() always succeeds. Even a recv() with a short timeout (0.5 s) fires with socket.timeout because during very early boot SLIRP cannot RST the connection (the guest TCP stack is not up yet), so the connection hangs open and the recv deadline fires before SLIRP can determine the guest state.

Fix: Replaced _wait_for_tcp with _wait_for_serial_login. The new function connects to QEMU's serial console socket (serial.sock) right after the pidfile appears and streams boot output until "login:" is seen. The serial console is authoritative: it reflects actual guest OS state, not SLIRP's synthetic TCP layer.

Timing:

  • serial.sock is created by QEMU at device init, before the pidfile.
  • We connect immediately after the pidfile → we receive all boot output.
  • Metasploitable2 prints "metasploitable login:" ≈ 5070 s after QEMU start.
  • The clean phase (10 s) runs AFTER the login prompt, so the exploit fires when Samba is reliably up.

Files: tools/run_tier3_demo.py


Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts

Symptom: After a systemd restart of cis490-orchestrator, the new wave's QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU from the previous wave is still running (QEMU is started with start_new_session=True so it survives the orchestrator's SIGTERM). The new episode detects the stale QEMU answering the port probe and proceeds as if the target is up — but the stale QEMU has different hostfwd mappings (no bind port for the current module), so the exploit never lands.

Fix: run_tier3_demo.py reads the old qemu.pid file from the run directory before recreating it. If a PID is found, os.killpg(pgid, SIGTERM) terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU exit before the port is rebound.

Files: tools/run_tier3_demo.py


Bug 8 — PORT_BASE default uses privileged ports (< 1024)

Symptom: launch_target.sh's default PORT_BASE was 21 + SLOT * 100. On Tier-2 hosts without Metasploitable2, standalone run_tier3_demo.py tries to bind port 21 on loopback. The cis490 service user cannot bind ports < 1024. QEMU exits immediately.

Fix: Default changed to 2021 + SLOT * 100. Port 2021 is above 1024 and reflects the scheme used by the fleet runner (base_port + 2000).

Files: vm/launch_target.sh, scripts/install-tier-3-4.sh


Bug 9 — msfrpc module.execute response is raw msgpack bytes, not str

Symptom: Key lookups on the module.execute response raise KeyError or fail silently because msgpack returns bin type (bytes) for all string values, even with raw=False on some Metasploit 6.x builds.

Fix: Added MSFRpcClient._str() to recursively decode bytes→str in all msgpack response dicts. Applied to module.execute and session.list.

Files: exploits/msfrpc.py


Net result after all fixes

With fixes 19 applied:

  • All 4 Tier-3 slots use SLIRP+hostfwd with correct per-slot port mapping.
  • samba_usermap_script fires cmd/unix/bind_perl with the correct per-slot LPORT; msfrpcd connects to the bind port via hostfwd.
  • The exploit fires only after Metasploitable2 confirms its login prompt on the serial console (~60 s after QEMU start).
  • Sessions open, workloads execute, episodes complete with session_open events (not session_open_timeout).