CIS490/TIER3-BRINGUP.md
Elliott Kolden b29d30a1b2 Tier-3: fix QEMU boot, catalog admission, verify module
Bug 14 (vm/launch_target.sh): Metasploitable2 requires -machine pc
(i440fx), -cpu kvm32, -drive if=ide, and -device e1000. The previous
config (-machine q35, -cpu host, -drive if=virtio, virtio-net-pci)
caused a kernel panic at boot because /dev/vda != the grub root=/dev/sda1.
Services never started; the b'' probe fix (Bug 10) then correctly waited
out the full timeout with no result.

Bug 15 (scripts/install-tier-3-4.sh): verify step used vsftpd_234_backdoor
which is requires_bridge=true and has a hardcoded port-6200 backdoor.
Changed to distccd_command_exec with TARGET_PORTS="5632:3632,4444:4444".

manifest.toml: admit distccd_command_exec and unreal_ircd_3281_backdoor
to the module catalog. Both use cmd/unix/bind_perl (bind shell, no guest
egress, SLIRP-safe). distccd returns a valid protocol response so MSF's
handler runs and session_open fires. Verified against Metasploitable2
sourceforge image sha256 a8c019c3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 16:41:41 -06:00

328 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01)
Bugs found and fixed during the first real-exploit fleet run on this host.
All fixes are in the commits following the `Dev_REL1_043026` merge of main.
---
## Bug 1 — BRIDGE env var breaks Tier-3 target VM networking
**Symptom:** All Tier-3 slots timeout at 300 s waiting for the target
service. QEMU starts with `netdev tap` instead of `netdev user` (SLIRP).
**Root cause:** `launch_target.sh` checks `BRIDGE` to switch between SLIRP
and tap networking. The fleet runner copied the parent environment (which had
`BRIDGE=br-malware` from the Tier-2 tap setup) into the Tier-3 subprocess.
The Tier-3 target VMs don't have a tap interface configured, so all guest
traffic is dropped.
**Fix:** `fleet.py` `_run_slot()` now calls `env.pop("BRIDGE", None)` before
launching `run_tier3_demo.py`. Tier-2 idle VMs continue to use tap; Tier-3
target VMs always use SLIRP+hostfwd.
**Files:** `orchestrator/fleet.py`
---
## Bug 2 — Bridge-requiring modules selected when BRIDGE is not available
**Symptom:** `distccd_command_exec` and `php_cgi_arg_injection` appear in
`usable_modules` even on SLIRP-only runs. Exploit fires but the reverse-shell
payload can't call back (no guest egress on `restrict=on`).
**Root cause:** `usable_modules` filtering was conditioned on `bridge_iface`
being set in the environment. When BRIDGE was not set, ALL modules were
considered usable. Modules that require bridge egress (reverse shells) silently
fell through, fired, and timed out waiting for a session.
**Fix:** `usable_modules` now always filters `requires_bridge=True` modules
regardless of the BRIDGE env var. The `requires_bridge` field in the module
TOML is authoritative.
**Files:** `orchestrator/fleet.py`, `exploits/modules/*.toml`
---
## Bug 3 — `cmd/unix/interact` creates no persistent session
**Symptom:** `samba_usermap_script` fires (job_id=None), no session appears in
`session.list` after 30 s. The exploit succeeds on the wire but the driver
reports `session_open_timeout`.
**Root cause:** `cmd/unix/interact` is a console-only payload. It attaches
directly to the module's job console — it does NOT create a background
Meterpreter/shell session visible via `session.list`. msfrpcd's
`module.execute` returns `job_id=None` (no background job), and
`wait_for_new_session` polls forever.
**Fix:** Changed payload to `cmd/unix/bind_perl` with `LPORT=4444`. The
bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects
to `RHOSTS:LPORT` after the exploit fires, creating a proper shell session.
**Files:** `exploits/modules/samba_usermap_script.toml`
---
## Bug 4 — Per-slot LPORT/hostfwd port mapping wrong
**Symptom:** For slots 1+, the bind-shell port is reachable on the host but
msfrpcd cannot connect. `ss -tlnp` on the host shows port 5444 listening
(QEMU) but the module tries to connect to port 4444.
**Root cause:** The extra hostfwd was `host:5444→guest:4444` (old guest port)
but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444.
Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path.
**Fix:** Extra hostfwd now uses `extra_host_port:extra_host_port` on both
sides. `extra_host_port = base_port + slot * 1000` is the per-slot LPORT, and
the guest binds that exact port.
**Files:** `orchestrator/fleet.py`
---
## Bug 5 — vsftpd module port 6200 collision across concurrent slots
**Symptom:** Multiple Tier-3 slots running vsftpd_234_backdoor all try to
hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start
because port 6200 is already bound by slot 0's QEMU.
**Root cause:** vsftpd's backdoor hardcodes port 6200 in both the vulnerable
binary and the Metasploit module. There is no LPORT override possible. With
SLIRP+hostfwd, all concurrent slots must use the same host port, which is
impossible.
**Fix:** Marked `vsftpd_234_backdoor.toml` with `requires_bridge = true`. The
fleet runner filters it from `usable_modules` on SLIRP runs. When a bridge is
available each guest gets its own IP, and msfrpcd connects to `guest_ip:6200`
directly.
**Files:** `exploits/modules/vsftpd_234_backdoor.toml`
---
## Bug 6 — SLIRP false-positive in `_wait_for_tcp` causes premature exploit fire
**Symptom:** Log shows "target service is up" within 0.5 s of QEMU start. The
exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 3060 s
to boot Samba. Result: `session_open_timeout` every episode.
**Root cause:** SLIRP's usermode TCP stack completes the TCP three-way
handshake (SYN-ACK) immediately for any port that has a `hostfwd` rule,
regardless of whether the guest OS has booted. A bare `socket.create_connection()`
always succeeds. Even a `recv()` with a short timeout (0.5 s) fires with
`socket.timeout` because during very early boot SLIRP cannot RST the connection
(the guest TCP stack is not up yet), so the connection hangs open and the recv
deadline fires before SLIRP can determine the guest state.
**Fix:** Replaced `_wait_for_tcp` with `_wait_for_serial_login`. The new
function connects to QEMU's serial console socket (`serial.sock`) right after
the pidfile appears and streams boot output until `"login:"` is seen. The
serial console is authoritative: it reflects actual guest OS state, not
SLIRP's synthetic TCP layer.
Timing:
- `serial.sock` is created by QEMU at device init, before the pidfile.
- We connect immediately after the pidfile → we receive all boot output.
- Metasploitable2 prints `"metasploitable login:"` ≈ 5070 s after QEMU start.
- The clean phase (10 s) runs AFTER the login prompt, so the exploit fires
when Samba is reliably up.
**Files:** `tools/run_tier3_demo.py`
---
## Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts
**Symptom:** After a systemd restart of `cis490-orchestrator`, the new wave's
QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU
from the previous wave is still running (QEMU is started with
`start_new_session=True` so it survives the orchestrator's SIGTERM). The new
episode detects the stale QEMU answering the port probe and proceeds as if the
target is up — but the stale QEMU has different hostfwd mappings (no bind port
for the current module), so the exploit never lands.
**Fix:** `run_tier3_demo.py` reads the old `qemu.pid` file from the run
directory before recreating it. If a PID is found, `os.killpg(pgid, SIGTERM)`
terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU
exit before the port is rebound.
**Files:** `tools/run_tier3_demo.py`
---
## Bug 8 — `PORT_BASE` default uses privileged ports (< 1024)
**Symptom:** `launch_target.sh`'s default `PORT_BASE` was `21 + SLOT * 100`.
On Tier-2 hosts without Metasploitable2, standalone `run_tier3_demo.py` tries
to bind port 21 on loopback. The `cis490` service user cannot bind ports
< 1024. QEMU exits immediately.
**Fix:** Default changed to `2021 + SLOT * 100`. Port 2021 is above 1024 and
reflects the scheme used by the fleet runner (base_port + 2000).
**Files:** `vm/launch_target.sh`, `scripts/install-tier-3-4.sh`
---
## Bug 9 — msfrpc `module.execute` response is raw msgpack bytes, not str
**Symptom:** Key lookups on the `module.execute` response raise `KeyError`
or fail silently because msgpack returns `bin` type (bytes) for all string
values, even with `raw=False` on some Metasploit 6.x builds.
**Fix:** Added `MSFRpcClient._str()` to recursively decode bytesstr in all
msgpack response dicts. Applied to `module.execute` and `session.list`.
**Files:** `exploits/msfrpc.py`
---
## Bug 10 — `_wait_for_tcp` returns success on `b''` (connection-closed-by-peer)
**Symptom:** Log shows "target service is up" within 0.5 s of the 65 s boot
floor, but all exploit fires time out waiting for a session. FTP (port 21),
Samba (139), and distccd (3632) all returned `b''`. The VM's services were not
up; the probe was wrong.
**Root cause:** When `recv(1)` returns `b''` (empty bytes), Python raises no
exception. The code fell through to `return`, incorrectly reporting "service
is up". `b''` means SLIRP forwarded the connection to the guest, the guest's
TCP stack RST'd (no service listening), and SLIRP converted RSTFIN the
host sees connection closed. Only `socket.timeout` (remote end holding the
connection open, waiting for client data) and non-empty `data` (banner
received) are genuine ready signals.
**Fix:** Changed `recv(1)` to save the return value. On `socket.timeout`,
return immediately (genuine up). On non-empty `data`, return (banner). On
`b''`, set `last_err` and `continue` (retry).
**Files:** `tools/run_tier3_demo.py`
---
## Bug 11 — `distccd` and `unreal_ircd` incorrectly marked `requires_bridge = true`
**Symptom:** `distcc_exec` and `unreal_ircd_3281_backdoor` were filtered from
`usable_modules` on every SLIRP-only run, even though their `cmd/unix/bind_perl`
payloads create an inward-connecting bind shell (host connects to guest), which
does NOT require the bridge.
**Root cause:** The comment in `distccd_command_exec.toml` said "needs bridge so
the guest can reach the attacker" correct for reverse_tcp payloads, wrong for
bind_perl. bind_perl listens on the guest; msfrpcd connects to the hostfwd'd
loopback port. No guest egress is needed.
**Fix:** Set `requires_bridge = false` in both modules. The fleet already adds
per-slot hostfwd entries for `extra_target_ports`, so these modules now work on
SLIRP+hostfwd runs without any other change.
**Files:** `exploits/modules/distccd_command_exec.toml`,
`exploits/modules/unreal_ircd_3281_backdoor.toml`
---
## Bug 12 — `msgpack.unpackb` crashes on integer session IDs
**Symptom:** `wait_for_new_session` raises `ValueError: int is not allowed for
map key` when msfrpcd returns a session dict keyed by integer session IDs.
Traceback seen in slot-0 logs on 2026-05-01.
**Root cause:** `msgpack.unpackb(raw, raw=False)` defaults to
`strict_map_key=True`, which rejects non-string keys. Metasploit 6.x msfrpcd
encodes session IDs as msgpack int64 map keys.
**Fix:** Added `strict_map_key=False` to the `unpackb` call in `_raw_call`.
**Files:** `exploits/msfrpc.py`
---
## Bug 13 — `samba_usermap_script` never opens a session (removed from catalog)
**Symptom:** `multi/samba/usermap_script` fired, port 4444 bound in guest, but
Metasploit reported `Rex::Proto::SMB::Exceptions::NoReply` on every run.
`session.list` stayed empty for the full 30 s timeout.
**Root cause:** The SMB auth connection is disrupted when Samba's
`username map script` executes the injected command (smbd kills the auth
handler). Metasploit never received an SMB response marked exploit "failed"
skipped calling the bind-shell handler session never created.
**Fix:** Removed `samba_usermap_script.toml` from the catalog. The fleet now
uses `distccd_command_exec` and `unreal_ircd_3281_backdoor` as SLIRP-capable
modules (see Bug 11 fix). Both protocols return a proper response after the
exploit fires, so Metasploit's handler is called and sessions open.
**Files:** `exploits/modules/samba_usermap_script.toml` (deleted),
`orchestrator/fleet.py`
---
## Bug 14 — QEMU launch config incompatible with Metasploitable2 (boot hang)
**Symptom:** Every `_wait_for_tcp` probe returns `b''` for the full timeout
(even after the Bug 10 fix). No service FTP, Samba, distccd, IRC ever
becomes reachable. The VM consumes CPU (QEMU runs) but nothing listens.
**Root cause (three compounding issues in `launch_target.sh`):**
1. `-drive if=virtio` presents the disk as `/dev/vda`. Metasploitable2's GRUB
was built for VMware SCSI (`/dev/sda`). Ubuntu 8.04's kernel command line
says `root=/dev/sda1`. The kernel can't mount root on `/dev/vda` kernel
panic immediately after decompression. Services never start.
2. `-machine q35` is a PCIe chipset (Sandy Bridge era). Old ISA-emulated
devices and BIOS assumptions in Ubuntu 8.04 break under q35.
3. `-cpu host` exposes AVX/XSAVE and other modern CPU features. Linux 2.6.24
doesn't know how to save/restore these in context switches; the kernel
freezes or mishandles the first SIMD operation during boot.
**Fix:** Three changes in `vm/launch_target.sh`:
- `-machine q35` `-machine pc` (i440fx, the classic PC compatible machine)
- `-drive if=virtio` `-drive if=ide` (Ubuntu 8.04 libata presents this as
`/dev/sda`, matching the GRUB `root=` line)
- `-cpu host` (KVM) `-cpu kvm32` (safe 32-bit KVM model, no exotic flags)
- `-device virtio-net-pci` `-device e1000` (Intel e1000: universally
supported since Linux 2.2, in every kernel config Metasploitable2 uses)
**Files:** `vm/launch_target.sh`
---
## Bug 15 — Tier-3 verify uses vsftpd (bridge-only, always fails on SLIRP)
**Symptom:** `install-tier-3-4.sh` verify step always fails. The vsftpd
module's backdoor opens port 6200 (hardcoded in the binary and the MSF
module). On SLIRP, all slots would need to share the same host port 6200,
which QEMU refuses. The verify is killed by `_wait_for_tcp` or by the exploit
itself never reaching a session.
**Root cause:** The verify step was left on `vsftpd_234_backdoor` after Bug 5
marked that module `requires_bridge = true`. The verify subprocess doesn't
have a bridge configured and doesn't set up the extra hostfwd for port 6200.
**Fix:** Changed verify to `distccd_command_exec` with correct SLIRP port
mappings: `TARGET_PORTS="5632:3632,4444:4444"` and `--target-port 5632`.
distccd doesn't hardcode a backdoor port the bind shell uses the
fleet-assigned `LPORT`. No bridge needed.
**Files:** `scripts/install-tier-3-4.sh`
---
## Net result after all fixes
With fixes 115 applied:
- Metasploitable2 boots correctly under KVM (pc machine, kvm32 CPU, ide disk,
e1000 network). Services start ~6070 s after QEMU launch.
- `_wait_for_tcp` correctly waits until a service is genuinely listening
(returns only on `socket.timeout` or non-empty banner data).
- `distccd_command_exec` and `unreal_ircd_3281_backdoor` are admitted to the
module catalog; both are SLIRP-compatible with `cmd/unix/bind_perl`.
- `samba_usermap_script` removed from catalog (NoReply, sessions never open).
- `msgpack.unpackb` accepts integer session ID keys without crashing.
- The verify step uses `distccd_command_exec` on SLIRP+hostfwd.
- Sessions open, workloads execute, episodes complete with `session_open`
events.