Bug 14 (vm/launch_target.sh): Metasploitable2 requires -machine pc (i440fx), -cpu kvm32, -drive if=ide, and -device e1000. The previous config (-machine q35, -cpu host, -drive if=virtio, virtio-net-pci) caused a kernel panic at boot because /dev/vda != the grub root=/dev/sda1. Services never started; the b'' probe fix (Bug 10) then correctly waited out the full timeout with no result. Bug 15 (scripts/install-tier-3-4.sh): verify step used vsftpd_234_backdoor which is requires_bridge=true and has a hardcoded port-6200 backdoor. Changed to distccd_command_exec with TARGET_PORTS="5632:3632,4444:4444". manifest.toml: admit distccd_command_exec and unreal_ircd_3281_backdoor to the module catalog. Both use cmd/unix/bind_perl (bind shell, no guest egress, SLIRP-safe). distccd returns a valid protocol response so MSF's handler runs and session_open fires. Verified against Metasploitable2 sourceforge image sha256 a8c019c3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
328 lines
14 KiB
Markdown
328 lines
14 KiB
Markdown
# Tier-3 Bring-up Bug Report — elliott-ThinkPad (2026-05-01)
|
||
|
||
Bugs found and fixed during the first real-exploit fleet run on this host.
|
||
All fixes are in the commits following the `Dev_REL1_043026` merge of main.
|
||
|
||
---
|
||
|
||
## Bug 1 — BRIDGE env var breaks Tier-3 target VM networking
|
||
|
||
**Symptom:** All Tier-3 slots timeout at 300 s waiting for the target
|
||
service. QEMU starts with `netdev tap` instead of `netdev user` (SLIRP).
|
||
|
||
**Root cause:** `launch_target.sh` checks `BRIDGE` to switch between SLIRP
|
||
and tap networking. The fleet runner copied the parent environment (which had
|
||
`BRIDGE=br-malware` from the Tier-2 tap setup) into the Tier-3 subprocess.
|
||
The Tier-3 target VMs don't have a tap interface configured, so all guest
|
||
traffic is dropped.
|
||
|
||
**Fix:** `fleet.py` `_run_slot()` now calls `env.pop("BRIDGE", None)` before
|
||
launching `run_tier3_demo.py`. Tier-2 idle VMs continue to use tap; Tier-3
|
||
target VMs always use SLIRP+hostfwd.
|
||
|
||
**Files:** `orchestrator/fleet.py`
|
||
|
||
---
|
||
|
||
## Bug 2 — Bridge-requiring modules selected when BRIDGE is not available
|
||
|
||
**Symptom:** `distccd_command_exec` and `php_cgi_arg_injection` appear in
|
||
`usable_modules` even on SLIRP-only runs. Exploit fires but the reverse-shell
|
||
payload can't call back (no guest egress on `restrict=on`).
|
||
|
||
**Root cause:** `usable_modules` filtering was conditioned on `bridge_iface`
|
||
being set in the environment. When BRIDGE was not set, ALL modules were
|
||
considered usable. Modules that require bridge egress (reverse shells) silently
|
||
fell through, fired, and timed out waiting for a session.
|
||
|
||
**Fix:** `usable_modules` now always filters `requires_bridge=True` modules
|
||
regardless of the BRIDGE env var. The `requires_bridge` field in the module
|
||
TOML is authoritative.
|
||
|
||
**Files:** `orchestrator/fleet.py`, `exploits/modules/*.toml`
|
||
|
||
---
|
||
|
||
## Bug 3 — `cmd/unix/interact` creates no persistent session
|
||
|
||
**Symptom:** `samba_usermap_script` fires (job_id=None), no session appears in
|
||
`session.list` after 30 s. The exploit succeeds on the wire but the driver
|
||
reports `session_open_timeout`.
|
||
|
||
**Root cause:** `cmd/unix/interact` is a console-only payload. It attaches
|
||
directly to the module's job console — it does NOT create a background
|
||
Meterpreter/shell session visible via `session.list`. msfrpcd's
|
||
`module.execute` returns `job_id=None` (no background job), and
|
||
`wait_for_new_session` polls forever.
|
||
|
||
**Fix:** Changed payload to `cmd/unix/bind_perl` with `LPORT=4444`. The
|
||
bind-shell payload instructs the guest to listen on LPORT; msfrpcd connects
|
||
to `RHOSTS:LPORT` after the exploit fires, creating a proper shell session.
|
||
|
||
**Files:** `exploits/modules/samba_usermap_script.toml`
|
||
|
||
---
|
||
|
||
## Bug 4 — Per-slot LPORT/hostfwd port mapping wrong
|
||
|
||
**Symptom:** For slots 1+, the bind-shell port is reachable on the host but
|
||
msfrpcd cannot connect. `ss -tlnp` on the host shows port 5444 listening
|
||
(QEMU) but the module tries to connect to port 4444.
|
||
|
||
**Root cause:** The extra hostfwd was `host:5444→guest:4444` (old guest port)
|
||
but FLEET_PAYLOAD_LPORT=5444 instructed the guest bind_perl to listen on 5444.
|
||
Mismatch: guest binds 5444, hostfwd forwards host:5444→guest:4444. No path.
|
||
|
||
**Fix:** Extra hostfwd now uses `extra_host_port:extra_host_port` on both
|
||
sides. `extra_host_port = base_port + slot * 1000` is the per-slot LPORT, and
|
||
the guest binds that exact port.
|
||
|
||
**Files:** `orchestrator/fleet.py`
|
||
|
||
---
|
||
|
||
## Bug 5 — vsftpd module port 6200 collision across concurrent slots
|
||
|
||
**Symptom:** Multiple Tier-3 slots running vsftpd_234_backdoor all try to
|
||
hostfwd port 6200 (the backdoor bind port). QEMU for slots 1+ fail to start
|
||
because port 6200 is already bound by slot 0's QEMU.
|
||
|
||
**Root cause:** vsftpd's backdoor hardcodes port 6200 in both the vulnerable
|
||
binary and the Metasploit module. There is no LPORT override possible. With
|
||
SLIRP+hostfwd, all concurrent slots must use the same host port, which is
|
||
impossible.
|
||
|
||
**Fix:** Marked `vsftpd_234_backdoor.toml` with `requires_bridge = true`. The
|
||
fleet runner filters it from `usable_modules` on SLIRP runs. When a bridge is
|
||
available each guest gets its own IP, and msfrpcd connects to `guest_ip:6200`
|
||
directly.
|
||
|
||
**Files:** `exploits/modules/vsftpd_234_backdoor.toml`
|
||
|
||
---
|
||
|
||
## Bug 6 — SLIRP false-positive in `_wait_for_tcp` causes premature exploit fire
|
||
|
||
**Symptom:** Log shows "target service is up" within 0.5 s of QEMU start. The
|
||
exploit fires at t=10 s (end of clean phase) but Metasploitable2 needs 30–60 s
|
||
to boot Samba. Result: `session_open_timeout` every episode.
|
||
|
||
**Root cause:** SLIRP's usermode TCP stack completes the TCP three-way
|
||
handshake (SYN-ACK) immediately for any port that has a `hostfwd` rule,
|
||
regardless of whether the guest OS has booted. A bare `socket.create_connection()`
|
||
always succeeds. Even a `recv()` with a short timeout (0.5 s) fires with
|
||
`socket.timeout` because during very early boot SLIRP cannot RST the connection
|
||
(the guest TCP stack is not up yet), so the connection hangs open and the recv
|
||
deadline fires before SLIRP can determine the guest state.
|
||
|
||
**Fix:** Replaced `_wait_for_tcp` with `_wait_for_serial_login`. The new
|
||
function connects to QEMU's serial console socket (`serial.sock`) right after
|
||
the pidfile appears and streams boot output until `"login:"` is seen. The
|
||
serial console is authoritative: it reflects actual guest OS state, not
|
||
SLIRP's synthetic TCP layer.
|
||
|
||
Timing:
|
||
- `serial.sock` is created by QEMU at device init, before the pidfile.
|
||
- We connect immediately after the pidfile → we receive all boot output.
|
||
- Metasploitable2 prints `"metasploitable login:"` ≈ 50–70 s after QEMU start.
|
||
- The clean phase (10 s) runs AFTER the login prompt, so the exploit fires
|
||
when Samba is reliably up.
|
||
|
||
**Files:** `tools/run_tier3_demo.py`
|
||
|
||
---
|
||
|
||
## Bug 7 — Stale QEMU processes hold hostfwd ports across orchestrator restarts
|
||
|
||
**Symptom:** After a systemd restart of `cis490-orchestrator`, the new wave's
|
||
QEMU processes fail to bind their hostfwd ports (e.g., 2139). The old QEMU
|
||
from the previous wave is still running (QEMU is started with
|
||
`start_new_session=True` so it survives the orchestrator's SIGTERM). The new
|
||
episode detects the stale QEMU answering the port probe and proceeds as if the
|
||
target is up — but the stale QEMU has different hostfwd mappings (no bind port
|
||
for the current module), so the exploit never lands.
|
||
|
||
**Fix:** `run_tier3_demo.py` reads the old `qemu.pid` file from the run
|
||
directory before recreating it. If a PID is found, `os.killpg(pgid, SIGTERM)`
|
||
terminates the old QEMU process group, followed by a 1.5 s sleep to let QEMU
|
||
exit before the port is rebound.
|
||
|
||
**Files:** `tools/run_tier3_demo.py`
|
||
|
||
---
|
||
|
||
## Bug 8 — `PORT_BASE` default uses privileged ports (< 1024)
|
||
|
||
**Symptom:** `launch_target.sh`'s default `PORT_BASE` was `21 + SLOT * 100`.
|
||
On Tier-2 hosts without Metasploitable2, standalone `run_tier3_demo.py` tries
|
||
to bind port 21 on loopback. The `cis490` service user cannot bind ports
|
||
< 1024. QEMU exits immediately.
|
||
|
||
**Fix:** Default changed to `2021 + SLOT * 100`. Port 2021 is above 1024 and
|
||
reflects the scheme used by the fleet runner (base_port + 2000).
|
||
|
||
**Files:** `vm/launch_target.sh`, `scripts/install-tier-3-4.sh`
|
||
|
||
---
|
||
|
||
## Bug 9 — msfrpc `module.execute` response is raw msgpack bytes, not str
|
||
|
||
**Symptom:** Key lookups on the `module.execute` response raise `KeyError`
|
||
or fail silently because msgpack returns `bin` type (bytes) for all string
|
||
values, even with `raw=False` on some Metasploit 6.x builds.
|
||
|
||
**Fix:** Added `MSFRpcClient._str()` to recursively decode bytes→str in all
|
||
msgpack response dicts. Applied to `module.execute` and `session.list`.
|
||
|
||
**Files:** `exploits/msfrpc.py`
|
||
|
||
---
|
||
|
||
## Bug 10 — `_wait_for_tcp` returns success on `b''` (connection-closed-by-peer)
|
||
|
||
**Symptom:** Log shows "target service is up" within 0.5 s of the 65 s boot
|
||
floor, but all exploit fires time out waiting for a session. FTP (port 21),
|
||
Samba (139), and distccd (3632) all returned `b''`. The VM's services were not
|
||
up; the probe was wrong.
|
||
|
||
**Root cause:** When `recv(1)` returns `b''` (empty bytes), Python raises no
|
||
exception. The code fell through to `return`, incorrectly reporting "service
|
||
is up". `b''` means SLIRP forwarded the connection to the guest, the guest's
|
||
TCP stack RST'd (no service listening), and SLIRP converted RST→FIN → the
|
||
host sees connection closed. Only `socket.timeout` (remote end holding the
|
||
connection open, waiting for client data) and non-empty `data` (banner
|
||
received) are genuine ready signals.
|
||
|
||
**Fix:** Changed `recv(1)` to save the return value. On `socket.timeout`,
|
||
return immediately (genuine up). On non-empty `data`, return (banner). On
|
||
`b''`, set `last_err` and `continue` (retry).
|
||
|
||
**Files:** `tools/run_tier3_demo.py`
|
||
|
||
---
|
||
|
||
## Bug 11 — `distccd` and `unreal_ircd` incorrectly marked `requires_bridge = true`
|
||
|
||
**Symptom:** `distcc_exec` and `unreal_ircd_3281_backdoor` were filtered from
|
||
`usable_modules` on every SLIRP-only run, even though their `cmd/unix/bind_perl`
|
||
payloads create an inward-connecting bind shell (host connects to guest), which
|
||
does NOT require the bridge.
|
||
|
||
**Root cause:** The comment in `distccd_command_exec.toml` said "needs bridge so
|
||
the guest can reach the attacker" — correct for reverse_tcp payloads, wrong for
|
||
bind_perl. bind_perl listens on the guest; msfrpcd connects to the hostfwd'd
|
||
loopback port. No guest egress is needed.
|
||
|
||
**Fix:** Set `requires_bridge = false` in both modules. The fleet already adds
|
||
per-slot hostfwd entries for `extra_target_ports`, so these modules now work on
|
||
SLIRP+hostfwd runs without any other change.
|
||
|
||
**Files:** `exploits/modules/distccd_command_exec.toml`,
|
||
`exploits/modules/unreal_ircd_3281_backdoor.toml`
|
||
|
||
---
|
||
|
||
## Bug 12 — `msgpack.unpackb` crashes on integer session IDs
|
||
|
||
**Symptom:** `wait_for_new_session` raises `ValueError: int is not allowed for
|
||
map key` when msfrpcd returns a session dict keyed by integer session IDs.
|
||
Traceback seen in slot-0 logs on 2026-05-01.
|
||
|
||
**Root cause:** `msgpack.unpackb(raw, raw=False)` defaults to
|
||
`strict_map_key=True`, which rejects non-string keys. Metasploit 6.x msfrpcd
|
||
encodes session IDs as msgpack int64 map keys.
|
||
|
||
**Fix:** Added `strict_map_key=False` to the `unpackb` call in `_raw_call`.
|
||
|
||
**Files:** `exploits/msfrpc.py`
|
||
|
||
---
|
||
|
||
## Bug 13 — `samba_usermap_script` never opens a session (removed from catalog)
|
||
|
||
**Symptom:** `multi/samba/usermap_script` fired, port 4444 bound in guest, but
|
||
Metasploit reported `Rex::Proto::SMB::Exceptions::NoReply` on every run.
|
||
`session.list` stayed empty for the full 30 s timeout.
|
||
|
||
**Root cause:** The SMB auth connection is disrupted when Samba's
|
||
`username map script` executes the injected command (smbd kills the auth
|
||
handler). Metasploit never received an SMB response → marked exploit "failed"
|
||
→ skipped calling the bind-shell handler → session never created.
|
||
|
||
**Fix:** Removed `samba_usermap_script.toml` from the catalog. The fleet now
|
||
uses `distccd_command_exec` and `unreal_ircd_3281_backdoor` as SLIRP-capable
|
||
modules (see Bug 11 fix). Both protocols return a proper response after the
|
||
exploit fires, so Metasploit's handler is called and sessions open.
|
||
|
||
**Files:** `exploits/modules/samba_usermap_script.toml` (deleted),
|
||
`orchestrator/fleet.py`
|
||
|
||
---
|
||
|
||
## Bug 14 — QEMU launch config incompatible with Metasploitable2 (boot hang)
|
||
|
||
**Symptom:** Every `_wait_for_tcp` probe returns `b''` for the full timeout
|
||
(even after the Bug 10 fix). No service — FTP, Samba, distccd, IRC — ever
|
||
becomes reachable. The VM consumes CPU (QEMU runs) but nothing listens.
|
||
|
||
**Root cause (three compounding issues in `launch_target.sh`):**
|
||
|
||
1. `-drive if=virtio` presents the disk as `/dev/vda`. Metasploitable2's GRUB
|
||
was built for VMware SCSI (`/dev/sda`). Ubuntu 8.04's kernel command line
|
||
says `root=/dev/sda1`. The kernel can't mount root on `/dev/vda` → kernel
|
||
panic immediately after decompression. Services never start.
|
||
|
||
2. `-machine q35` is a PCIe chipset (Sandy Bridge era). Old ISA-emulated
|
||
devices and BIOS assumptions in Ubuntu 8.04 break under q35.
|
||
|
||
3. `-cpu host` exposes AVX/XSAVE and other modern CPU features. Linux 2.6.24
|
||
doesn't know how to save/restore these in context switches; the kernel
|
||
freezes or mishandles the first SIMD operation during boot.
|
||
|
||
**Fix:** Three changes in `vm/launch_target.sh`:
|
||
- `-machine q35` → `-machine pc` (i440fx, the classic PC compatible machine)
|
||
- `-drive if=virtio` → `-drive if=ide` (Ubuntu 8.04 libata presents this as
|
||
`/dev/sda`, matching the GRUB `root=` line)
|
||
- `-cpu host` (KVM) → `-cpu kvm32` (safe 32-bit KVM model, no exotic flags)
|
||
- `-device virtio-net-pci` → `-device e1000` (Intel e1000: universally
|
||
supported since Linux 2.2, in every kernel config Metasploitable2 uses)
|
||
|
||
**Files:** `vm/launch_target.sh`
|
||
|
||
---
|
||
|
||
## Bug 15 — Tier-3 verify uses vsftpd (bridge-only, always fails on SLIRP)
|
||
|
||
**Symptom:** `install-tier-3-4.sh` verify step always fails. The vsftpd
|
||
module's backdoor opens port 6200 (hardcoded in the binary and the MSF
|
||
module). On SLIRP, all slots would need to share the same host port 6200,
|
||
which QEMU refuses. The verify is killed by `_wait_for_tcp` or by the exploit
|
||
itself never reaching a session.
|
||
|
||
**Root cause:** The verify step was left on `vsftpd_234_backdoor` after Bug 5
|
||
marked that module `requires_bridge = true`. The verify subprocess doesn't
|
||
have a bridge configured and doesn't set up the extra hostfwd for port 6200.
|
||
|
||
**Fix:** Changed verify to `distccd_command_exec` with correct SLIRP port
|
||
mappings: `TARGET_PORTS="5632:3632,4444:4444"` and `--target-port 5632`.
|
||
distccd doesn't hardcode a backdoor port — the bind shell uses the
|
||
fleet-assigned `LPORT`. No bridge needed.
|
||
|
||
**Files:** `scripts/install-tier-3-4.sh`
|
||
|
||
---
|
||
|
||
## Net result after all fixes
|
||
|
||
With fixes 1–15 applied:
|
||
- Metasploitable2 boots correctly under KVM (pc machine, kvm32 CPU, ide disk,
|
||
e1000 network). Services start ~60–70 s after QEMU launch.
|
||
- `_wait_for_tcp` correctly waits until a service is genuinely listening
|
||
(returns only on `socket.timeout` or non-empty banner data).
|
||
- `distccd_command_exec` and `unreal_ircd_3281_backdoor` are admitted to the
|
||
module catalog; both are SLIRP-compatible with `cmd/unix/bind_perl`.
|
||
- `samba_usermap_script` removed from catalog (NoReply, sessions never open).
|
||
- `msgpack.unpackb` accepts integer session ID keys without crashing.
|
||
- The verify step uses `distccd_command_exec` on SLIRP+hostfwd.
|
||
- Sessions open, workloads execute, episodes complete with `session_open`
|
||
events.
|