docs: fix notes for Dev_REL3_050126 — all 7 Tier-3 bring-up bugs

Branch HEAD: 656a015443

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
elliott 2026-05-02 12:20:35 -06:00
parent 656a015443
commit 0fb2f3b9a6

View file

@ -0,0 +1,142 @@
# Fix Notes — Dev_REL3_050126
Branch HEAD: `656a015443f54dffeab66ae29fa726eee36a51ed`
Date: 2026-05-02
Author: elliott (k-gamingcom lab host)
## Summary
Seven bugs found and fixed during Tier-3 + Tier-4 bring-up on k-gamingcom,
following the AGENTS.md runbook. All fixes are committed to `Dev_REL3_050126`
and deployed to `/opt/cis490`.
---
## Fixes (oldest → newest)
### 1. `cis490-msfrpcd` crashes with EROFS on `/root/.msf4` — commit `1dd484d`
**File:** `scripts/install-msfrpcd.sh`
**Symptom:** msfrpcd service failed immediately with `EROFS` because
`ProtectHome=true` in the generated systemd unit made `/root` a read-only
overlay. msfrpcd defaulted `$HOME` to `/root` and could not create `.msf4/`.
**Fix:** Pre-create `/var/lib/cis490/msf4`, add `Environment=HOME=/var/lib/cis490/msf4`
and `ReadWritePaths=/var/lib/cis490` to the generated unit.
---
### 2. Two Tier-3 install bugs — `metasploitable2` symlink + msfrpcd HOME — commit `ae4b80d`
**Files:** `scripts/install-tier-3-4.sh`
**Symptom A:** `install-tier-3-4.sh` fetched the Metasploitable2 image to
`$DATA_ROOT/vm/images/` but never symlinked it to `$INSTALL_ROOT/vm/images/`.
`launch_target.sh` resolved `IMAGE` relative to `$INSTALL_ROOT/vm/images/`
and exited immediately; `qemu.pid` never appeared.
**Fix:** Added `install -d` + `ln -sf` step after the fetch.
**Symptom B:** Same install bug also carried over the `HOME` fix above into
the install script's live-patch path.
---
### 3. PORT_BASE=21 is privileged; RPORT not propagated — commit `f4eef81`
**Files:** `vm/launch_target.sh`, `exploits/driver.py`, `tools/run_tier3_demo.py`
**Symptom:** `launch_target.sh` defaulted `PORT_BASE` to `$((21 + SLOT * 100))`.
Slot 0 → port 21, which `cis490` (non-root) cannot bind. QEMU printed
`bind(AF_INET, ...): Permission denied` and exited before booting the guest.
Even if the port had worked, `DriverConfig` had no way to override `RPORT`,
so the exploit module would have still connected to port 21 (not the hostfwd'd
port).
**Fix:**
- `launch_target.sh`: `PORT_BASE` default → `$((2121 + SLOT * 100))`
- `DriverConfig`: added `target_port: int | None` field
- `MSFExploitDriver._fire()`: if `target_port` set and RPORT in opts, override
- `run_tier3_demo.py`: pass `target_port=args.target_port` to `DriverConfig`
- `install-tier-3-4.sh` verify call: `--target-port 2121`
---
### 4. `run_tier3_demo.py --data-root` defaults to relative `"data"` — commit `d2716b4`
**Files:** `scripts/install-tier-3-4.sh`
**Symptom:** `run_tier3_demo.py` defaults `--data-root` to `"data"` (relative).
When invoked via `sudo -u cis490`, the CWD was `/`, so episode dirs resolved to
`/data/episodes/` which doesn't exist; `mkdir` raised `PermissionError`.
**Fix:** Pass `--data-root "$DATA_ROOT/data"` explicitly in the install script.
---
### 5. msfrpc bytes/str normalisation — commit `4262625` (closes #20)
**File:** `exploits/msfrpc.py`
**Symptom:** msfrpcd encodes all response strings as msgpack `bin` type (always
Python `bytes`). `unpackb(raw=False)` only converts the legacy `raw` type;
`bin` comes out as `bytes` regardless. `auth.login` received
`{b'result': b'success', b'token': b'TEMP...'}` and `resp.get("result")`
returned `None``MSFRpcError("auth.login failed: ...")`.
**Fix:** Added `_decode_response()` recursive `bytes → str` normaliser and
called it in `_raw_call` immediately after `msgpack.unpackb`.
---
### 6. Orchestrator never received `MSFRPC_PASSWORD` — commit `d294eb9`
**File:** `etc/cis490-orchestrator.service`
**Symptom:** The orchestrator unit only loaded `lab-host.env`, which contains
`FLEET_HOST_ID` and `BRIDGE` but not `MSFRPC_PASSWORD`. `run_tier3_demo.py`
checks for the env var at startup and exits `rc=2` immediately if unset.
All tier3 slots were failing in ~240 ms with `rc=2`.
**Fix:** Added `EnvironmentFile=-/etc/cis490/msfrpc.env` to the unit (the `-`
prefix silences the error on Tier-2-only hosts where the file doesn't exist).
---
### 7. Fleet port formula produces privileged ports; boot timeout too tight — commit `656a015`
**File:** `orchestrator/fleet.py`
**Symptom A:** `PORT_BASE = target_port + slot * 1000` produced host ports
< 1024 for `samba_usermap_script` (RPORT=139, slot 0 port 139) and
`php_cgi_arg_injection` (RPORT=80, slot 0 → port 80). `cis490` lacks
`CAP_NET_BIND_SERVICE`; QEMU's SLIRP `hostfwd` silently failed. The service
was never reachable. All 7 slots returned `rc=1` after timing out.
**Symptom B:** `--target-boot-timeout` was not passed to `run_tier3_demo.py`,
which uses a 180 s default. 7 concurrent VMs contending on I/O during boot
cannot reliably start their services within 180 s.
**Fix:**
- Port formula: `host_port = (target_port % 1000) + 2000 + slot * 1000`
(minimum host port 2000, no collisions across module types or slots)
- Pass `--target-boot-timeout 300` explicitly from the fleet runner
---
## Verification
After all fixes were applied:
- `install-tier-3-4.sh` step 4 produced episode `01KQJM5WGWC33P0QWJXRDJV1EN`
- `install-tier-3-4.sh` step 5 staged 6 real binaries in `samples/store/`
- Fleet wave at 19:55:57 UTC-6 confirmed slot 0 samba probing port 2139
with 300 s timeout — first wave to actually run to completion
## Still outstanding
- Pi-side mTLS cert for k-gamingcom not yet issued (shipper in
"waiting on mTLS material" state). Blocked on Pi operator running
`deploy-cis490-cert.sh k-gamingcom <wg_ip>`. No action needed on
lab-host side.