CIS490/docs/fix-notes-Dev_REL3_050126.md
elliott 0fb2f3b9a6 docs: fix notes for Dev_REL3_050126 — all 7 Tier-3 bring-up bugs
Branch HEAD: 656a015443

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:20:35 -06:00

5.4 KiB

Fix Notes — Dev_REL3_050126

Branch HEAD: 656a015443f54dffeab66ae29fa726eee36a51ed Date: 2026-05-02 Author: elliott (k-gamingcom lab host)

Summary

Seven bugs found and fixed during Tier-3 + Tier-4 bring-up on k-gamingcom, following the AGENTS.md runbook. All fixes are committed to Dev_REL3_050126 and deployed to /opt/cis490.


Fixes (oldest → newest)

1. cis490-msfrpcd crashes with EROFS on /root/.msf4 — commit 1dd484d

File: scripts/install-msfrpcd.sh

Symptom: msfrpcd service failed immediately with EROFS because ProtectHome=true in the generated systemd unit made /root a read-only overlay. msfrpcd defaulted $HOME to /root and could not create .msf4/.

Fix: Pre-create /var/lib/cis490/msf4, add Environment=HOME=/var/lib/cis490/msf4 and ReadWritePaths=/var/lib/cis490 to the generated unit.


Files: scripts/install-tier-3-4.sh

Symptom A: install-tier-3-4.sh fetched the Metasploitable2 image to $DATA_ROOT/vm/images/ but never symlinked it to $INSTALL_ROOT/vm/images/. launch_target.sh resolved IMAGE relative to $INSTALL_ROOT/vm/images/ and exited immediately; qemu.pid never appeared.

Fix: Added install -d + ln -sf step after the fetch.

Symptom B: Same install bug also carried over the HOME fix above into the install script's live-patch path.


3. PORT_BASE=21 is privileged; RPORT not propagated — commit f4eef81

Files: vm/launch_target.sh, exploits/driver.py, tools/run_tier3_demo.py

Symptom: launch_target.sh defaulted PORT_BASE to $((21 + SLOT * 100)). Slot 0 → port 21, which cis490 (non-root) cannot bind. QEMU printed bind(AF_INET, ...): Permission denied and exited before booting the guest. Even if the port had worked, DriverConfig had no way to override RPORT, so the exploit module would have still connected to port 21 (not the hostfwd'd port).

Fix:

  • launch_target.sh: PORT_BASE default → $((2121 + SLOT * 100))
  • DriverConfig: added target_port: int | None field
  • MSFExploitDriver._fire(): if target_port set and RPORT in opts, override
  • run_tier3_demo.py: pass target_port=args.target_port to DriverConfig
  • install-tier-3-4.sh verify call: --target-port 2121

4. run_tier3_demo.py --data-root defaults to relative "data" — commit d2716b4

Files: scripts/install-tier-3-4.sh

Symptom: run_tier3_demo.py defaults --data-root to "data" (relative). When invoked via sudo -u cis490, the CWD was /, so episode dirs resolved to /data/episodes/ which doesn't exist; mkdir raised PermissionError.

Fix: Pass --data-root "$DATA_ROOT/data" explicitly in the install script.


5. msfrpc bytes/str normalisation — commit 4262625 (closes #20)

File: exploits/msfrpc.py

Symptom: msfrpcd encodes all response strings as msgpack bin type (always Python bytes). unpackb(raw=False) only converts the legacy raw type; bin comes out as bytes regardless. auth.login received {b'result': b'success', b'token': b'TEMP...'} and resp.get("result") returned NoneMSFRpcError("auth.login failed: ...").

Fix: Added _decode_response() recursive bytes → str normaliser and called it in _raw_call immediately after msgpack.unpackb.


6. Orchestrator never received MSFRPC_PASSWORD — commit d294eb9

File: etc/cis490-orchestrator.service

Symptom: The orchestrator unit only loaded lab-host.env, which contains FLEET_HOST_ID and BRIDGE but not MSFRPC_PASSWORD. run_tier3_demo.py checks for the env var at startup and exits rc=2 immediately if unset. All tier3 slots were failing in ~240 ms with rc=2.

Fix: Added EnvironmentFile=-/etc/cis490/msfrpc.env to the unit (the - prefix silences the error on Tier-2-only hosts where the file doesn't exist).


7. Fleet port formula produces privileged ports; boot timeout too tight — commit 656a015

File: orchestrator/fleet.py

Symptom A: PORT_BASE = target_port + slot * 1000 produced host ports < 1024 for samba_usermap_script (RPORT=139, slot 0 → port 139) and php_cgi_arg_injection (RPORT=80, slot 0 → port 80). cis490 lacks CAP_NET_BIND_SERVICE; QEMU's SLIRP hostfwd silently failed. The service was never reachable. All 7 slots returned rc=1 after timing out.

Symptom B: --target-boot-timeout was not passed to run_tier3_demo.py, which uses a 180 s default. 7 concurrent VMs contending on I/O during boot cannot reliably start their services within 180 s.

Fix:

  • Port formula: host_port = (target_port % 1000) + 2000 + slot * 1000 (minimum host port 2000, no collisions across module types or slots)
  • Pass --target-boot-timeout 300 explicitly from the fleet runner

Verification

After all fixes were applied:

  • install-tier-3-4.sh step 4 produced episode 01KQJM5WGWC33P0QWJXRDJV1EN
  • install-tier-3-4.sh step 5 staged 6 real binaries in samples/store/
  • Fleet wave at 19:55:57 UTC-6 confirmed slot 0 samba probing port 2139 with 300 s timeout — first wave to actually run to completion

Still outstanding

  • Pi-side mTLS cert for k-gamingcom not yet issued (shipper in "waiting on mTLS material" state). Blocked on Pi operator running deploy-cis490-cert.sh k-gamingcom <wg_ip>. No action needed on lab-host side.