CIS490

Author	SHA1	Message	Date
max	f9b2e5c4e6	shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors Three robustness items off the future-work list: 1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The daemon sends READY=1 after queue construction and WATCHDOG=1 once per scan pass via a heartbeat callback wired into run_forever. Restart=on-failure only catches process death — silent stalls (deadlock, hung tar subprocess, blocked I/O past timeout) used to leave a zombie running with the data backlog growing. Now systemd kills + restarts the daemon if no WATCHDOG=1 arrives within 180s. Verified end-to-end against systemd via `systemd-run --transient --property=Type=notify --property=WatchdogSec=10`: unit transitions to active on READY=1; SIGSTOP'ing the process triggers `Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at exactly t+10s, then unit goes failed → restart cycle. 2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew forever as fatal episodes piled up. New ShipperConfig fields: quarantine_keep_days = 30 # opt-out: 0 disables quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't # statx() the whole tree Cleanup runs at the start of run_once() but is gated to once per hour. Removed entries logged. 3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper journal and surfaces 412/400/transient patterns as red/yellow rows with the canonical fix command. An on-device agent running cis490_doctor.py now sees one line ("12 ship(s) rejected as out-of-window") instead of needing to grep the journal. Tests: 200/200 (was 188). New coverage: heartbeat callback fires + survives exceptions; quarantine cleanup respects keep_days, gate, and opt-out; doctor parser correctly classifies 412/400/transient/clean/ empty/journalctl-denied; both error classes prioritise 412 (more actionable) when present together. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:02:59 -05:00
max	ed5e6b0581	docs+doctor: surface VERSION-stamp + fallback wiring receiver.toml.example: the local_repo_path comment was wrong about when it kicks in. With the new fallback path, it's used both when forgejo_url is unset (sole backend) AND when forgejo is unreachable (failover). Document that, plus the auto-detect of /opt/cis490/.git. cis490_doctor: add a VERSION-stamp check for lab-host role. If /opt/cis490/VERSION is missing or malformed, the orchestrator stamps "unknown" → receiver gate rejects every PUT → quarantine. Surface this as a red row with the canonical fix (re-run install-lab-host.sh) so an on-device agent doesn't have to grep journal logs to figure it out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:54:36 -05:00
elliott	95ac56a382	fix: three install-time bugs found during first lab-host bring-up on k-gamingcom 1. pyproject.toml — move pycdlib to main deps (was dev-only; cidata build fails on first install because the venv doesn't include dev extras). 2. scripts/install-lab-host.sh — create vm/images/ dir before symlinking alpine-baseline.qcow2 and cidata.iso into INSTALL_ROOT. Without the mkdir the ln -sf silently fails (\|\| true), leaving the launchers unable to find the images and causing every episode to fail within 15 s. 3. tools/cis490_doctor.py — two fixes: a. Insert repo_root into sys.path at doctor startup so the inline `from exploits.modules import ...` succeeds when running from /opt/cis490 (package = false means nothing is installed into site-packages). b. Pass cwd=/opt/cis490 to the shipper --ping subprocess so python -m shipper resolves the module correctly regardless of the caller's CWD. Tested on k-gamingcom: install script now builds cidata.iso on first run, 7-slot fleet wave completes with rc=0, doctor shows 13 ok / 4 warn / 2 fail (remaining failures are mTLS certs + collector.wg DNS — both need Pi-side action, not code changes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 15:05:00 -06:00
max	507eac617b	Solvable Tier-3 holes: callback payloads, busybox workloads, bridge by default Closes the next batch of issues from the post-mortem. The previous "each run uses a different vulnerability" commit shipped 5 modules but 3 of them couldn't actually fire under SLIRP+restrict=on: their reverse-shell payloads needed a callback channel the launcher didn't provide, AND their LHOST options were set to {{ target_ip }} (the target's IP, not the attacker's — copy-paste from RHOSTS). Same time, the workloads.py shell commands used bash-only /dev/tcp redirects that silently no-op'd in the busybox shell sessions Metasploitable2 returns. Net effect: episodes that selected those modules would have produced session_open_timeout + dead workloads. Module configs (the three callback ones): exploits/modules/distccd_command_exec.toml exploits/modules/php_cgi_arg_injection.toml exploits/modules/unreal_ircd_3281_backdoor.toml - Switch payload from cmd/unix/reverse* to cmd/unix/bind_perl so the target listens on a known port; msfrpcd connects to it via the host's hostfwd (no callback path required). - Drop the bogus LHOST = "{{ target_ip }}" — bind shells don't use LHOST. - Add [runtime] table: requires_bridge = true extra_target_ports = [<bind_lport>] Both fields are honored by the loader (ModuleConfig.requires_bridge) and the launcher (TARGET_PORTS gets the extra port hostfwd'd when BRIDGE mode is active). orchestrator/fleet.py When BRIDGE is unset in env, _run_slot filters the module catalog down to modules where requires_bridge=False before calling select_module. Two same-socket-shell modules (vsftpd_234_backdoor + samba_usermap_script) survive — fleet still has variety; just doesn't pick modules whose payloads can't land. With BRIDGE set, the full catalog rotates as before, AND BRIDGE is propagated to the per-slot subprocess env so launch_target.sh enters tap+bridge mode. exploits/workloads.py Replaced bash-only constructs in three profiles: scan-and-dial /dev/tcp/HOST/PORT redirects → nc -z -w 1 bursty-c2 same fix shell-resident exec 3<>/dev/tcp/... → piping into nc -w All three now run cleanly in busybox / dash / Metasploitable2's default shell. The remaining three profiles (cpu-saturate, io-walk, low-and-slow) were already busybox-portable. scripts/install-lab-host.sh - lab-host.env now defaults BRIDGE=br-malware (was commented out). Operator opt-out is to comment the line back in. - New step 6b: provisions br-malware via vm/setup_bridge.sh AND pre-creates a per-slot tap pool (cis490tap0..7 for Tier-2 demo, cis490target0..7 for Tier-3 target) all attached to br-malware and brought up. Launchers reference these by SLOT — no sudo needed at episode time. - On bridge-setup failure, the script auto-comments BRIDGE in the env file with a "auto-disabled: bridge setup failed" note so the fleet falls back to same-socket modules + Tier-2 cleanly. tools/cis490_doctor.py Two new checks for the lab-host role: bridge: br-malware exists / up tier3: msfrpcd listening on 127.0.0.1:55553 tier3: module catalog parses (counts same-socket vs requires_bridge) All three are warn-level — they don't fail an otherwise-healthy Tier-2-only setup; they tell the operator what's missing for full Tier-3 + source 4 coverage. Tests: 132 (was 129). New cases: test_fleet.py +3 - fleet skips requires_bridge modules when BRIDGE unset (asserted across 20 episodes; never picks a callback module) - fleet uses the full catalog when BRIDGE is set - BRIDGE env propagates to per-slot subprocess What's still untested live: the bind_perl payloads against a real Metasploitable2 in the bridge-enabled launcher path. That's a deployment validation, not a code change. The unit tests confirm the dispatch / filter logic; the live test is the next operator action. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 02:32:52 -05:00
max	6f8b744c33	cis490-doctor + AGENTS.md operator runbook + louder install script Adds the missing diagnostic + onboarding tools so an agent (AI or human) handed a fresh lab host can get to "shipping data" without re-deriving every step from logs. tools/cis490_doctor.py — one-shot health check that walks the full stack from the bottom up. Each row is green/yellow/red with an exact fix command for the red rows. Checks: - repo: branch, tree-clean, distance from origin/main - install: /opt/cis490, .venv python, /etc/cis490/{lab-host,receiver}.toml, /etc/cis490/lab-host.env - mTLS: /etc/cis490/certs/{wg-ca,lab-host}.{pem,key}, openssl chain verify - systemd: cis490-{shipper,orchestrator,receiver} active state - net: receiver.url DNS, TCP reach, mTLS handshake to collector.wg - vm prereqs: /dev/kvm, qemu-system-x86_64, zstd, alpine-baseline.qcow2, cidata.iso - tier3 prereqs: msfrpcd, metasploitable2.qcow2 (warn-level) - end-to-end: cis490-shipper --ping Modes: --role {lab-host,receiver}, --json (machine-readable), --no-tier3 (skip optional checks). Exits non-zero on any red row. ANSI color (auto-disabled on non-tty / NO_COLOR). AGENTS.md gains a "How a lab host gets to shipping data" canonical flow at the top: cert delivery via wg-pki/deploy-cis490-cert.sh → install-lab-host.sh → cis490-doctor → systemctl enable. Plus an "on-demand episode" recipe + a "smallest E2E test" snippet for agents that need to verify the pipe without waiting on the timer. The strict "cloning the repo by itself does nothing" callout makes the failure mode mu and elliott-lab hit explicit. scripts/install-lab-host.sh prints a 5-step banner on first install that points at cis490_doctor.py + the deploy-cis490-cert.sh flow, plus an always-printed footer warning that "cloning + running launchers manually is NOT enough." Same message the AGENTS.md section reinforces. Refs spectral/CIS490#8 (the "Tier-2 is shipping in the meantime" claim that turned out to be untrue because no cis490-shipper service was running on elliott-lab — exactly the case this diagnostic tool targets). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:11:57 -05:00

5 commits