CIS490

Author	SHA1	Message	Date
Elliott Kolden	b73f5559dc	Tier-3 fixes: b'' probe false-positive, requires_bridge, msgpack Bug 10: _wait_for_tcp returned on recv()→b'' (connection closed by peer), falsely signalling service-ready. Only socket.timeout or non-empty data are genuine ready signals; b'' now retries. Bug 11: distccd_command_exec and unreal_ircd_3281_backdoor incorrectly had requires_bridge=true. bind_perl payloads connect inward (host→guest via hostfwd), not outward — no bridge egress needed. Both modules now run on SLIRP-only fleet slots. Bug 12: msgpack.unpackb crashed on integer session IDs from msfrpcd 6.x (strict_map_key=True default). Added strict_map_key=False. Bug 13 (documented): samba_usermap_script removed from catalog (NoReply on every fire — already handled in `dca6144` on origin/main). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 15:15:18 -06:00
Max Gorog	3d4f282e9c	Tier-2 episodes use clean-only schedule; .gitignore VERSION Two correctness fixes that the §4.5 event-driven labeller surfaced: 1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule (clean → armed → infecting → infected_running → ...) for episodes with no exploit firing. Pre-§4.5 those episodes wrote dishonest `infected_running` labels from the schedule clock — exactly the §3 evidence pattern. Post-§4.5 they write `failed` at the infecting transition (the justifying exploit_fire never arrives), which is honest about what happened but useless for training. The honest fix: Tier-2 episodes have a clean-only schedule. All telemetry tagged `clean` because nothing infected anything. The total duration matches the canonical Tier-3 schedule so episode lengths are comparable across tiers — no length-bias in the dataset (§10). Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py derives `[("clean", total_seconds)]` from the canonical schedule. `tier3_schedule_from(schedule)` renders the legacy `[(name, seconds)]` shape EpisodeConfig still expects. Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from. Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from. Drops the hardcoded DEFAULT_SCHEDULE constants from both — the canonical manifest is the single source of truth (§4.1). 2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp writes /opt/cis490/VERSION so episodes can record code provenance without /opt/cis490 carrying a .git directory. But /opt/cis490 IS typically a git checkout on lab hosts (auto-update.sh pulls into it), so writing VERSION leaves the working tree dirty. Every episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance gate's rule 4 would then reject every episode without CIS490_ALLOW_DIRTY=1 set — which would break the data flow. Now VERSION is .gitignored: install-lab-host.sh stamps it, git status doesn't see it, dirty=false, gate rule 4 passes naturally. These two changes together keep the data flowing AND honest. Tier-2 episodes pass with `phases=[clean]` + every collector emitting real rows. Tier-3 episodes (none today, empty catalog) walk the full event-driven schedule when a verified module gets re-admitted. 286 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:55:37 -05:00
Max Gorog	d9f913fc97	PIPELINE §5 step 6: event-driven labeller (§4.5) Phase labels are written ONLY when justifying events arrive. The schedule clock is now a budget — an upper bound — never a label source. This is the core honesty fix the §3 evidence demanded: Before: every Tier-3 episode wrote `infected_running` from the schedule clock regardless of whether session_open ever fired. Per §10 every dishonest label is a poisoned training example. 67/67 of the §3 probe episodes were poisoned this way. After: `infecting` writes ONLY when exploit_fire is observed in events.jsonl. `infected_running` writes ONLY when session_open is observed. Either timing out or seeing session_open_timeout terminates the walker with a `failed` label that the §4.6 acceptance gate will reject. PHASE_JUSTIFYING_EVENTS in orchestrator/episode.py declares which events justify which phases: "clean": None # orchestrator-emitted "armed": None # orchestrator-emitted "infecting": ("exploit_fire",) "infected_running": ("session_open",) TERMINAL_FAILURE_EVENTS = {"session_open_timeout"} short-circuit any event-driven wait into a `failed` label. `dormant` is intentionally OFF the canonical schedule. §4.5 calls for dormant to be event-driven (session_idle / session_active) too, but the driver doesn't emit those yet. Per §1 default-to-removal we ship without dormant rather than label it from the clock; when the driver gains those emits, dormant re-enters the schedule with proper justification. EpisodeRunner now owns: * `_event_log` — every emit_event appends here * `_event_cv` — condition variable for waiters * `_wait_for_event(names, since_t_mono_ns, timeout_s)` — returns the first matching event in the log with t_mono >= threshold; threshold catches events that fired during the previous on_phase callback. When an event-driven phase's justifier already arrived (e.g. exploit_fire emitted by driver._fire() inside on_phase("armed")), the walker uses the EVENT's t_mono on the label — not the time the walker noticed. The label means "this is when this thing actually happened." manifest.toml: dropped the dormant cycle from the canonical schedule. Episode is shorter (~30s) but every label is event-justified. 14 new tests in tests/test_event_driven_labeller.py covering: justifier mapping invariants, _wait_for_event semantics (already-arrived, future, timeout, since-threshold, first-of-multiple-names), walker behavior (orchestrator-emitted phases, event-driven phases, missing event → failed, terminal-failure-event short-circuit, stop event, event-t_mono on label, phase_transition events with justified_by). 286 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:43:16 -05:00
Max Gorog	0d51b9b253	PIPELINE §5 step 5: collector admission emit tests (§4.4) Adds the missing emit-tests so every collector in KNOWN_COLLECTORS has end-to-end coverage: * test_proc_emits_rows_against_self_pid Samples /proc/<own pid> for ~0.6s. Asserts ≥3 rows + populated core fields (cpu_user_jiffies, rss_bytes, vsize_bytes). Works anywhere with /proc. * test_pcap_bucketize_emits_rows_from_synthetic_capture Builds a 2-packet Ethernet+IPv4+TCP pcap in-memory, feeds it to pcap.bucketize, asserts ≥1 row written + total packet count across buckets matches input. Covers BOTH the pcap and netflow collectors (netflow IS the bucketized pcap output). * test_every_known_collector_has_emit_coverage Cross-cutting tripwire: for every name in KNOWN_COLLECTORS, either there's a test_collectors_emit.py test or there's an explicit COLLECTOR_TEST_CARVE_OUTS entry. Adding a collector to KNOWN_COLLECTORS without an emit test fails this. Carve-outs today: qmp (covered by tests/test_qmp.py — needs running QEMU for real-binary emit) and guest_agent (covered by tests/test_guest_agent.py — needs a real VM with the agent baked in). The carve-outs are explicit, not implicit. A drift where someone adds a new collector without a real-binary emit test fails CI before the manifest can include it. 272 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:37:40 -05:00
Max Gorog	22269e175d	PIPELINE §5 step 4: catalog admission verifier (§4.3) tools/verify_catalog.py runs the §4.3 end-to-end verification flow against every entry in manifest.toml's [catalog].modules (or a single named module). The flow follows §4.3 exactly: 1. Load the module config + the verified-against target spec. 2. Resolve the published image path; fail loudly if absent. 3. Boot the target VM under §4.13 containment (restrict=on, snapshot=on, no shared FS, unprivileged QEMU — same posture as verify.sh). 4. Wait for the service on the spec'd port. 5. Login to msfrpcd, snapshot the existing session set, fire the module against `127.0.0.1:<host_port>` (the SLIRP hostfwd to the guest's promised service port). 6. Wait for `session_open` — NOT session_open_timeout, which is the §4.5 failed-label outcome. 7. Round-trip a shell command (`id`); confirm uid= shape. 8. Confirm a guest-side artifact (touch marker; ls + echo VERIFY_OK). Per-module exit code is 0 only when EVERY step passes. CLI exit is 0 only when EVERY requested module passes — partial credit isn't an option (§1 default-to-removal: a module that can't pass shouldn't be in the catalog). Structured JSON output with per-step timings + detail strings, written to stdout or --out <path>. Operator pulls this into a successful CI run + signs off on the manifest.toml [[catalog.modules]] amendment with a fresh `last_verified = <commit_sha>` per §15. Tests (tests/test_verify_catalog.py, 8 cases): exercise the flow with a mocked MSFRpcClient + mocked qemu boot. Cover happy path, every short-circuit failure mode (image missing, service never up, session timeout, shell round-trip wrong, guest artifact missing), and spec-load errors. Real verification needs lab hardware; the mocked flow proves the orchestration contract. 269 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:35:32 -05:00
Max Gorog	4d29b7236d	PIPELINE §5 step 3: target VM build infrastructure + containment posture §4.2 calls for target VMs we BUILD, not VMs we fetch. §4.13 demands every target ship the same isolation posture (no upstream egress, no host-shared FS, unprivileged QEMU, fresh snapshot per episode). This commit lands the infrastructure for both. New surface: * orchestrator/target_spec.py Loads + validates `vm/targets/<name>/spec.toml`. Containment fields are not knobs — each has exactly ONE safe value, and a spec asserting the unsafe value is rejected at load time. There's no `--containment-override`; weakening §4.13 requires amending PIPELINE.md and operator sign-off. * tools/build_target.py Orchestrates build → verify → publish for a single target. Spec invalid → exit 78 (sysadmin error). build.sh failure → image not published. verify.sh failure → image discarded; that's the §4.2 acceptance gate. Publishes sha256 + the manifest.toml stanza the operator copies in to admit the image (§16 substantive amendment with sign-off per §15). * vm/targets/<name>/{spec.toml,build.sh,verify.sh} Template structure. spec.toml is the contract; build.sh produces $OUT_PATH; verify.sh boots the produced image under the §4.13 containment posture and asserts every promise. * vm/targets/shellshock/ First real working target. CVE-2014-6271 (Apache mod_cgi + bash 4.2 mis-parsing function-export environment values). Replaces the SourceForge Metasploitable2 path that §3 evidence proved unverifiable. Bash 4.2 is built from sha256-pinned GNU source inside an Alpine 3.21 cloudinit guest; the build script asserts the produced bash actually triggers shellshock; the verifier re-asserts it under restrict=on with a real CVE-2014-6271 probe. * vm/targets/README.md How operators add a target. Walks the spec → build → verify → manifest amendment loop. Containment regression tests (tests/test_containment.py) — 20 new assertions, parameterized over every target with a build/verify trio: * verify.sh MUST contain `restrict=on` on its netdev (§4.13) * verify.sh MUST contain `snapshot=on` on the boot drive (§4.13) * verify.sh + build.sh MUST NOT contain -virtfs / -fsdev / 9pfs * verify.sh + build.sh MUST NOT wrap qemu-system in `sudo` * Every target must ship the complete spec.toml + build.sh + verify.sh trio — no half-built targets (§1 default-to-removal) Spec validation tests (tests/test_target_spec.py): 13 new tests over spec parse, name/dir mismatch, missing fields, out-of-range port, and the §4.13 containment field validators (each unsafe value rejected with a clear error). The shellshock target's image is NOT yet published to manifest.toml's [[targets.images]] — that's the §15 sign-off amendment that lands after a successful operator-driven build_target.py run on a lab host with KVM. Building takes ~10 min on x86_64; cannot run on the Pi under TCG. Operator drives the first build, verifies the sha256, then amends manifest.toml in a follow-up commit. 261 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:31:40 -05:00
Max Gorog	207a902c3e	PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml The experiment is now defined by a single version-pinned file — manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every lab host loads THIS exact file; per-host overrides of experiment shape are forbidden. Drops the following per-host CLI overrides that previously violated the canonical-manifest principle: * --manifest, --modules-dir (paths now derived) * --ram-per-vm-mib (in manifest.experiment) * --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling) * --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots) * --force-tier2 (not a §14 sanctioned override knob — ship empty catalog to disable Tier-3) * --require-real-samples (sample-side concern; out of fleet scope) * tools/run__demo.py --manifest (samples path now from canonical) New surface: manifest.toml — the single source of truth * orchestrator/manifest.py — load_canonical() + Manifest dataclass with strict validation, raises ManifestError on any failure * EpisodeConfig.experiment_meta — populated by run__demo.py from the canonical manifest; stamped into every episode's meta.json under "experiment" key for provenance cis490-orchestrator.service — RestartPreventExitStatus=78 so manifest-load failures stay stuck-and-loud (§9, §4.7) * install-lab-host.sh — validates manifest.toml at install time; missing or invalid = die with clear message Catalog admission semantics: only modules whose name appears in manifest.catalog get loaded into the runtime catalog (§4.3 in miniature, will tighten further in step 4 when verified_against / last_verified actually gate admission). Missing toml for an admitted name is a sysadmin error → exit 78. Renames cfg.manifest → cfg.samples + adds cfg.experiment to disambiguate sample-manifest from experiment-manifest. Rewrites test_fleet.py fixture to construct synthetic Manifest objects so test outcomes don't depend on the on-disk manifest.toml content. 12 new tests in tests/test_manifest.py: schema-version mismatch, unknown collector, duplicate collector, unknown phase, negative phase seconds, negative ram, missing catalog fields, json round-trip. Local run: `python tools/run_fleet.py --capacity` correctly logs the loaded manifest and prints capacity. 241 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:25:01 -05:00
Max Gorog	dca6144a4a	catalog: remove samba_usermap_script — never landed sessions in prod PIPELINE.md §1 (default-to-removal), §4.3 (catalog admission), §10 (every dishonest label is a poisoned training example). Empirical evidence on commits `4ab5477` → `c41763b`: samba_usermap_script fired its bind_perl payload but the framework's bind handler never managed to connect to the guest's listening port within session_open_timeout_s=30 (or even with WfsDelay=30 bumped on the framework side). All 67 attempts in the §3 probe ended in session_open_timeout. Yet the schedule clock was still writing `infected_running` labels for the failed exploit — exactly the §10 poisoned-example pattern. Until §5 step 3 builds an in-house target VM and step 4 re-admits modules with `verified_against` recorded (§4.3), the production catalog should consist of zero verified Tier-3 modules. That's the state after this removal: the four remaining modules (vsftpd_234_backdoor, distccd_command_exec, php_cgi_arg_injection, unreal_ircd_3281_backdoor) are all `requires_bridge=true`, which the fleet picker filters out unconditionally (the post-revert behavior from commit `0390eb2`). Net effect: production runs Tier-2 only, producing honest Tier-2 episodes and zero dishonest Tier-3 infected_running labels. Test fixture updated to inject synthetic in-memory ModuleConfigs instead of loading from disk, so Tier-3 dispatch logic stays tested even though no production module qualifies. test_exploits asserts the new "every shipped module is requires_bridge until §4.3 admits something verified" invariant — flips into a tripwire if anyone reintroduces an unverified non-bridge module. 229 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 22:48:03 -05:00
Max Gorog	c41763bd28	install-lab-host: auto-install perf + tcpdump on Arch / Debian / RHEL PIPELINE.md §4.4 requires every collector in the active set to actually work end-to-end. On k-gamingcom (commit `dac03d2` episode at 02:21Z) the new perf_unavailable lifecycle event surfaced a concrete cause: `reason: binary_not_on_path` — perf is enabled but the binary isn't installed. Same story with tcpdump on k-gamingcom (pcap_unavailable events with `error: tcpdump not found`). The canonical install script is the right place to ensure the deps are present. detect_os reads /etc/os-release; ensure_collector_packages installs `perf` (Arch / RHEL) or `linux-perf` + `linux-tools-generic` (Debian/Ubuntu) plus `tcpdump`. After the install attempt the script re-checks `command -v` and dies loudly if either is still missing — silent silent silent forbidden per §1, so install failure has to be observable. Idempotent (`--needed` / equivalent skips already-installed packages). Operator owns full system upgrades; this only does targeted package install. On unknown distros logs a warning and dies on the followup check, with a clear pointer to install perf/tcpdump by hand. The next autoupdate tick on k-gamingcom should pull this and self-install perf + tcpdump, after which rows_perf > 0 and pcap should start producing bytes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 21:26:28 -05:00
Max Gorog	dac03d2eff	perf: emit per-episode lifecycle events; emit row even with empty agg Validation on k-gamingcom (commit `ac7b85f`) showed perf enabled in production but rows_perf=0 on every episode. Without lifecycle events the failure mode is indistinguishable from "perf wasn't enabled" — §1 silent-downgrade. The events now surface the actual cause: - perf_unavailable — binary missing OR launch failed (with reason) - perf_started — perf is running (pid, events, interval) - perf_first_row — first row written; counters_populated tells whether any event was actually counted - perf_finished — final tally (intervals_seen, intervals_with_values) - perf_no_counters — perf was alive but every interval came back <not counted> (likely paranoid > 2 or PID ownership mismatch) `_flush()` now writes a row whenever an interval is observed, even when every event was <not counted>. The all-None row is honest data ("perf observed this interval and counted nothing"), and the rows become a count of observed intervals rather than a count of successful measurements — distinct from rows_proc / rows_qmp which do count successful measurements. Trainers filter on `cycles is not None` etc. when they need only populated rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:08:42 -05:00
Max Gorog	0390eb20b6	fix: revert speculative fleet picker change — was producing dishonest labels Empirical evidence from k-gamingcom (commit `4ab5477`, 2026-05-03 22:20Z vsftpd_234_backdoor episode): the picker selected vsftpd because BRIDGE was set on that host. The exploit fires against target_ip=127.0.0.1 (SLIRP loopback) but vsftpd's hardcoded port-6200 backdoor is reachable only at the guest's bridge IP. Result: session_open_timeout, AND a schedule-clock-driven `infected_running` label was still written for the failed exploit — exactly the §10 poisoned-training-example pattern. Until guest-IP discovery for bridge mode is wired (a separate piece of infrastructure), bridge-only modules can't actually reach their target even when the operator sets BRIDGE for Tier-2's pcap source. Revert the picker to its prior conservative form: drop requires_bridge modules unconditionally regardless of BRIDGE state. Same for the BRIDGE env strip in the Tier-3 launch path — it was correct as unconditional. Replaces the two aspirational tests (test_fleet_uses_all_modules_when_bridge_set, test_fleet_propagates_bridge_env_to_runner) with their honest negatives (test_tier3_drops_requires_bridge_modules_unconditionally, test_tier3_strips_bridge_env_even_when_set). The previous tests asserted behavior the rest of the pipeline can't deliver; they were false signals. 229 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:58:43 -05:00
Max Gorog	ac7b85ff8d	PIPELINE §5 step 1 follow-up: enable perf in production launchers The §5 step 1 fixes correct the perf collector's stdout/stderr + event-name parser bugs, but the launchers (run_real_vm_demo / run_tier3_demo) never set enable_perf=True, so production episodes still ship with rows_perf=0 — silently disabled collector, which is exactly the §1 / §4.4 pattern. Turn it on in both launchers. Failure modes (perf binary missing, paranoid level too high) are logged as warnings + return 0 rows visibly, not silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:40:37 -05:00
Max Gorog	4ab5477226	PIPELINE §5 step 1: fix four root-cause defects Diagnoses + fixes for the silent-collector / never-lands-session failures that the 200-episode quality probe surfaced (§3 evidence). All four address the producer; no compensating layers added. perf collector (rows_perf=0 on 100% of episodes): - perf stat -j writes to stderr by default with -p; we read stdout. Add --log-fd 1 so JSON reaches stdout where the parser sees it. - Event names come back annotated with the privilege scope perf actually measured ("cycles:u" under perf_event_paranoid=2). Strip the suffix so _build_row's plain-name lookups hit. Without this every metric was None even when perf reported real numbers. - tests/test_collectors_emit.py covers the regression with a real busy-loop fixture; emit-test discipline per §4.4. guest-agent collector (rows_guest=0 on 100% of episodes): - Alpine cloud image doesn't ship python3, so the in-guest agent's `#!/usr/bin/env python3` shebang silently fails. Add packages: [python3] to cidata user-data so cloud-init installs it before the OpenRC service starts. - Guest agent now exits nonzero (was: silent stdout fallback) when /dev/virtio-ports/cis490.guest.agent is missing, so OpenRC reports the failure to /var/log/cis490-agent.log instead of the bytes vanishing into the void. Refs §1. - Host-side collector emits guest_agent_connected / guest_agent_first_byte / guest_agent_silent_window into the orchestrator's events.jsonl. Future episodes show the in-guest failure mode per-episode instead of inferring from rows_guest=0. k-gamingcom missing qmp/netflow/pcap (also affected elliott on Tier-3 episodes — was misclassified as host divergence): - tools/run_tier3_demo.py was building EpisodeConfig WITHOUT qmp_socket / guest_agent_socket / bridge_iface — even though launch_target.sh creates the underlying chardevs and BRIDGE supplies the iface. tools/run_real_vm_demo.py wires them correctly; Tier-3 had a copy-paste gap. - tests/test_collectors_emit.py adds a source-grep regression so the wiring stays honest. samba_usermap_script never lands session (0/67 in §3 probe): - Bind handler default WfsDelay (~5s) gives up before bind_perl on Metasploitable2 has finished forking + binding LPORT under SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in exploits/driver.py so framework + driver agree on the wait budget. Add ConnectTimeout=15 so the handler's bind connect has retry budget instead of one-shot. orchestrator/fleet.py: usable_modules + BRIDGE handling were both unconditional, so: - With BRIDGE set, requires_bridge modules were still being dropped — picker only ever returned samba_usermap_script across every slot/episode (the test_fleet_uses_all_modules_when_bridge_set failure on HEAD). - env.pop("BRIDGE") fired even when BRIDGE was the operator's explicit setup, breaking modules that need bridge mode (vsftpd backdoor on hardcoded port 6200, distccd, etc.). Both made conditional on bridge_set so the picker walks the full catalog under bridge mode and SLIRP-only modules still get a clean SLIRP env when BRIDGE is unset. receiver/app.py: half-pregnant v2 schema state in HEAD — calling store.ingest_stream(episode_type=..., benign_profile=...) with kwargs the matching store.py change was in the WIP stash. Removed v2 awareness from app.py so v1 episodes (what the producer ships today) get accepted again. SCHEMA_VERSION default reset to 1 to match. 229 passed, 0 failed. (HEAD had 15 failures, all linked to the half-pregnant v2 state above.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:05:25 -05:00
Max Gorog	bfb1c491f8	PIPELINE.md is canonical; rewrite AGENTS.md; delete FIXYOURSELF.md PIPELINE.md is the canonical plan for the data-collection / emulation / labelling pipeline. It supersedes any guidance in AGENTS.md, README.md, or other repo docs that contradicts it (§17). Future sessions read it before changing anything in the pipeline. AGENTS.md is rewritten to point at PIPELINE.md as canonical and to strip the prescriptive symptom→fix table that absorbed producer-side defects instead of fixing them (§7.1 compensating-layer pattern). FIXYOURSELF.md is deleted (§4.12, §7.10 recovery-layer pattern). The states it covered are made impossible by the §4.6 acceptance gate landing later in §5; recovering from a state that shouldn't exist is itself the bandaid we're removing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:04:43 -05:00
max	05bf785f0a	fleet-health: exit 0 when alerts found (don't mark unit failed) The detector previously returned 1 on alerts, which made systemd mark cis490-fleet-health.service as 'failed' every tick that found a sick host. That's the wrong UX — a detector finding a fault is working correctly, not crashing. The alert is the signal (via WARNING log + alerts.jsonl); the unit's success state should mean "the detector itself ran cleanly." Test added. Caught while live-deploying on the Pi: the first run found elliott-thinkpad fatal-only at 943×4xx + 1425×5xx and correctly emitted the alert — but systemd showed the unit red, which would have caused operators to chase the wrong tail. Side note: the same first run also caught a real bug — pycache for receiver.store on /opt/cis490 was stale after I deployed the new app.py + store.py from main, causing 1464 × 500 responses. Cleared the pycache and the index immediately resumed growing (4465 → 4515 in 30 seconds). The detector earned its keep on the very first cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:51:20 -05:00
max	49eba2fd60	fleet-health: proactive alerts on the Pi + per-host doctor reports Two pieces of self-monitoring so the maintainer isn't the alarm: (2) Receiver-side fleet health monitor cis490-fleet-health.timer runs check_fleet_health.py every 5 min. Detects three symptoms and writes them to /var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy to forward to a notifier): silent — host shipped in last 24h but has been quiet >30 min fatal-only — actively shipping but every PUT 4xx unstamped — shipping without X-Cis490-Code-Commit header Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault fires once per hour, not every 5 min. 15 unit tests cover the index parser, three detectors, and dedup. (3) Per-host doctor snapshots Lab hosts run cis490-doctor-check.timer once a day (10 min after boot, then daily with 30-min jitter). The timer runs cis490_doctor.py --json and PUTs the result to a new endpoint: PUT /v1/host-health/<host> → /var/lib/cis490/host-health/<host>.json GET /v1/host-health → aggregate across all hosts Endpoint is NOT gated by version_gate — sick hosts running stale code MUST still be able to report sickness. 11 unit tests cover PUT/GET, atomic-write semantics, bearer auth, and the not-gated-by-version-gate property. ship_health_check.py reuses the existing shipper transport (mTLS + bearer + receiver URL from lab-host.toml) so we don't reimplement auth. Both timers wired into install-lab-host.sh — the loop also enables the previously-added autoupdate + cert-fetch timers, so a single install run gives a host all four self-healing mechanisms. Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2 pre-existing test_fleet.py failures from the elliott-ThinkPad merge (`667f042`) are unrelated to this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:48:31 -05:00
Elliott Kolden	1bac2f0135	run_tier3_demo: replace serial probe with min-wait + TCP probe The serial console approach failed: Metasploitable2's kernel is not configured with console=ttyS0, so only GRUB output reaches the QEMU serial socket; the OS boot and login prompt never appear there. New approach: 1. Sleep _METASPLOITABLE2_MIN_BOOT_S (65 s) after QEMU writes its pidfile. By this point the guest kernel and init are always up. 2. Call _wait_for_tcp with a 3 s recv timeout. Post-floor, SLIRP has forwarded the connection to the guest TCP stack, so: - socket.timeout → service listening, waiting for client data ✓ - OSError/RST → port still closed (service not ready); retry ✓ Eliminates the early-boot false-positive that caused exploits to fire ~60 s before Samba was actually listening. Also update TIER3-BRINGUP.md bug 6 to reflect the correct final fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:38:22 -06:00
max	3180f7b5ac	lab-host: cis490-cert-fetch.timer for automatic mTLS bootstrap retry k-gamingcom symptom (2026-05-02): the on-device agent successfully finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS material" because the cert auto-fetch step in install-lab-host.sh either ran with host_id still REPLACE_ME, or hit a transient bootstrap.wg failure, and there's no automatic retry. The Pi-side cert IS minted and the bootstrap endpoint serves it — the failure mode is purely "lab-host hasn't pulled it down." Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh (idempotent, no-op when certs are already on disk, no-op when host_id is unset, exit-0 on transient network failure so the unit doesn't get pinned as failed), and run it from a 5-minute systemd timer. The timer handles all three "stuck waiting on mTLS" cases without operator action: - operator edited host_id post-install but didn't re-run install - bootstrap.wg was briefly unreachable during install - lab host was offline when install ran but came up later The script `try-restart`s cis490-shipper after a successful fetch so the daemon picks up the new cert immediately instead of waiting for its lazy retry. install-lab-host.sh still calls the script on install for fast first-time bring-up — the timer is the safety net. Tarball extract is staged through a temp dir + atomic rename so a mid-extract crash never leaves us with a mismatched cert/key pair. AGENTS.md row 4 updated: "waiting on mTLS material" remediation now points at the timer, with the exact `systemctl start cis490-cert-fetch.service` command to force an immediate retry. Tests: 267/267 unchanged. The fetch script is idempotent + has all its happy/error paths handled inline; a unit test would mostly be testing systemd's behaviour. The integration test path is the timer running on a real lab host, which is the actual production case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:30:16 -05:00
Elliott Kolden	667f042707	Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01) Root causes and fixes documented in TIER3-BRINGUP.md. Summary: 1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot. 2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring modules selected on SLIRP runs; fix: always filter requires_bridge. 3. cmd/unix/interact creates no session.list entry → session_open_timeout every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl. 4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444); fix: extra_host_port:extra_host_port mapping so guest binds the per-slot LPORT directly. 5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots; fix: requires_bridge=true filters it from SLIRP fleet runs. 6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba boots (~60 s too early); fix: replace TCP probe with serial console _wait_for_serial_login that waits for actual "login:" prompt. 7. Stale QEMU survives orchestrator restart (start_new_session=True) → holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from old pidfile before rmtree. 8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100. 9. msfrpcd 6.x returns bytes for all string values even with raw=False; fix: MSFRpcClient._str() recursive decoder applied to all responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:26:19 -06:00
Elliott Kolden	86bd9e21d7	Merge remote-tracking branch 'origin/main' into Dev_REL1_043026	2026-05-02 12:24:08 -06:00
max	cd67624eef	receiver: 4xx remediation points at FIXYOURSELF.md The shipper on a stuck lab host logs the receiver's response body verbatim as ERROR (queue.py:_log_412). That's the ONLY inbound channel from this Pi to a lab host without ssh — every PUT the shipper makes pulls down a fresh remediation message. Update the 400 (missing-commit) and 412 (not-in-window) bodies to explicitly call out FIXYOURSELF.md and the diverged-HEAD case (§B), not just "pull and reinstall" — because if the host is on a local commit that's not on origin/main, plain `git pull --ff-only` fails and the agent needs to know about §B's three resolutions. elliott-thinkpad has been hitting the receiver ~1/sec for 19 hours; it'll receive this updated body on its very next PUT. The on-device agent (or whoever is reading the journal) sees the path forward without the maintainer having to push through any other channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:55:36 -05:00
max	d1e1b1132d	FIXYOURSELF.md: explicit decision tree for stuck lab hosts The auto-update timer (`98dcd4f`) covers the routine case of a host falling behind origin/main. It deliberately refuses to fast-forward when local HEAD isn't an ancestor of origin/main — the right call for safety, but it leaves on-device agents with no automatic path out when they (or an operator) made a local commit. That's exactly the elliott-thinkpad incident: ~31,738 episodes shipped over 19 hours, all stamped with local commit `5568d77` that isn't on origin/main, all 412'd. Auto-update can't fix it; the on-device agent had no doc telling it what to do. FIXYOURSELF.md is that doc. Pure decision tree, six branches (behind / diverged / no-network / no-git / dirty-tree / clean) each with verbatim commands and the order to try them. The diverged-HEAD branch (§B) is the elliott-thinkpad case and offers three resolutions (push, reset, file-issue-and-wait) so an agent that doesn't have push permission isn't backed into discarding work. Linked from the AGENTS.md top-of-file symptom table so a smaller model finds it without having to know the filename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:53:16 -05:00
max	98dcd4f9f8	lab-host: cis490-autoupdate.timer for self-healing on push Today's incident: post-cutover, k-gamingcom went silent and elliott-thinkpad kept shipping pre-stamp episodes that the receiver gate 400'd in a 2300+ PUT loop. Both required `git pull && install- lab-host.sh` on the host — neither the on-device AI agent nor the operator pulled in time, and from the receiver Pi I cannot reach in (sshd off on the lab hosts). Fix the recurrence directly: a 30-min systemd timer that does git fetch + (if behind) ff-only pull + re-run install-lab-host.sh. Hosts catch up on the next tick on their own — no human or agent action required. Mechanics: - scripts/auto-update.sh runs as root, drops to cis490 for git ops to satisfy /opt/cis490 ownership ("dubious ownership" guard). - Refuses ff if local HEAD isn't an ancestor of origin/main — protects operator hand-edits from silent overwrite. - Network failures exit 0 (offline is normal, don't pin a unit failure); divergence + install failures exit non-zero so the journal records what broke. - RandomizedDelaySec=10min on the timer prevents thundering-herd when several hosts boot together. - Hands off to install-lab-host.sh via exec — exactly one path through bring-up; no special "auto" flow. The version-gate provides the quality boundary, so even if origin/ main moves forward unsafely, the receiver's allow-list still controls what lands in the index. install-lab-host.sh enables cis490-autoupdate.timer on every run, idempotent — existing hosts pick it up the next time they pull manually. Filed Forgejo #18 with the canonical command for elliott-thinkpad + k-gamingcom to bootstrap themselves out of the current incident (auto-update doesn't help them retroactively — it has to be running before the cutover to catch the next one). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:59:31 -05:00
max	20ff76c1e0	AGENTS.md: prescriptive symptom→command table for on-device agents Smaller models running on lab hosts read AGENTS.md top-to-bottom and need explicit if-this-then-that. Restructure to put a decision-tree table at the very top mapping every realistic symptom to the exact command to run (verbatim — no paraphrasing instruction). Adds an unambiguous HARD RULES list. Also fixes accumulated drift: - Tier-4 section had two contradictory descriptions (theZoo flow + legacy MalwareBazaar flow). Removed the MalwareBazaar paragraphs; the table's MALWAREBAZAAR_API_KEY env var is gone (theZoo needs no auth). The "DO NOT push API key" bullet was about a flow that no longer exists. - Canonical bring-up step 6 said the Metasploitable2 download was "registration-walled" requiring an operator-supplied URL+sha256. Not true since the SourceForge mirror + TOFU pinning fix — install-lab-host.sh handles it. Removed the manual step entirely and noted Tier-3+4 are part of step 1. - The "Three install bugs in 95ac56a" historical table was churn that doesn't help current agents. Replaced with a generic "outdated-clone? pull main and re-run install-lab-host.sh" block that explicitly enumerates what the install script does (VERSION stamp, queue drain, daemon-reload+restart, watchdog). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:19:53 -05:00
max	f9b2e5c4e6	shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors Three robustness items off the future-work list: 1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The daemon sends READY=1 after queue construction and WATCHDOG=1 once per scan pass via a heartbeat callback wired into run_forever. Restart=on-failure only catches process death — silent stalls (deadlock, hung tar subprocess, blocked I/O past timeout) used to leave a zombie running with the data backlog growing. Now systemd kills + restarts the daemon if no WATCHDOG=1 arrives within 180s. Verified end-to-end against systemd via `systemd-run --transient --property=Type=notify --property=WatchdogSec=10`: unit transitions to active on READY=1; SIGSTOP'ing the process triggers `Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at exactly t+10s, then unit goes failed → restart cycle. 2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew forever as fatal episodes piled up. New ShipperConfig fields: quarantine_keep_days = 30 # opt-out: 0 disables quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't # statx() the whole tree Cleanup runs at the start of run_once() but is gated to once per hour. Removed entries logged. 3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper journal and surfaces 412/400/transient patterns as red/yellow rows with the canonical fix command. An on-device agent running cis490_doctor.py now sees one line ("12 ship(s) rejected as out-of-window") instead of needing to grep the journal. Tests: 200/200 (was 188). New coverage: heartbeat callback fires + survives exceptions; quarantine cleanup respects keep_days, gate, and opt-out; doctor parser correctly classifies 412/400/transient/clean/ empty/journalctl-denied; both error classes prioritise 412 (more actionable) when present together. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:02:59 -05:00
max	ed5e6b0581	docs+doctor: surface VERSION-stamp + fallback wiring receiver.toml.example: the local_repo_path comment was wrong about when it kicks in. With the new fallback path, it's used both when forgejo_url is unset (sole backend) AND when forgejo is unreachable (failover). Document that, plus the auto-detect of /opt/cis490/.git. cis490_doctor: add a VERSION-stamp check for lab-host role. If /opt/cis490/VERSION is missing or malformed, the orchestrator stamps "unknown" → receiver gate rejects every PUT → quarantine. Surface this as a red row with the canonical fix (re-run install-lab-host.sh) so an on-device agent doesn't have to grep journal logs to figure it out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:54:36 -05:00
max	5cebe7096a	robustness: gate falls back to local git, queue sweeps stale tarballs Two follow-ups from the post-cutover diagnosis: 1. version_gate: forgejo → local git fallback. If forgejo refresh returns empty AND a local repo path is configured, retry against `git log` from the local checkout. The receiver service runs on the same Pi as forgejo, so a simultaneous restart used to leave the gate's cache empty and reject every PUT with not-in-window. Auto-detects /opt/cis490/.git when the operator hasn't set local_repo_path explicitly — that path is always present on a production receiver and ProtectSystem=strict still allows reads. Logs `source=git-fallback` so this isn't silent. 2. shipper/queue: sweep orphaned outbox tarballs. The lifecycle invariant is `outbox/<id>.tar.zst exists ⇒ episodes/<id>/ exists` — broken historically by the now-fixed fatal-loop, by operator `rm` of an episode dir, or by an OS crash between rename(2) and the post-ship cleanup. Without sweeping, dead bytes pile up forever. New _sweep_outbox runs at the start of every scan, bounded by the file count in outbox/. Tests cover: fallback fires when forgejo unreachable + repo_path set; no fallback when repo_path None (opt-in); orphan tarball + partial get swept on the next pass; live tarballs untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:49:38 -05:00
max	f294e97875	AGENTS.md: how to recover from 400/412 commit-rejected loops Smaller models running as on-device agents need a direct, prescriptive remediation block for the gate-failure modes — the receiver's response body is good but only visible if the agent reads journalctl carefully. Document the exact sequence (git pull → install-lab-host.sh) and what the install script now does on its own (drain pre-stamp queue, restart services). Also calls out the two anti-patterns we don't want agents trying: silencing the shipper to stop log noise, or fabricating a code_version field to bypass the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:46:04 -05:00
max	eda6164897	fix: lab-host install loop after commit-gate cutover Why services weren't starting after the gate went live: 1. install-lab-host.sh self-copy. The receiver's 400 remediation tells the agent to `cd /opt/cis490 && git pull && sudo ./scripts/install-lab-host.sh`. That makes REPO_ROOT==INSTALL_ROOT and `cp -aT $REPO_ROOT $INSTALL_ROOT` errors with "are the same file"; `set -e` aborts before the systemd units install or anything restarts. Detect the same-dir case and skip the cp; chown still runs. 2. Services never restart. install-lab-host.sh and install-tier-3-4.sh both ended by telling the operator to restart, then exiting. The running shipper/orchestrator kept executing pre-gate code from the old module objects, so new `code_version` stamping never reached an episode. Both scripts now `systemctl restart` the units they own when those units are enabled. 3. Shipper queue fatal-loop. queue.py incremented `fatal++` but didn't move the episode out of `data/episodes/`. Next scan re-tarred and re-PUT the same dir, getting 400 again. With 4465+ pre-stamp episodes on k-gamingcom this burned ~1 PUT/sec for 5+ hours of receiver log. Fatal episodes now move to data/quarantine/<id>/ with a quarantine_reason.json beside them; the outbox tarball is deleted. 4. Pre-stamp backlog drain. tools/quarantine_unstamped.py is a one-shot that scans data/episodes/ and quarantines anything without a 40-char-hex code_version.commit. Wired into install-lab-host.sh step 9 so a re-install drains the queue automatically. Idempotent; safe to run while the shipper is active. Tests cover the queue's new fatal-quarantine path and every drain behaviour (kept/quarantined/dry-run/idempotent/missing-meta/collision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:36:21 -05:00
Elliott Kolden	5568d77df8	Merge remote-tracking branch 'origin/main' into Dev_REL1_043026	2026-05-01 07:51:34 -06:00
max	e2bb76144f	tools/verify_tier3_local.py: Pi-runnable Tier-3 verifier Closes the "have you tested it" gap as much as we can without x86 KVM. The Pi is ARM64 — can't boot Metasploitable2 or run KVM-accelerated guests. But most of the Tier-3 chain doesn't need x86: * chunked_real_binary_upload is just shell commands over a pipe * exploit module TOMLs and the deterministic selector are pure Python * manifest loading + sample selection are pure Python * msfrpcd itself runs on ARM (Ruby + Java) * the receiver's commit gate is the same on any arch verify_tier3_local.py exercises each of those end-to-end, in process, on this Pi: PASS exploits/modules/.toml parse + selector deterministic PASS manifest loads + selector covers every sample PASS chunked binary upload survives a real /bin/sh round-trip (150 KB binary, 26 chunks, sha256-verified end to end) PASS staged samples are Linux i386 ELF (when staged) PASS msfrpcd round-trips core.version (when listening) PASS receiver /v1/health + gate enforces commit allow-list Live result on this Pi today: 5 PASS, 1 SKIP (msfrpcd not installed on the Pi, which is correct — the Pi is the receiver, not a lab host). When run on a lab host after install-tier-3-4.sh, all 6 PASS gives full Tier-3 readiness. What this script does NOT verify (still needs x86 KVM on a lab host, covered by install-tier-3-4.sh's verify step): Metasploitable2 boots under QEMU/KVM * vsftpd_234_backdoor lands a session against it * the chunked-upload binary actually executes inside that session But the chunked-upload step proves every byte of the upload path (printf '%s', heredoc-free path, base64 decode, sha256 verify, chmod, exec scaffold) works against a real POSIX shell. An msfrpc session presents the same shell interface, so a passing local-sh test is strong evidence the production path will work. tests/test_tier3_local_verify.py wraps the deterministic steps (module parse, manifest, chunked upload) so pytest catches regressions automatically. 174/174 total. Operator workflow: ssh into Pi (or lab host), run: /opt/cis490/.venv/bin/python tools/verify_tier3_local.py Each step prints PASS/FAIL/SKIP with detail. Exit 1 if any FAIL.	2026-05-01 03:41:21 -05:00
max	b809e1e26e	auto_fetch_samples: pick Linux i386 ELF; manifest matches theZoo User caught it: I shipped the theZoo path without running it end-to-end. A real fetch on the Pi exposed two bugs: 1. Family-name matcher was substring-strict. "Cryptolocker-class" wouldn't match the dir "CryptoLocker_22Jan2014" because "-class" isn't in the dir name. Now expands to a sequence of tokens (full, head-of-dash, head-of-dot, head-of-underscore) and tries each. First match wins. 2. Extraction picker was "largest non-text" — a bad heuristic for theZoo, where each Linux.* zip often contains MULTIPLE binaries for different platforms (Linux i386, x86-64, ARM, FreeBSD, sometimes even Windows PE). The largest is rarely the i386 Linux ELF that would actually run on Metasploitable2. Now sniffs ELF magic bytes in stdlib and tiers: 1. Linux i386 ELF (largest first) 2. any other ELF (best-effort, may not execute) 3. largest non-text (Wine fallback) Verified end-to-end on the Pi against a real theZoo clone (~500 MB, 263 family dirs, 2026-05-01 fresh pull): linux-encoder-ransomware → ELF 32-bit Intel i386 SYSV (278 KB) linux-wirenet-rat → ELF 32-bit Intel i386 SYSV (64 KB) linux-rex-ransomware → ELF 32-bit Intel i386 SYSV Go (7.6 MB) linux-neurevt-bot → ELF 32-bit Intel i386 SYSV (3.0 MB) linux-earthkrahang-apt → ELF 32-bit Intel i386 GNU/Linux (5.8 MB) 5/5 picks are runnable Linux i386 ELFs. Manifest rewrites in place add source/sha256/url; meta.sample.kind goes to "real" automatically. Manifest rewritten: - Old families (XMRig, Mirai, Cryptolocker-class, Dridex, Kovter, Reverse-Shell) → mostly absent from theZoo's Linux catalog or matched the wrong arch. - New families chosen against a verified theZoo presence list: Linux.Encoder, Linux.Wirenet, Ransomware.Rex, Neurevt, EarthKrahang. - XMRig + Kovter remain as mimic-only fallbacks (theZoo lacks a runnable Linux i386 binary for these; orchestrator falls back to the mimic profile). Tests added (tests/test_auto_fetch_samples.py): 13 cases covering ELF magic detection (i386 accepted, FreeBSD/x86-64/ARM/PE32/text all rejected), family-token expansion (the "-class" suffix bug), extraction picker (prefers Linux i386 over larger non-Linux ELFs), manifest in-place rewrite preserves mode + skips entries that already have sha256. What's still NOT verified end-to-end (requires a lab host with KVM x86): - Metasploitable2 boot under QEMU - vsftpd_234_backdoor exploit fire via msfrpcd - chunked binary upload through a real shell session - real binary executing inside a Metasploitable2 guest The Pi is ARM64 — can't run Metasploitable2. install-tier-3-4.sh's verify step (run_tier3_demo.py) covers all four on a real lab host; deploy verifies on first run there. 171/171 tests pass.	2026-05-01 03:28:26 -05:00
max	cc0c96953e	version_gate: Forgejo as canonical commit source (no fs perms needed) Initial git-log-based gate ran into a permission wall: the cis490 service user can't read /home/max/cis490/.git (ProtectHome=true + home-dir mode). Switching the production source to the local Forgejo HTTP API (already accessible to all WG peers, single source of truth both lab hosts and the receiver pull from). When the maintainer pushes new code to spectral/CIS490, the next 5-second cache refresh sees the new commit and lab hosts can immediately ship under it. VersionGate now takes either: - forgejo_url + repo_owner + repo_name + branch (+ optional auth_token for private repos): hits /api/v1/repos/<owner>/<name>/commits?sha=<branch>&limit=<n> - repo_path: dev-only fallback, runs `git log` locally Local-git path retained for tests + the dev-only case. receiver.toml.example gains forgejo_url/repo_owner/repo_name/branch with auth_token commented; live-deployed receiver.toml on the Pi has the spectral org + token. Live state on the Pi: 41 valid hashes loaded, head=f8ad02b. Verified end-to-end: bogus commit → 412 + remediation HEAD commit → clears gate (fails downstream at sha-mismatch as expected for the empty-body verify probe) Test added: test_forgejo_backend_accepts_returned_commits stands up a tiny canned-response HTTPServer in-process, exercises the parser without depending on a live Forgejo instance. Brings test_version_gate to 10 cases; total 158/158.	2026-05-01 01:42:45 -05:00
max	f8ad02b2d7	Receiver enforces X-Cis490-Code-Commit allow-list (live, auto-refreshed) Stops out-of-date lab hosts from polluting the dataset with episodes generated by buggy code. The valid-commits set mirrors the maintainer's working clone on the Pi automatically — when the maintainer pulls or pushes a new commit, the receiver picks it up within the 5-second cache TTL with no service restart. Receiver changes: - receiver/version_gate.py (new): VersionGate(repo_path, window). Each check() consults a frozenset of the last `window` commit hashes from `git -C <repo> log --format=%H -n <window>`, refreshed every 5s under a lock. Resilient to transient git failure (keeps prior cache so a flaky `git` doesn't lock out every shipper). - receiver/app.py: PUT extracts X-Cis490-Code-Commit; gate.check() before ingest. Rejects with: 400 + remediation if header missing or malformed 412 + remediation + your_commit + head_commit if not in window Remediation block is verbatim copy-pasteable into the lab-host shell: cd /opt/cis490 && sudo -u cis490 git pull origin main sudo /opt/cis490/scripts/install-lab-host.sh sudo systemctl restart cis490-orchestrator - receiver/store.py: ingest_stream takes commit kwarg, stamps it on the index.jsonl row (new optional field). Backfilled rows from index_backfill.py also pull commit out of meta.json. - receiver/config.py + etc/receiver.toml.example: new [version_gate] section. enabled=true, repo_path=/home/max/cis490, window=100 by default. Enabled toggle exists for emergency disable-and-collect. Shipper changes: - shipper/transport.py: ship_tarball() takes commit kwarg, sends X-Cis490-Code-Commit header. 412 maps to status='fatal' so the queue doesn't infinite-retry — operator must pull and reinstall before the next ship will succeed. - shipper/queue.py: reads meta.json::code_version.commit per episode, passes through. On 412, logs the receiver's full remediation block at ERROR level so journalctl on the lab host shows exactly what to run. Tests: 9 in test_version_gate (including 2 end-to-end via starlette.testclient), 2 cover the boundary where new commits land mid-cache and where missing-repo gracefully keeps prior cache. 157/157 total. Index schema: existing rows stay valid (commit field is optional on read). New rows from receiver-direct AND from index_backfill.py include commit.	2026-05-01 01:38:50 -05:00
max	5c0bc9af8e	meta.json: stamp code_version (commit, branch, dirty) per episode Closes a real reproducibility gap. Three weeks of bug fixes have shipped (probe fix in `2707709`, multi-signal classifier in `321ea63`, mandatory tier-4 in `265f3ad`, etc.); without a per-episode code_version, trainers can't tell which episodes came from buggy pre-fix code and have to scan every tarball to guess. Resolution priority (cached across episodes): 1. $INSTALL_ROOT/VERSION (production — install-lab-host.sh writes it at install time since /opt/cis490 is a flat copy with no .git) 2. git rev-parse HEAD from the repo root (dev clones) 3. {"commit": "unknown", source: "unknown"} so the field is always present (filterable) Output shape, always present in meta.json: "code_version": { "commit": "<40-hex>" \| "unknown", "branch": "<name>" \| null, "dirty": bool \| null, "source": "VERSION-file" \| "git" \| "unknown" } install-lab-host.sh writes VERSION at install time with the source repo's git rev-parse HEAD + branch + clean-tree flag + install timestamp. Lab-host agents that pull main + re-run install-lab-host.sh get a fresh stamp automatically. 148/148 tests pass; test_episode_against_self_pid_produces_full_directory asserts the field's presence + valid `source` value.	2026-05-01 01:29:01 -05:00
max	265f3ad313	Tier-4 sample source: theZoo (no auth, no operator action) Replaces MalwareBazaar with theZoo (https://github.com/ytisf/theZoo). theZoo is a public security-research repo with hundreds of malware samples organized by family, password-protected with the well-known 'infected'. No API key, no signup, nothing for an operator to do — which is what zero-touch tier-4 actually means. Changes: - tools/auto_fetch_samples.py: rewrite. Clones theZoo (shallow, ~500 MB) to /var/lib/cis490/theZoo on first run, then for each manifest family without a sha256 it locates a matching Binaries/<Name> dir, extracts the .zip with password 'infected', picks the largest non-text payload as the binary, sha256s it, stages at samples/store/<sha256>, and rewrites manifest.toml in place (atomic tempfile + os.replace, stat preserved). Mandatory exit semantic: non-zero if no real samples landed. - scripts/install-tier-3-4.sh: dropped the MB-key resolution chain (env var → local file → bootstrap.wg fetch). Now just runs auto_fetch_samples.py and dies if zero samples land. SKIP_TIER4 remains as the explicit override but is documented as defeating the project. - bootstrap/app.py + __main__.py + etc/cis490-bootstrap.service: removed the /v1/secret/<name> endpoint and the --secrets-root flag. Dead code now that no API key needs distributing. Live-rolled back on the Pi (404 verified post-restart, stale /etc/cis490/secrets dir removed). - scripts/set-malwarebazaar-key.sh: deleted. No MB key means no one-time operator step. - tests/test_bootstrap_secrets.py: deleted (route removed). - AGENTS.md: rewrote tier-4 section to reflect zero-operator model. 148/148 tests pass. Bootstrap service rolled back live.	2026-05-01 01:17:50 -05:00
max	5d0e8e33a9	Tier 4 is mandatory: hard-fail on no real samples; auto-distribute MB key User: 'we don't want it to be optional, this real malware IS the data we want.' Acknowledged. Three changes make Tier 4 actually mandatory without forcing per-host operator action: 1. bootstrap.wg /v1/secret/<name> endpoint - Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts over the same trust boundary as the cert endpoint (WG mesh, iptmonads-gated). Strict allow-list — only `malwarebazaar` resolves; everything else 404s. Secret returned as bare text with Cache-Control: no-store. Live-verified on the Pi. - tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned, 200 with token, 404 unknown name, 500 on empty file. 2. install-tier-3-4.sh: Tier 4 is no longer optional - Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token → https://bootstrap.wg/v1/secret/malwarebazaar. - Caches the bootstrap-fetched key locally so re-runs are offline. - If all three resolution paths fail, dies with the exact remediation command for the operator (one-time set-malwarebazaar-key.sh on the Pi). - auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still works for emergency overrides but logs a warning that the host will produce only mimics). Deploy fails if zero binaries land in samples/store/ — no silent mimic-only fallback. - SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'. 3. scripts/set-malwarebazaar-key.sh - Pi-side helper: one operator command per fleet, ever. Accepts key via env or stdin, validates length, drops at the right path with the right perms. Lab hosts pull the rest automatically. AGENTS.md: rewrote the Tier-4 section to reflect mandatory status + the one-time-on-Pi distribution model. 152/152 tests pass. Bootstrap service updated live on the Pi.	2026-05-01 00:44:41 -05:00
max	683bfe9ce6	Tier 3 + Tier 4 auto-deploy: zero operator interaction Replaces the manual runbook with scripts that just work. install-lab-host.sh now runs the full Tier-3 deploy automatically as its 8th step (after the mTLS cert lands), and Tier-4 auto-fetches when MALWAREBAZAAR_API_KEY is set. Changes: - install-msfrpcd.sh: actually runs the Rapid7 omnibus installer when metasploit-framework isn't present (was: bail with "install manually"). apt-get and dnf paths both go through the same omnibus script with DEBIAN_FRONTEND=noninteractive. Idempotent. - fetch-metasploitable2.sh: bakes in the SourceForge public-mirror URL (https://downloads.sourceforge.net/project/metasploitable/...) so no operator URL is required. sha256 is now optional and TOFU-pinned — first run records the hash to OUT_DIR/metasploitable2.qcow2.sha256; subsequent runs verify against that. Skips if qcow2 already present. - scripts/install-tier-3-4.sh (new): orchestrates the four steps (msfrpcd → metasploitable2 → bridge → tier-3 verify) plus optional Tier-4 auto-fetch. Idempotent. SKIP_VERIFY / SKIP_BRIDGE / SKIP_TIER4 env knobs for partial deploys. - tools/auto_fetch_samples.py (new): when MALWAREBAZAAR_API_KEY is set, queries MB by each manifest entry's `family` (signature match), pulls the first match via fetch_sample.py, and rewrites manifest.toml in place (atomic tempfile + os.replace, preserving stat). Skips entries that already have sha256. - install-lab-host.sh: gains a step 8 that calls install-tier-3-4.sh automatically when mTLS certs are on disk. --skip-tier3 flag for operators who want Tier 2 only. Skipped silently before certs land so first-pass install (host_id=REPLACE_ME) still works. - AGENTS.md: rewrote the Tier-3 section to point at the one-shot script. Removed the old multi-command runbook so on-device agents can't accidentally follow stale steps. Net effect: a fresh lab host now gets Tier 3 (and Tier 4 if API key present) from a single sudo invocation. No operator picks for image URLs, no manual metasploit installs, no manual manifest edits.	2026-04-30 23:12:08 -05:00
max	02b9d0a645	Tier 3 + Tier 4 deploy runbook in AGENTS.md Repo has all the code paths for Tier 3 (real exploit fire via msfrpcd) and Tier 4 (real malware execution via chunked upload), but neither lab host has run a single Tier-3 episode because msfrpcd and the Metasploitable2 image aren't deployed there. 3009 episodes in flight to date are all Tier 2 (mimic workloads in clean Alpine), which is useful pipeline-validation data but cannot answer the actual research question. This commit makes the deploy push-button: - AGENTS.md: new "Tier 3 + Tier 4 deploy" section listing the three prereqs (install-msfrpcd.sh, fetch-metasploitable2.sh, setup_bridge.sh), the foreground verify command (run_tier3_demo.py), and the Tier-4 promotion path (MB API key → fetch_sample.py → manifest edit → orchestrator restart). - samples/manifest.toml: clearer per-entry comment showing the 4-step sha256 → real-binary promotion path. Replaces the earlier "TBD" placeholder which suggested a single edit unlocks Tier 4 when in fact you need to fetch the binary too. The fleet runner already auto-detects msfrpcd (orchestrator/fleet.py _msfrpcd_available()); once the lab-host operator-AI lands the prereqs, episodes flip to Tier 3 with no orchestrator config change. Tier 4 follows automatically the next time the deterministic selector picks a sample whose sha256 file exists in samples/store/.	2026-04-30 22:57:23 -05:00
max	321ea63803	Multi-signal prune classifier: rescue valid episodes /proc misses A laptop-class lab host (elliott-thinkpad) running 14 parallel fleet slots can't deliver host /proc CPU% signal for the bursty profiles — the per-VM share gets buried under contention. But the workloads ARE running: qmp blockstats record 90+ MB written during infected_running for io-walk episodes, netflow shows real packet bursts for scan-and-dial, and the in-guest agent (when alive) shows load_1m deltas the host can't see. The classifier now cross-checks four sources before flagging an episode: - /proc CPU% medians (host-side qemu) - netflow byte totals (bridge_pcap) - qmp blockstats per-phase DELTA (cumulative counters; deltas matter, not raw values) - guest-agent load_1m An episode flags only if every available source agrees no inter-phase signal. Missing sources are "unknown", not "flat". Time-base bug also fixed: phase mapping now uses t_wall_ns (which all sources stamp from CLOCK_REALTIME) rather than t_mono_ns — netflow uses qemu boot-monotonic, /proc uses orchestrator-relative, they don't share a number line. Result on the live receiver: - 1067 active episodes, 100% kept under the new logic - 143 episodes rescued from a previous false-positive archive - Only the 9 genuinely-broken pre-Sample-propagation elliott-lab episodes remain archived (no-sample + no-workload-events) Two new tests (test_flat_proc_rescued_by_netflow, test_flat_everywhere_still_flags) pin the boundary so a future regression surfaces immediately. AGENTS.md gains a "classifier is multi-source" section explaining the cross-check and the t_wall_ns invariant.	2026-04-30 19:10:01 -05:00
Elliott Kolden	3d4936a227	Merge remote-tracking branch 'origin/main' into Dev_REL1_043026	2026-04-30 16:34:01 -06:00
max	2707709299	Fix workload-silent false-positive on Alpine busybox guests (closes #15 ) On-device agent (k-gamingcom) ran the diagnostic probe sequence and proved the workload IS running on Alpine — yes saturating the vCPU, loadavg=1.05, three yes PIDs visible — but two busybox incompatibilities made every episode look silent: 1. _probe() used `pgrep -c yes`. The -c flag is procps-ng/util-linux, not busybox. busybox pgrep exits 1 with a usage banner; the `\|\| echo 0` fallback then reported yes=0 every time. Switched to `pgrep yes \| wc -l` which both pgrep variants support. 2. _wrap_loop appended `disown` after the nohup-backgrounded script. busybox sh / ash have no disown builtin, so each infected_running phase printed `sh: disown: not found` into run()'s captured output. The script kept running (nohup gives SIGHUP immunity, which is what disown was for), but the spurious error is now gone. Cross-validation in the classifier: - prune_episodes.py: workload-silent now requires the probe AND host-side /proc CPU envelope (flat-cpu) to AGREE. A probe-only zero is treated as the busybox false-positive and dropped. This means the 244 already-on-disk episodes from elliott-thinkpad and k-gamingcom are correctly classified without re-collecting. Test coverage: - test_workload_silent_flag updated to require both signals - test_workload_silent_suppressed_when_host_cpu_real new regression for the busybox false-positive AGENTS.md gains a "Don't trust the in-guest probe alone" section with the busybox-vs-procps gotcha + a list of busybox-incompatible patterns to avoid in any new in-guest diagnostic.	2026-04-30 17:28:48 -05:00
elliott	4e8d2bdb04	etc/lab-host.toml.example: pin Caddy root, not wg-pki client CA (closes #14 ) ca_bundle is what the shipper uses to verify collector.wg's TLS cert. That cert is signed by the Caddy Local Authority, bundled in the repo as etc/caddy-root.crt. Pointing it at wg-ca.pem (the wg-pki CIS490 Lab-Host Client CA, which is the receiver's trust anchor for our client cert) caused CERTIFICATE_VERIFY_FAILED on every ship. Original fix authored by the on-device agent on k-gamingcom in Dev_REL2_043026@786b8da; cherry-picked here onto main. Co-Authored-By: k-gamingcom on-device agent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:26:36 -05:00
Elliott Kolden	b42d073669	Merge remote-tracking branch 'origin/main' into Dev_REL1_043026	2026-04-30 15:48:23 -06:00
max	8d2d0d2e99	prune+receiver: preserve index ownership and add a backfill helper (closes #13 ) Root cause of #13 (PUT 500s on first ship, retries return already-present): my earlier prune-tool session ran as root and rewrote the live index via os.replace(), which drops the original ownership/mode. The new file was root:root and the cis490 service user couldn't append to it. Every fresh PUT 500'd on _append_index after the tarball had already landed via os.replace, so retries always saw "already-present" and never recovered the missing index row. Two fixes: - tools/prune_episodes.py: snapshot the index's stat before the rename and restore uid/gid/mode after. Best-effort chown so non-root prune runs (where chown would EPERM) still succeed; non-root callers matched the original owner anyway. - tools/index_backfill.py: new tool. Walks episodes/<host>/*.tar.zst, computes sha256+size, and appends rows for episodes missing from the index. Preserves "backfilled: true" so trainers can distinguish reconstructed rows. Always opens the index in append mode (never replaces), so it cannot reproduce the ownership bug it's recovering from. Regression test: tests/test_prune.py::test_archive_preserves_index_mode. Operator note for the live receiver: ran the chown fix manually (chown cis490:cis490 /var/lib/cis490/index.jsonl) and ran the backfill once to recover 140 elliott-thinkpad rows that 500'd before the chown landed.	2026-04-30 16:36:05 -05:00
max	f6d7d07837	Make mTLS bring-up unmistakable for on-device agents Sysadmin observed lab-host agents still trying to "secure the connection" — minting certs, generating CSRs, or otherwise reinventing a cert-delivery flow that's already automated through bootstrap.wg. Three reinforcements so an agent reading any of the three surfaces (AGENTS.md, install script output, journalctl) gets the same message: - AGENTS.md gains a top-of-file "do not mint your own certs" callout + a dedicated "Securing the connection (mTLS)" section with the one fix (re-run install-lab-host.sh after setting host_id) and an explicit "what NOT to do" list (no openssl, no copy from another host, no verify_tls=false). - install-lab-host.sh's FIRST-INSTALL NEXT STEPS now spells out that the cert auto-fetch is silently skipped while host_id is REPLACE_ME, and that the operator MUST re-run the script after editing host_id. Step 2 is now "RE-RUN THIS SCRIPT" with a DO NOT openssl warning. - The shipper's "waiting on mTLS material" warning now embeds the exact remediation command + a pointer to AGENTS.md, so an agent reading journalctl without ever opening the repo still gets it. Tests: 12/12 in test_shipper still pass; warning string change is not asserted on (only the dataclass error field).	2026-04-30 16:23:44 -05:00
max	c80a36d3ae	AGENTS.md: prescriptive guidance for smaller models on lab hosts Smaller (non-4.7) Claude models act as on-device agents on CIS490 lab hosts and have hit the install gotchas that became issues #10–#12. Their reports describe symptoms well but miss inferred context — so this expands the runbook with explicit "do this, not that" notes: - run tools from /opt/cis490 not a clone (CWD-on-sys.path trap) - shipper "waiting on mTLS material" is expected and self-heals; do not try to fix it manually - table of the three install bugs already closed in main, so a fresh agent can recognize the symptom and pull instead of re-filing - "fix one red row at a time" rather than batching attempts Closes nothing new; this is the followup to #10/#11/#12 promised during their resolution.	2026-04-30 16:19:09 -05:00
Elliott Kolden	7c35bf7d49	Merge commit '86a088c' into Dev_REL1_043026	2026-04-30 15:16:41 -06:00
max	86a088c204	shipper: defer SSL context build until cert/CA paths exist (closes #11 ) First-boot bring-up enables cis490-shipper before the Pi has issued the mTLS leaf, so ssl.create_default_context(cafile=...) raised FileNotFoundError out of __init__ and systemd crash-looped the unit every RestartSec=5. Now the transport pre-flights the configured ca_bundle / client_cert / client_key paths, raises a recoverable _CertNotReadyError, and ping/ship_tarball retry the build on each request — daemon self-heals once the cert lands without a restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:13:59 -05:00
Elliott Kolden	7683b64929	Merge origin/main into Dev_REL1_043026; accept main's service files Cherry-picks all upstream additions (fleet runner, full collector suite, shipper module, exploit driver, samples, scripts/, cis490_doctor, etc.) and resolves the two service-file conflicts by accepting main's production versions over the stubs we wrote on Day 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 15:05:51 -06:00

1 2

80 commits