CIS490

Author	SHA1	Message	Date
Max Gorog	3d4f282e9c	Tier-2 episodes use clean-only schedule; .gitignore VERSION Two correctness fixes that the §4.5 event-driven labeller surfaced: 1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule (clean → armed → infecting → infected_running → ...) for episodes with no exploit firing. Pre-§4.5 those episodes wrote dishonest `infected_running` labels from the schedule clock — exactly the §3 evidence pattern. Post-§4.5 they write `failed` at the infecting transition (the justifying exploit_fire never arrives), which is honest about what happened but useless for training. The honest fix: Tier-2 episodes have a clean-only schedule. All telemetry tagged `clean` because nothing infected anything. The total duration matches the canonical Tier-3 schedule so episode lengths are comparable across tiers — no length-bias in the dataset (§10). Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py derives `[("clean", total_seconds)]` from the canonical schedule. `tier3_schedule_from(schedule)` renders the legacy `[(name, seconds)]` shape EpisodeConfig still expects. Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from. Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from. Drops the hardcoded DEFAULT_SCHEDULE constants from both — the canonical manifest is the single source of truth (§4.1). 2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp writes /opt/cis490/VERSION so episodes can record code provenance without /opt/cis490 carrying a .git directory. But /opt/cis490 IS typically a git checkout on lab hosts (auto-update.sh pulls into it), so writing VERSION leaves the working tree dirty. Every episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance gate's rule 4 would then reject every episode without CIS490_ALLOW_DIRTY=1 set — which would break the data flow. Now VERSION is .gitignored: install-lab-host.sh stamps it, git status doesn't see it, dirty=false, gate rule 4 passes naturally. These two changes together keep the data flowing AND honest. Tier-2 episodes pass with `phases=[clean]` + every collector emitting real rows. Tier-3 episodes (none today, empty catalog) walk the full event-driven schedule when a verified module gets re-admitted. 286 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:55:37 -05:00
Max Gorog	d9f913fc97	PIPELINE §5 step 6: event-driven labeller (§4.5) Phase labels are written ONLY when justifying events arrive. The schedule clock is now a budget — an upper bound — never a label source. This is the core honesty fix the §3 evidence demanded: Before: every Tier-3 episode wrote `infected_running` from the schedule clock regardless of whether session_open ever fired. Per §10 every dishonest label is a poisoned training example. 67/67 of the §3 probe episodes were poisoned this way. After: `infecting` writes ONLY when exploit_fire is observed in events.jsonl. `infected_running` writes ONLY when session_open is observed. Either timing out or seeing session_open_timeout terminates the walker with a `failed` label that the §4.6 acceptance gate will reject. PHASE_JUSTIFYING_EVENTS in orchestrator/episode.py declares which events justify which phases: "clean": None # orchestrator-emitted "armed": None # orchestrator-emitted "infecting": ("exploit_fire",) "infected_running": ("session_open",) TERMINAL_FAILURE_EVENTS = {"session_open_timeout"} short-circuit any event-driven wait into a `failed` label. `dormant` is intentionally OFF the canonical schedule. §4.5 calls for dormant to be event-driven (session_idle / session_active) too, but the driver doesn't emit those yet. Per §1 default-to-removal we ship without dormant rather than label it from the clock; when the driver gains those emits, dormant re-enters the schedule with proper justification. EpisodeRunner now owns: * `_event_log` — every emit_event appends here * `_event_cv` — condition variable for waiters * `_wait_for_event(names, since_t_mono_ns, timeout_s)` — returns the first matching event in the log with t_mono >= threshold; threshold catches events that fired during the previous on_phase callback. When an event-driven phase's justifier already arrived (e.g. exploit_fire emitted by driver._fire() inside on_phase("armed")), the walker uses the EVENT's t_mono on the label — not the time the walker noticed. The label means "this is when this thing actually happened." manifest.toml: dropped the dormant cycle from the canonical schedule. Episode is shorter (~30s) but every label is event-justified. 14 new tests in tests/test_event_driven_labeller.py covering: justifier mapping invariants, _wait_for_event semantics (already-arrived, future, timeout, since-threshold, first-of-multiple-names), walker behavior (orchestrator-emitted phases, event-driven phases, missing event → failed, terminal-failure-event short-circuit, stop event, event-t_mono on label, phase_transition events with justified_by). 286 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:43:16 -05:00
Max Gorog	4d29b7236d	PIPELINE §5 step 3: target VM build infrastructure + containment posture §4.2 calls for target VMs we BUILD, not VMs we fetch. §4.13 demands every target ship the same isolation posture (no upstream egress, no host-shared FS, unprivileged QEMU, fresh snapshot per episode). This commit lands the infrastructure for both. New surface: * orchestrator/target_spec.py Loads + validates `vm/targets/<name>/spec.toml`. Containment fields are not knobs — each has exactly ONE safe value, and a spec asserting the unsafe value is rejected at load time. There's no `--containment-override`; weakening §4.13 requires amending PIPELINE.md and operator sign-off. * tools/build_target.py Orchestrates build → verify → publish for a single target. Spec invalid → exit 78 (sysadmin error). build.sh failure → image not published. verify.sh failure → image discarded; that's the §4.2 acceptance gate. Publishes sha256 + the manifest.toml stanza the operator copies in to admit the image (§16 substantive amendment with sign-off per §15). * vm/targets/<name>/{spec.toml,build.sh,verify.sh} Template structure. spec.toml is the contract; build.sh produces $OUT_PATH; verify.sh boots the produced image under the §4.13 containment posture and asserts every promise. * vm/targets/shellshock/ First real working target. CVE-2014-6271 (Apache mod_cgi + bash 4.2 mis-parsing function-export environment values). Replaces the SourceForge Metasploitable2 path that §3 evidence proved unverifiable. Bash 4.2 is built from sha256-pinned GNU source inside an Alpine 3.21 cloudinit guest; the build script asserts the produced bash actually triggers shellshock; the verifier re-asserts it under restrict=on with a real CVE-2014-6271 probe. * vm/targets/README.md How operators add a target. Walks the spec → build → verify → manifest amendment loop. Containment regression tests (tests/test_containment.py) — 20 new assertions, parameterized over every target with a build/verify trio: * verify.sh MUST contain `restrict=on` on its netdev (§4.13) * verify.sh MUST contain `snapshot=on` on the boot drive (§4.13) * verify.sh + build.sh MUST NOT contain -virtfs / -fsdev / 9pfs * verify.sh + build.sh MUST NOT wrap qemu-system in `sudo` * Every target must ship the complete spec.toml + build.sh + verify.sh trio — no half-built targets (§1 default-to-removal) Spec validation tests (tests/test_target_spec.py): 13 new tests over spec parse, name/dir mismatch, missing fields, out-of-range port, and the §4.13 containment field validators (each unsafe value rejected with a clear error). The shellshock target's image is NOT yet published to manifest.toml's [[targets.images]] — that's the §15 sign-off amendment that lands after a successful operator-driven build_target.py run on a lab host with KVM. Building takes ~10 min on x86_64; cannot run on the Pi under TCG. Operator drives the first build, verifies the sha256, then amends manifest.toml in a follow-up commit. 261 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:31:40 -05:00
Max Gorog	207a902c3e	PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml The experiment is now defined by a single version-pinned file — manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every lab host loads THIS exact file; per-host overrides of experiment shape are forbidden. Drops the following per-host CLI overrides that previously violated the canonical-manifest principle: * --manifest, --modules-dir (paths now derived) * --ram-per-vm-mib (in manifest.experiment) * --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling) * --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots) * --force-tier2 (not a §14 sanctioned override knob — ship empty catalog to disable Tier-3) * --require-real-samples (sample-side concern; out of fleet scope) * tools/run__demo.py --manifest (samples path now from canonical) New surface: manifest.toml — the single source of truth * orchestrator/manifest.py — load_canonical() + Manifest dataclass with strict validation, raises ManifestError on any failure * EpisodeConfig.experiment_meta — populated by run__demo.py from the canonical manifest; stamped into every episode's meta.json under "experiment" key for provenance cis490-orchestrator.service — RestartPreventExitStatus=78 so manifest-load failures stay stuck-and-loud (§9, §4.7) * install-lab-host.sh — validates manifest.toml at install time; missing or invalid = die with clear message Catalog admission semantics: only modules whose name appears in manifest.catalog get loaded into the runtime catalog (§4.3 in miniature, will tighten further in step 4 when verified_against / last_verified actually gate admission). Missing toml for an admitted name is a sysadmin error → exit 78. Renames cfg.manifest → cfg.samples + adds cfg.experiment to disambiguate sample-manifest from experiment-manifest. Rewrites test_fleet.py fixture to construct synthetic Manifest objects so test outcomes don't depend on the on-disk manifest.toml content. 12 new tests in tests/test_manifest.py: schema-version mismatch, unknown collector, duplicate collector, unknown phase, negative phase seconds, negative ram, missing catalog fields, json round-trip. Local run: `python tools/run_fleet.py --capacity` correctly logs the loaded manifest and prints capacity. 241 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:25:01 -05:00
Max Gorog	dac03d2eff	perf: emit per-episode lifecycle events; emit row even with empty agg Validation on k-gamingcom (commit `ac7b85f`) showed perf enabled in production but rows_perf=0 on every episode. Without lifecycle events the failure mode is indistinguishable from "perf wasn't enabled" — §1 silent-downgrade. The events now surface the actual cause: - perf_unavailable — binary missing OR launch failed (with reason) - perf_started — perf is running (pid, events, interval) - perf_first_row — first row written; counters_populated tells whether any event was actually counted - perf_finished — final tally (intervals_seen, intervals_with_values) - perf_no_counters — perf was alive but every interval came back <not counted> (likely paranoid > 2 or PID ownership mismatch) `_flush()` now writes a row whenever an interval is observed, even when every event was <not counted>. The all-None row is honest data ("perf observed this interval and counted nothing"), and the rows become a count of observed intervals rather than a count of successful measurements — distinct from rows_proc / rows_qmp which do count successful measurements. Trainers filter on `cycles is not None` etc. when they need only populated rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:08:42 -05:00
Max Gorog	0390eb20b6	fix: revert speculative fleet picker change — was producing dishonest labels Empirical evidence from k-gamingcom (commit `4ab5477`, 2026-05-03 22:20Z vsftpd_234_backdoor episode): the picker selected vsftpd because BRIDGE was set on that host. The exploit fires against target_ip=127.0.0.1 (SLIRP loopback) but vsftpd's hardcoded port-6200 backdoor is reachable only at the guest's bridge IP. Result: session_open_timeout, AND a schedule-clock-driven `infected_running` label was still written for the failed exploit — exactly the §10 poisoned-training-example pattern. Until guest-IP discovery for bridge mode is wired (a separate piece of infrastructure), bridge-only modules can't actually reach their target even when the operator sets BRIDGE for Tier-2's pcap source. Revert the picker to its prior conservative form: drop requires_bridge modules unconditionally regardless of BRIDGE state. Same for the BRIDGE env strip in the Tier-3 launch path — it was correct as unconditional. Replaces the two aspirational tests (test_fleet_uses_all_modules_when_bridge_set, test_fleet_propagates_bridge_env_to_runner) with their honest negatives (test_tier3_drops_requires_bridge_modules_unconditionally, test_tier3_strips_bridge_env_even_when_set). The previous tests asserted behavior the rest of the pipeline can't deliver; they were false signals. 229 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:58:43 -05:00
Max Gorog	4ab5477226	PIPELINE §5 step 1: fix four root-cause defects Diagnoses + fixes for the silent-collector / never-lands-session failures that the 200-episode quality probe surfaced (§3 evidence). All four address the producer; no compensating layers added. perf collector (rows_perf=0 on 100% of episodes): - perf stat -j writes to stderr by default with -p; we read stdout. Add --log-fd 1 so JSON reaches stdout where the parser sees it. - Event names come back annotated with the privilege scope perf actually measured ("cycles:u" under perf_event_paranoid=2). Strip the suffix so _build_row's plain-name lookups hit. Without this every metric was None even when perf reported real numbers. - tests/test_collectors_emit.py covers the regression with a real busy-loop fixture; emit-test discipline per §4.4. guest-agent collector (rows_guest=0 on 100% of episodes): - Alpine cloud image doesn't ship python3, so the in-guest agent's `#!/usr/bin/env python3` shebang silently fails. Add packages: [python3] to cidata user-data so cloud-init installs it before the OpenRC service starts. - Guest agent now exits nonzero (was: silent stdout fallback) when /dev/virtio-ports/cis490.guest.agent is missing, so OpenRC reports the failure to /var/log/cis490-agent.log instead of the bytes vanishing into the void. Refs §1. - Host-side collector emits guest_agent_connected / guest_agent_first_byte / guest_agent_silent_window into the orchestrator's events.jsonl. Future episodes show the in-guest failure mode per-episode instead of inferring from rows_guest=0. k-gamingcom missing qmp/netflow/pcap (also affected elliott on Tier-3 episodes — was misclassified as host divergence): - tools/run_tier3_demo.py was building EpisodeConfig WITHOUT qmp_socket / guest_agent_socket / bridge_iface — even though launch_target.sh creates the underlying chardevs and BRIDGE supplies the iface. tools/run_real_vm_demo.py wires them correctly; Tier-3 had a copy-paste gap. - tests/test_collectors_emit.py adds a source-grep regression so the wiring stays honest. samba_usermap_script never lands session (0/67 in §3 probe): - Bind handler default WfsDelay (~5s) gives up before bind_perl on Metasploitable2 has finished forking + binding LPORT under SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in exploits/driver.py so framework + driver agree on the wait budget. Add ConnectTimeout=15 so the handler's bind connect has retry budget instead of one-shot. orchestrator/fleet.py: usable_modules + BRIDGE handling were both unconditional, so: - With BRIDGE set, requires_bridge modules were still being dropped — picker only ever returned samba_usermap_script across every slot/episode (the test_fleet_uses_all_modules_when_bridge_set failure on HEAD). - env.pop("BRIDGE") fired even when BRIDGE was the operator's explicit setup, breaking modules that need bridge mode (vsftpd backdoor on hardcoded port 6200, distccd, etc.). Both made conditional on bridge_set so the picker walks the full catalog under bridge mode and SLIRP-only modules still get a clean SLIRP env when BRIDGE is unset. receiver/app.py: half-pregnant v2 schema state in HEAD — calling store.ingest_stream(episode_type=..., benign_profile=...) with kwargs the matching store.py change was in the WIP stash. Removed v2 awareness from app.py so v1 episodes (what the producer ships today) get accepted again. SCHEMA_VERSION default reset to 1 to match. 229 passed, 0 failed. (HEAD had 15 failures, all linked to the half-pregnant v2 state above.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:05:25 -05:00
Elliott Kolden	667f042707	Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01) Root causes and fixes documented in TIER3-BRINGUP.md. Summary: 1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot. 2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring modules selected on SLIRP runs; fix: always filter requires_bridge. 3. cmd/unix/interact creates no session.list entry → session_open_timeout every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl. 4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444); fix: extra_host_port:extra_host_port mapping so guest binds the per-slot LPORT directly. 5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots; fix: requires_bridge=true filters it from SLIRP fleet runs. 6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba boots (~60 s too early); fix: replace TCP probe with serial console _wait_for_serial_login that waits for actual "login:" prompt. 7. Stale QEMU survives orchestrator restart (start_new_session=True) → holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from old pidfile before rmtree. 8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100. 9. msfrpcd 6.x returns bytes for all string values even with raw=False; fix: MSFRpcClient._str() recursive decoder applied to all responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:26:19 -06:00
max	5c0bc9af8e	meta.json: stamp code_version (commit, branch, dirty) per episode Closes a real reproducibility gap. Three weeks of bug fixes have shipped (probe fix in `2707709`, multi-signal classifier in `321ea63`, mandatory tier-4 in `265f3ad`, etc.); without a per-episode code_version, trainers can't tell which episodes came from buggy pre-fix code and have to scan every tarball to guess. Resolution priority (cached across episodes): 1. $INSTALL_ROOT/VERSION (production — install-lab-host.sh writes it at install time since /opt/cis490 is a flat copy with no .git) 2. git rev-parse HEAD from the repo root (dev clones) 3. {"commit": "unknown", source: "unknown"} so the field is always present (filterable) Output shape, always present in meta.json: "code_version": { "commit": "<40-hex>" \| "unknown", "branch": "<name>" \| null, "dirty": bool \| null, "source": "VERSION-file" \| "git" \| "unknown" } install-lab-host.sh writes VERSION at install time with the source repo's git rev-parse HEAD + branch + clean-tree flag + install timestamp. Lab-host agents that pull main + re-run install-lab-host.sh get a fresh stamp automatically. 148/148 tests pass; test_episode_against_self_pid_produces_full_directory asserts the field's presence + valid `source` value.	2026-05-01 01:29:01 -05:00
max	507eac617b	Solvable Tier-3 holes: callback payloads, busybox workloads, bridge by default Closes the next batch of issues from the post-mortem. The previous "each run uses a different vulnerability" commit shipped 5 modules but 3 of them couldn't actually fire under SLIRP+restrict=on: their reverse-shell payloads needed a callback channel the launcher didn't provide, AND their LHOST options were set to {{ target_ip }} (the target's IP, not the attacker's — copy-paste from RHOSTS). Same time, the workloads.py shell commands used bash-only /dev/tcp redirects that silently no-op'd in the busybox shell sessions Metasploitable2 returns. Net effect: episodes that selected those modules would have produced session_open_timeout + dead workloads. Module configs (the three callback ones): exploits/modules/distccd_command_exec.toml exploits/modules/php_cgi_arg_injection.toml exploits/modules/unreal_ircd_3281_backdoor.toml - Switch payload from cmd/unix/reverse* to cmd/unix/bind_perl so the target listens on a known port; msfrpcd connects to it via the host's hostfwd (no callback path required). - Drop the bogus LHOST = "{{ target_ip }}" — bind shells don't use LHOST. - Add [runtime] table: requires_bridge = true extra_target_ports = [<bind_lport>] Both fields are honored by the loader (ModuleConfig.requires_bridge) and the launcher (TARGET_PORTS gets the extra port hostfwd'd when BRIDGE mode is active). orchestrator/fleet.py When BRIDGE is unset in env, _run_slot filters the module catalog down to modules where requires_bridge=False before calling select_module. Two same-socket-shell modules (vsftpd_234_backdoor + samba_usermap_script) survive — fleet still has variety; just doesn't pick modules whose payloads can't land. With BRIDGE set, the full catalog rotates as before, AND BRIDGE is propagated to the per-slot subprocess env so launch_target.sh enters tap+bridge mode. exploits/workloads.py Replaced bash-only constructs in three profiles: scan-and-dial /dev/tcp/HOST/PORT redirects → nc -z -w 1 bursty-c2 same fix shell-resident exec 3<>/dev/tcp/... → piping into nc -w All three now run cleanly in busybox / dash / Metasploitable2's default shell. The remaining three profiles (cpu-saturate, io-walk, low-and-slow) were already busybox-portable. scripts/install-lab-host.sh - lab-host.env now defaults BRIDGE=br-malware (was commented out). Operator opt-out is to comment the line back in. - New step 6b: provisions br-malware via vm/setup_bridge.sh AND pre-creates a per-slot tap pool (cis490tap0..7 for Tier-2 demo, cis490target0..7 for Tier-3 target) all attached to br-malware and brought up. Launchers reference these by SLOT — no sudo needed at episode time. - On bridge-setup failure, the script auto-comments BRIDGE in the env file with a "auto-disabled: bridge setup failed" note so the fleet falls back to same-socket modules + Tier-2 cleanly. tools/cis490_doctor.py Two new checks for the lab-host role: bridge: br-malware exists / up tier3: msfrpcd listening on 127.0.0.1:55553 tier3: module catalog parses (counts same-socket vs requires_bridge) All three are warn-level — they don't fail an otherwise-healthy Tier-2-only setup; they tell the operator what's missing for full Tier-3 + source 4 coverage. Tests: 132 (was 129). New cases: test_fleet.py +3 - fleet skips requires_bridge modules when BRIDGE unset (asserted across 20 episodes; never picks a callback module) - fleet uses the full catalog when BRIDGE is set - BRIDGE env propagates to per-slot subprocess What's still untested live: the bind_perl payloads against a real Metasploitable2 in the bridge-enabled launcher path. That's a deployment validation, not a code change. The unit tests confirm the dispatch / filter logic; the live test is the next operator action. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 02:32:52 -05:00
max	a193d17ead	fleet: rotate exploit modules per (host, slot, ep); Tier 3 by default Closes the "every run hits the same vulnerability" gap. Before this commit, the fleet shipped Tier-2 episodes (no exploit at all) with only the post-infection sample varying. Tier-3 had a single canned module — vsftpd_234_backdoor — so even when exploit fire was exercised, the entry vector never changed. Trainer would see one shape of `armed → infecting` and learn nothing about how varied real exploits look on the wire / in /proc. What landed: exploits/modules/ + samba_usermap_script.toml CVE-2007-2447, SMB:139 + distccd_command_exec.toml CVE-2004-2687, distcc:3632 + php_cgi_arg_injection.toml CVE-2012-1823, http:80 + unreal_ircd_3281_backdoor.toml CVE-2010-2075, ircd:6667 (vsftpd_234_backdoor.toml unchanged) All five are canonical Metasploitable2 vectors with stable Metasploit modules. Each TOML carries the RPORT the launcher needs to wire its hostfwd at, plus a payload tuned to a clean shell session (cmd/unix/interact for in-band shells, cmd/unix/reverse* with deterministic LPORTs for reverse shells). exploits/modules.py + select_module(catalog, host_id, slot, episode_index) — same SHA-256-keyed deterministic selection shape SampleManifest uses for samples. Two hosts at the same slot/episode hash to different modules; one host walks the full catalog within ~len(catalog) episodes. + module_target_port() — pulls RPORT off the module config so the fleet can plumb the launcher's hostfwd at the right service. orchestrator/fleet.py - _run_slot now decides Tier 3 vs Tier 2 from msfrpcd reachability + module-catalog populated. Default is Tier 3 when both are true; Tier 2 fallback when not (logged + recorded in SlotResult.tier so trainers can filter no-exploit episodes). - Per-slot module via select_module() — each concurrent slot in a wave gets a different vector AND a different sample. - PORT_BASE per slot (target_port + slot * 1000) so concurrent Tier-3 targets don't collide on the host-side hostfwd port. - _msfrpcd_available() probe gates the dispatch. - Fleet-side log line records (slot, ep, tier, sample, module, run_dir) so the operator can see at a glance what each wave is exercising. - SlotResult grows tier + module_name fields; FleetConfig grows modules + force_tier2 + msfrpcd_{host,port} fields. orchestrator/episode.py + EpisodeConfig.exploit_meta — plain dict the runner stamps into meta.exploit so every Tier-3 episode records {framework, module path, module type, payload, RPORT, RHOSTS template}. Trainers join on meta.exploit.module_name to stratify by entry vector; meta.sample.name to stratify by post-infection family. tools/run_tier3_demo.py + Builds exploit_meta from the loaded ModuleConfig and passes it to EpisodeConfig. Sample is now also passed (was missing). tools/run_fleet.py + --modules-dir (default exploits/modules/) — load module catalog on startup; pass to FleetConfig. + --force-tier2 — escape hatch for dev / smoke tests. + JSON output now includes per-slot {tier, module} so the operator can see at a glance what each slot ran without grepping logs. Tests: 129 (was 119). New cases: test_exploits.py +6 - catalog has at least the five canonical Metasploitable2 vectors - select_module is deterministic per (host, slot, ep) - select_module diversifies across hosts - select_module walks the full catalog over many episodes - module_target_port pulls RPORT for each shipped TOML test_fleet.py +4 - _run_slot dispatches to run_tier3_demo.py when msfrpcd up - falls back to run_real_vm_demo.py when msfrpcd unreachable - falls back when module catalog empty - --force-tier2 overrides msfrpcd availability - PORT_BASE is unique per concurrent slot (no hostfwd collision) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 02:22:49 -05:00
max	d86502d950	workload audit trail: meta.sample + per-phase events + pre-kill probe The elliott-lab episode showed every phase median'd 20% CPU because the in-guest workload silently never fired — and there was no signal in events.jsonl to detect that from outside, so a trainer would treat the labels as ground truth and learn "all phases look identical". This commit closes the audit gap so the failure is visible in meta: orchestrator/episode.py EpisodeConfig.sample: Sample \| None — the manifest entry that drove this episode's workload selection. Stamped into meta.sample as {name, family, category, profile, kind, sha256} so trainers can join cleanly without re-deriving from events. None means the v1 yes-loop fallback path ran (and the trainer should treat the episode with appropriate skepticism). tools/vm_load_controller.py VMLoadController gains an emit_event callable. Every phase now emits a workload_* event into the runner's events.jsonl: workload_setup login + initial cleanup OK workload_killed clean / dormant. Dormant carries a `pre_kill_probe` dict from inside the guest (`pgrep -c yes`, `pgrep -c sh`, /proc/loadavg) so the trainer can detect the elliott-lab failure mode where the workload never actually ran. workload_armed armed handshake fired workload_infecting dd urandom / payload write fired workload_started infected_running command sent workload_failed any of the above raised inside SerialClient (timeout, EOF, partial login). The runner would have silently swallowed the exception via its on_phase try/except; the audit row makes the failure detectable. Exceptions in shell calls surface as workload_failed events but do NOT propagate, matching the runner's existing on_phase contract. tools/run_real_vm_demo.py Wires the controller's emit_event to the runner's emit_event via a small forward-reference closure (controller is built before runner; runner.emit_event needs to be the sink). Sample also flows into EpisodeConfig.sample so meta.sample matches what the controller actually ran. Tests: 119 (was 106). New cases: tests/test_vm_load_controller.py (11 tests against a FakeSerial) - setup emits workload_setup - infected_running runs the v1 yes-loop AND emits workload_started - dormant probes BEFORE killing and stamps pre_kill_probe - dormant probe records "yes=0" (the elliott-lab fingerprint) - clean / armed / infecting all emit their respective events - serial.run() exception → workload_failed event, no propagation - sample-with-profile dispatches to exploits.workloads command (NOT the v1 yes-loop) - missing emit_event callback is a no-op (back-compat) tests/test_episode.py (2 new) - meta.sample carries name/family/category/profile/kind/sha256 when EpisodeConfig.sample is set - meta.sample stays null in the v1 fallback path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 02:12:34 -05:00
max	8753340ea3	fleet: fix per-slot run-dir collision so concurrent VMs actually run Root cause of "fleet says max_concurrent=3 but only one episode ships per wave" symptom on elliott-lab: 1. orchestrator/fleet.py::_run_slot set env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot. 2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm (NO slot suffix), then UNCONDITIONALLY overwrote the env's RUN_DIR with that flag's value before exec'ing the launcher. 3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots collided on the same socket dir. 4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's rmtree literally deleted slot 0's pidfile + sockets mid-boot. 5. Net effect: one VM survives per wave on a multi-core host that should be running ~cores-1 in parallel. Throughput collapses to 1/N. Fix: tools/run_real_vm_demo.py + tools/run_tier3_demo.py: --run-dir default cascade — 1) explicit CLI flag 2) RUN_DIR env (set by fleet runner) 3) /tmp/cis490-vm-<SLOT> (SLOT from env, default 0) Same change in both runners so Tier-2 + Tier-3 fleet waves parallelize cleanly. orchestrator/fleet.py::_run_slot: Pass --run-dir explicitly to the subprocess so the per-slot path is audit-visible in the fleet log instead of buried in env. Also flip the subprocess interpreter to repo_root/.venv/bin/python when present (was /usr/bin/env python3 — worked by luck because the orchestrator path doesn't import msgpack/httpx, but a Tier-3 fleet wave would have died at import-time on a host without those in system Python). etc/cis490-orchestrator.service: Removed the duplicate [Service] hardening block at the bottom of the file that was silently overriding the AmbientCapabilities grant (NoNewPrivileges=true at the bottom flipped the NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_ ADMIN + CAP_PERFMON before per-episode subprocesses inherit them). Sources 3 + 4 would have failed silently inside the sandbox. Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable. 106/106 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:55:56 -05:00
max	7311802822	orchestrator: emit snapshot_load before _write_meta to keep t_mono ~0 On slower disks (Pi5 SD cards, mu's hardware) the json.dump → write → os.replace path inside _write_meta takes more than 1 ms, so when the snapshot_load event fired afterwards its t_mono_ns drifted past the "<1 ms after origin" assertion in test_driver_events_persist_to_events_jsonl. Fix: emit snapshot_load immediately after setting _t_mono_origin_ns, before any file I/O. Matches the semantic intent (snapshot_load marks episode clock = 0) and removes the disk-speed dependency from the event timeline. Diagnosis + suggested patch from spectral/CIS490#7 (filed by mu). Closes spectral/CIS490#7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:49:50 -05:00
max	a88ac83db0	Close out the deployment-readiness gaps Wraps the gaps surfaced in the "what is not implemented" audit so the fleet really is shippable end-to-end. Verified live on the Pi: - cis490-shipper --ping → HTTP 200 through Caddy + mTLS via the new wg-pki client CA leaf - real episode dir → tar+zstd → PUT → HTTP 201 stored - re-ship same bytes → 200 (idempotent) - re-ship different bytes under same id → 409 (conflict) Changes: orchestrator/episode.py - EpisodeConfig.revert_at_start / revert_at_end (Tier 0+ snapshot/ revert per docs/architecture.md). When set + qmp_socket present, EpisodeRunner issues loadvm <snapshot_name> and emits snapshot_revert / snapshot_revert_failed events on the same monotonic clock as everything else. collectors/qmp.py - savevm() / loadvm() helpers using human-monitor-command, plus a test against the fake QMP server. exploits/workloads.py - chunked_real_binary_upload() returns a ChunkedUpload plan: 8 KiB base64 chunks (~6 KiB binary each) so msfrpc never sees a buffer- busting payload. Includes a finalize step that sha256-verifies on the guest before exec. - real_binary_workload() now wraps the chunked plan for backwards compat with single-shot callers. exploits/driver.py - Tier-4 dispatch walks the chunked plan in MSFExploitDriver: each chunk is a separate session_shell_write; finalize verifies; exec only runs on sha-ok. New events: real_binary_upload_begin, real_binary_verify, real_binary_aborted. etc/cis490-orchestrator.service - Reads /etc/cis490/lab-host.env (FLEET_HOST_ID + optional BRIDGE). - Grants AmbientCapabilities CAP_NET_RAW (tcpdump for source 4) + CAP_SYS_ADMIN + CAP_PERFMON (perf for source 3) so collectors work under hardening. scripts/install-lab-host.sh - Writes /etc/cis490/lab-host.env on first install with FLEET_HOST_ID defaulting to `hostname -s`. - Best-effort: fetches the Alpine baseline qcow2 (sha512-pinned) and builds cidata.iso with the in-guest agent embedded; symlinks both into /opt/cis490/vm/images/ so launchers find them. scripts/fetch-alpine-baseline.sh - Idempotent fetcher for the Alpine 3.21 cloud-init nocloud qcow2 matching the sha512 in docs/sources.md. tools/plot_envelope.py - Rebuilt to render whatever telemetry the episode dir contains: proc → QMP block ops → perf IPC/miss-rate → bridge pkts/SYNs → guest agent load/mem. Missing sources are silently skipped. tools/index_reader.py - cis490-index CLI: filter receiver's index.jsonl by host / sample / time range, sort, count-by group. Closest thing to a query interface until we stand up Postgres/Timescale. samples/README.md - Rewritten to match the new manifest schema, the kind=real vs mimic split, the per-(host, slot, ep) selection mechanic, and the chunked-upload safety story. Tests: 106 pass (was 102). New cases: - test_qmp.py — savevm + loadvm (HMP wrapper + error path) - test_tier4.py — chunked plan splitting, sha-pinned finalize, end-to-end driver walks all chunks + verify + exec via the fake msfrpc client Closes the "what is not implemented" punch list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:31:55 -05:00
max	bdcd2ecbef	Close out the open issues: bridge pcap wiring, perf collector, Tier-4 Wraps the three remaining 🚧 items from the README so every collector the threat-model promises is actually live, and the Tier-4 path (real-malware fetch + upload + exec) works end-to-end as soon as a sha256 lands in samples/store/. Closes spectral/CIS490#4, #5, #6. == #6 — Bridge pcap wiring == EpisodeConfig grows three optional fields: bridge_iface: str \| None # e.g. "br-malware" bridge_ip: str = "10.200.0.1" pcap_snaplen: int = 256 When bridge_iface is set, EpisodeRunner spawns tcpdump for the duration of the schedule (network.pcap), stops it cleanly on episode end, and runs collectors.pcap.bucketize() to produce netflow.jsonl per the 100-ms schema in docs/data-model.md. EpisodeResult + meta.result gain rows_netflow + pcap_bytes counters. vm/launch_demo.sh + launch_target.sh now switch between SLIRP usermode and tap+bridge based on $BRIDGE — operator pre-creates the tap as a bridge member, no sudo from the launcher. run_real_vm_demo.py picks BRIDGE up from env so the fleet runner can opt entire waves into pcap mode by exporting BRIDGE before invocation. == #5 — Source 3 perf collector == collectors/perf_qemu.py shells out to ``perf stat -p <pid> -I 100 -j`` and parses the per-event JSON stream. Aggregates one row per interval across the canonical event set (cycles/instructions/cache-{refs,misses}/ branches/branch-misses/page-faults/context-switches), computes IPC + cache-miss rate. Tolerates missing events (``<not counted>`` / ``<not supported>``) without dropping the row, and skips cleanly when ``perf`` isn't on PATH or the process can't be attached. EpisodeConfig.enable_perf=True opts into the collector — off by default because perf needs CAP_SYS_ADMIN or perf_event_paranoid <= 1. When enabled, runs as a parallel thread alongside the other collectors; EpisodeResult.rows_perf records the count. == #4 — Tier 4 (real-malware fetch + upload + exec) == tools/fetch_sample.py: pulls a sample by sha256 from MalwareBazaar (API key from env or samples/.bazaar.token), unzips with the standard "infected" password, verifies the resulting binary's sha256, lands at samples/store/<sha256>. Idempotent — already-staged correct binaries return immediately. samples/manifest.py: Sample.binary_path(store_root) resolves to the staged binary path, or None for mimics / not-yet-fetched real samples. exploits/workloads.py: real_binary_workload(bytes, sample) builds a Workload that base64-uploads the binary into the shell session via a heredoc, decodes + chmods + execs it in the background, captures the PID for clean stop on dormant. Per-profile pid/bin paths so concurrent samples in the same guest don't collide. exploits/driver.py: dispatch order is now: 1) sample.kind == "real" + binary staged at sample_store_root → real_binary_workload (Tier 4) 2) profile mimic from workloads.workload_for() (Tier 3 v2) 3) None → driver v1 fallback yes-loop DriverConfig.sample_store_root is the new field; run_tier3_demo.py wires it to repo_root/samples/store. driver_setup event records sample_sha256 so trainers can join Tier-4 episodes against the manifest by hash. samples/store/.gitkeep added (binaries themselves are gitignored). Tests: 102 pass (was 86). New suites: tests/test_perf_qemu.py — parser + builder + perf-missing fallback tests/test_tier4.py — real_binary_workload base64 round-trip, stop-cmd kills pidfile, per-profile path isolation, driver dispatch chooses real vs mimic correctly, fetcher input validation and cached-fast-path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:17:49 -05:00
max	1b6c7b2f4a	Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts This is the chunk that makes "real data" actually flow on multiple hosts in parallel. End-to-end pipe was up at `613c6fa` / 2579683; now the lab-host side has the diversity + concurrency it needs. Collectors landed: collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP client + row builder + run loop. Tolerates older qemu without query-stats. collectors/guest_agent.py — source 5 (deployable). Reads the virtio-serial host-side socket, parses agent JSON-lines, re-stamps to the host monotonic clock, persists. collectors/pcap.py — source 4 (deployable). tcpdump capture + pure-Python pcap reader + 100 ms netflow.jsonl bucketizer. Decodes Ethernet/IPv4/TCP/UDP enough for the schema in docs/data-model.md. In-guest agent: vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs, thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent. tools/build_cidata.py — embeds the agent + an OpenRC service into user-data so first boot of the Alpine cidata image auto-starts it. Launchers: vm/launch_demo.sh / launch_target.sh — second virtio-serial port for the agent socket; SLOT env support so multiple VMs run without socket / port collisions; PORT_BASE on launch_target so multiple target VMs hostfwd different host ports. vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24, no NAT). Idempotent. Fleet: orchestrator/fleet.py — capacity detector (cores / RAM / load headroom) + concurrent-slot runner. Per-slot ENV selects the sample. FleetCapacity dataclass round-trips into meta.json so "this episode ran with 6 concurrent VMs" is auditable post-hoc. tools/run_fleet.py — CLI: --capacity report; --waves N runs N waves of (max_concurrent) episodes each, every slot with a different sample. etc/cis490-orchestrator.service — now drives the fleet runner with Restart=always so each invocation runs one wave and respawns, giving a continuous stream. Samples: samples/manifest.toml — six profiles spanning the five major behaviour shapes. Each entry is real OR mimic (sha256 distinguishes). samples/manifest.py — strict TOML loader (rejects dups, unknown categories) + deterministic select(host_id, slot, episode_index) so different hosts on the network walk the catalog in different orders without any coordinator. EpisodeRunner: orchestrator/episode.py — optional qmp_socket + guest_agent_socket fields on EpisodeConfig; when set, additional collector threads run alongside proc_qemu. EpisodeResult now carries rows_qmp + rows_guest counters. Tier-3 setup automation: scripts/install-msfrpcd.sh — installs metasploit-framework where the package manager has it, generates a strong password into /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to 127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch once MSFRPC_PASSWORD is sourced. scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256 from the operator (Rapid7 download is registration-walled), pulls, verifies, converts vmdk → qcow2, lands at vm/images/. Tests: 82 pass (was 51). New suites: tests/test_qmp.py — fake QMP server, capability handshake, blockstats, async-event interleaving, 5-failure backoff tests/test_guest_agent.py — fake virtio socket, JSON-lines read + re-stamp, malformed-line tolerance tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames, bucketize correctness across windows tests/test_fleet.py — capacity math (8-core idle / low-RAM / high-load / Pi5 / 1-core box), manifest selection determinism + diversity What's queued for the next commit (already discussed in convo): - MSFExploitDriver v2: map sample.profile → distinct in-session workload so Tier-3 episodes don't all produce the same yes-loop envelope. Critical for ML to learn varied malware shapes. - Real-sample fetch from MalwareBazaar by sha256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:02:27 -05:00
max	613c6fa223	Tier 3: msfrpc-driven exploit driver + first module config Adds the Tier-3 exploit driver — an MSFExploitDriver that plugs into EpisodeRunner.on_phase, fires a Metasploit module against a target VM via msfrpcd, watches for the resulting session, and stamps each transition (exploit_fire, session_open, session_landing_probe, sample_executed, session_dormant, session_killed) into the episode's events.jsonl on the orchestrator's monotonic clock. What landed: - exploits/msfrpc.py — minimal msgpack-over-HTTPS client (auth, module.execute, job/session lifecycle) so we don't depend on a third-party MSF wrapper. - exploits/driver.py — phase-to-msfrpc adapter; idempotent fire, session-open polling with timeout, workload start/stop, teardown. - exploits/modules.py + exploits/modules/vsftpd_234_backdoor.toml — TOML module configs with {{ target_ip }} placeholders, replacing the imperative .rc-script approach the README previously hinted at. - vm/launch_target.sh — SLIRP+restrict=on launcher for the intentionally-vulnerable target VM (host can reach guest via hostfwd, guest cannot reach host or internet). - tools/run_tier3_demo.py — end-to-end runner mirroring run_real_vm_demo. - tests/test_exploits.py — 12 new tests against a fake MSFRpcClient, including an integration test that drives a real EpisodeRunner. Plumbing changes: - EpisodeRunner._emit_event → public emit_event, so external drivers share the runner's monotonic clock and events.jsonl. - mkdir for episode_dir moved to __init__ so emit_event is callable before run() (driver_setup fires pre-schedule). Status: driver + tests pass (40/40); end-to-end against a live msfrpcd + Metasploitable2 image is the next bring-up step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:11:52 -05:00
Maximus Gorog	970698af83	Synthetic envelope demo: phase-driven load mimic + plotter End-to-end pipeline now produces a labeled envelope from a single command. Drives the orchestrator through an 8-phase XMRig-shaped schedule and renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands sourced from labels.jsonl. Real telemetry, simulated load — validates the collection + labeling shape before a real VM is involved. Components: - tools/load_mimic.py phase-driven load generator. Reads phase commands on stdin; CPU/IO behavior matches the named phase (clean=idle, armed=light burst, infecting=disk burst+CPU, infected_running= CPU saturation+stratum-shaped writes, dormant=quieter than clean). - tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with a default 85s schedule that includes the classic infected_running → dormant → re-entry pattern. - tools/plot_envelope.py reads telemetry + labels from an episode dir, writes envelope.png with colored phase bands. orchestrator: EpisodeRunner now takes an optional phase_schedule and an on_phase callback. Walks the schedule emitting one label per transition. Backwards-compatible — existing single-phase tests still green. Doc fix (user pushback): README + architecture + threat-model no longer imply the Pi5 is the deployment target. Pi5's actual role here is the WireGuard-side collector for episode tarballs. Deployment target is generic ("constrained Linux device"). The "gateway observer" concept remains a deployment pattern, decoupled from the Pi5's collector role. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:53:20 -06:00
Maximus Gorog	064387b7a0	Add v0 orchestrator + first oracle collector (host /proc) End-to-end: ``python -m orchestrator --target-pid <pid> --duration N`` now writes a complete episode directory matching docs/data-model.md, with phase labels, events, and a 10 Hz host /proc telemetry stream. No VM yet — pid is arbitrary so we can validate the loop against e.g. ``sleep 5`` while the lab side comes up. collectors/proc_qemu.py — parses /proc/<pid>/{stat,io,status} (handles parens in comm), single-shot collect_once(), and a stop-event-driven run_loop() that ticks at a fixed cadence and exits when the pid disappears. Tagged ``available_in_deployment: false`` per the threat-model doc. orchestrator/episode.py — EpisodeRunner: creates data/episodes/<ulid>/, atomic meta.json, events.jsonl + labels.jsonl writers, drives the collector in a thread for duration_s, writes done.marker last so the shipper never sees a half-finished episode. orchestrator/ulid.py — tiny 26-char Crockford-base32 ULID generator. Time-sortable, no third-party dep. orchestrator/__main__.py — CLI entry point. Tests (15 new, 28 total green): - proc_qemu: real-ish stat with parens-in-comm, missing /proc/<pid>/io, missing pid, run_loop cadence, run_loop terminates when pid disappears. - episode: full directory shape against os.getpid(), id override, done.marker written after meta.json finalize. - ulid: length+alphabet, 2000-burst uniqueness, time-sortability. Smoke-tested against ``sleep 10``: 16 rows over 1.5s at 100ms cadence, monotonic clock, RSS stable at ~3.5 MiB as expected for an idle sleep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:40:25 -06:00
Maximus Gorog	fa1574a0a6	Scaffold project: docs, repo skeleton, transport + deploy design Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:21:00 -06:00

21 commits