CIS490

Author	SHA1	Message	Date
Max Gorog	3d4f282e9c	Tier-2 episodes use clean-only schedule; .gitignore VERSION Two correctness fixes that the §4.5 event-driven labeller surfaced: 1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule (clean → armed → infecting → infected_running → ...) for episodes with no exploit firing. Pre-§4.5 those episodes wrote dishonest `infected_running` labels from the schedule clock — exactly the §3 evidence pattern. Post-§4.5 they write `failed` at the infecting transition (the justifying exploit_fire never arrives), which is honest about what happened but useless for training. The honest fix: Tier-2 episodes have a clean-only schedule. All telemetry tagged `clean` because nothing infected anything. The total duration matches the canonical Tier-3 schedule so episode lengths are comparable across tiers — no length-bias in the dataset (§10). Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py derives `[("clean", total_seconds)]` from the canonical schedule. `tier3_schedule_from(schedule)` renders the legacy `[(name, seconds)]` shape EpisodeConfig still expects. Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from. Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from. Drops the hardcoded DEFAULT_SCHEDULE constants from both — the canonical manifest is the single source of truth (§4.1). 2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp writes /opt/cis490/VERSION so episodes can record code provenance without /opt/cis490 carrying a .git directory. But /opt/cis490 IS typically a git checkout on lab hosts (auto-update.sh pulls into it), so writing VERSION leaves the working tree dirty. Every episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance gate's rule 4 would then reject every episode without CIS490_ALLOW_DIRTY=1 set — which would break the data flow. Now VERSION is .gitignored: install-lab-host.sh stamps it, git status doesn't see it, dirty=false, gate rule 4 passes naturally. These two changes together keep the data flowing AND honest. Tier-2 episodes pass with `phases=[clean]` + every collector emitting real rows. Tier-3 episodes (none today, empty catalog) walk the full event-driven schedule when a verified module gets re-admitted. 286 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:55:37 -05:00
Max Gorog	207a902c3e	PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml The experiment is now defined by a single version-pinned file — manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every lab host loads THIS exact file; per-host overrides of experiment shape are forbidden. Drops the following per-host CLI overrides that previously violated the canonical-manifest principle: * --manifest, --modules-dir (paths now derived) * --ram-per-vm-mib (in manifest.experiment) * --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling) * --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots) * --force-tier2 (not a §14 sanctioned override knob — ship empty catalog to disable Tier-3) * --require-real-samples (sample-side concern; out of fleet scope) * tools/run__demo.py --manifest (samples path now from canonical) New surface: manifest.toml — the single source of truth * orchestrator/manifest.py — load_canonical() + Manifest dataclass with strict validation, raises ManifestError on any failure * EpisodeConfig.experiment_meta — populated by run__demo.py from the canonical manifest; stamped into every episode's meta.json under "experiment" key for provenance cis490-orchestrator.service — RestartPreventExitStatus=78 so manifest-load failures stay stuck-and-loud (§9, §4.7) * install-lab-host.sh — validates manifest.toml at install time; missing or invalid = die with clear message Catalog admission semantics: only modules whose name appears in manifest.catalog get loaded into the runtime catalog (§4.3 in miniature, will tighten further in step 4 when verified_against / last_verified actually gate admission). Missing toml for an admitted name is a sysadmin error → exit 78. Renames cfg.manifest → cfg.samples + adds cfg.experiment to disambiguate sample-manifest from experiment-manifest. Rewrites test_fleet.py fixture to construct synthetic Manifest objects so test outcomes don't depend on the on-disk manifest.toml content. 12 new tests in tests/test_manifest.py: schema-version mismatch, unknown collector, duplicate collector, unknown phase, negative phase seconds, negative ram, missing catalog fields, json round-trip. Local run: `python tools/run_fleet.py --capacity` correctly logs the loaded manifest and prints capacity. 241 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:25:01 -05:00
Max Gorog	ac7b85ff8d	PIPELINE §5 step 1 follow-up: enable perf in production launchers The §5 step 1 fixes correct the perf collector's stdout/stderr + event-name parser bugs, but the launchers (run_real_vm_demo / run_tier3_demo) never set enable_perf=True, so production episodes still ship with rows_perf=0 — silently disabled collector, which is exactly the §1 / §4.4 pattern. Turn it on in both launchers. Failure modes (perf binary missing, paranoid level too high) are logged as warnings + return 0 rows visibly, not silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:40:37 -05:00
max	642f7a94d6	runners: take savevm baseline-v1 after boot so revert_at_* actually works EpisodeConfig.revert_at_start / revert_at_end have been issuing loadvm "baseline-v1" via QMP since the snapshot/revert wiring landed, but no part of the system was running savevm — so loadvm targeted a snapshot that didn't exist and silently emitted snapshot_revert_failed every time. The reverted-baseline mode was, in effect, dead code. Both runners now take a savevm immediately after the guest is up and reachable, before any workload runs: run_real_vm_demo.py — after SerialClient.login() succeeds (Tier 2) run_tier3_demo.py — after _wait_for_tcp on the vulnerable port (Tier 3, before the exploit fires) Both call qmp.QMPClient.savevm("baseline-v1"). Best-effort: if savevm fails (older qemu, non-qcow2 disk, KVM nesting issue), we log a warning and run the episode anyway — just without revert support. The snapshot_name in EpisodeConfig is unified to "baseline-v1" across both runners (Tier 3 was previously stamping "qcow2-snapshot-on" into meta, which didn't match what loadvm would target). Why both runners take savevm individually instead of a unified path: the two runners boot different launchers (launch_demo.sh for the Alpine cidata image, launch_target.sh for the vulnerable target). Each is responsible for its own QMP socket lifecycle. A shared savevm helper module would just be a one-line wrapper around the existing qmp.QMPClient.savevm; not worth the indirection. Existing test coverage: tests/test_qmp.py exercises QMPClient.savevm/loadvm against a fake server (HMP wrapper, error path). The runner-side call is exercised in production but not in unit tests — would need a fake launcher subprocess, which is outside this commit's scope. 132/132 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 02:37:05 -05:00
max	d86502d950	workload audit trail: meta.sample + per-phase events + pre-kill probe The elliott-lab episode showed every phase median'd 20% CPU because the in-guest workload silently never fired — and there was no signal in events.jsonl to detect that from outside, so a trainer would treat the labels as ground truth and learn "all phases look identical". This commit closes the audit gap so the failure is visible in meta: orchestrator/episode.py EpisodeConfig.sample: Sample \| None — the manifest entry that drove this episode's workload selection. Stamped into meta.sample as {name, family, category, profile, kind, sha256} so trainers can join cleanly without re-deriving from events. None means the v1 yes-loop fallback path ran (and the trainer should treat the episode with appropriate skepticism). tools/vm_load_controller.py VMLoadController gains an emit_event callable. Every phase now emits a workload_* event into the runner's events.jsonl: workload_setup login + initial cleanup OK workload_killed clean / dormant. Dormant carries a `pre_kill_probe` dict from inside the guest (`pgrep -c yes`, `pgrep -c sh`, /proc/loadavg) so the trainer can detect the elliott-lab failure mode where the workload never actually ran. workload_armed armed handshake fired workload_infecting dd urandom / payload write fired workload_started infected_running command sent workload_failed any of the above raised inside SerialClient (timeout, EOF, partial login). The runner would have silently swallowed the exception via its on_phase try/except; the audit row makes the failure detectable. Exceptions in shell calls surface as workload_failed events but do NOT propagate, matching the runner's existing on_phase contract. tools/run_real_vm_demo.py Wires the controller's emit_event to the runner's emit_event via a small forward-reference closure (controller is built before runner; runner.emit_event needs to be the sink). Sample also flows into EpisodeConfig.sample so meta.sample matches what the controller actually ran. Tests: 119 (was 106). New cases: tests/test_vm_load_controller.py (11 tests against a FakeSerial) - setup emits workload_setup - infected_running runs the v1 yes-loop AND emits workload_started - dormant probes BEFORE killing and stamps pre_kill_probe - dormant probe records "yes=0" (the elliott-lab fingerprint) - clean / armed / infecting all emit their respective events - serial.run() exception → workload_failed event, no propagation - sample-with-profile dispatches to exploits.workloads command (NOT the v1 yes-loop) - missing emit_event callback is a no-op (back-compat) tests/test_episode.py (2 new) - meta.sample carries name/family/category/profile/kind/sha256 when EpisodeConfig.sample is set - meta.sample stays null in the v1 fallback path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 02:12:34 -05:00
max	8753340ea3	fleet: fix per-slot run-dir collision so concurrent VMs actually run Root cause of "fleet says max_concurrent=3 but only one episode ships per wave" symptom on elliott-lab: 1. orchestrator/fleet.py::_run_slot set env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot. 2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm (NO slot suffix), then UNCONDITIONALLY overwrote the env's RUN_DIR with that flag's value before exec'ing the launcher. 3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots collided on the same socket dir. 4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's rmtree literally deleted slot 0's pidfile + sockets mid-boot. 5. Net effect: one VM survives per wave on a multi-core host that should be running ~cores-1 in parallel. Throughput collapses to 1/N. Fix: tools/run_real_vm_demo.py + tools/run_tier3_demo.py: --run-dir default cascade — 1) explicit CLI flag 2) RUN_DIR env (set by fleet runner) 3) /tmp/cis490-vm-<SLOT> (SLOT from env, default 0) Same change in both runners so Tier-2 + Tier-3 fleet waves parallelize cleanly. orchestrator/fleet.py::_run_slot: Pass --run-dir explicitly to the subprocess so the per-slot path is audit-visible in the fleet log instead of buried in env. Also flip the subprocess interpreter to repo_root/.venv/bin/python when present (was /usr/bin/env python3 — worked by luck because the orchestrator path doesn't import msgpack/httpx, but a Tier-3 fleet wave would have died at import-time on a host without those in system Python). etc/cis490-orchestrator.service: Removed the duplicate [Service] hardening block at the bottom of the file that was silently overriding the AmbientCapabilities grant (NoNewPrivileges=true at the bottom flipped the NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_ ADMIN + CAP_PERFMON before per-episode subprocesses inherit them). Sources 3 + 4 would have failed silently inside the sandbox. Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable. 106/106 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:55:56 -05:00
max	bdcd2ecbef	Close out the open issues: bridge pcap wiring, perf collector, Tier-4 Wraps the three remaining 🚧 items from the README so every collector the threat-model promises is actually live, and the Tier-4 path (real-malware fetch + upload + exec) works end-to-end as soon as a sha256 lands in samples/store/. Closes spectral/CIS490#4, #5, #6. == #6 — Bridge pcap wiring == EpisodeConfig grows three optional fields: bridge_iface: str \| None # e.g. "br-malware" bridge_ip: str = "10.200.0.1" pcap_snaplen: int = 256 When bridge_iface is set, EpisodeRunner spawns tcpdump for the duration of the schedule (network.pcap), stops it cleanly on episode end, and runs collectors.pcap.bucketize() to produce netflow.jsonl per the 100-ms schema in docs/data-model.md. EpisodeResult + meta.result gain rows_netflow + pcap_bytes counters. vm/launch_demo.sh + launch_target.sh now switch between SLIRP usermode and tap+bridge based on $BRIDGE — operator pre-creates the tap as a bridge member, no sudo from the launcher. run_real_vm_demo.py picks BRIDGE up from env so the fleet runner can opt entire waves into pcap mode by exporting BRIDGE before invocation. == #5 — Source 3 perf collector == collectors/perf_qemu.py shells out to ``perf stat -p <pid> -I 100 -j`` and parses the per-event JSON stream. Aggregates one row per interval across the canonical event set (cycles/instructions/cache-{refs,misses}/ branches/branch-misses/page-faults/context-switches), computes IPC + cache-miss rate. Tolerates missing events (``<not counted>`` / ``<not supported>``) without dropping the row, and skips cleanly when ``perf`` isn't on PATH or the process can't be attached. EpisodeConfig.enable_perf=True opts into the collector — off by default because perf needs CAP_SYS_ADMIN or perf_event_paranoid <= 1. When enabled, runs as a parallel thread alongside the other collectors; EpisodeResult.rows_perf records the count. == #4 — Tier 4 (real-malware fetch + upload + exec) == tools/fetch_sample.py: pulls a sample by sha256 from MalwareBazaar (API key from env or samples/.bazaar.token), unzips with the standard "infected" password, verifies the resulting binary's sha256, lands at samples/store/<sha256>. Idempotent — already-staged correct binaries return immediately. samples/manifest.py: Sample.binary_path(store_root) resolves to the staged binary path, or None for mimics / not-yet-fetched real samples. exploits/workloads.py: real_binary_workload(bytes, sample) builds a Workload that base64-uploads the binary into the shell session via a heredoc, decodes + chmods + execs it in the background, captures the PID for clean stop on dormant. Per-profile pid/bin paths so concurrent samples in the same guest don't collide. exploits/driver.py: dispatch order is now: 1) sample.kind == "real" + binary staged at sample_store_root → real_binary_workload (Tier 4) 2) profile mimic from workloads.workload_for() (Tier 3 v2) 3) None → driver v1 fallback yes-loop DriverConfig.sample_store_root is the new field; run_tier3_demo.py wires it to repo_root/samples/store. driver_setup event records sample_sha256 so trainers can join Tier-4 episodes against the manifest by hash. samples/store/.gitkeep added (binaries themselves are gitignored). Tests: 102 pass (was 86). New suites: tests/test_perf_qemu.py — parser + builder + perf-missing fallback tests/test_tier4.py — real_binary_workload base64 round-trip, stop-cmd kills pidfile, per-profile path isolation, driver dispatch chooses real vs mimic correctly, fetcher input validation and cached-fast-path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:17:49 -05:00
max	b80986d99c	Driver v2: sample-profile-driven workloads (Tier-2 + Tier-3) The v1 driver ran ``yes > /dev/null`` for every sample, which produced the same envelope shape regardless of which malware family the orchestrator claimed to be running. That's a poor training signal: the model sees identical /proc + QMP traces tagged "cryptominer" / "ransomware" / "RAT" with no distinguishing features. v2 fixes this. What landed: exploits/workloads.py — six ``Workload`` profiles, each producing a distinct in-session shell command pair (start_cmd / stop_cmd) that backgrounds a profile-shaped loop: cpu-saturate — sustained 1-vCPU saturation (XMRig shape) scan-and-dial — periodic SYN-style probes across 10.200.0.0/24 + dial-home to gateway (Mirai shape) io-walk — fs traversal + 4 KiB urandom writes, periodic re-read (ransomware shape) bursty-c2 — long idle, periodic 3-packet TCP egress burst (Dridex C2 beacon shape) low-and-slow — minimal CPU + periodic awk-driven memory churn (Kovter / fileless shape) shell-resident — single long-lived TCP socket pinned to gateway with periodic 6-byte command ticks (RAT shape) Each profile uses a /tmp/.cis490-workload-<profile>.{pid,sh} pair so the stop_cmd can cleanly kill the loop and its descendants. exploits/driver.py — MSFExploitDriver now accepts an optional ``Sample``. With one supplied, ``infected_running`` dispatches to the matching workload via exploits.workloads.workload_for(); the ``sample_executed`` event records profile + sample name + sample kind so the trainer can join cleanly. Without a sample, the v1 yes-loop path remains unchanged (backwards compat). tools/vm_load_controller.py — the same dispatch on the Tier-2 path (no exploit, real Alpine guest driven over the serial console). A fleet wave now produces six visually distinct envelopes per wave whether the underlying mode is Tier 2 or Tier 3. tools/run_real_vm_demo.py — accepts ``--sample <name>`` (or SAMPLE_NAME env from the fleet runner) + auto-wires QMP + agent sockets into the EpisodeConfig so all three new collectors (sources 2, 4, 5) run alongside source 1 by default. tools/run_tier3_demo.py — same ``--sample`` plumbing for the exploit-driven path. Tests: 86 pass (was 82). New v2 cases: - profile dispatch routes infected_running to the workload's start_cmd (NOT the v1 yes-loop) when a Sample is set - all six profiles produce distinct start_cmds (the property the ML model needs) - unknown profile string falls back to cpu-saturate with a warning - v1 path (no Sample) still uses yes-loop (backwards compat) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:06:15 -05:00
Maximus Gorog	7216ec09bd	Tier 2: real Alpine VM, real workload, real envelope End-to-end now drives a real KVM guest through the full XMRig-shaped phase schedule with the workload running INSIDE the guest. Telemetry is host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven over the serial console at every phase transition. The plotted envelope shows clean idle → armed → infecting (disk spike) → infected_running (100% CPU plateau) → dormant → re-entry → final clean. Components: vm/launch_demo.sh now boots Alpine 3.21 nocloud-cloudinit (Cirros 0.6.x's cirros-init blocks on the EC2 metadata service for ~17 min before falling through to NoCloud — abandoned). Mounts a cidata ISO as a second drive. tools/build_cidata.py pure-Python NoCloud ISO builder (pycdlib). Sets root password and ssh_pwauth via runcmd so we don't depend on a specific cloud-init version's plain_text_passwd handling. tools/vm_serial.py serial-console client (stdlib socket). Idempotent login (detects already-in-shell state), sentinel-bracketed run() that distinguishes shell output from the TTY echo of input by requiring a leading \r\n boundary on the marker. tools/vm_load_controller.py in-guest load controller. set_phase() dispatches the per-phase shell command over the serial connection. tools/run_real_vm_demo.py ties it all together: boot VM, wait for cloud-init runcmd, log in, run the EpisodeRunner with on_phase=controller, shut down VM. Deps: paramiko, pycdlib added. docs/sources.md updated with Alpine cloud image (sha512 pinned), and the new Python deps. README leads with the tier-2 plot now (real VM, real workload). The previous synthetic plot is moved below with explicit "host-side mimic, not a VM" labelling. Tier-2 status flipped to ✅ in the tier table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:38:53 -06:00

9 commits