Two correctness fixes that the §4.5 event-driven labeller surfaced:
1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule
(clean → armed → infecting → infected_running → ...) for episodes
with no exploit firing. Pre-§4.5 those episodes wrote dishonest
`infected_running` labels from the schedule clock — exactly the §3
evidence pattern. Post-§4.5 they write `failed` at the infecting
transition (the justifying exploit_fire never arrives), which is
honest about what happened but useless for training.
The honest fix: Tier-2 episodes have a clean-only schedule. All
telemetry tagged `clean` because nothing infected anything. The
total duration matches the canonical Tier-3 schedule so episode
lengths are comparable across tiers — no length-bias in the
dataset (§10).
Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py
derives `[("clean", total_seconds)]` from the canonical schedule.
`tier3_schedule_from(schedule)` renders the legacy
`[(name, seconds)]` shape EpisodeConfig still expects.
Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from.
Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from.
Drops the hardcoded DEFAULT_SCHEDULE constants from both — the
canonical manifest is the single source of truth (§4.1).
2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp
writes /opt/cis490/VERSION so episodes can record code provenance
without /opt/cis490 carrying a .git directory. But /opt/cis490 IS
typically a git checkout on lab hosts (auto-update.sh pulls into
it), so writing VERSION leaves the working tree dirty. Every
episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance
gate's rule 4 would then reject every episode without
CIS490_ALLOW_DIRTY=1 set — which would break the data flow.
Now VERSION is .gitignored: install-lab-host.sh stamps it, git
status doesn't see it, dirty=false, gate rule 4 passes naturally.
These two changes together keep the data flowing AND honest. Tier-2
episodes pass with `phases=[clean]` + every collector emitting real
rows. Tier-3 episodes (none today, empty catalog) walk the full
event-driven schedule when a verified module gets re-admitted.
286 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase labels are written ONLY when justifying events arrive. The
schedule clock is now a budget — an upper bound — never a label
source. This is the core honesty fix the §3 evidence demanded:
Before: every Tier-3 episode wrote `infected_running` from the
schedule clock regardless of whether session_open ever
fired. Per §10 every dishonest label is a poisoned
training example. 67/67 of the §3 probe episodes were
poisoned this way.
After: `infecting` writes ONLY when exploit_fire is observed in
events.jsonl. `infected_running` writes ONLY when
session_open is observed. Either timing out or seeing
session_open_timeout terminates the walker with a `failed`
label that the §4.6 acceptance gate will reject.
PHASE_JUSTIFYING_EVENTS in orchestrator/episode.py declares which
events justify which phases:
"clean": None # orchestrator-emitted
"armed": None # orchestrator-emitted
"infecting": ("exploit_fire",)
"infected_running": ("session_open",)
TERMINAL_FAILURE_EVENTS = {"session_open_timeout"} short-circuit any
event-driven wait into a `failed` label.
`dormant` is intentionally OFF the canonical schedule. §4.5 calls
for dormant to be event-driven (session_idle / session_active) too,
but the driver doesn't emit those yet. Per §1 default-to-removal we
ship without dormant rather than label it from the clock; when the
driver gains those emits, dormant re-enters the schedule with
proper justification.
EpisodeRunner now owns:
* `_event_log` — every emit_event appends here
* `_event_cv` — condition variable for waiters
* `_wait_for_event(names, since_t_mono_ns, timeout_s)` — returns
the first matching event in the log
with t_mono >= threshold; threshold
catches events that fired during
the previous on_phase callback.
When an event-driven phase's justifier already arrived (e.g.
exploit_fire emitted by driver._fire() inside on_phase("armed")),
the walker uses the EVENT's t_mono on the label — not the time the
walker noticed. The label means "this is when this thing actually
happened."
manifest.toml: dropped the dormant cycle from the canonical schedule.
Episode is shorter (~30s) but every label is event-justified.
14 new tests in tests/test_event_driven_labeller.py covering: justifier
mapping invariants, _wait_for_event semantics (already-arrived,
future, timeout, since-threshold, first-of-multiple-names), walker
behavior (orchestrator-emitted phases, event-driven phases, missing
event → failed, terminal-failure-event short-circuit, stop event,
event-t_mono on label, phase_transition events with justified_by).
286 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§4.2 calls for target VMs we BUILD, not VMs we fetch. §4.13 demands
every target ship the same isolation posture (no upstream egress, no
host-shared FS, unprivileged QEMU, fresh snapshot per episode). This
commit lands the infrastructure for both.
New surface:
* orchestrator/target_spec.py
Loads + validates `vm/targets/<name>/spec.toml`. Containment
fields are not knobs — each has exactly ONE safe value, and a
spec asserting the unsafe value is rejected at load time. There's
no `--containment-override`; weakening §4.13 requires amending
PIPELINE.md and operator sign-off.
* tools/build_target.py
Orchestrates build → verify → publish for a single target. Spec
invalid → exit 78 (sysadmin error). build.sh failure → image not
published. verify.sh failure → image discarded; that's the §4.2
acceptance gate. Publishes sha256 + the manifest.toml stanza the
operator copies in to admit the image (§16 substantive amendment
with sign-off per §15).
* vm/targets/<name>/{spec.toml,build.sh,verify.sh}
Template structure. spec.toml is the contract; build.sh produces
$OUT_PATH; verify.sh boots the produced image under the §4.13
containment posture and asserts every promise.
* vm/targets/shellshock/
First real working target. CVE-2014-6271 (Apache mod_cgi + bash
4.2 mis-parsing function-export environment values). Replaces
the SourceForge Metasploitable2 path that §3 evidence proved
unverifiable. Bash 4.2 is built from sha256-pinned GNU source
inside an Alpine 3.21 cloudinit guest; the build script asserts
the produced bash actually triggers shellshock; the verifier
re-asserts it under restrict=on with a real CVE-2014-6271 probe.
* vm/targets/README.md
How operators add a target. Walks the spec → build → verify →
manifest amendment loop.
Containment regression tests (tests/test_containment.py) — 20 new
assertions, parameterized over every target with a build/verify trio:
* verify.sh MUST contain `restrict=on` on its netdev (§4.13)
* verify.sh MUST contain `snapshot=on` on the boot drive (§4.13)
* verify.sh + build.sh MUST NOT contain -virtfs / -fsdev / 9pfs
* verify.sh + build.sh MUST NOT wrap qemu-system in `sudo`
* Every target must ship the complete spec.toml + build.sh + verify.sh
trio — no half-built targets (§1 default-to-removal)
Spec validation tests (tests/test_target_spec.py): 13 new tests over
spec parse, name/dir mismatch, missing fields, out-of-range port, and
the §4.13 containment field validators (each unsafe value rejected
with a clear error).
The shellshock target's image is NOT yet published to manifest.toml's
[[targets.images]] — that's the §15 sign-off amendment that lands
after a successful operator-driven build_target.py run on a lab host
with KVM. Building takes ~10 min on x86_64; cannot run on the Pi
under TCG. Operator drives the first build, verifies the sha256, then
amends manifest.toml in a follow-up commit.
261 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The experiment is now defined by a single version-pinned file —
manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every
lab host loads THIS exact file; per-host overrides of experiment
shape are forbidden.
Drops the following per-host CLI overrides that previously violated
the canonical-manifest principle:
* --manifest, --modules-dir (paths now derived)
* --ram-per-vm-mib (in manifest.experiment)
* --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling)
* --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots)
* --force-tier2 (not a §14 sanctioned override knob —
ship empty catalog to disable Tier-3)
* --require-real-samples (sample-side concern; out of fleet scope)
* tools/run_*_demo.py --manifest (samples path now from canonical)
New surface:
* manifest.toml — the single source of truth
* orchestrator/manifest.py — load_canonical() + Manifest dataclass
with strict validation, raises
ManifestError on any failure
* EpisodeConfig.experiment_meta — populated by run_*_demo.py from
the canonical manifest; stamped
into every episode's meta.json
under "experiment" key for
provenance
* cis490-orchestrator.service — RestartPreventExitStatus=78 so
manifest-load failures stay
stuck-and-loud (§9, §4.7)
* install-lab-host.sh — validates manifest.toml at
install time; missing or invalid
= die with clear message
Catalog admission semantics: only modules whose name appears in
manifest.catalog get loaded into the runtime catalog (§4.3 in
miniature, will tighten further in step 4 when verified_against /
last_verified actually gate admission). Missing toml for an admitted
name is a sysadmin error → exit 78.
Renames cfg.manifest → cfg.samples + adds cfg.experiment to
disambiguate sample-manifest from experiment-manifest. Rewrites
test_fleet.py fixture to construct synthetic Manifest objects so
test outcomes don't depend on the on-disk manifest.toml content.
12 new tests in tests/test_manifest.py: schema-version mismatch,
unknown collector, duplicate collector, unknown phase, negative
phase seconds, negative ram, missing catalog fields, json round-trip.
Local run: `python tools/run_fleet.py --capacity` correctly logs the
loaded manifest and prints capacity. 241 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validation on k-gamingcom (commit ac7b85f) showed perf enabled in
production but rows_perf=0 on every episode. Without lifecycle events
the failure mode is indistinguishable from "perf wasn't enabled" — §1
silent-downgrade. The events now surface the actual cause:
- perf_unavailable — binary missing OR launch failed (with reason)
- perf_started — perf is running (pid, events, interval)
- perf_first_row — first row written; counters_populated tells
whether any event was actually counted
- perf_finished — final tally (intervals_seen,
intervals_with_values)
- perf_no_counters — perf was alive but every interval came back
<not counted> (likely paranoid > 2 or PID
ownership mismatch)
`_flush()` now writes a row whenever an interval is observed, even
when every event was <not counted>. The all-None row is honest data
("perf observed this interval and counted nothing"), and the rows
become a count of observed intervals rather than a count of
successful measurements — distinct from rows_proc / rows_qmp which
do count successful measurements. Trainers filter on
`cycles is not None` etc. when they need only populated rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical evidence from k-gamingcom (commit 4ab5477, 2026-05-03 22:20Z
vsftpd_234_backdoor episode): the picker selected vsftpd because BRIDGE
was set on that host. The exploit fires against target_ip=127.0.0.1
(SLIRP loopback) but vsftpd's hardcoded port-6200 backdoor is reachable
only at the guest's bridge IP. Result: session_open_timeout, AND a
schedule-clock-driven `infected_running` label was still written for
the failed exploit — exactly the §10 poisoned-training-example pattern.
Until guest-IP discovery for bridge mode is wired (a separate piece of
infrastructure), bridge-only modules can't actually reach their target
even when the operator sets BRIDGE for Tier-2's pcap source. Revert
the picker to its prior conservative form: drop requires_bridge modules
unconditionally regardless of BRIDGE state. Same for the BRIDGE env
strip in the Tier-3 launch path — it was correct as unconditional.
Replaces the two aspirational tests
(test_fleet_uses_all_modules_when_bridge_set,
test_fleet_propagates_bridge_env_to_runner) with their honest negatives
(test_tier3_drops_requires_bridge_modules_unconditionally,
test_tier3_strips_bridge_env_even_when_set). The previous tests asserted
behavior the rest of the pipeline can't deliver; they were false signals.
229 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.
perf collector (rows_perf=0 on 100% of episodes):
- perf stat -j writes to stderr by default with -p; we read stdout.
Add --log-fd 1 so JSON reaches stdout where the parser sees it.
- Event names come back annotated with the privilege scope perf
actually measured ("cycles:u" under perf_event_paranoid=2). Strip
the suffix so _build_row's plain-name lookups hit. Without this
every metric was None even when perf reported real numbers.
- tests/test_collectors_emit.py covers the regression with a real
busy-loop fixture; emit-test discipline per §4.4.
guest-agent collector (rows_guest=0 on 100% of episodes):
- Alpine cloud image doesn't ship python3, so the in-guest agent's
`#!/usr/bin/env python3` shebang silently fails. Add packages:
[python3] to cidata user-data so cloud-init installs it before
the OpenRC service starts.
- Guest agent now exits nonzero (was: silent stdout fallback) when
/dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
reports the failure to /var/log/cis490-agent.log instead of the
bytes vanishing into the void. Refs §1.
- Host-side collector emits guest_agent_connected /
guest_agent_first_byte / guest_agent_silent_window into the
orchestrator's events.jsonl. Future episodes show the in-guest
failure mode per-episode instead of inferring from rows_guest=0.
k-gamingcom missing qmp/netflow/pcap (also affected elliott on
Tier-3 episodes — was misclassified as host divergence):
- tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
qmp_socket / guest_agent_socket / bridge_iface — even though
launch_target.sh creates the underlying chardevs and BRIDGE
supplies the iface. tools/run_real_vm_demo.py wires them
correctly; Tier-3 had a copy-paste gap.
- tests/test_collectors_emit.py adds a source-grep regression so
the wiring stays honest.
samba_usermap_script never lands session (0/67 in §3 probe):
- Bind handler default WfsDelay (~5s) gives up before bind_perl on
Metasploitable2 has finished forking + binding LPORT under
SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
exploits/driver.py so framework + driver agree on the wait
budget. Add ConnectTimeout=15 so the handler's bind connect has
retry budget instead of one-shot.
orchestrator/fleet.py: usable_modules + BRIDGE handling were both
unconditional, so:
- With BRIDGE set, requires_bridge modules were still being
dropped — picker only ever returned samba_usermap_script across
every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
failure on HEAD).
- env.pop("BRIDGE") fired even when BRIDGE was the operator's
explicit setup, breaking modules that need bridge mode (vsftpd
backdoor on hardcoded port 6200, distccd, etc.).
Both made conditional on bridge_set so the picker walks the full
catalog under bridge mode and SLIRP-only modules still get a
clean SLIRP env when BRIDGE is unset.
receiver/app.py: half-pregnant v2 schema state in HEAD — calling
store.ingest_stream(episode_type=..., benign_profile=...) with
kwargs the matching store.py change was in the WIP stash. Removed
v2 awareness from app.py so v1 episodes (what the producer ships
today) get accepted again. SCHEMA_VERSION default reset to 1 to
match.
229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root causes and fixes documented in TIER3-BRINGUP.md. Summary:
1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.
2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
modules selected on SLIRP runs; fix: always filter requires_bridge.
3. cmd/unix/interact creates no session.list entry → session_open_timeout
every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.
4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
fix: extra_host_port:extra_host_port mapping so guest binds the
per-slot LPORT directly.
5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
fix: requires_bridge=true filters it from SLIRP fleet runs.
6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
boots (~60 s too early); fix: replace TCP probe with serial console
_wait_for_serial_login that waits for actual "login:" prompt.
7. Stale QEMU survives orchestrator restart (start_new_session=True) →
holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
old pidfile before rmtree.
8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.
9. msfrpcd 6.x returns bytes for all string values even with raw=False;
fix: MSFRpcClient._str() recursive decoder applied to all responses.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes a real reproducibility gap. Three weeks of bug fixes have
shipped (probe fix in 2707709, multi-signal classifier in 321ea63,
mandatory tier-4 in 265f3ad, etc.); without a per-episode
code_version, trainers can't tell which episodes came from buggy
pre-fix code and have to scan every tarball to guess.
Resolution priority (cached across episodes):
1. $INSTALL_ROOT/VERSION (production — install-lab-host.sh writes
it at install time since /opt/cis490 is a flat copy with no .git)
2. git rev-parse HEAD from the repo root (dev clones)
3. {"commit": "unknown", source: "unknown"} so the field is always
present (filterable)
Output shape, always present in meta.json:
"code_version": {
"commit": "<40-hex>" | "unknown",
"branch": "<name>" | null,
"dirty": bool | null,
"source": "VERSION-file" | "git" | "unknown"
}
install-lab-host.sh writes VERSION at install time with the source
repo's git rev-parse HEAD + branch + clean-tree flag + install
timestamp. Lab-host agents that pull main + re-run install-lab-host.sh
get a fresh stamp automatically.
148/148 tests pass; test_episode_against_self_pid_produces_full_directory
asserts the field's presence + valid `source` value.
Closes the next batch of issues from the post-mortem. The previous
"each run uses a different vulnerability" commit shipped 5 modules
but 3 of them couldn't actually fire under SLIRP+restrict=on:
their reverse-shell payloads needed a callback channel the launcher
didn't provide, AND their LHOST options were set to {{ target_ip }}
(the target's IP, not the attacker's — copy-paste from RHOSTS).
Same time, the workloads.py shell commands used bash-only /dev/tcp
redirects that silently no-op'd in the busybox shell sessions
Metasploitable2 returns. Net effect: episodes that selected those
modules would have produced session_open_timeout + dead workloads.
Module configs (the three callback ones):
exploits/modules/distccd_command_exec.toml
exploits/modules/php_cgi_arg_injection.toml
exploits/modules/unreal_ircd_3281_backdoor.toml
- Switch payload from cmd/unix/reverse* to cmd/unix/bind_perl
so the target listens on a known port; msfrpcd connects to it
via the host's hostfwd (no callback path required).
- Drop the bogus LHOST = "{{ target_ip }}" — bind shells don't
use LHOST.
- Add [runtime] table:
requires_bridge = true
extra_target_ports = [<bind_lport>]
Both fields are honored by the loader (ModuleConfig.requires_bridge)
and the launcher (TARGET_PORTS gets the extra port hostfwd'd
when BRIDGE mode is active).
orchestrator/fleet.py
When BRIDGE is unset in env, _run_slot filters the module catalog
down to modules where requires_bridge=False before calling
select_module. Two same-socket-shell modules (vsftpd_234_backdoor +
samba_usermap_script) survive — fleet still has variety; just
doesn't pick modules whose payloads can't land. With BRIDGE set,
the full catalog rotates as before, AND BRIDGE is propagated to
the per-slot subprocess env so launch_target.sh enters tap+bridge
mode.
exploits/workloads.py
Replaced bash-only constructs in three profiles:
scan-and-dial /dev/tcp/HOST/PORT redirects → nc -z -w 1
bursty-c2 same fix
shell-resident exec 3<>/dev/tcp/... → piping into nc -w
All three now run cleanly in busybox / dash / Metasploitable2's
default shell. The remaining three profiles (cpu-saturate, io-walk,
low-and-slow) were already busybox-portable.
scripts/install-lab-host.sh
- lab-host.env now defaults BRIDGE=br-malware (was commented out).
Operator opt-out is to comment the line back in.
- New step 6b: provisions br-malware via vm/setup_bridge.sh AND
pre-creates a per-slot tap pool (cis490tap0..7 for Tier-2 demo,
cis490target0..7 for Tier-3 target) all attached to br-malware
and brought up. Launchers reference these by SLOT — no sudo
needed at episode time.
- On bridge-setup failure, the script auto-comments BRIDGE in the
env file with a "auto-disabled: bridge setup failed" note so
the fleet falls back to same-socket modules + Tier-2 cleanly.
tools/cis490_doctor.py
Two new checks for the lab-host role:
bridge: br-malware exists / up
tier3: msfrpcd listening on 127.0.0.1:55553
tier3: module catalog parses (counts same-socket vs requires_bridge)
All three are warn-level — they don't fail an otherwise-healthy
Tier-2-only setup; they tell the operator what's missing for full
Tier-3 + source 4 coverage.
Tests: 132 (was 129). New cases:
test_fleet.py +3
- fleet skips requires_bridge modules when BRIDGE unset (asserted
across 20 episodes; never picks a callback module)
- fleet uses the full catalog when BRIDGE is set
- BRIDGE env propagates to per-slot subprocess
What's still untested live: the bind_perl payloads against a real
Metasploitable2 in the bridge-enabled launcher path. That's a
deployment validation, not a code change. The unit tests confirm
the dispatch / filter logic; the live test is the next operator
action.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the "every run hits the same vulnerability" gap. Before this
commit, the fleet shipped Tier-2 episodes (no exploit at all) with
only the post-infection sample varying. Tier-3 had a single canned
module — vsftpd_234_backdoor — so even when exploit fire was
exercised, the entry vector never changed. Trainer would see one
shape of `armed → infecting` and learn nothing about how varied
real exploits look on the wire / in /proc.
What landed:
exploits/modules/
+ samba_usermap_script.toml CVE-2007-2447, SMB:139
+ distccd_command_exec.toml CVE-2004-2687, distcc:3632
+ php_cgi_arg_injection.toml CVE-2012-1823, http:80
+ unreal_ircd_3281_backdoor.toml CVE-2010-2075, ircd:6667
(vsftpd_234_backdoor.toml unchanged)
All five are canonical Metasploitable2 vectors with stable
Metasploit modules. Each TOML carries the RPORT the launcher
needs to wire its hostfwd at, plus a payload tuned to a clean
shell session (cmd/unix/interact for in-band shells,
cmd/unix/reverse* with deterministic LPORTs for reverse shells).
exploits/modules.py
+ select_module(catalog, host_id, slot, episode_index) — same
SHA-256-keyed deterministic selection shape SampleManifest uses
for samples. Two hosts at the same slot/episode hash to
different modules; one host walks the full catalog within
~len(catalog) episodes.
+ module_target_port() — pulls RPORT off the module config so
the fleet can plumb the launcher's hostfwd at the right service.
orchestrator/fleet.py
- _run_slot now decides Tier 3 vs Tier 2 from msfrpcd reachability
+ module-catalog populated. Default is Tier 3 when both are true;
Tier 2 fallback when not (logged + recorded in SlotResult.tier
so trainers can filter no-exploit episodes).
- Per-slot module via select_module() — each concurrent slot in a
wave gets a different vector AND a different sample.
- PORT_BASE per slot (target_port + slot * 1000) so concurrent
Tier-3 targets don't collide on the host-side hostfwd port.
- _msfrpcd_available() probe gates the dispatch.
- Fleet-side log line records (slot, ep, tier, sample, module,
run_dir) so the operator can see at a glance what each wave is
exercising.
- SlotResult grows tier + module_name fields; FleetConfig grows
modules + force_tier2 + msfrpcd_{host,port} fields.
orchestrator/episode.py
+ EpisodeConfig.exploit_meta — plain dict the runner stamps into
meta.exploit so every Tier-3 episode records {framework,
module path, module type, payload, RPORT, RHOSTS template}.
Trainers join on meta.exploit.module_name to stratify by entry
vector; meta.sample.name to stratify by post-infection family.
tools/run_tier3_demo.py
+ Builds exploit_meta from the loaded ModuleConfig and passes it
to EpisodeConfig. Sample is now also passed (was missing).
tools/run_fleet.py
+ --modules-dir (default exploits/modules/) — load module catalog
on startup; pass to FleetConfig.
+ --force-tier2 — escape hatch for dev / smoke tests.
+ JSON output now includes per-slot {tier, module} so the operator
can see at a glance what each slot ran without grepping logs.
Tests: 129 (was 119). New cases:
test_exploits.py +6
- catalog has at least the five canonical Metasploitable2 vectors
- select_module is deterministic per (host, slot, ep)
- select_module diversifies across hosts
- select_module walks the full catalog over many episodes
- module_target_port pulls RPORT for each shipped TOML
test_fleet.py +4
- _run_slot dispatches to run_tier3_demo.py when msfrpcd up
- falls back to run_real_vm_demo.py when msfrpcd unreachable
- falls back when module catalog empty
- --force-tier2 overrides msfrpcd availability
- PORT_BASE is unique per concurrent slot (no hostfwd collision)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The elliott-lab episode showed every phase median'd 20% CPU because
the in-guest workload silently never fired — and there was no signal
in events.jsonl to detect that from outside, so a trainer would
treat the labels as ground truth and learn "all phases look identical".
This commit closes the audit gap so the failure is visible in meta:
orchestrator/episode.py
EpisodeConfig.sample: Sample | None — the manifest entry that
drove this episode's workload selection. Stamped into meta.sample
as {name, family, category, profile, kind, sha256} so trainers
can join cleanly without re-deriving from events. None means the
v1 yes-loop fallback path ran (and the trainer should treat the
episode with appropriate skepticism).
tools/vm_load_controller.py
VMLoadController gains an emit_event callable. Every phase now
emits a workload_* event into the runner's events.jsonl:
workload_setup login + initial cleanup OK
workload_killed clean / dormant. Dormant carries a
`pre_kill_probe` dict from inside the
guest (`pgrep -c yes`, `pgrep -c sh`,
/proc/loadavg) so the trainer can detect
the elliott-lab failure mode where the
workload never actually ran.
workload_armed armed handshake fired
workload_infecting dd urandom / payload write fired
workload_started infected_running command sent
workload_failed any of the above raised inside SerialClient
(timeout, EOF, partial login). The runner
would have silently swallowed the
exception via its on_phase try/except;
the audit row makes the failure detectable.
Exceptions in shell calls surface as workload_failed events but
do NOT propagate, matching the runner's existing on_phase
contract.
tools/run_real_vm_demo.py
Wires the controller's emit_event to the runner's emit_event via
a small forward-reference closure (controller is built before
runner; runner.emit_event needs to be the sink). Sample also
flows into EpisodeConfig.sample so meta.sample matches what the
controller actually ran.
Tests: 119 (was 106). New cases:
tests/test_vm_load_controller.py (11 tests against a FakeSerial)
- setup emits workload_setup
- infected_running runs the v1 yes-loop AND emits workload_started
- dormant probes BEFORE killing and stamps pre_kill_probe
- dormant probe records "yes=0" (the elliott-lab fingerprint)
- clean / armed / infecting all emit their respective events
- serial.run() exception → workload_failed event, no propagation
- sample-with-profile dispatches to exploits.workloads command
(NOT the v1 yes-loop)
- missing emit_event callback is a no-op (back-compat)
tests/test_episode.py (2 new)
- meta.sample carries name/family/category/profile/kind/sha256
when EpisodeConfig.sample is set
- meta.sample stays null in the v1 fallback path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of "fleet says max_concurrent=3 but only one episode ships
per wave" symptom on elliott-lab:
1. orchestrator/fleet.py::_run_slot set
env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot.
2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm
(NO slot suffix), then UNCONDITIONALLY overwrote the env's
RUN_DIR with that flag's value before exec'ing the launcher.
3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots
collided on the same socket dir.
4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's
rmtree literally deleted slot 0's pidfile + sockets mid-boot.
5. Net effect: one VM survives per wave on a multi-core host that
should be running ~cores-1 in parallel. Throughput collapses
to 1/N.
Fix:
tools/run_real_vm_demo.py + tools/run_tier3_demo.py:
--run-dir default cascade —
1) explicit CLI flag
2) RUN_DIR env (set by fleet runner)
3) /tmp/cis490-vm-<SLOT> (SLOT from env, default 0)
Same change in both runners so Tier-2 + Tier-3 fleet waves
parallelize cleanly.
orchestrator/fleet.py::_run_slot:
Pass --run-dir explicitly to the subprocess so the per-slot path
is audit-visible in the fleet log instead of buried in env.
Also flip the subprocess interpreter to repo_root/.venv/bin/python
when present (was /usr/bin/env python3 — worked by luck because
the orchestrator path doesn't import msgpack/httpx, but a Tier-3
fleet wave would have died at import-time on a host without those
in system Python).
etc/cis490-orchestrator.service:
Removed the duplicate [Service] hardening block at the bottom of
the file that was silently overriding the AmbientCapabilities
grant (NoNewPrivileges=true at the bottom flipped the
NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_
ADMIN + CAP_PERFMON before per-episode subprocesses inherit
them). Sources 3 + 4 would have failed silently inside the
sandbox.
Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable.
106/106 tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On slower disks (Pi5 SD cards, mu's hardware) the json.dump → write →
os.replace path inside _write_meta takes more than 1 ms, so when the
snapshot_load event fired afterwards its t_mono_ns drifted past the
"<1 ms after origin" assertion in test_driver_events_persist_to_events_jsonl.
Fix: emit snapshot_load immediately after setting _t_mono_origin_ns,
before any file I/O. Matches the semantic intent (snapshot_load marks
episode clock = 0) and removes the disk-speed dependency from the
event timeline.
Diagnosis + suggested patch from spectral/CIS490#7 (filed by mu).
Closesspectral/CIS490#7.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the gaps surfaced in the "what is not implemented" audit so the
fleet really is shippable end-to-end. Verified live on the Pi:
- cis490-shipper --ping → HTTP 200 through Caddy + mTLS via the
new wg-pki client CA leaf
- real episode dir → tar+zstd → PUT → HTTP 201 stored
- re-ship same bytes → 200 (idempotent)
- re-ship different bytes under same id → 409 (conflict)
Changes:
orchestrator/episode.py
- EpisodeConfig.revert_at_start / revert_at_end (Tier 0+ snapshot/
revert per docs/architecture.md). When set + qmp_socket present,
EpisodeRunner issues loadvm <snapshot_name> and emits
snapshot_revert / snapshot_revert_failed events on the same
monotonic clock as everything else.
collectors/qmp.py
- savevm() / loadvm() helpers using human-monitor-command, plus a
test against the fake QMP server.
exploits/workloads.py
- chunked_real_binary_upload() returns a ChunkedUpload plan: 8 KiB
base64 chunks (~6 KiB binary each) so msfrpc never sees a buffer-
busting payload. Includes a finalize step that sha256-verifies on
the guest before exec.
- real_binary_workload() now wraps the chunked plan for backwards
compat with single-shot callers.
exploits/driver.py
- Tier-4 dispatch walks the chunked plan in MSFExploitDriver:
each chunk is a separate session_shell_write; finalize verifies;
exec only runs on sha-ok. New events: real_binary_upload_begin,
real_binary_verify, real_binary_aborted.
etc/cis490-orchestrator.service
- Reads /etc/cis490/lab-host.env (FLEET_HOST_ID + optional BRIDGE).
- Grants AmbientCapabilities CAP_NET_RAW (tcpdump for source 4) +
CAP_SYS_ADMIN + CAP_PERFMON (perf for source 3) so collectors
work under hardening.
scripts/install-lab-host.sh
- Writes /etc/cis490/lab-host.env on first install with FLEET_HOST_ID
defaulting to `hostname -s`.
- Best-effort: fetches the Alpine baseline qcow2 (sha512-pinned) and
builds cidata.iso with the in-guest agent embedded; symlinks both
into /opt/cis490/vm/images/ so launchers find them.
scripts/fetch-alpine-baseline.sh
- Idempotent fetcher for the Alpine 3.21 cloud-init nocloud qcow2
matching the sha512 in docs/sources.md.
tools/plot_envelope.py
- Rebuilt to render whatever telemetry the episode dir contains:
proc → QMP block ops → perf IPC/miss-rate → bridge pkts/SYNs →
guest agent load/mem. Missing sources are silently skipped.
tools/index_reader.py
- cis490-index CLI: filter receiver's index.jsonl by host / sample
/ time range, sort, count-by group. Closest thing to a query
interface until we stand up Postgres/Timescale.
samples/README.md
- Rewritten to match the new manifest schema, the kind=real vs mimic
split, the per-(host, slot, ep) selection mechanic, and the
chunked-upload safety story.
Tests: 106 pass (was 102). New cases:
- test_qmp.py — savevm + loadvm (HMP wrapper + error path)
- test_tier4.py — chunked plan splitting, sha-pinned finalize,
end-to-end driver walks all chunks + verify + exec via the fake
msfrpc client
Closes the "what is not implemented" punch list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the three remaining 🚧 items from the README so every collector
the threat-model promises is actually live, and the Tier-4 path
(real-malware fetch + upload + exec) works end-to-end as soon as a
sha256 lands in samples/store/.
Closesspectral/CIS490#4, #5, #6.
== #6 — Bridge pcap wiring ==
EpisodeConfig grows three optional fields:
bridge_iface: str | None # e.g. "br-malware"
bridge_ip: str = "10.200.0.1"
pcap_snaplen: int = 256
When bridge_iface is set, EpisodeRunner spawns tcpdump for the duration
of the schedule (network.pcap), stops it cleanly on episode end, and
runs collectors.pcap.bucketize() to produce netflow.jsonl per the
100-ms schema in docs/data-model.md. EpisodeResult + meta.result
gain rows_netflow + pcap_bytes counters.
vm/launch_demo.sh + launch_target.sh now switch between SLIRP usermode
and tap+bridge based on $BRIDGE — operator pre-creates the tap as a
bridge member, no sudo from the launcher.
run_real_vm_demo.py picks BRIDGE up from env so the fleet runner can
opt entire waves into pcap mode by exporting BRIDGE before invocation.
== #5 — Source 3 perf collector ==
collectors/perf_qemu.py shells out to ``perf stat -p <pid> -I 100 -j``
and parses the per-event JSON stream. Aggregates one row per interval
across the canonical event set (cycles/instructions/cache-{refs,misses}/
branches/branch-misses/page-faults/context-switches), computes IPC +
cache-miss rate. Tolerates missing events (``<not counted>`` /
``<not supported>``) without dropping the row, and skips cleanly when
``perf`` isn't on PATH or the process can't be attached.
EpisodeConfig.enable_perf=True opts into the collector — off by default
because perf needs CAP_SYS_ADMIN or perf_event_paranoid <= 1. When
enabled, runs as a parallel thread alongside the other collectors;
EpisodeResult.rows_perf records the count.
== #4 — Tier 4 (real-malware fetch + upload + exec) ==
tools/fetch_sample.py: pulls a sample by sha256 from MalwareBazaar
(API key from env or samples/.bazaar.token), unzips with the standard
"infected" password, verifies the resulting binary's sha256, lands at
samples/store/<sha256>. Idempotent — already-staged correct binaries
return immediately.
samples/manifest.py: Sample.binary_path(store_root) resolves to the
staged binary path, or None for mimics / not-yet-fetched real samples.
exploits/workloads.py: real_binary_workload(bytes, sample) builds a
Workload that base64-uploads the binary into the shell session via a
heredoc, decodes + chmods + execs it in the background, captures the
PID for clean stop on dormant. Per-profile pid/bin paths so concurrent
samples in the same guest don't collide.
exploits/driver.py: dispatch order is now:
1) sample.kind == "real" + binary staged at sample_store_root
→ real_binary_workload (Tier 4)
2) profile mimic from workloads.workload_for() (Tier 3 v2)
3) None → driver v1 fallback yes-loop
DriverConfig.sample_store_root is the new field; run_tier3_demo.py
wires it to repo_root/samples/store. driver_setup event records
sample_sha256 so trainers can join Tier-4 episodes against the
manifest by hash.
samples/store/.gitkeep added (binaries themselves are gitignored).
Tests: 102 pass (was 86). New suites:
tests/test_perf_qemu.py — parser + builder + perf-missing fallback
tests/test_tier4.py — real_binary_workload base64 round-trip,
stop-cmd kills pidfile, per-profile path
isolation, driver dispatch chooses real vs
mimic correctly, fetcher input validation
and cached-fast-path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.
Collectors landed:
collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP
client + row builder + run loop. Tolerates
older qemu without query-stats.
collectors/guest_agent.py — source 5 (deployable). Reads the
virtio-serial host-side socket, parses
agent JSON-lines, re-stamps to the host
monotonic clock, persists.
collectors/pcap.py — source 4 (deployable). tcpdump capture
+ pure-Python pcap reader + 100 ms
netflow.jsonl bucketizer. Decodes
Ethernet/IPv4/TCP/UDP enough for the
schema in docs/data-model.md.
In-guest agent:
vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
/proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
tools/build_cidata.py — embeds the agent + an OpenRC service into
user-data so first boot of the Alpine cidata image auto-starts it.
Launchers:
vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
the agent socket; SLOT env support so multiple VMs run without
socket / port collisions; PORT_BASE on launch_target so multiple
target VMs hostfwd different host ports.
vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
no NAT). Idempotent.
Fleet:
orchestrator/fleet.py — capacity detector (cores / RAM / load
headroom) + concurrent-slot runner. Per-slot ENV selects the
sample. FleetCapacity dataclass round-trips into meta.json so
"this episode ran with 6 concurrent VMs" is auditable post-hoc.
tools/run_fleet.py — CLI: --capacity report; --waves N runs N
waves of (max_concurrent) episodes each, every slot with a
different sample.
etc/cis490-orchestrator.service — now drives the fleet runner with
Restart=always so each invocation runs one wave and respawns,
giving a continuous stream.
Samples:
samples/manifest.toml — six profiles spanning the five major
behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
samples/manifest.py — strict TOML loader (rejects dups, unknown
categories) + deterministic select(host_id, slot, episode_index)
so different hosts on the network walk the catalog in different
orders without any coordinator.
EpisodeRunner:
orchestrator/episode.py — optional qmp_socket + guest_agent_socket
fields on EpisodeConfig; when set, additional collector threads
run alongside proc_qemu. EpisodeResult now carries rows_qmp +
rows_guest counters.
Tier-3 setup automation:
scripts/install-msfrpcd.sh — installs metasploit-framework where
the package manager has it, generates a strong password into
/etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
once MSFRPC_PASSWORD is sourced.
scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
from the operator (Rapid7 download is registration-walled), pulls,
verifies, converts vmdk → qcow2, lands at vm/images/.
Tests: 82 pass (was 51). New suites:
tests/test_qmp.py — fake QMP server, capability handshake,
blockstats, async-event interleaving,
5-failure backoff
tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
re-stamp, malformed-line tolerance
tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames,
bucketize correctness across windows
tests/test_fleet.py — capacity math (8-core idle / low-RAM /
high-load / Pi5 / 1-core box), manifest
selection determinism + diversity
What's queued for the next commit (already discussed in convo):
- MSFExploitDriver v2: map sample.profile → distinct in-session
workload so Tier-3 episodes don't all produce the same yes-loop
envelope. Critical for ML to learn varied malware shapes.
- Real-sample fetch from MalwareBazaar by sha256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the Tier-3 exploit driver — an MSFExploitDriver that plugs into
EpisodeRunner.on_phase, fires a Metasploit module against a target VM
via msfrpcd, watches for the resulting session, and stamps each
transition (exploit_fire, session_open, session_landing_probe,
sample_executed, session_dormant, session_killed) into the episode's
events.jsonl on the orchestrator's monotonic clock.
What landed:
- exploits/msfrpc.py — minimal msgpack-over-HTTPS client (auth,
module.execute, job/session lifecycle) so we don't depend on a
third-party MSF wrapper.
- exploits/driver.py — phase-to-msfrpc adapter; idempotent fire,
session-open polling with timeout, workload start/stop, teardown.
- exploits/modules.py + exploits/modules/vsftpd_234_backdoor.toml —
TOML module configs with {{ target_ip }} placeholders, replacing the
imperative .rc-script approach the README previously hinted at.
- vm/launch_target.sh — SLIRP+restrict=on launcher for the
intentionally-vulnerable target VM (host can reach guest via
hostfwd, guest cannot reach host or internet).
- tools/run_tier3_demo.py — end-to-end runner mirroring run_real_vm_demo.
- tests/test_exploits.py — 12 new tests against a fake MSFRpcClient,
including an integration test that drives a real EpisodeRunner.
Plumbing changes:
- EpisodeRunner._emit_event → public emit_event, so external drivers
share the runner's monotonic clock and events.jsonl.
- mkdir for episode_dir moved to __init__ so emit_event is callable
before run() (driver_setup fires pre-schedule).
Status: driver + tests pass (40/40); end-to-end against a live msfrpcd
+ Metasploitable2 image is the next bring-up step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end pipeline now produces a labeled envelope from a single command.
Drives the orchestrator through an 8-phase XMRig-shaped schedule and
renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands
sourced from labels.jsonl. Real telemetry, simulated load — validates the
collection + labeling shape before a real VM is involved.
Components:
- tools/load_mimic.py phase-driven load generator. Reads phase
commands on stdin; CPU/IO behavior matches
the named phase (clean=idle, armed=light burst,
infecting=disk burst+CPU, infected_running=
CPU saturation+stratum-shaped writes,
dormant=quieter than clean).
- tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with
a default 85s schedule that includes the
classic infected_running → dormant → re-entry
pattern.
- tools/plot_envelope.py reads telemetry + labels from an episode dir,
writes envelope.png with colored phase bands.
orchestrator: EpisodeRunner now takes an optional phase_schedule and an
on_phase callback. Walks the schedule emitting one label per transition.
Backwards-compatible — existing single-phase tests still green.
Doc fix (user pushback): README + architecture + threat-model no longer
imply the Pi5 is the deployment target. Pi5's actual role here is the
WireGuard-side collector for episode tarballs. Deployment target is
generic ("constrained Linux device"). The "gateway observer" concept
remains a deployment pattern, decoupled from the Pi5's collector role.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end: ``python -m orchestrator --target-pid <pid> --duration N`` now
writes a complete episode directory matching docs/data-model.md, with phase
labels, events, and a 10 Hz host /proc telemetry stream. No VM yet — pid is
arbitrary so we can validate the loop against e.g. ``sleep 5`` while the lab
side comes up.
collectors/proc_qemu.py — parses /proc/<pid>/{stat,io,status} (handles parens
in comm), single-shot collect_once(), and a stop-event-driven run_loop()
that ticks at a fixed cadence and exits when the pid disappears. Tagged
``available_in_deployment: false`` per the threat-model doc.
orchestrator/episode.py — EpisodeRunner: creates data/episodes/<ulid>/,
atomic meta.json, events.jsonl + labels.jsonl writers, drives the collector
in a thread for duration_s, writes done.marker last so the shipper never
sees a half-finished episode.
orchestrator/ulid.py — tiny 26-char Crockford-base32 ULID generator.
Time-sortable, no third-party dep.
orchestrator/__main__.py — CLI entry point.
Tests (15 new, 28 total green):
- proc_qemu: real-ish stat with parens-in-comm, missing /proc/<pid>/io,
missing pid, run_loop cadence, run_loop terminates when pid disappears.
- episode: full directory shape against os.getpid(), id override,
done.marker written after meta.json finalize.
- ulid: length+alphabet, 2000-burst uniqueness, time-sortability.
Smoke-tested against ``sleep 10``: 16 rows over 1.5s at 100ms cadence,
monotonic clock, RSS stable at ~3.5 MiB as expected for an idle sleep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>