Two correctness fixes that the §4.5 event-driven labeller surfaced:
1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule
(clean → armed → infecting → infected_running → ...) for episodes
with no exploit firing. Pre-§4.5 those episodes wrote dishonest
`infected_running` labels from the schedule clock — exactly the §3
evidence pattern. Post-§4.5 they write `failed` at the infecting
transition (the justifying exploit_fire never arrives), which is
honest about what happened but useless for training.
The honest fix: Tier-2 episodes have a clean-only schedule. All
telemetry tagged `clean` because nothing infected anything. The
total duration matches the canonical Tier-3 schedule so episode
lengths are comparable across tiers — no length-bias in the
dataset (§10).
Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py
derives `[("clean", total_seconds)]` from the canonical schedule.
`tier3_schedule_from(schedule)` renders the legacy
`[(name, seconds)]` shape EpisodeConfig still expects.
Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from.
Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from.
Drops the hardcoded DEFAULT_SCHEDULE constants from both — the
canonical manifest is the single source of truth (§4.1).
2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp
writes /opt/cis490/VERSION so episodes can record code provenance
without /opt/cis490 carrying a .git directory. But /opt/cis490 IS
typically a git checkout on lab hosts (auto-update.sh pulls into
it), so writing VERSION leaves the working tree dirty. Every
episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance
gate's rule 4 would then reject every episode without
CIS490_ALLOW_DIRTY=1 set — which would break the data flow.
Now VERSION is .gitignored: install-lab-host.sh stamps it, git
status doesn't see it, dirty=false, gate rule 4 passes naturally.
These two changes together keep the data flowing AND honest. Tier-2
episodes pass with `phases=[clean]` + every collector emitting real
rows. Tier-3 episodes (none today, empty catalog) walk the full
event-driven schedule when a verified module gets re-admitted.
286 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The experiment is now defined by a single version-pinned file —
manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every
lab host loads THIS exact file; per-host overrides of experiment
shape are forbidden.
Drops the following per-host CLI overrides that previously violated
the canonical-manifest principle:
* --manifest, --modules-dir (paths now derived)
* --ram-per-vm-mib (in manifest.experiment)
* --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling)
* --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots)
* --force-tier2 (not a §14 sanctioned override knob —
ship empty catalog to disable Tier-3)
* --require-real-samples (sample-side concern; out of fleet scope)
* tools/run_*_demo.py --manifest (samples path now from canonical)
New surface:
* manifest.toml — the single source of truth
* orchestrator/manifest.py — load_canonical() + Manifest dataclass
with strict validation, raises
ManifestError on any failure
* EpisodeConfig.experiment_meta — populated by run_*_demo.py from
the canonical manifest; stamped
into every episode's meta.json
under "experiment" key for
provenance
* cis490-orchestrator.service — RestartPreventExitStatus=78 so
manifest-load failures stay
stuck-and-loud (§9, §4.7)
* install-lab-host.sh — validates manifest.toml at
install time; missing or invalid
= die with clear message
Catalog admission semantics: only modules whose name appears in
manifest.catalog get loaded into the runtime catalog (§4.3 in
miniature, will tighten further in step 4 when verified_against /
last_verified actually gate admission). Missing toml for an admitted
name is a sysadmin error → exit 78.
Renames cfg.manifest → cfg.samples + adds cfg.experiment to
disambiguate sample-manifest from experiment-manifest. Rewrites
test_fleet.py fixture to construct synthetic Manifest objects so
test outcomes don't depend on the on-disk manifest.toml content.
12 new tests in tests/test_manifest.py: schema-version mismatch,
unknown collector, duplicate collector, unknown phase, negative
phase seconds, negative ram, missing catalog fields, json round-trip.
Local run: `python tools/run_fleet.py --capacity` correctly logs the
loaded manifest and prints capacity. 241 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The §5 step 1 fixes correct the perf collector's stdout/stderr +
event-name parser bugs, but the launchers
(run_real_vm_demo / run_tier3_demo) never set enable_perf=True, so
production episodes still ship with rows_perf=0 — silently disabled
collector, which is exactly the §1 / §4.4 pattern.
Turn it on in both launchers. Failure modes (perf binary missing,
paranoid level too high) are logged as warnings + return 0 rows
visibly, not silently.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EpisodeConfig.revert_at_start / revert_at_end have been issuing
loadvm "baseline-v1" via QMP since the snapshot/revert wiring landed,
but no part of the system was running savevm — so loadvm targeted a
snapshot that didn't exist and silently emitted snapshot_revert_failed
every time. The reverted-baseline mode was, in effect, dead code.
Both runners now take a savevm immediately after the guest is up
and reachable, before any workload runs:
run_real_vm_demo.py — after SerialClient.login() succeeds (Tier 2)
run_tier3_demo.py — after _wait_for_tcp on the vulnerable port
(Tier 3, before the exploit fires)
Both call qmp.QMPClient.savevm("baseline-v1"). Best-effort: if savevm
fails (older qemu, non-qcow2 disk, KVM nesting issue), we log a
warning and run the episode anyway — just without revert support.
The snapshot_name in EpisodeConfig is unified to "baseline-v1" across
both runners (Tier 3 was previously stamping "qcow2-snapshot-on" into
meta, which didn't match what loadvm would target).
Why both runners take savevm individually instead of a unified path:
the two runners boot different launchers (launch_demo.sh for the
Alpine cidata image, launch_target.sh for the vulnerable target).
Each is responsible for its own QMP socket lifecycle. A shared
savevm helper module would just be a one-line wrapper around the
existing qmp.QMPClient.savevm; not worth the indirection.
Existing test coverage: tests/test_qmp.py exercises
QMPClient.savevm/loadvm against a fake server (HMP wrapper, error
path). The runner-side call is exercised in production but not in
unit tests — would need a fake launcher subprocess, which is outside
this commit's scope.
132/132 tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The elliott-lab episode showed every phase median'd 20% CPU because
the in-guest workload silently never fired — and there was no signal
in events.jsonl to detect that from outside, so a trainer would
treat the labels as ground truth and learn "all phases look identical".
This commit closes the audit gap so the failure is visible in meta:
orchestrator/episode.py
EpisodeConfig.sample: Sample | None — the manifest entry that
drove this episode's workload selection. Stamped into meta.sample
as {name, family, category, profile, kind, sha256} so trainers
can join cleanly without re-deriving from events. None means the
v1 yes-loop fallback path ran (and the trainer should treat the
episode with appropriate skepticism).
tools/vm_load_controller.py
VMLoadController gains an emit_event callable. Every phase now
emits a workload_* event into the runner's events.jsonl:
workload_setup login + initial cleanup OK
workload_killed clean / dormant. Dormant carries a
`pre_kill_probe` dict from inside the
guest (`pgrep -c yes`, `pgrep -c sh`,
/proc/loadavg) so the trainer can detect
the elliott-lab failure mode where the
workload never actually ran.
workload_armed armed handshake fired
workload_infecting dd urandom / payload write fired
workload_started infected_running command sent
workload_failed any of the above raised inside SerialClient
(timeout, EOF, partial login). The runner
would have silently swallowed the
exception via its on_phase try/except;
the audit row makes the failure detectable.
Exceptions in shell calls surface as workload_failed events but
do NOT propagate, matching the runner's existing on_phase
contract.
tools/run_real_vm_demo.py
Wires the controller's emit_event to the runner's emit_event via
a small forward-reference closure (controller is built before
runner; runner.emit_event needs to be the sink). Sample also
flows into EpisodeConfig.sample so meta.sample matches what the
controller actually ran.
Tests: 119 (was 106). New cases:
tests/test_vm_load_controller.py (11 tests against a FakeSerial)
- setup emits workload_setup
- infected_running runs the v1 yes-loop AND emits workload_started
- dormant probes BEFORE killing and stamps pre_kill_probe
- dormant probe records "yes=0" (the elliott-lab fingerprint)
- clean / armed / infecting all emit their respective events
- serial.run() exception → workload_failed event, no propagation
- sample-with-profile dispatches to exploits.workloads command
(NOT the v1 yes-loop)
- missing emit_event callback is a no-op (back-compat)
tests/test_episode.py (2 new)
- meta.sample carries name/family/category/profile/kind/sha256
when EpisodeConfig.sample is set
- meta.sample stays null in the v1 fallback path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of "fleet says max_concurrent=3 but only one episode ships
per wave" symptom on elliott-lab:
1. orchestrator/fleet.py::_run_slot set
env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot.
2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm
(NO slot suffix), then UNCONDITIONALLY overwrote the env's
RUN_DIR with that flag's value before exec'ing the launcher.
3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots
collided on the same socket dir.
4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's
rmtree literally deleted slot 0's pidfile + sockets mid-boot.
5. Net effect: one VM survives per wave on a multi-core host that
should be running ~cores-1 in parallel. Throughput collapses
to 1/N.
Fix:
tools/run_real_vm_demo.py + tools/run_tier3_demo.py:
--run-dir default cascade —
1) explicit CLI flag
2) RUN_DIR env (set by fleet runner)
3) /tmp/cis490-vm-<SLOT> (SLOT from env, default 0)
Same change in both runners so Tier-2 + Tier-3 fleet waves
parallelize cleanly.
orchestrator/fleet.py::_run_slot:
Pass --run-dir explicitly to the subprocess so the per-slot path
is audit-visible in the fleet log instead of buried in env.
Also flip the subprocess interpreter to repo_root/.venv/bin/python
when present (was /usr/bin/env python3 — worked by luck because
the orchestrator path doesn't import msgpack/httpx, but a Tier-3
fleet wave would have died at import-time on a host without those
in system Python).
etc/cis490-orchestrator.service:
Removed the duplicate [Service] hardening block at the bottom of
the file that was silently overriding the AmbientCapabilities
grant (NoNewPrivileges=true at the bottom flipped the
NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_
ADMIN + CAP_PERFMON before per-episode subprocesses inherit
them). Sources 3 + 4 would have failed silently inside the
sandbox.
Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable.
106/106 tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the three remaining 🚧 items from the README so every collector
the threat-model promises is actually live, and the Tier-4 path
(real-malware fetch + upload + exec) works end-to-end as soon as a
sha256 lands in samples/store/.
Closesspectral/CIS490#4, #5, #6.
== #6 — Bridge pcap wiring ==
EpisodeConfig grows three optional fields:
bridge_iface: str | None # e.g. "br-malware"
bridge_ip: str = "10.200.0.1"
pcap_snaplen: int = 256
When bridge_iface is set, EpisodeRunner spawns tcpdump for the duration
of the schedule (network.pcap), stops it cleanly on episode end, and
runs collectors.pcap.bucketize() to produce netflow.jsonl per the
100-ms schema in docs/data-model.md. EpisodeResult + meta.result
gain rows_netflow + pcap_bytes counters.
vm/launch_demo.sh + launch_target.sh now switch between SLIRP usermode
and tap+bridge based on $BRIDGE — operator pre-creates the tap as a
bridge member, no sudo from the launcher.
run_real_vm_demo.py picks BRIDGE up from env so the fleet runner can
opt entire waves into pcap mode by exporting BRIDGE before invocation.
== #5 — Source 3 perf collector ==
collectors/perf_qemu.py shells out to ``perf stat -p <pid> -I 100 -j``
and parses the per-event JSON stream. Aggregates one row per interval
across the canonical event set (cycles/instructions/cache-{refs,misses}/
branches/branch-misses/page-faults/context-switches), computes IPC +
cache-miss rate. Tolerates missing events (``<not counted>`` /
``<not supported>``) without dropping the row, and skips cleanly when
``perf`` isn't on PATH or the process can't be attached.
EpisodeConfig.enable_perf=True opts into the collector — off by default
because perf needs CAP_SYS_ADMIN or perf_event_paranoid <= 1. When
enabled, runs as a parallel thread alongside the other collectors;
EpisodeResult.rows_perf records the count.
== #4 — Tier 4 (real-malware fetch + upload + exec) ==
tools/fetch_sample.py: pulls a sample by sha256 from MalwareBazaar
(API key from env or samples/.bazaar.token), unzips with the standard
"infected" password, verifies the resulting binary's sha256, lands at
samples/store/<sha256>. Idempotent — already-staged correct binaries
return immediately.
samples/manifest.py: Sample.binary_path(store_root) resolves to the
staged binary path, or None for mimics / not-yet-fetched real samples.
exploits/workloads.py: real_binary_workload(bytes, sample) builds a
Workload that base64-uploads the binary into the shell session via a
heredoc, decodes + chmods + execs it in the background, captures the
PID for clean stop on dormant. Per-profile pid/bin paths so concurrent
samples in the same guest don't collide.
exploits/driver.py: dispatch order is now:
1) sample.kind == "real" + binary staged at sample_store_root
→ real_binary_workload (Tier 4)
2) profile mimic from workloads.workload_for() (Tier 3 v2)
3) None → driver v1 fallback yes-loop
DriverConfig.sample_store_root is the new field; run_tier3_demo.py
wires it to repo_root/samples/store. driver_setup event records
sample_sha256 so trainers can join Tier-4 episodes against the
manifest by hash.
samples/store/.gitkeep added (binaries themselves are gitignored).
Tests: 102 pass (was 86). New suites:
tests/test_perf_qemu.py — parser + builder + perf-missing fallback
tests/test_tier4.py — real_binary_workload base64 round-trip,
stop-cmd kills pidfile, per-profile path
isolation, driver dispatch chooses real vs
mimic correctly, fetcher input validation
and cached-fast-path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v1 driver ran ``yes > /dev/null`` for every sample, which
produced the same envelope shape regardless of which malware family
the orchestrator claimed to be running. That's a poor training
signal: the model sees identical /proc + QMP traces tagged
"cryptominer" / "ransomware" / "RAT" with no distinguishing
features. v2 fixes this.
What landed:
exploits/workloads.py — six ``Workload`` profiles, each producing
a distinct in-session shell command pair (start_cmd / stop_cmd)
that backgrounds a profile-shaped loop:
cpu-saturate — sustained 1-vCPU saturation (XMRig shape)
scan-and-dial — periodic SYN-style probes across 10.200.0.0/24
+ dial-home to gateway (Mirai shape)
io-walk — fs traversal + 4 KiB urandom writes, periodic
re-read (ransomware shape)
bursty-c2 — long idle, periodic 3-packet TCP egress burst
(Dridex C2 beacon shape)
low-and-slow — minimal CPU + periodic awk-driven memory churn
(Kovter / fileless shape)
shell-resident — single long-lived TCP socket pinned to gateway
with periodic 6-byte command ticks (RAT shape)
Each profile uses a /tmp/.cis490-workload-<profile>.{pid,sh} pair so
the stop_cmd can cleanly kill the loop and its descendants.
exploits/driver.py — MSFExploitDriver now accepts an optional
``Sample``. With one supplied, ``infected_running`` dispatches to
the matching workload via exploits.workloads.workload_for(); the
``sample_executed`` event records profile + sample name + sample
kind so the trainer can join cleanly. Without a sample, the v1
yes-loop path remains unchanged (backwards compat).
tools/vm_load_controller.py — the same dispatch on the Tier-2 path
(no exploit, real Alpine guest driven over the serial console).
A fleet wave now produces six visually distinct envelopes per
wave whether the underlying mode is Tier 2 or Tier 3.
tools/run_real_vm_demo.py — accepts ``--sample <name>`` (or
SAMPLE_NAME env from the fleet runner) + auto-wires QMP + agent
sockets into the EpisodeConfig so all three new collectors
(sources 2, 4, 5) run alongside source 1 by default.
tools/run_tier3_demo.py — same ``--sample`` plumbing for the
exploit-driven path.
Tests: 86 pass (was 82). New v2 cases:
- profile dispatch routes infected_running to the workload's
start_cmd (NOT the v1 yes-loop) when a Sample is set
- all six profiles produce distinct start_cmds (the property the
ML model needs)
- unknown profile string falls back to cpu-saturate with a warning
- v1 path (no Sample) still uses yes-loop (backwards compat)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end now drives a real KVM guest through the full XMRig-shaped
phase schedule with the workload running INSIDE the guest. Telemetry is
host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU
saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven
over the serial console at every phase transition. The plotted envelope
shows clean idle → armed → infecting (disk spike) → infected_running
(100% CPU plateau) → dormant → re-entry → final clean.
Components:
vm/launch_demo.sh now boots Alpine 3.21 nocloud-cloudinit
(Cirros 0.6.x's cirros-init blocks on the
EC2 metadata service for ~17 min before
falling through to NoCloud — abandoned).
Mounts a cidata ISO as a second drive.
tools/build_cidata.py pure-Python NoCloud ISO builder (pycdlib).
Sets root password and ssh_pwauth via
runcmd so we don't depend on a specific
cloud-init version's plain_text_passwd
handling.
tools/vm_serial.py serial-console client (stdlib socket).
Idempotent login (detects already-in-shell
state), sentinel-bracketed run() that
distinguishes shell output from the TTY
echo of input by requiring a leading
\r\n boundary on the marker.
tools/vm_load_controller.py in-guest load controller. set_phase()
dispatches the per-phase shell command
over the serial connection.
tools/run_real_vm_demo.py ties it all together: boot VM, wait for
cloud-init runcmd, log in, run the
EpisodeRunner with on_phase=controller,
shut down VM.
Deps: paramiko, pycdlib added.
docs/sources.md updated with Alpine cloud image (sha512 pinned), and
the new Python deps.
README leads with the tier-2 plot now (real VM, real workload). The
previous synthetic plot is moved below with explicit "host-side mimic,
not a VM" labelling. Tier-2 status flipped to ✅ in the tier table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>