Commit graph

9 commits

Author SHA1 Message Date
Max Gorog
3d4f282e9c Tier-2 episodes use clean-only schedule; .gitignore VERSION
Two correctness fixes that the §4.5 event-driven labeller surfaced:

1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule
   (clean → armed → infecting → infected_running → ...) for episodes
   with no exploit firing. Pre-§4.5 those episodes wrote dishonest
   `infected_running` labels from the schedule clock — exactly the §3
   evidence pattern. Post-§4.5 they write `failed` at the infecting
   transition (the justifying exploit_fire never arrives), which is
   honest about what happened but useless for training.

   The honest fix: Tier-2 episodes have a clean-only schedule. All
   telemetry tagged `clean` because nothing infected anything. The
   total duration matches the canonical Tier-3 schedule so episode
   lengths are comparable across tiers — no length-bias in the
   dataset (§10).

   Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py
   derives `[("clean", total_seconds)]` from the canonical schedule.
   `tier3_schedule_from(schedule)` renders the legacy
   `[(name, seconds)]` shape EpisodeConfig still expects.

   Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from.
   Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from.
   Drops the hardcoded DEFAULT_SCHEDULE constants from both — the
   canonical manifest is the single source of truth (§4.1).

2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp
   writes /opt/cis490/VERSION so episodes can record code provenance
   without /opt/cis490 carrying a .git directory. But /opt/cis490 IS
   typically a git checkout on lab hosts (auto-update.sh pulls into
   it), so writing VERSION leaves the working tree dirty. Every
   episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance
   gate's rule 4 would then reject every episode without
   CIS490_ALLOW_DIRTY=1 set — which would break the data flow.

   Now VERSION is .gitignored: install-lab-host.sh stamps it, git
   status doesn't see it, dirty=false, gate rule 4 passes naturally.

These two changes together keep the data flowing AND honest. Tier-2
episodes pass with `phases=[clean]` + every collector emitting real
rows. Tier-3 episodes (none today, empty catalog) walk the full
event-driven schedule when a verified module gets re-admitted.

286 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:55:37 -05:00
Max Gorog
207a902c3e PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml
The experiment is now defined by a single version-pinned file —
manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every
lab host loads THIS exact file; per-host overrides of experiment
shape are forbidden.

Drops the following per-host CLI overrides that previously violated
the canonical-manifest principle:
  * --manifest, --modules-dir       (paths now derived)
  * --ram-per-vm-mib                (in manifest.experiment)
  * --max-concurrent                (manifest.experiment.fleet.max_concurrent_ceiling)
  * --max-tier3-slots               (manifest.experiment.fleet.max_tier3_slots)
  * --force-tier2                   (not a §14 sanctioned override knob —
                                     ship empty catalog to disable Tier-3)
  * --require-real-samples          (sample-side concern; out of fleet scope)
  * tools/run_*_demo.py --manifest  (samples path now from canonical)

New surface:
  * manifest.toml                   — the single source of truth
  * orchestrator/manifest.py        — load_canonical() + Manifest dataclass
                                      with strict validation, raises
                                      ManifestError on any failure
  * EpisodeConfig.experiment_meta   — populated by run_*_demo.py from
                                      the canonical manifest; stamped
                                      into every episode's meta.json
                                      under "experiment" key for
                                      provenance
  * cis490-orchestrator.service     — RestartPreventExitStatus=78 so
                                      manifest-load failures stay
                                      stuck-and-loud (§9, §4.7)
  * install-lab-host.sh             — validates manifest.toml at
                                      install time; missing or invalid
                                      = die with clear message

Catalog admission semantics: only modules whose name appears in
manifest.catalog get loaded into the runtime catalog (§4.3 in
miniature, will tighten further in step 4 when verified_against /
last_verified actually gate admission). Missing toml for an admitted
name is a sysadmin error → exit 78.

Renames cfg.manifest → cfg.samples + adds cfg.experiment to
disambiguate sample-manifest from experiment-manifest. Rewrites
test_fleet.py fixture to construct synthetic Manifest objects so
test outcomes don't depend on the on-disk manifest.toml content.

12 new tests in tests/test_manifest.py: schema-version mismatch,
unknown collector, duplicate collector, unknown phase, negative
phase seconds, negative ram, missing catalog fields, json round-trip.

Local run: `python tools/run_fleet.py --capacity` correctly logs the
loaded manifest and prints capacity. 241 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:25:01 -05:00
Max Gorog
ac7b85ff8d PIPELINE §5 step 1 follow-up: enable perf in production launchers
The §5 step 1 fixes correct the perf collector's stdout/stderr +
event-name parser bugs, but the launchers
(run_real_vm_demo / run_tier3_demo) never set enable_perf=True, so
production episodes still ship with rows_perf=0 — silently disabled
collector, which is exactly the §1 / §4.4 pattern.

Turn it on in both launchers. Failure modes (perf binary missing,
paranoid level too high) are logged as warnings + return 0 rows
visibly, not silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:40:37 -05:00
max
642f7a94d6 runners: take savevm baseline-v1 after boot so revert_at_* actually works
EpisodeConfig.revert_at_start / revert_at_end have been issuing
loadvm "baseline-v1" via QMP since the snapshot/revert wiring landed,
but no part of the system was running savevm — so loadvm targeted a
snapshot that didn't exist and silently emitted snapshot_revert_failed
every time. The reverted-baseline mode was, in effect, dead code.

Both runners now take a savevm immediately after the guest is up
and reachable, before any workload runs:

  run_real_vm_demo.py — after SerialClient.login() succeeds (Tier 2)
  run_tier3_demo.py   — after _wait_for_tcp on the vulnerable port
                        (Tier 3, before the exploit fires)

Both call qmp.QMPClient.savevm("baseline-v1"). Best-effort: if savevm
fails (older qemu, non-qcow2 disk, KVM nesting issue), we log a
warning and run the episode anyway — just without revert support.

The snapshot_name in EpisodeConfig is unified to "baseline-v1" across
both runners (Tier 3 was previously stamping "qcow2-snapshot-on" into
meta, which didn't match what loadvm would target).

Why both runners take savevm individually instead of a unified path:
the two runners boot different launchers (launch_demo.sh for the
Alpine cidata image, launch_target.sh for the vulnerable target).
Each is responsible for its own QMP socket lifecycle. A shared
savevm helper module would just be a one-line wrapper around the
existing qmp.QMPClient.savevm; not worth the indirection.

Existing test coverage: tests/test_qmp.py exercises
QMPClient.savevm/loadvm against a fake server (HMP wrapper, error
path). The runner-side call is exercised in production but not in
unit tests — would need a fake launcher subprocess, which is outside
this commit's scope.

132/132 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:37:05 -05:00
max
d86502d950 workload audit trail: meta.sample + per-phase events + pre-kill probe
The elliott-lab episode showed every phase median'd 20% CPU because
the in-guest workload silently never fired — and there was no signal
in events.jsonl to detect that from outside, so a trainer would
treat the labels as ground truth and learn "all phases look identical".
This commit closes the audit gap so the failure is visible in meta:

orchestrator/episode.py
  EpisodeConfig.sample: Sample | None — the manifest entry that
  drove this episode's workload selection. Stamped into meta.sample
  as {name, family, category, profile, kind, sha256} so trainers
  can join cleanly without re-deriving from events. None means the
  v1 yes-loop fallback path ran (and the trainer should treat the
  episode with appropriate skepticism).

tools/vm_load_controller.py
  VMLoadController gains an emit_event callable. Every phase now
  emits a workload_* event into the runner's events.jsonl:
    workload_setup        login + initial cleanup OK
    workload_killed       clean / dormant. Dormant carries a
                          `pre_kill_probe` dict from inside the
                          guest (`pgrep -c yes`, `pgrep -c sh`,
                          /proc/loadavg) so the trainer can detect
                          the elliott-lab failure mode where the
                          workload never actually ran.
    workload_armed        armed handshake fired
    workload_infecting    dd urandom / payload write fired
    workload_started      infected_running command sent
    workload_failed       any of the above raised inside SerialClient
                          (timeout, EOF, partial login). The runner
                          would have silently swallowed the
                          exception via its on_phase try/except;
                          the audit row makes the failure detectable.
  Exceptions in shell calls surface as workload_failed events but
  do NOT propagate, matching the runner's existing on_phase
  contract.

tools/run_real_vm_demo.py
  Wires the controller's emit_event to the runner's emit_event via
  a small forward-reference closure (controller is built before
  runner; runner.emit_event needs to be the sink). Sample also
  flows into EpisodeConfig.sample so meta.sample matches what the
  controller actually ran.

Tests: 119 (was 106). New cases:
  tests/test_vm_load_controller.py  (11 tests against a FakeSerial)
    - setup emits workload_setup
    - infected_running runs the v1 yes-loop AND emits workload_started
    - dormant probes BEFORE killing and stamps pre_kill_probe
    - dormant probe records "yes=0" (the elliott-lab fingerprint)
    - clean / armed / infecting all emit their respective events
    - serial.run() exception → workload_failed event, no propagation
    - sample-with-profile dispatches to exploits.workloads command
      (NOT the v1 yes-loop)
    - missing emit_event callback is a no-op (back-compat)
  tests/test_episode.py  (2 new)
    - meta.sample carries name/family/category/profile/kind/sha256
      when EpisodeConfig.sample is set
    - meta.sample stays null in the v1 fallback path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:12:34 -05:00
max
8753340ea3 fleet: fix per-slot run-dir collision so concurrent VMs actually run
Root cause of "fleet says max_concurrent=3 but only one episode ships
per wave" symptom on elliott-lab:

  1. orchestrator/fleet.py::_run_slot set
     env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot.
  2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm
     (NO slot suffix), then UNCONDITIONALLY overwrote the env's
     RUN_DIR with that flag's value before exec'ing the launcher.
  3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots
     collided on the same socket dir.
  4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's
     rmtree literally deleted slot 0's pidfile + sockets mid-boot.
  5. Net effect: one VM survives per wave on a multi-core host that
     should be running ~cores-1 in parallel. Throughput collapses
     to 1/N.

Fix:

  tools/run_real_vm_demo.py + tools/run_tier3_demo.py:
    --run-dir default cascade —
      1) explicit CLI flag
      2) RUN_DIR env (set by fleet runner)
      3) /tmp/cis490-vm-<SLOT>  (SLOT from env, default 0)
    Same change in both runners so Tier-2 + Tier-3 fleet waves
    parallelize cleanly.

  orchestrator/fleet.py::_run_slot:
    Pass --run-dir explicitly to the subprocess so the per-slot path
    is audit-visible in the fleet log instead of buried in env.
    Also flip the subprocess interpreter to repo_root/.venv/bin/python
    when present (was /usr/bin/env python3 — worked by luck because
    the orchestrator path doesn't import msgpack/httpx, but a Tier-3
    fleet wave would have died at import-time on a host without those
    in system Python).

  etc/cis490-orchestrator.service:
    Removed the duplicate [Service] hardening block at the bottom of
    the file that was silently overriding the AmbientCapabilities
    grant (NoNewPrivileges=true at the bottom flipped the
    NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_
    ADMIN + CAP_PERFMON before per-episode subprocesses inherit
    them). Sources 3 + 4 would have failed silently inside the
    sandbox.
    Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable.

106/106 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:55:56 -05:00
max
bdcd2ecbef Close out the open issues: bridge pcap wiring, perf collector, Tier-4
Wraps the three remaining 🚧 items from the README so every collector
the threat-model promises is actually live, and the Tier-4 path
(real-malware fetch + upload + exec) works end-to-end as soon as a
sha256 lands in samples/store/.

Closes spectral/CIS490#4, #5, #6.

== #6 — Bridge pcap wiring ==
EpisodeConfig grows three optional fields:
  bridge_iface: str | None        # e.g. "br-malware"
  bridge_ip:    str = "10.200.0.1"
  pcap_snaplen: int = 256
When bridge_iface is set, EpisodeRunner spawns tcpdump for the duration
of the schedule (network.pcap), stops it cleanly on episode end, and
runs collectors.pcap.bucketize() to produce netflow.jsonl per the
100-ms schema in docs/data-model.md. EpisodeResult + meta.result
gain rows_netflow + pcap_bytes counters.

vm/launch_demo.sh + launch_target.sh now switch between SLIRP usermode
and tap+bridge based on $BRIDGE — operator pre-creates the tap as a
bridge member, no sudo from the launcher.

run_real_vm_demo.py picks BRIDGE up from env so the fleet runner can
opt entire waves into pcap mode by exporting BRIDGE before invocation.

== #5 — Source 3 perf collector ==
collectors/perf_qemu.py shells out to ``perf stat -p <pid> -I 100 -j``
and parses the per-event JSON stream. Aggregates one row per interval
across the canonical event set (cycles/instructions/cache-{refs,misses}/
branches/branch-misses/page-faults/context-switches), computes IPC +
cache-miss rate. Tolerates missing events (``<not counted>`` /
``<not supported>``) without dropping the row, and skips cleanly when
``perf`` isn't on PATH or the process can't be attached.

EpisodeConfig.enable_perf=True opts into the collector — off by default
because perf needs CAP_SYS_ADMIN or perf_event_paranoid <= 1. When
enabled, runs as a parallel thread alongside the other collectors;
EpisodeResult.rows_perf records the count.

== #4 — Tier 4 (real-malware fetch + upload + exec) ==
tools/fetch_sample.py: pulls a sample by sha256 from MalwareBazaar
(API key from env or samples/.bazaar.token), unzips with the standard
"infected" password, verifies the resulting binary's sha256, lands at
samples/store/<sha256>. Idempotent — already-staged correct binaries
return immediately.

samples/manifest.py: Sample.binary_path(store_root) resolves to the
staged binary path, or None for mimics / not-yet-fetched real samples.

exploits/workloads.py: real_binary_workload(bytes, sample) builds a
Workload that base64-uploads the binary into the shell session via a
heredoc, decodes + chmods + execs it in the background, captures the
PID for clean stop on dormant. Per-profile pid/bin paths so concurrent
samples in the same guest don't collide.

exploits/driver.py: dispatch order is now:
  1) sample.kind == "real" + binary staged at sample_store_root
     → real_binary_workload (Tier 4)
  2) profile mimic from workloads.workload_for() (Tier 3 v2)
  3) None → driver v1 fallback yes-loop
DriverConfig.sample_store_root is the new field; run_tier3_demo.py
wires it to repo_root/samples/store. driver_setup event records
sample_sha256 so trainers can join Tier-4 episodes against the
manifest by hash.

samples/store/.gitkeep added (binaries themselves are gitignored).

Tests: 102 pass (was 86). New suites:
  tests/test_perf_qemu.py — parser + builder + perf-missing fallback
  tests/test_tier4.py     — real_binary_workload base64 round-trip,
                            stop-cmd kills pidfile, per-profile path
                            isolation, driver dispatch chooses real vs
                            mimic correctly, fetcher input validation
                            and cached-fast-path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:17:49 -05:00
max
b80986d99c Driver v2: sample-profile-driven workloads (Tier-2 + Tier-3)
The v1 driver ran ``yes > /dev/null`` for every sample, which
produced the same envelope shape regardless of which malware family
the orchestrator claimed to be running. That's a poor training
signal: the model sees identical /proc + QMP traces tagged
"cryptominer" / "ransomware" / "RAT" with no distinguishing
features. v2 fixes this.

What landed:

  exploits/workloads.py — six ``Workload`` profiles, each producing
    a distinct in-session shell command pair (start_cmd / stop_cmd)
    that backgrounds a profile-shaped loop:

      cpu-saturate    — sustained 1-vCPU saturation (XMRig shape)
      scan-and-dial   — periodic SYN-style probes across 10.200.0.0/24
                        + dial-home to gateway (Mirai shape)
      io-walk         — fs traversal + 4 KiB urandom writes, periodic
                        re-read (ransomware shape)
      bursty-c2       — long idle, periodic 3-packet TCP egress burst
                        (Dridex C2 beacon shape)
      low-and-slow    — minimal CPU + periodic awk-driven memory churn
                        (Kovter / fileless shape)
      shell-resident  — single long-lived TCP socket pinned to gateway
                        with periodic 6-byte command ticks (RAT shape)

  Each profile uses a /tmp/.cis490-workload-<profile>.{pid,sh} pair so
  the stop_cmd can cleanly kill the loop and its descendants.

  exploits/driver.py — MSFExploitDriver now accepts an optional
    ``Sample``. With one supplied, ``infected_running`` dispatches to
    the matching workload via exploits.workloads.workload_for(); the
    ``sample_executed`` event records profile + sample name + sample
    kind so the trainer can join cleanly. Without a sample, the v1
    yes-loop path remains unchanged (backwards compat).

  tools/vm_load_controller.py — the same dispatch on the Tier-2 path
    (no exploit, real Alpine guest driven over the serial console).
    A fleet wave now produces six visually distinct envelopes per
    wave whether the underlying mode is Tier 2 or Tier 3.

  tools/run_real_vm_demo.py — accepts ``--sample <name>`` (or
    SAMPLE_NAME env from the fleet runner) + auto-wires QMP + agent
    sockets into the EpisodeConfig so all three new collectors
    (sources 2, 4, 5) run alongside source 1 by default.

  tools/run_tier3_demo.py — same ``--sample`` plumbing for the
    exploit-driven path.

Tests: 86 pass (was 82). New v2 cases:
  - profile dispatch routes infected_running to the workload's
    start_cmd (NOT the v1 yes-loop) when a Sample is set
  - all six profiles produce distinct start_cmds (the property the
    ML model needs)
  - unknown profile string falls back to cpu-saturate with a warning
  - v1 path (no Sample) still uses yes-loop (backwards compat)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:06:15 -05:00
Maximus Gorog
7216ec09bd Tier 2: real Alpine VM, real workload, real envelope
End-to-end now drives a real KVM guest through the full XMRig-shaped
phase schedule with the workload running INSIDE the guest. Telemetry is
host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU
saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven
over the serial console at every phase transition. The plotted envelope
shows clean idle → armed → infecting (disk spike) → infected_running
(100% CPU plateau) → dormant → re-entry → final clean.

Components:

  vm/launch_demo.sh              now boots Alpine 3.21 nocloud-cloudinit
                                 (Cirros 0.6.x's cirros-init blocks on the
                                 EC2 metadata service for ~17 min before
                                 falling through to NoCloud — abandoned).
                                 Mounts a cidata ISO as a second drive.

  tools/build_cidata.py          pure-Python NoCloud ISO builder (pycdlib).
                                 Sets root password and ssh_pwauth via
                                 runcmd so we don't depend on a specific
                                 cloud-init version's plain_text_passwd
                                 handling.

  tools/vm_serial.py             serial-console client (stdlib socket).
                                 Idempotent login (detects already-in-shell
                                 state), sentinel-bracketed run() that
                                 distinguishes shell output from the TTY
                                 echo of input by requiring a leading
                                 \r\n boundary on the marker.

  tools/vm_load_controller.py    in-guest load controller. set_phase()
                                 dispatches the per-phase shell command
                                 over the serial connection.

  tools/run_real_vm_demo.py      ties it all together: boot VM, wait for
                                 cloud-init runcmd, log in, run the
                                 EpisodeRunner with on_phase=controller,
                                 shut down VM.

Deps: paramiko, pycdlib added.

docs/sources.md updated with Alpine cloud image (sha512 pinned), and
the new Python deps.

README leads with the tier-2 plot now (real VM, real workload). The
previous synthetic plot is moved below with explicit "host-side mimic,
not a VM" labelling. Tier-2 status flipped to  in the tier table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:38:53 -06:00