Commit graph

9 commits

Author SHA1 Message Date
Max Gorog
4ab5477226 PIPELINE §5 step 1: fix four root-cause defects
Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.

perf collector (rows_perf=0 on 100% of episodes):
  - perf stat -j writes to stderr by default with -p; we read stdout.
    Add --log-fd 1 so JSON reaches stdout where the parser sees it.
  - Event names come back annotated with the privilege scope perf
    actually measured ("cycles:u" under perf_event_paranoid=2). Strip
    the suffix so _build_row's plain-name lookups hit. Without this
    every metric was None even when perf reported real numbers.
  - tests/test_collectors_emit.py covers the regression with a real
    busy-loop fixture; emit-test discipline per §4.4.

guest-agent collector (rows_guest=0 on 100% of episodes):
  - Alpine cloud image doesn't ship python3, so the in-guest agent's
    `#!/usr/bin/env python3` shebang silently fails. Add packages:
    [python3] to cidata user-data so cloud-init installs it before
    the OpenRC service starts.
  - Guest agent now exits nonzero (was: silent stdout fallback) when
    /dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
    reports the failure to /var/log/cis490-agent.log instead of the
    bytes vanishing into the void. Refs §1.
  - Host-side collector emits guest_agent_connected /
    guest_agent_first_byte / guest_agent_silent_window into the
    orchestrator's events.jsonl. Future episodes show the in-guest
    failure mode per-episode instead of inferring from rows_guest=0.

k-gamingcom missing qmp/netflow/pcap (also affected elliott on
  Tier-3 episodes — was misclassified as host divergence):
  - tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
    qmp_socket / guest_agent_socket / bridge_iface — even though
    launch_target.sh creates the underlying chardevs and BRIDGE
    supplies the iface. tools/run_real_vm_demo.py wires them
    correctly; Tier-3 had a copy-paste gap.
  - tests/test_collectors_emit.py adds a source-grep regression so
    the wiring stays honest.

samba_usermap_script never lands session (0/67 in §3 probe):
  - Bind handler default WfsDelay (~5s) gives up before bind_perl on
    Metasploitable2 has finished forking + binding LPORT under
    SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
    exploits/driver.py so framework + driver agree on the wait
    budget. Add ConnectTimeout=15 so the handler's bind connect has
    retry budget instead of one-shot.

orchestrator/fleet.py: usable_modules + BRIDGE handling were both
  unconditional, so:
  - With BRIDGE set, requires_bridge modules were still being
    dropped — picker only ever returned samba_usermap_script across
    every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
    failure on HEAD).
  - env.pop("BRIDGE") fired even when BRIDGE was the operator's
    explicit setup, breaking modules that need bridge mode (vsftpd
    backdoor on hardcoded port 6200, distccd, etc.).
  Both made conditional on bridge_set so the picker walks the full
  catalog under bridge mode and SLIRP-only modules still get a
  clean SLIRP env when BRIDGE is unset.

receiver/app.py: half-pregnant v2 schema state in HEAD — calling
  store.ingest_stream(episode_type=..., benign_profile=...) with
  kwargs the matching store.py change was in the WIP stash. Removed
  v2 awareness from app.py so v1 episodes (what the producer ships
  today) get accepted again. SCHEMA_VERSION default reset to 1 to
  match.

229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:05:25 -05:00
1bac2f0135 run_tier3_demo: replace serial probe with min-wait + TCP probe
The serial console approach failed: Metasploitable2's kernel is not
configured with console=ttyS0, so only GRUB output reaches the QEMU
serial socket; the OS boot and login prompt never appear there.

New approach:
1. Sleep _METASPLOITABLE2_MIN_BOOT_S (65 s) after QEMU writes its
   pidfile. By this point the guest kernel and init are always up.
2. Call _wait_for_tcp with a 3 s recv timeout. Post-floor, SLIRP has
   forwarded the connection to the guest TCP stack, so:
   - socket.timeout → service listening, waiting for client data ✓
   - OSError/RST    → port still closed (service not ready); retry ✓
   Eliminates the early-boot false-positive that caused exploits to
   fire ~60 s before Samba was actually listening.

Also update TIER3-BRINGUP.md bug 6 to reflect the correct final fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:38:22 -06:00
667f042707 Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01)
Root causes and fixes documented in TIER3-BRINGUP.md. Summary:

1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
   instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.

2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
   modules selected on SLIRP runs; fix: always filter requires_bridge.

3. cmd/unix/interact creates no session.list entry → session_open_timeout
   every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.

4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
   fix: extra_host_port:extra_host_port mapping so guest binds the
   per-slot LPORT directly.

5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
   fix: requires_bridge=true filters it from SLIRP fleet runs.

6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
   boots (~60 s too early); fix: replace TCP probe with serial console
   _wait_for_serial_login that waits for actual "login:" prompt.

7. Stale QEMU survives orchestrator restart (start_new_session=True) →
   holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
   old pidfile before rmtree.

8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.

9. msfrpcd 6.x returns bytes for all string values even with raw=False;
   fix: MSFRpcClient._str() recursive decoder applied to all responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:26:19 -06:00
max
642f7a94d6 runners: take savevm baseline-v1 after boot so revert_at_* actually works
EpisodeConfig.revert_at_start / revert_at_end have been issuing
loadvm "baseline-v1" via QMP since the snapshot/revert wiring landed,
but no part of the system was running savevm — so loadvm targeted a
snapshot that didn't exist and silently emitted snapshot_revert_failed
every time. The reverted-baseline mode was, in effect, dead code.

Both runners now take a savevm immediately after the guest is up
and reachable, before any workload runs:

  run_real_vm_demo.py — after SerialClient.login() succeeds (Tier 2)
  run_tier3_demo.py   — after _wait_for_tcp on the vulnerable port
                        (Tier 3, before the exploit fires)

Both call qmp.QMPClient.savevm("baseline-v1"). Best-effort: if savevm
fails (older qemu, non-qcow2 disk, KVM nesting issue), we log a
warning and run the episode anyway — just without revert support.

The snapshot_name in EpisodeConfig is unified to "baseline-v1" across
both runners (Tier 3 was previously stamping "qcow2-snapshot-on" into
meta, which didn't match what loadvm would target).

Why both runners take savevm individually instead of a unified path:
the two runners boot different launchers (launch_demo.sh for the
Alpine cidata image, launch_target.sh for the vulnerable target).
Each is responsible for its own QMP socket lifecycle. A shared
savevm helper module would just be a one-line wrapper around the
existing qmp.QMPClient.savevm; not worth the indirection.

Existing test coverage: tests/test_qmp.py exercises
QMPClient.savevm/loadvm against a fake server (HMP wrapper, error
path). The runner-side call is exercised in production but not in
unit tests — would need a fake launcher subprocess, which is outside
this commit's scope.

132/132 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:37:05 -05:00
max
a193d17ead fleet: rotate exploit modules per (host, slot, ep); Tier 3 by default
Closes the "every run hits the same vulnerability" gap. Before this
commit, the fleet shipped Tier-2 episodes (no exploit at all) with
only the post-infection sample varying. Tier-3 had a single canned
module — vsftpd_234_backdoor — so even when exploit fire was
exercised, the entry vector never changed. Trainer would see one
shape of `armed → infecting` and learn nothing about how varied
real exploits look on the wire / in /proc.

What landed:

exploits/modules/
  + samba_usermap_script.toml          CVE-2007-2447, SMB:139
  + distccd_command_exec.toml          CVE-2004-2687, distcc:3632
  + php_cgi_arg_injection.toml         CVE-2012-1823, http:80
  + unreal_ircd_3281_backdoor.toml     CVE-2010-2075, ircd:6667
  (vsftpd_234_backdoor.toml unchanged)
  All five are canonical Metasploitable2 vectors with stable
  Metasploit modules. Each TOML carries the RPORT the launcher
  needs to wire its hostfwd at, plus a payload tuned to a clean
  shell session (cmd/unix/interact for in-band shells,
  cmd/unix/reverse* with deterministic LPORTs for reverse shells).

exploits/modules.py
  + select_module(catalog, host_id, slot, episode_index) — same
    SHA-256-keyed deterministic selection shape SampleManifest uses
    for samples. Two hosts at the same slot/episode hash to
    different modules; one host walks the full catalog within
    ~len(catalog) episodes.
  + module_target_port() — pulls RPORT off the module config so
    the fleet can plumb the launcher's hostfwd at the right service.

orchestrator/fleet.py
  - _run_slot now decides Tier 3 vs Tier 2 from msfrpcd reachability
    + module-catalog populated. Default is Tier 3 when both are true;
    Tier 2 fallback when not (logged + recorded in SlotResult.tier
    so trainers can filter no-exploit episodes).
  - Per-slot module via select_module() — each concurrent slot in a
    wave gets a different vector AND a different sample.
  - PORT_BASE per slot (target_port + slot * 1000) so concurrent
    Tier-3 targets don't collide on the host-side hostfwd port.
  - _msfrpcd_available() probe gates the dispatch.
  - Fleet-side log line records (slot, ep, tier, sample, module,
    run_dir) so the operator can see at a glance what each wave is
    exercising.
  - SlotResult grows tier + module_name fields; FleetConfig grows
    modules + force_tier2 + msfrpcd_{host,port} fields.

orchestrator/episode.py
  + EpisodeConfig.exploit_meta — plain dict the runner stamps into
    meta.exploit so every Tier-3 episode records {framework,
    module path, module type, payload, RPORT, RHOSTS template}.
    Trainers join on meta.exploit.module_name to stratify by entry
    vector; meta.sample.name to stratify by post-infection family.

tools/run_tier3_demo.py
  + Builds exploit_meta from the loaded ModuleConfig and passes it
    to EpisodeConfig. Sample is now also passed (was missing).

tools/run_fleet.py
  + --modules-dir (default exploits/modules/) — load module catalog
    on startup; pass to FleetConfig.
  + --force-tier2 — escape hatch for dev / smoke tests.
  + JSON output now includes per-slot {tier, module} so the operator
    can see at a glance what each slot ran without grepping logs.

Tests: 129 (was 119). New cases:
  test_exploits.py +6
    - catalog has at least the five canonical Metasploitable2 vectors
    - select_module is deterministic per (host, slot, ep)
    - select_module diversifies across hosts
    - select_module walks the full catalog over many episodes
    - module_target_port pulls RPORT for each shipped TOML
  test_fleet.py +4
    - _run_slot dispatches to run_tier3_demo.py when msfrpcd up
    - falls back to run_real_vm_demo.py when msfrpcd unreachable
    - falls back when module catalog empty
    - --force-tier2 overrides msfrpcd availability
    - PORT_BASE is unique per concurrent slot (no hostfwd collision)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:22:49 -05:00
max
8753340ea3 fleet: fix per-slot run-dir collision so concurrent VMs actually run
Root cause of "fleet says max_concurrent=3 but only one episode ships
per wave" symptom on elliott-lab:

  1. orchestrator/fleet.py::_run_slot set
     env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot.
  2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm
     (NO slot suffix), then UNCONDITIONALLY overwrote the env's
     RUN_DIR with that flag's value before exec'ing the launcher.
  3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots
     collided on the same socket dir.
  4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's
     rmtree literally deleted slot 0's pidfile + sockets mid-boot.
  5. Net effect: one VM survives per wave on a multi-core host that
     should be running ~cores-1 in parallel. Throughput collapses
     to 1/N.

Fix:

  tools/run_real_vm_demo.py + tools/run_tier3_demo.py:
    --run-dir default cascade —
      1) explicit CLI flag
      2) RUN_DIR env (set by fleet runner)
      3) /tmp/cis490-vm-<SLOT>  (SLOT from env, default 0)
    Same change in both runners so Tier-2 + Tier-3 fleet waves
    parallelize cleanly.

  orchestrator/fleet.py::_run_slot:
    Pass --run-dir explicitly to the subprocess so the per-slot path
    is audit-visible in the fleet log instead of buried in env.
    Also flip the subprocess interpreter to repo_root/.venv/bin/python
    when present (was /usr/bin/env python3 — worked by luck because
    the orchestrator path doesn't import msgpack/httpx, but a Tier-3
    fleet wave would have died at import-time on a host without those
    in system Python).

  etc/cis490-orchestrator.service:
    Removed the duplicate [Service] hardening block at the bottom of
    the file that was silently overriding the AmbientCapabilities
    grant (NoNewPrivileges=true at the bottom flipped the
    NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_
    ADMIN + CAP_PERFMON before per-episode subprocesses inherit
    them). Sources 3 + 4 would have failed silently inside the
    sandbox.
    Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable.

106/106 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:55:56 -05:00
max
bdcd2ecbef Close out the open issues: bridge pcap wiring, perf collector, Tier-4
Wraps the three remaining 🚧 items from the README so every collector
the threat-model promises is actually live, and the Tier-4 path
(real-malware fetch + upload + exec) works end-to-end as soon as a
sha256 lands in samples/store/.

Closes spectral/CIS490#4, #5, #6.

== #6 — Bridge pcap wiring ==
EpisodeConfig grows three optional fields:
  bridge_iface: str | None        # e.g. "br-malware"
  bridge_ip:    str = "10.200.0.1"
  pcap_snaplen: int = 256
When bridge_iface is set, EpisodeRunner spawns tcpdump for the duration
of the schedule (network.pcap), stops it cleanly on episode end, and
runs collectors.pcap.bucketize() to produce netflow.jsonl per the
100-ms schema in docs/data-model.md. EpisodeResult + meta.result
gain rows_netflow + pcap_bytes counters.

vm/launch_demo.sh + launch_target.sh now switch between SLIRP usermode
and tap+bridge based on $BRIDGE — operator pre-creates the tap as a
bridge member, no sudo from the launcher.

run_real_vm_demo.py picks BRIDGE up from env so the fleet runner can
opt entire waves into pcap mode by exporting BRIDGE before invocation.

== #5 — Source 3 perf collector ==
collectors/perf_qemu.py shells out to ``perf stat -p <pid> -I 100 -j``
and parses the per-event JSON stream. Aggregates one row per interval
across the canonical event set (cycles/instructions/cache-{refs,misses}/
branches/branch-misses/page-faults/context-switches), computes IPC +
cache-miss rate. Tolerates missing events (``<not counted>`` /
``<not supported>``) without dropping the row, and skips cleanly when
``perf`` isn't on PATH or the process can't be attached.

EpisodeConfig.enable_perf=True opts into the collector — off by default
because perf needs CAP_SYS_ADMIN or perf_event_paranoid <= 1. When
enabled, runs as a parallel thread alongside the other collectors;
EpisodeResult.rows_perf records the count.

== #4 — Tier 4 (real-malware fetch + upload + exec) ==
tools/fetch_sample.py: pulls a sample by sha256 from MalwareBazaar
(API key from env or samples/.bazaar.token), unzips with the standard
"infected" password, verifies the resulting binary's sha256, lands at
samples/store/<sha256>. Idempotent — already-staged correct binaries
return immediately.

samples/manifest.py: Sample.binary_path(store_root) resolves to the
staged binary path, or None for mimics / not-yet-fetched real samples.

exploits/workloads.py: real_binary_workload(bytes, sample) builds a
Workload that base64-uploads the binary into the shell session via a
heredoc, decodes + chmods + execs it in the background, captures the
PID for clean stop on dormant. Per-profile pid/bin paths so concurrent
samples in the same guest don't collide.

exploits/driver.py: dispatch order is now:
  1) sample.kind == "real" + binary staged at sample_store_root
     → real_binary_workload (Tier 4)
  2) profile mimic from workloads.workload_for() (Tier 3 v2)
  3) None → driver v1 fallback yes-loop
DriverConfig.sample_store_root is the new field; run_tier3_demo.py
wires it to repo_root/samples/store. driver_setup event records
sample_sha256 so trainers can join Tier-4 episodes against the
manifest by hash.

samples/store/.gitkeep added (binaries themselves are gitignored).

Tests: 102 pass (was 86). New suites:
  tests/test_perf_qemu.py — parser + builder + perf-missing fallback
  tests/test_tier4.py     — real_binary_workload base64 round-trip,
                            stop-cmd kills pidfile, per-profile path
                            isolation, driver dispatch chooses real vs
                            mimic correctly, fetcher input validation
                            and cached-fast-path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:17:49 -05:00
max
b80986d99c Driver v2: sample-profile-driven workloads (Tier-2 + Tier-3)
The v1 driver ran ``yes > /dev/null`` for every sample, which
produced the same envelope shape regardless of which malware family
the orchestrator claimed to be running. That's a poor training
signal: the model sees identical /proc + QMP traces tagged
"cryptominer" / "ransomware" / "RAT" with no distinguishing
features. v2 fixes this.

What landed:

  exploits/workloads.py — six ``Workload`` profiles, each producing
    a distinct in-session shell command pair (start_cmd / stop_cmd)
    that backgrounds a profile-shaped loop:

      cpu-saturate    — sustained 1-vCPU saturation (XMRig shape)
      scan-and-dial   — periodic SYN-style probes across 10.200.0.0/24
                        + dial-home to gateway (Mirai shape)
      io-walk         — fs traversal + 4 KiB urandom writes, periodic
                        re-read (ransomware shape)
      bursty-c2       — long idle, periodic 3-packet TCP egress burst
                        (Dridex C2 beacon shape)
      low-and-slow    — minimal CPU + periodic awk-driven memory churn
                        (Kovter / fileless shape)
      shell-resident  — single long-lived TCP socket pinned to gateway
                        with periodic 6-byte command ticks (RAT shape)

  Each profile uses a /tmp/.cis490-workload-<profile>.{pid,sh} pair so
  the stop_cmd can cleanly kill the loop and its descendants.

  exploits/driver.py — MSFExploitDriver now accepts an optional
    ``Sample``. With one supplied, ``infected_running`` dispatches to
    the matching workload via exploits.workloads.workload_for(); the
    ``sample_executed`` event records profile + sample name + sample
    kind so the trainer can join cleanly. Without a sample, the v1
    yes-loop path remains unchanged (backwards compat).

  tools/vm_load_controller.py — the same dispatch on the Tier-2 path
    (no exploit, real Alpine guest driven over the serial console).
    A fleet wave now produces six visually distinct envelopes per
    wave whether the underlying mode is Tier 2 or Tier 3.

  tools/run_real_vm_demo.py — accepts ``--sample <name>`` (or
    SAMPLE_NAME env from the fleet runner) + auto-wires QMP + agent
    sockets into the EpisodeConfig so all three new collectors
    (sources 2, 4, 5) run alongside source 1 by default.

  tools/run_tier3_demo.py — same ``--sample`` plumbing for the
    exploit-driven path.

Tests: 86 pass (was 82). New v2 cases:
  - profile dispatch routes infected_running to the workload's
    start_cmd (NOT the v1 yes-loop) when a Sample is set
  - all six profiles produce distinct start_cmds (the property the
    ML model needs)
  - unknown profile string falls back to cpu-saturate with a warning
  - v1 path (no Sample) still uses yes-loop (backwards compat)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:06:15 -05:00
max
613c6fa223 Tier 3: msfrpc-driven exploit driver + first module config
Adds the Tier-3 exploit driver — an MSFExploitDriver that plugs into
EpisodeRunner.on_phase, fires a Metasploit module against a target VM
via msfrpcd, watches for the resulting session, and stamps each
transition (exploit_fire, session_open, session_landing_probe,
sample_executed, session_dormant, session_killed) into the episode's
events.jsonl on the orchestrator's monotonic clock.

What landed:
- exploits/msfrpc.py — minimal msgpack-over-HTTPS client (auth,
  module.execute, job/session lifecycle) so we don't depend on a
  third-party MSF wrapper.
- exploits/driver.py — phase-to-msfrpc adapter; idempotent fire,
  session-open polling with timeout, workload start/stop, teardown.
- exploits/modules.py + exploits/modules/vsftpd_234_backdoor.toml —
  TOML module configs with {{ target_ip }} placeholders, replacing the
  imperative .rc-script approach the README previously hinted at.
- vm/launch_target.sh — SLIRP+restrict=on launcher for the
  intentionally-vulnerable target VM (host can reach guest via
  hostfwd, guest cannot reach host or internet).
- tools/run_tier3_demo.py — end-to-end runner mirroring run_real_vm_demo.
- tests/test_exploits.py — 12 new tests against a fake MSFRpcClient,
  including an integration test that drives a real EpisodeRunner.

Plumbing changes:
- EpisodeRunner._emit_event → public emit_event, so external drivers
  share the runner's monotonic clock and events.jsonl.
- mkdir for episode_dir moved to __init__ so emit_event is callable
  before run() (driver_setup fires pre-schedule).

Status: driver + tests pass (40/40); end-to-end against a live msfrpcd
+ Metasploitable2 image is the next bring-up step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:11:52 -05:00