Commit graph

49 commits

Author SHA1 Message Date
Max
308140c6ce training: lambda-cloud one-shot training integration
External-GPU path for the time-pressured first round, before the
Windows desktop joins the WG fleet. Lambda is treated as an "external
worker" whose output lands in the same /var/lib/cis490/models/ tree
the receiver-coordinated fleet uses, so cis490-jobs status reflects
Lambda runs identically to fleet runs.

Three scripts + one ingest tool:

  scripts/build-lambda-bundle.sh
    Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with:
      - the repo (sans .git, sans data/, sans artifacts*)
      - data/processed/{validation_v1,features_window_v1}.parquet
      - data/processed/feature_schema_v1.json
      - data/processed/tensor_window_v1/   (npz shards)
      - bootstrap.sh (entrypoint)
      - training_manifest.toml (the canonical job list)
      - BUNDLE_MANIFEST.json (commit hash + counts + build stamp)
    Verifies all four data inputs exist BEFORE compressing 5+ GB.

  scripts/run-on-lambda.sh ubuntu@<ip>
    rsync bundle up → ssh + run bootstrap → rsync artifacts +
    reports/eval back to artifacts-lambda/ + reports/lambda/.
    Resumable rsync; sha256-verified.

  scripts/lambda-bootstrap.sh   (runs ON the Lambda instance)
    Creates .venv with cu121 torch + xgboost + the [training] deps,
    iterates the manifest's job list in priority order (highest first),
    runs trainer/run.py (or run_ssl.py for transformer_ssl) per job,
    skips jobs whose .ckpt.json already exists (idempotent on re-run),
    writes per-job logs/<model>_<mode>.log, runs eval suite at the end,
    stamps artifacts/RUN_SUMMARY.json with counts + failed-job list.

  tools/ingest_lambda_artifacts.py
    Bundles each (ckpt.json + sidecar + train.json) trio into a
    .tar.zst, sha256, PUTs to the local trainer-receiver's
    /v1/model/{job_id}, marks the job complete. Maps (model, mode) →
    job_id by re-reading the canonical manifest. Handles the queue
    state churn (requeue if completed, claim if pending, fail-back
    on race losses).

End-to-end smoke verified on the A100 instance just provisioned:
  - SSH from Pi via ed25519 keypair (cis490-trainer-pi)
  - GPU: A100-SXM4-40GB, driver 580.105.08
  - venv warmed: torch 2.5.1+cu121, xgboost 3.2.0
  - 464 GB ephemeral disk available

Pi-side feature build (build_features.py + build_tensors.py against
all 72,952 accepted+degraded episodes) is in progress; bundle build
gates on its completion. Estimated wall-clock for the full Lambda
training run on A100: ~2.5 hours for 12 supervised + 2 SSL models +
eval suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:32:04 -05:00
Max
8643192a71 training/fleet: distributed multi-host trainer with capability gating
Symmetric companion to the collection fleet (orchestrator/fleet.py)
but for *training*. Collection is embarrassingly parallel; training
is not (a model is trained at most once across the fleet), so the
receiver coordinates which worker gets which job.

Operator-control surface is etc/training_manifest.toml.example —
single canonical file declaring (a) per-host capability + per-model
allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper)
with capability constraints (require_cuda, prefer_cuda, min_vram_gib,
min_ram_gib, allowed_hosts).

Components:

  capability.py — self-detection: hostname, cores, RAM, CUDA presence,
    VRAM, torch version, git commit. Used by workers to filter
    eligible jobs before claiming.

  manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable
    sha256 of (model, mode, hyper, split_recipe, train_hosts, seed)
    so manifest reload is idempotent: existing rows keep their status,
    new jobs become claimable, removed jobs stay until cancelled.

  queue.py — SQLite job queue (training_jobs.db) with statuses
    pending|claimed|running|completed|failed|cancelled. Atomic
    claim_next via single UPDATE WHERE status='pending'. Heartbeat,
    complete, fail. Stale-claim sweep (stale_after_s=600s) with
    max_attempts cutoff to failed.

  store.py — model artifact store mirroring receiver/store.py.
    Artifact ID is the sha256 of the uploaded tarball; bit-identical
    re-runs deduplicate.

  receiver.py — Starlette app exposing 11 endpoints:
    POST /v1/job/claim          (worker)
    POST /v1/job/{id}/heartbeat (worker)
    POST /v1/job/{id}/complete  (worker)
    POST /v1/job/{id}/fail      (worker)
    PUT  /v1/model/{id}         (worker — uploads tarball)
    GET  /v1/jobs               (anyone)
    GET  /v1/workers            (anyone)
    POST /v1/job/{id}/cancel    (operator: X-Operator-Token)
    POST /v1/job/{id}/requeue   (operator)
    POST /v1/manifest/reload    (operator)
    GET  /v1/health             (anyone)
    Runs as cis490-trainer-receiver.service on the Pi alongside the
    existing receiver, on a separate port.

  client.py — stdlib HTTP client (urllib only, no new deps).

  worker.py — long-running daemon. Loop: detect capability → claim →
    spawn training/trainer/run.py subprocess → heartbeat every 30s →
    tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe.

Operator CLI (tools/cis490_jobs.py): status / list / show / cancel /
requeue / reload / workers. Cancel and requeue require
$CIS490_OPERATOR_TOKEN matching the receiver's configured value.

Bootstrap: scripts/install-training-worker.sh (Linux systemd) and
scripts/install-training-worker-windows.ps1 (Windows Scheduled Task)
let the operator enroll a new host with one command after cloning
the repo and setting up the venv. Worker self-tests capability
before registering.

End-to-end smoke verified on the Pi: receiver up, manifest synced,
14 jobs queued, worker registered, claimed 4 CPU-eligible jobs
(allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle,
mlp-oracle), 1 failed with the actual error visible via
cis490-jobs status, 3 artifacts uploaded to
/var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with
proper index.jsonl row.

21 unit tests (manifest validation: 8; queue lifecycle + eligibility:
13). All pass alongside the prior 17 training tests = 38 green.

Open limitations surfaced inline:
  - Hyper-key drift between manifest and run.py fails at training
    time, not at manifest reload (worth tightening to argparse
    introspection later).
  - mTLS not yet wired through Caddy for the trainer-receiver port —
    listens loopback-only until that lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:20:20 -05:00
Max
1fabd4a246 training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers
The model layer of the project, built honestly:

  - tools/dataset_validate.py — full-sweep validator over the receiver
    store (sha256, schema, monotonic labels, telemetry-row gate). On the
    current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
    7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
    is committed as the per-episode acceptance index.

  - training/_features.py — channel registry (46 channels across
    proc/guest/qmp/netflow), summary-stat windowing AND channel×time
    tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
    (Unix ns) — tested fix for a real netflow-vs-host clock-base
    inconsistency that was silently dropping every netflow channel.

  - training/_split.py — three held-out recipes (host / sample / time)
    with profile-stratification assertions. held_out_host carries
    untested_profiles for cases like scan-and-dial absent from the test
    host (5 of 6 profiles tested cross-device, never silently averaged).

  - training/models/ — 6 architectures behind a common BaseModel
    interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
    trained twice (realistic / oracle) per the deployment threat model.
    Schema-hashed checkpoints refuse to load if _features.py changed
    since training (silent-input-drift protection, tested).

  - training/trainer/ — unified training loop: class-weighted CE, LR
    warmup + cosine, gradient clipping, mixed precision when CUDA,
    early stopping on val macro F1, best-on-val checkpoint. Same loop
    runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
    early_stopping_rounds on val mlogloss.

  - training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
    per-profile and per-host breakdown, paired-bootstrap significance
    for model-vs-model gap. Confusion matrix uses union of seen labels.

  - training/dashboard/producers/ — replay/metrics/perf/profiles
    emitting the six event types the dashboard's awaiting scenes
    consume; on-demand tensor extraction so the Pi can run live
    inference without 65 GB of shards.

  - 17 unit tests (split coverage, features round-trip, schema mismatch,
    determinism, time-base alignment regression).

End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:19:00 -05:00
b73f5559dc Tier-3 fixes: b'' probe false-positive, requires_bridge, msgpack
Bug 10: _wait_for_tcp returned on recv()→b'' (connection closed by peer),
falsely signalling service-ready. Only socket.timeout or non-empty data
are genuine ready signals; b'' now retries.

Bug 11: distccd_command_exec and unreal_ircd_3281_backdoor incorrectly
had requires_bridge=true. bind_perl payloads connect inward (host→guest
via hostfwd), not outward — no bridge egress needed. Both modules now
run on SLIRP-only fleet slots.

Bug 12: msgpack.unpackb crashed on integer session IDs from msfrpcd 6.x
(strict_map_key=True default). Added strict_map_key=False.

Bug 13 (documented): samba_usermap_script removed from catalog (NoReply
on every fire — already handled in dca6144 on origin/main).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 15:15:18 -06:00
Max Gorog
3d4f282e9c Tier-2 episodes use clean-only schedule; .gitignore VERSION
Two correctness fixes that the §4.5 event-driven labeller surfaced:

1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule
   (clean → armed → infecting → infected_running → ...) for episodes
   with no exploit firing. Pre-§4.5 those episodes wrote dishonest
   `infected_running` labels from the schedule clock — exactly the §3
   evidence pattern. Post-§4.5 they write `failed` at the infecting
   transition (the justifying exploit_fire never arrives), which is
   honest about what happened but useless for training.

   The honest fix: Tier-2 episodes have a clean-only schedule. All
   telemetry tagged `clean` because nothing infected anything. The
   total duration matches the canonical Tier-3 schedule so episode
   lengths are comparable across tiers — no length-bias in the
   dataset (§10).

   Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py
   derives `[("clean", total_seconds)]` from the canonical schedule.
   `tier3_schedule_from(schedule)` renders the legacy
   `[(name, seconds)]` shape EpisodeConfig still expects.

   Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from.
   Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from.
   Drops the hardcoded DEFAULT_SCHEDULE constants from both — the
   canonical manifest is the single source of truth (§4.1).

2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp
   writes /opt/cis490/VERSION so episodes can record code provenance
   without /opt/cis490 carrying a .git directory. But /opt/cis490 IS
   typically a git checkout on lab hosts (auto-update.sh pulls into
   it), so writing VERSION leaves the working tree dirty. Every
   episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance
   gate's rule 4 would then reject every episode without
   CIS490_ALLOW_DIRTY=1 set — which would break the data flow.

   Now VERSION is .gitignored: install-lab-host.sh stamps it, git
   status doesn't see it, dirty=false, gate rule 4 passes naturally.

These two changes together keep the data flowing AND honest. Tier-2
episodes pass with `phases=[clean]` + every collector emitting real
rows. Tier-3 episodes (none today, empty catalog) walk the full
event-driven schedule when a verified module gets re-admitted.

286 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:55:37 -05:00
Max Gorog
22269e175d PIPELINE §5 step 4: catalog admission verifier (§4.3)
tools/verify_catalog.py runs the §4.3 end-to-end verification flow
against every entry in manifest.toml's [catalog].modules (or a single
named module). The flow follows §4.3 exactly:

  1. Load the module config + the verified-against target spec.
  2. Resolve the published image path; fail loudly if absent.
  3. Boot the target VM under §4.13 containment (restrict=on, snapshot=on,
     no shared FS, unprivileged QEMU — same posture as verify.sh).
  4. Wait for the service on the spec'd port.
  5. Login to msfrpcd, snapshot the existing session set, fire the
     module against `127.0.0.1:<host_port>` (the SLIRP hostfwd to the
     guest's promised service port).
  6. Wait for `session_open` — NOT session_open_timeout, which is the
     §4.5 failed-label outcome.
  7. Round-trip a shell command (`id`); confirm uid= shape.
  8. Confirm a guest-side artifact (touch marker; ls + echo VERIFY_OK).

Per-module exit code is 0 only when EVERY step passes. CLI exit is 0
only when EVERY requested module passes — partial credit isn't an
option (§1 default-to-removal: a module that can't pass shouldn't be
in the catalog).

Structured JSON output with per-step timings + detail strings, written
to stdout or --out <path>. Operator pulls this into a successful CI
run + signs off on the manifest.toml [[catalog.modules]] amendment
with a fresh `last_verified = <commit_sha>` per §15.

Tests (tests/test_verify_catalog.py, 8 cases): exercise the flow with
a mocked MSFRpcClient + mocked qemu boot. Cover happy path, every
short-circuit failure mode (image missing, service never up, session
timeout, shell round-trip wrong, guest artifact missing), and
spec-load errors. Real verification needs lab hardware; the mocked
flow proves the orchestration contract.

269 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:35:32 -05:00
Max Gorog
4d29b7236d PIPELINE §5 step 3: target VM build infrastructure + containment posture
§4.2 calls for target VMs we BUILD, not VMs we fetch. §4.13 demands
every target ship the same isolation posture (no upstream egress, no
host-shared FS, unprivileged QEMU, fresh snapshot per episode). This
commit lands the infrastructure for both.

New surface:
  * orchestrator/target_spec.py
      Loads + validates `vm/targets/<name>/spec.toml`. Containment
      fields are not knobs — each has exactly ONE safe value, and a
      spec asserting the unsafe value is rejected at load time. There's
      no `--containment-override`; weakening §4.13 requires amending
      PIPELINE.md and operator sign-off.

  * tools/build_target.py
      Orchestrates build → verify → publish for a single target. Spec
      invalid → exit 78 (sysadmin error). build.sh failure → image not
      published. verify.sh failure → image discarded; that's the §4.2
      acceptance gate. Publishes sha256 + the manifest.toml stanza the
      operator copies in to admit the image (§16 substantive amendment
      with sign-off per §15).

  * vm/targets/<name>/{spec.toml,build.sh,verify.sh}
      Template structure. spec.toml is the contract; build.sh produces
      $OUT_PATH; verify.sh boots the produced image under the §4.13
      containment posture and asserts every promise.

  * vm/targets/shellshock/
      First real working target. CVE-2014-6271 (Apache mod_cgi + bash
      4.2 mis-parsing function-export environment values). Replaces
      the SourceForge Metasploitable2 path that §3 evidence proved
      unverifiable. Bash 4.2 is built from sha256-pinned GNU source
      inside an Alpine 3.21 cloudinit guest; the build script asserts
      the produced bash actually triggers shellshock; the verifier
      re-asserts it under restrict=on with a real CVE-2014-6271 probe.

  * vm/targets/README.md
      How operators add a target. Walks the spec → build → verify →
      manifest amendment loop.

Containment regression tests (tests/test_containment.py) — 20 new
assertions, parameterized over every target with a build/verify trio:

  * verify.sh MUST contain `restrict=on` on its netdev (§4.13)
  * verify.sh MUST contain `snapshot=on` on the boot drive (§4.13)
  * verify.sh + build.sh MUST NOT contain -virtfs / -fsdev / 9pfs
  * verify.sh + build.sh MUST NOT wrap qemu-system in `sudo`
  * Every target must ship the complete spec.toml + build.sh + verify.sh
    trio — no half-built targets (§1 default-to-removal)

Spec validation tests (tests/test_target_spec.py): 13 new tests over
spec parse, name/dir mismatch, missing fields, out-of-range port, and
the §4.13 containment field validators (each unsafe value rejected
with a clear error).

The shellshock target's image is NOT yet published to manifest.toml's
[[targets.images]] — that's the §15 sign-off amendment that lands
after a successful operator-driven build_target.py run on a lab host
with KVM. Building takes ~10 min on x86_64; cannot run on the Pi
under TCG. Operator drives the first build, verifies the sha256, then
amends manifest.toml in a follow-up commit.

261 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:31:40 -05:00
Max Gorog
207a902c3e PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml
The experiment is now defined by a single version-pinned file —
manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every
lab host loads THIS exact file; per-host overrides of experiment
shape are forbidden.

Drops the following per-host CLI overrides that previously violated
the canonical-manifest principle:
  * --manifest, --modules-dir       (paths now derived)
  * --ram-per-vm-mib                (in manifest.experiment)
  * --max-concurrent                (manifest.experiment.fleet.max_concurrent_ceiling)
  * --max-tier3-slots               (manifest.experiment.fleet.max_tier3_slots)
  * --force-tier2                   (not a §14 sanctioned override knob —
                                     ship empty catalog to disable Tier-3)
  * --require-real-samples          (sample-side concern; out of fleet scope)
  * tools/run_*_demo.py --manifest  (samples path now from canonical)

New surface:
  * manifest.toml                   — the single source of truth
  * orchestrator/manifest.py        — load_canonical() + Manifest dataclass
                                      with strict validation, raises
                                      ManifestError on any failure
  * EpisodeConfig.experiment_meta   — populated by run_*_demo.py from
                                      the canonical manifest; stamped
                                      into every episode's meta.json
                                      under "experiment" key for
                                      provenance
  * cis490-orchestrator.service     — RestartPreventExitStatus=78 so
                                      manifest-load failures stay
                                      stuck-and-loud (§9, §4.7)
  * install-lab-host.sh             — validates manifest.toml at
                                      install time; missing or invalid
                                      = die with clear message

Catalog admission semantics: only modules whose name appears in
manifest.catalog get loaded into the runtime catalog (§4.3 in
miniature, will tighten further in step 4 when verified_against /
last_verified actually gate admission). Missing toml for an admitted
name is a sysadmin error → exit 78.

Renames cfg.manifest → cfg.samples + adds cfg.experiment to
disambiguate sample-manifest from experiment-manifest. Rewrites
test_fleet.py fixture to construct synthetic Manifest objects so
test outcomes don't depend on the on-disk manifest.toml content.

12 new tests in tests/test_manifest.py: schema-version mismatch,
unknown collector, duplicate collector, unknown phase, negative
phase seconds, negative ram, missing catalog fields, json round-trip.

Local run: `python tools/run_fleet.py --capacity` correctly logs the
loaded manifest and prints capacity. 241 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:25:01 -05:00
Max Gorog
ac7b85ff8d PIPELINE §5 step 1 follow-up: enable perf in production launchers
The §5 step 1 fixes correct the perf collector's stdout/stderr +
event-name parser bugs, but the launchers
(run_real_vm_demo / run_tier3_demo) never set enable_perf=True, so
production episodes still ship with rows_perf=0 — silently disabled
collector, which is exactly the §1 / §4.4 pattern.

Turn it on in both launchers. Failure modes (perf binary missing,
paranoid level too high) are logged as warnings + return 0 rows
visibly, not silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:40:37 -05:00
Max Gorog
4ab5477226 PIPELINE §5 step 1: fix four root-cause defects
Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.

perf collector (rows_perf=0 on 100% of episodes):
  - perf stat -j writes to stderr by default with -p; we read stdout.
    Add --log-fd 1 so JSON reaches stdout where the parser sees it.
  - Event names come back annotated with the privilege scope perf
    actually measured ("cycles:u" under perf_event_paranoid=2). Strip
    the suffix so _build_row's plain-name lookups hit. Without this
    every metric was None even when perf reported real numbers.
  - tests/test_collectors_emit.py covers the regression with a real
    busy-loop fixture; emit-test discipline per §4.4.

guest-agent collector (rows_guest=0 on 100% of episodes):
  - Alpine cloud image doesn't ship python3, so the in-guest agent's
    `#!/usr/bin/env python3` shebang silently fails. Add packages:
    [python3] to cidata user-data so cloud-init installs it before
    the OpenRC service starts.
  - Guest agent now exits nonzero (was: silent stdout fallback) when
    /dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
    reports the failure to /var/log/cis490-agent.log instead of the
    bytes vanishing into the void. Refs §1.
  - Host-side collector emits guest_agent_connected /
    guest_agent_first_byte / guest_agent_silent_window into the
    orchestrator's events.jsonl. Future episodes show the in-guest
    failure mode per-episode instead of inferring from rows_guest=0.

k-gamingcom missing qmp/netflow/pcap (also affected elliott on
  Tier-3 episodes — was misclassified as host divergence):
  - tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
    qmp_socket / guest_agent_socket / bridge_iface — even though
    launch_target.sh creates the underlying chardevs and BRIDGE
    supplies the iface. tools/run_real_vm_demo.py wires them
    correctly; Tier-3 had a copy-paste gap.
  - tests/test_collectors_emit.py adds a source-grep regression so
    the wiring stays honest.

samba_usermap_script never lands session (0/67 in §3 probe):
  - Bind handler default WfsDelay (~5s) gives up before bind_perl on
    Metasploitable2 has finished forking + binding LPORT under
    SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
    exploits/driver.py so framework + driver agree on the wait
    budget. Add ConnectTimeout=15 so the handler's bind connect has
    retry budget instead of one-shot.

orchestrator/fleet.py: usable_modules + BRIDGE handling were both
  unconditional, so:
  - With BRIDGE set, requires_bridge modules were still being
    dropped — picker only ever returned samba_usermap_script across
    every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
    failure on HEAD).
  - env.pop("BRIDGE") fired even when BRIDGE was the operator's
    explicit setup, breaking modules that need bridge mode (vsftpd
    backdoor on hardcoded port 6200, distccd, etc.).
  Both made conditional on bridge_set so the picker walks the full
  catalog under bridge mode and SLIRP-only modules still get a
  clean SLIRP env when BRIDGE is unset.

receiver/app.py: half-pregnant v2 schema state in HEAD — calling
  store.ingest_stream(episode_type=..., benign_profile=...) with
  kwargs the matching store.py change was in the WIP stash. Removed
  v2 awareness from app.py so v1 episodes (what the producer ships
  today) get accepted again. SCHEMA_VERSION default reset to 1 to
  match.

229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:05:25 -05:00
max
05bf785f0a fleet-health: exit 0 when alerts found (don't mark unit failed)
The detector previously returned 1 on alerts, which made systemd
mark cis490-fleet-health.service as 'failed' every tick that found
a sick host. That's the wrong UX — a detector finding a fault is
working correctly, not crashing. The alert is the signal (via
WARNING log + alerts.jsonl); the unit's success state should mean
"the detector itself ran cleanly." Test added.

Caught while live-deploying on the Pi: the first run found
elliott-thinkpad fatal-only at 943×4xx + 1425×5xx and correctly
emitted the alert — but systemd showed the unit red, which would
have caused operators to chase the wrong tail.

Side note: the same first run also caught a real bug — pycache for
receiver.store on /opt/cis490 was stale after I deployed the new
app.py + store.py from main, causing 1464 × 500 responses. Cleared
the pycache and the index immediately resumed growing (4465 →
4515 in 30 seconds). The detector earned its keep on the very
first cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:51:20 -05:00
max
49eba2fd60 fleet-health: proactive alerts on the Pi + per-host doctor reports
Two pieces of self-monitoring so the maintainer isn't the alarm:

(2) Receiver-side fleet health monitor

cis490-fleet-health.timer runs check_fleet_health.py every 5 min.
Detects three symptoms and writes them to
/var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy
to forward to a notifier):

  silent      — host shipped in last 24h but has been quiet >30 min
  fatal-only  — actively shipping but every PUT 4xx
  unstamped   — shipping without X-Cis490-Code-Commit header

Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault
fires once per hour, not every 5 min. 15 unit tests cover the index
parser, three detectors, and dedup.

(3) Per-host doctor snapshots

Lab hosts run cis490-doctor-check.timer once a day (10 min after
boot, then daily with 30-min jitter). The timer runs
cis490_doctor.py --json and PUTs the result to a new endpoint:

  PUT /v1/host-health/<host>   →  /var/lib/cis490/host-health/<host>.json
  GET /v1/host-health          →  aggregate across all hosts

Endpoint is NOT gated by version_gate — sick hosts running stale
code MUST still be able to report sickness. 11 unit tests cover
PUT/GET, atomic-write semantics, bearer auth, and the
not-gated-by-version-gate property.

ship_health_check.py reuses the existing shipper transport (mTLS +
bearer + receiver URL from lab-host.toml) so we don't reimplement
auth.

Both timers wired into install-lab-host.sh — the loop also enables
the previously-added autoupdate + cert-fetch timers, so a single
install run gives a host all four self-healing mechanisms.

Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2
pre-existing test_fleet.py failures from the elliott-ThinkPad
merge (667f042) are unrelated to this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:48:31 -05:00
1bac2f0135 run_tier3_demo: replace serial probe with min-wait + TCP probe
The serial console approach failed: Metasploitable2's kernel is not
configured with console=ttyS0, so only GRUB output reaches the QEMU
serial socket; the OS boot and login prompt never appear there.

New approach:
1. Sleep _METASPLOITABLE2_MIN_BOOT_S (65 s) after QEMU writes its
   pidfile. By this point the guest kernel and init are always up.
2. Call _wait_for_tcp with a 3 s recv timeout. Post-floor, SLIRP has
   forwarded the connection to the guest TCP stack, so:
   - socket.timeout → service listening, waiting for client data ✓
   - OSError/RST    → port still closed (service not ready); retry ✓
   Eliminates the early-boot false-positive that caused exploits to
   fire ~60 s before Samba was actually listening.

Also update TIER3-BRINGUP.md bug 6 to reflect the correct final fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:38:22 -06:00
667f042707 Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01)
Root causes and fixes documented in TIER3-BRINGUP.md. Summary:

1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
   instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.

2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
   modules selected on SLIRP runs; fix: always filter requires_bridge.

3. cmd/unix/interact creates no session.list entry → session_open_timeout
   every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.

4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
   fix: extra_host_port:extra_host_port mapping so guest binds the
   per-slot LPORT directly.

5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
   fix: requires_bridge=true filters it from SLIRP fleet runs.

6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
   boots (~60 s too early); fix: replace TCP probe with serial console
   _wait_for_serial_login that waits for actual "login:" prompt.

7. Stale QEMU survives orchestrator restart (start_new_session=True) →
   holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
   old pidfile before rmtree.

8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.

9. msfrpcd 6.x returns bytes for all string values even with raw=False;
   fix: MSFRpcClient._str() recursive decoder applied to all responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:26:19 -06:00
86bd9e21d7 Merge remote-tracking branch 'origin/main' into Dev_REL1_043026 2026-05-02 12:24:08 -06:00
max
f9b2e5c4e6 shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors
Three robustness items off the future-work list:

1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The
   daemon sends READY=1 after queue construction and WATCHDOG=1 once
   per scan pass via a heartbeat callback wired into run_forever.
   Restart=on-failure only catches process death — silent stalls
   (deadlock, hung tar subprocess, blocked I/O past timeout) used to
   leave a zombie running with the data backlog growing. Now systemd
   kills + restarts the daemon if no WATCHDOG=1 arrives within 180s.

   Verified end-to-end against systemd via `systemd-run --transient
   --property=Type=notify --property=WatchdogSec=10`: unit transitions
   to active on READY=1; SIGSTOP'ing the process triggers
   `Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at
   exactly t+10s, then unit goes failed → restart cycle.

2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew
   forever as fatal episodes piled up. New ShipperConfig fields:
     quarantine_keep_days = 30           # opt-out: 0 disables
     quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't
                                          # statx() the whole tree
   Cleanup runs at the start of run_once() but is gated to once per
   hour. Removed entries logged.

3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper
   journal and surfaces 412/400/transient patterns as red/yellow rows
   with the canonical fix command. An on-device agent running
   cis490_doctor.py now sees one line ("12 ship(s) rejected as
   out-of-window") instead of needing to grep the journal.

Tests: 200/200 (was 188). New coverage: heartbeat callback fires +
survives exceptions; quarantine cleanup respects keep_days, gate, and
opt-out; doctor parser correctly classifies 412/400/transient/clean/
empty/journalctl-denied; both error classes prioritise 412 (more
actionable) when present together.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:02:59 -05:00
max
ed5e6b0581 docs+doctor: surface VERSION-stamp + fallback wiring
receiver.toml.example: the local_repo_path comment was wrong about
when it kicks in. With the new fallback path, it's used both when
forgejo_url is unset (sole backend) AND when forgejo is unreachable
(failover). Document that, plus the auto-detect of /opt/cis490/.git.

cis490_doctor: add a VERSION-stamp check for lab-host role. If
/opt/cis490/VERSION is missing or malformed, the orchestrator stamps
"unknown" → receiver gate rejects every PUT → quarantine. Surface
this as a red row with the canonical fix (re-run install-lab-host.sh)
so an on-device agent doesn't have to grep journal logs to figure it
out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:54:36 -05:00
max
eda6164897 fix: lab-host install loop after commit-gate cutover
Why services weren't starting after the gate went live:

1. install-lab-host.sh self-copy. The receiver's 400 remediation tells
   the agent to `cd /opt/cis490 && git pull && sudo
   ./scripts/install-lab-host.sh`. That makes REPO_ROOT==INSTALL_ROOT
   and `cp -aT $REPO_ROOT $INSTALL_ROOT` errors with "are the same
   file"; `set -e` aborts before the systemd units install or anything
   restarts. Detect the same-dir case and skip the cp; chown still
   runs.

2. Services never restart. install-lab-host.sh and install-tier-3-4.sh
   both ended by *telling the operator* to restart, then exiting. The
   running shipper/orchestrator kept executing pre-gate code from the
   old module objects, so new `code_version` stamping never reached an
   episode. Both scripts now `systemctl restart` the units they own
   when those units are enabled.

3. Shipper queue fatal-loop. queue.py incremented `fatal++` but didn't
   move the episode out of `data/episodes/`. Next scan re-tarred and
   re-PUT the same dir, getting 400 again. With 4465+ pre-stamp
   episodes on k-gamingcom this burned ~1 PUT/sec for 5+ hours of
   receiver log. Fatal episodes now move to data/quarantine/<id>/ with
   a quarantine_reason.json beside them; the outbox tarball is
   deleted.

4. Pre-stamp backlog drain. tools/quarantine_unstamped.py is a
   one-shot that scans data/episodes/ and quarantines anything without
   a 40-char-hex code_version.commit. Wired into install-lab-host.sh
   step 9 so a re-install drains the queue automatically. Idempotent;
   safe to run while the shipper is active.

Tests cover the queue's new fatal-quarantine path and every drain
behaviour (kept/quarantined/dry-run/idempotent/missing-meta/collision).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:36:21 -05:00
5568d77df8 Merge remote-tracking branch 'origin/main' into Dev_REL1_043026 2026-05-01 07:51:34 -06:00
max
e2bb76144f tools/verify_tier3_local.py: Pi-runnable Tier-3 verifier
Closes the "have you tested it" gap as much as we can without x86 KVM.
The Pi is ARM64 — can't boot Metasploitable2 or run KVM-accelerated
guests. But most of the Tier-3 chain doesn't need x86:

  * chunked_real_binary_upload is just shell commands over a pipe
  * exploit module TOMLs and the deterministic selector are pure Python
  * manifest loading + sample selection are pure Python
  * msfrpcd itself runs on ARM (Ruby + Java)
  * the receiver's commit gate is the same on any arch

verify_tier3_local.py exercises each of those end-to-end, in process,
on this Pi:

  PASS  exploits/modules/*.toml parse + selector deterministic
  PASS  manifest loads + selector covers every sample
  PASS  chunked binary upload survives a real /bin/sh round-trip
        (150 KB binary, 26 chunks, sha256-verified end to end)
  PASS  staged samples are Linux i386 ELF (when staged)
  PASS  msfrpcd round-trips core.version (when listening)
  PASS  receiver /v1/health + gate enforces commit allow-list

Live result on this Pi today: 5 PASS, 1 SKIP (msfrpcd not installed
on the Pi, which is correct — the Pi is the receiver, not a lab
host). When run on a lab host after install-tier-3-4.sh, all 6
PASS gives full Tier-3 readiness.

What this script does NOT verify (still needs x86 KVM on a lab
host, covered by install-tier-3-4.sh's verify step):

  * Metasploitable2 boots under QEMU/KVM
  * vsftpd_234_backdoor lands a session against it
  * the chunked-upload binary actually executes inside that session

But the chunked-upload step proves every byte of the upload path
(printf '%s', heredoc-free path, base64 decode, sha256 verify,
chmod, exec scaffold) works against a real POSIX shell. An msfrpc
session presents the same shell interface, so a passing local-sh
test is strong evidence the production path will work.

tests/test_tier3_local_verify.py wraps the deterministic steps
(module parse, manifest, chunked upload) so pytest catches
regressions automatically. 174/174 total.

Operator workflow: ssh into Pi (or lab host), run:
  /opt/cis490/.venv/bin/python tools/verify_tier3_local.py
Each step prints PASS/FAIL/SKIP with detail. Exit 1 if any FAIL.
2026-05-01 03:41:21 -05:00
max
b809e1e26e auto_fetch_samples: pick Linux i386 ELF; manifest matches theZoo
User caught it: I shipped the theZoo path without running it
end-to-end. A real fetch on the Pi exposed two bugs:

1. Family-name matcher was substring-strict. "Cryptolocker-class"
   wouldn't match the dir "CryptoLocker_22Jan2014" because "-class"
   isn't in the dir name. Now expands to a sequence of tokens
   (full, head-of-dash, head-of-dot, head-of-underscore) and tries
   each. First match wins.

2. Extraction picker was "largest non-text" — a bad heuristic for
   theZoo, where each Linux.* zip often contains MULTIPLE binaries
   for different platforms (Linux i386, x86-64, ARM, FreeBSD, sometimes
   even Windows PE). The largest is rarely the i386 Linux ELF that
   would actually run on Metasploitable2. Now sniffs ELF magic bytes
   in stdlib and tiers:
     1. Linux i386 ELF (largest first)
     2. any other ELF (best-effort, may not execute)
     3. largest non-text (Wine fallback)

Verified end-to-end on the Pi against a real theZoo clone (~500 MB,
263 family dirs, 2026-05-01 fresh pull):

  linux-encoder-ransomware  → ELF 32-bit Intel i386 SYSV (278 KB)
  linux-wirenet-rat         → ELF 32-bit Intel i386 SYSV (64 KB)
  linux-rex-ransomware      → ELF 32-bit Intel i386 SYSV Go (7.6 MB)
  linux-neurevt-bot         → ELF 32-bit Intel i386 SYSV (3.0 MB)
  linux-earthkrahang-apt    → ELF 32-bit Intel i386 GNU/Linux (5.8 MB)

5/5 picks are runnable Linux i386 ELFs. Manifest rewrites in place
add source/sha256/url; meta.sample.kind goes to "real" automatically.

Manifest rewritten:
  - Old families (XMRig, Mirai, Cryptolocker-class, Dridex, Kovter,
    Reverse-Shell) → mostly absent from theZoo's Linux catalog or
    matched the wrong arch.
  - New families chosen against a verified theZoo presence list:
    Linux.Encoder, Linux.Wirenet, Ransomware.Rex, Neurevt,
    EarthKrahang.
  - XMRig + Kovter remain as mimic-only fallbacks (theZoo lacks a
    runnable Linux i386 binary for these; orchestrator falls back
    to the mimic profile).

Tests added (tests/test_auto_fetch_samples.py): 13 cases covering
ELF magic detection (i386 accepted, FreeBSD/x86-64/ARM/PE32/text
all rejected), family-token expansion (the "-class" suffix bug),
extraction picker (prefers Linux i386 over larger non-Linux ELFs),
manifest in-place rewrite preserves mode + skips entries that
already have sha256.

What's still NOT verified end-to-end (requires a lab host with
KVM x86):
  - Metasploitable2 boot under QEMU
  - vsftpd_234_backdoor exploit fire via msfrpcd
  - chunked binary upload through a real shell session
  - real binary executing inside a Metasploitable2 guest

The Pi is ARM64 — can't run Metasploitable2. install-tier-3-4.sh's
verify step (run_tier3_demo.py) covers all four on a real lab host;
deploy verifies on first run there.

171/171 tests pass.
2026-05-01 03:28:26 -05:00
max
f8ad02b2d7 Receiver enforces X-Cis490-Code-Commit allow-list (live, auto-refreshed)
Stops out-of-date lab hosts from polluting the dataset with episodes
generated by buggy code. The valid-commits set mirrors the maintainer's
working clone on the Pi automatically — when the maintainer pulls or
pushes a new commit, the receiver picks it up within the 5-second
cache TTL with no service restart.

Receiver changes:

- receiver/version_gate.py (new): VersionGate(repo_path, window).
  Each check() consults a frozenset of the last `window` commit
  hashes from `git -C <repo> log --format=%H -n <window>`, refreshed
  every 5s under a lock. Resilient to transient git failure (keeps
  prior cache so a flaky `git` doesn't lock out every shipper).

- receiver/app.py: PUT extracts X-Cis490-Code-Commit; gate.check()
  before ingest. Rejects with:
    400 + remediation if header missing or malformed
    412 + remediation + your_commit + head_commit if not in window
  Remediation block is verbatim copy-pasteable into the lab-host
  shell:
    cd /opt/cis490 && sudo -u cis490 git pull origin main
    sudo /opt/cis490/scripts/install-lab-host.sh
    sudo systemctl restart cis490-orchestrator

- receiver/store.py: ingest_stream takes commit kwarg, stamps it on
  the index.jsonl row (new optional field). Backfilled rows from
  index_backfill.py also pull commit out of meta.json.

- receiver/config.py + etc/receiver.toml.example: new [version_gate]
  section. enabled=true, repo_path=/home/max/cis490, window=100 by
  default. Enabled toggle exists for emergency disable-and-collect.

Shipper changes:

- shipper/transport.py: ship_tarball() takes commit kwarg, sends
  X-Cis490-Code-Commit header. 412 maps to status='fatal' so the
  queue doesn't infinite-retry — operator must pull and reinstall
  before the next ship will succeed.

- shipper/queue.py: reads meta.json::code_version.commit per
  episode, passes through. On 412, logs the receiver's full
  remediation block at ERROR level so journalctl on the lab host
  shows exactly what to run.

Tests: 9 in test_version_gate (including 2 end-to-end via
starlette.testclient), 2 cover the boundary where new commits land
mid-cache and where missing-repo gracefully keeps prior cache.
157/157 total.

Index schema: existing rows stay valid (commit field is optional
on read). New rows from receiver-direct AND from index_backfill.py
include commit.
2026-05-01 01:38:50 -05:00
max
265f3ad313 Tier-4 sample source: theZoo (no auth, no operator action)
Replaces MalwareBazaar with theZoo (https://github.com/ytisf/theZoo).
theZoo is a public security-research repo with hundreds of malware
samples organized by family, password-protected with the well-known
'infected'. No API key, no signup, nothing for an operator to do —
which is what zero-touch tier-4 actually means.

Changes:

- tools/auto_fetch_samples.py: rewrite. Clones theZoo (shallow, ~500 MB)
  to /var/lib/cis490/theZoo on first run, then for each manifest
  family without a sha256 it locates a matching Binaries/<Name>
  dir, extracts the .zip with password 'infected', picks the largest
  non-text payload as the binary, sha256s it, stages at
  samples/store/<sha256>, and rewrites manifest.toml in place
  (atomic tempfile + os.replace, stat preserved). Mandatory exit
  semantic: non-zero if no real samples landed.

- scripts/install-tier-3-4.sh: dropped the MB-key resolution chain
  (env var → local file → bootstrap.wg fetch). Now just runs
  auto_fetch_samples.py and dies if zero samples land. SKIP_TIER4
  remains as the explicit override but is documented as defeating
  the project.

- bootstrap/app.py + __main__.py + etc/cis490-bootstrap.service:
  removed the /v1/secret/<name> endpoint and the --secrets-root flag.
  Dead code now that no API key needs distributing. Live-rolled
  back on the Pi (404 verified post-restart, stale /etc/cis490/secrets
  dir removed).

- scripts/set-malwarebazaar-key.sh: deleted. No MB key means no
  one-time operator step.

- tests/test_bootstrap_secrets.py: deleted (route removed).

- AGENTS.md: rewrote tier-4 section to reflect zero-operator model.

148/148 tests pass. Bootstrap service rolled back live.
2026-05-01 01:17:50 -05:00
max
5d0e8e33a9 Tier 4 is mandatory: hard-fail on no real samples; auto-distribute MB key
User: 'we don't want it to be optional, this real malware IS the data
we want.' Acknowledged. Three changes make Tier 4 actually mandatory
without forcing per-host operator action:

1. bootstrap.wg /v1/secret/<name> endpoint
   - Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts
     over the same trust boundary as the cert endpoint (WG mesh,
     iptmonads-gated). Strict allow-list — only `malwarebazaar`
     resolves; everything else 404s. Secret returned as bare text
     with Cache-Control: no-store. Live-verified on the Pi.
   - tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned,
     200 with token, 404 unknown name, 500 on empty file.

2. install-tier-3-4.sh: Tier 4 is no longer optional
   - Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token
     → https://bootstrap.wg/v1/secret/malwarebazaar.
   - Caches the bootstrap-fetched key locally so re-runs are offline.
   - If all three resolution paths fail, dies with the exact
     remediation command for the operator (one-time set-malwarebazaar-key.sh
     on the Pi).
   - auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still
     works for emergency overrides but logs a warning that the host
     will produce only mimics). Deploy fails if zero binaries land
     in samples/store/ — no silent mimic-only fallback.
   - SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'.

3. scripts/set-malwarebazaar-key.sh
   - Pi-side helper: one operator command per fleet, ever. Accepts
     key via env or stdin, validates length, drops at the right
     path with the right perms. Lab hosts pull the rest automatically.

AGENTS.md: rewrote the Tier-4 section to reflect mandatory status +
the one-time-on-Pi distribution model.

152/152 tests pass. Bootstrap service updated live on the Pi.
2026-05-01 00:44:41 -05:00
max
683bfe9ce6 Tier 3 + Tier 4 auto-deploy: zero operator interaction
Replaces the manual runbook with scripts that just work. install-lab-host.sh
now runs the full Tier-3 deploy automatically as its 8th step (after the
mTLS cert lands), and Tier-4 auto-fetches when MALWAREBAZAAR_API_KEY is set.

Changes:

- install-msfrpcd.sh: actually runs the Rapid7 omnibus installer when
  metasploit-framework isn't present (was: bail with "install manually").
  apt-get and dnf paths both go through the same omnibus script with
  DEBIAN_FRONTEND=noninteractive. Idempotent.

- fetch-metasploitable2.sh: bakes in the SourceForge public-mirror URL
  (https://downloads.sourceforge.net/project/metasploitable/...) so no
  operator URL is required. sha256 is now optional and TOFU-pinned —
  first run records the hash to OUT_DIR/metasploitable2.qcow2.sha256;
  subsequent runs verify against that. Skips if qcow2 already present.

- scripts/install-tier-3-4.sh (new): orchestrates the four steps
  (msfrpcd → metasploitable2 → bridge → tier-3 verify) plus optional
  Tier-4 auto-fetch. Idempotent. SKIP_VERIFY / SKIP_BRIDGE / SKIP_TIER4
  env knobs for partial deploys.

- tools/auto_fetch_samples.py (new): when MALWAREBAZAAR_API_KEY is set,
  queries MB by each manifest entry's `family` (signature match), pulls
  the first match via fetch_sample.py, and rewrites manifest.toml in
  place (atomic tempfile + os.replace, preserving stat). Skips entries
  that already have sha256.

- install-lab-host.sh: gains a step 8 that calls install-tier-3-4.sh
  automatically when mTLS certs are on disk. --skip-tier3 flag for
  operators who want Tier 2 only. Skipped silently before certs land
  so first-pass install (host_id=REPLACE_ME) still works.

- AGENTS.md: rewrote the Tier-3 section to point at the one-shot
  script. Removed the old multi-command runbook so on-device agents
  can't accidentally follow stale steps.

Net effect: a fresh lab host now gets Tier 3 (and Tier 4 if API key
present) from a single sudo invocation. No operator picks for image
URLs, no manual metasploit installs, no manual manifest edits.
2026-04-30 23:12:08 -05:00
max
321ea63803 Multi-signal prune classifier: rescue valid episodes /proc misses
A laptop-class lab host (elliott-thinkpad) running 14 parallel fleet
slots can't deliver host /proc CPU% signal for the bursty profiles —
the per-VM share gets buried under contention. But the workloads ARE
running: qmp blockstats record 90+ MB written during infected_running
for io-walk episodes, netflow shows real packet bursts for
scan-and-dial, and the in-guest agent (when alive) shows load_1m
deltas the host can't see.

The classifier now cross-checks four sources before flagging an
episode:
  - /proc CPU% medians (host-side qemu)
  - netflow byte totals (bridge_pcap)
  - qmp blockstats per-phase DELTA (cumulative counters; deltas
    matter, not raw values)
  - guest-agent load_1m

An episode flags only if every available source agrees no
inter-phase signal. Missing sources are "unknown", not "flat".

Time-base bug also fixed: phase mapping now uses t_wall_ns (which
all sources stamp from CLOCK_REALTIME) rather than t_mono_ns —
netflow uses qemu boot-monotonic, /proc uses orchestrator-relative,
they don't share a number line.

Result on the live receiver:
  - 1067 active episodes, 100% kept under the new logic
  - 143 episodes rescued from a previous false-positive archive
  - Only the 9 genuinely-broken pre-Sample-propagation elliott-lab
    episodes remain archived (no-sample + no-workload-events)

Two new tests (test_flat_proc_rescued_by_netflow,
test_flat_everywhere_still_flags) pin the boundary so a future
regression surfaces immediately.

AGENTS.md gains a "classifier is multi-source" section explaining
the cross-check and the t_wall_ns invariant.
2026-04-30 19:10:01 -05:00
3d4936a227 Merge remote-tracking branch 'origin/main' into Dev_REL1_043026 2026-04-30 16:34:01 -06:00
max
2707709299 Fix workload-silent false-positive on Alpine busybox guests (closes #15)
On-device agent (k-gamingcom) ran the diagnostic probe sequence and
proved the workload IS running on Alpine — yes saturating the vCPU,
loadavg=1.05, three yes PIDs visible — but two busybox incompatibilities
made every episode look silent:

1. _probe() used `pgrep -c yes`. The -c flag is procps-ng/util-linux,
   not busybox. busybox pgrep exits 1 with a usage banner; the
   `|| echo 0` fallback then reported yes=0 every time. Switched to
   `pgrep yes | wc -l` which both pgrep variants support.

2. _wrap_loop appended `disown` after the nohup-backgrounded script.
   busybox sh / ash have no disown builtin, so each infected_running
   phase printed `sh: disown: not found` into run()'s captured output.
   The script kept running (nohup gives SIGHUP immunity, which is
   what disown was for), but the spurious error is now gone.

Cross-validation in the classifier:
- prune_episodes.py: workload-silent now requires the probe AND
  host-side /proc CPU envelope (flat-cpu) to AGREE. A probe-only zero
  is treated as the busybox false-positive and dropped. This means
  the 244 already-on-disk episodes from elliott-thinkpad and
  k-gamingcom are correctly classified without re-collecting.

Test coverage:
- test_workload_silent_flag updated to require both signals
- test_workload_silent_suppressed_when_host_cpu_real new regression
  for the busybox false-positive

AGENTS.md gains a "Don't trust the in-guest probe alone" section with
the busybox-vs-procps gotcha + a list of busybox-incompatible patterns
to avoid in any new in-guest diagnostic.
2026-04-30 17:28:48 -05:00
b42d073669 Merge remote-tracking branch 'origin/main' into Dev_REL1_043026 2026-04-30 15:48:23 -06:00
max
8d2d0d2e99 prune+receiver: preserve index ownership and add a backfill helper (closes #13)
Root cause of #13 (PUT 500s on first ship, retries return already-present):
my earlier prune-tool session ran as root and rewrote the live index via
os.replace(), which drops the original ownership/mode. The new file was
root:root and the cis490 service user couldn't append to it. Every fresh
PUT 500'd on _append_index after the tarball had already landed via
os.replace, so retries always saw "already-present" and never recovered
the missing index row.

Two fixes:

- tools/prune_episodes.py: snapshot the index's stat before the rename
  and restore uid/gid/mode after. Best-effort chown so non-root prune
  runs (where chown would EPERM) still succeed; non-root callers
  matched the original owner anyway.

- tools/index_backfill.py: new tool. Walks episodes/<host>/*.tar.zst,
  computes sha256+size, and appends rows for episodes missing from
  the index. Preserves "backfilled: true" so trainers can distinguish
  reconstructed rows. Always opens the index in append mode (never
  replaces), so it cannot reproduce the ownership bug it's recovering
  from.

Regression test: tests/test_prune.py::test_archive_preserves_index_mode.

Operator note for the live receiver: ran the chown fix manually
(chown cis490:cis490 /var/lib/cis490/index.jsonl) and ran the
backfill once to recover 140 elliott-thinkpad rows that 500'd before
the chown landed.
2026-04-30 16:36:05 -05:00
7c35bf7d49 Merge commit '86a088c' into Dev_REL1_043026 2026-04-30 15:16:41 -06:00
7683b64929 Merge origin/main into Dev_REL1_043026; accept main's service files
Cherry-picks all upstream additions (fleet runner, full collector suite,
shipper module, exploit driver, samples, scripts/, cis490_doctor, etc.)
and resolves the two service-file conflicts by accepting main's production
versions over the stubs we wrote on Day 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 15:05:51 -06:00
95ac56a382 fix: three install-time bugs found during first lab-host bring-up on k-gamingcom
1. pyproject.toml — move pycdlib to main deps (was dev-only; cidata build
   fails on first install because the venv doesn't include dev extras).

2. scripts/install-lab-host.sh — create vm/images/ dir before symlinking
   alpine-baseline.qcow2 and cidata.iso into INSTALL_ROOT. Without the
   mkdir the ln -sf silently fails (|| true), leaving the launchers unable
   to find the images and causing every episode to fail within 15 s.

3. tools/cis490_doctor.py — two fixes:
   a. Insert repo_root into sys.path at doctor startup so the inline
      `from exploits.modules import ...` succeeds when running from /opt/cis490
      (package = false means nothing is installed into site-packages).
   b. Pass cwd=/opt/cis490 to the shipper --ping subprocess so python -m
      shipper resolves the module correctly regardless of the caller's CWD.

Tested on k-gamingcom: install script now builds cidata.iso on first run,
7-slot fleet wave completes with rc=0, doctor shows 13 ok / 4 warn / 2 fail
(remaining failures are mTLS certs + collector.wg DNS — both need Pi-side
action, not code changes).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 15:05:00 -06:00
842918556b Add automated campaign runner, shipper, and systemd units
Implements the unattended episode loop described in docs/deploy.md but not
yet built. run_campaign.py boots a fresh VM per episode, drives the full
phase schedule via the existing EpisodeRunner/VMLoadController stack, writes
campaign.json atomically after each episode, and signals completion with
campaign_done.marker. shipper.py watches data/episodes/ for done.marker
files, tar+zstd-compresses each, and PUTs them to the receiver with
exponential backoff on failure. Both support SIGTERM gracefully, finishing
the current episode/scan before exiting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 14:53:40 -06:00
max
a61fa05980 cis490-prune: retroactively filter low-quality episodes from the dataset
Without a prune step, every fix we land before elliott-lab pulls
leaves a residue of pre-fix episodes in /var/lib/cis490/episodes/.
Trainers either filter at training time (processing the bad data
anyway) or — worse — train on it. This tool walks the receiver's
index, classifies each episode against five quality signals, and
either prints a dry-run summary, archives flagged episodes to
/var/lib/cis490/episodes-archive/, or deletes them outright (with
the index rewritten atomically).

Quality signals (each independent; a bad episode can hit several):

  no-sample           meta.sample is null. Pre-Sample-propagation code
                      ran the v1 yes-loop fallback regardless of fleet
                      selection, so the post-infection family isn't
                      recorded.

  no-workload-events  events.jsonl has zero workload_* rows. Pre-audit-
                      trail code (before VMLoadController emits) — we
                      can't tell whether the workload actually fired.

  workload-failed     events.jsonl contains workload_failed. SerialClient
                      raised mid-phase; labels and telemetry don't match
                      what the orchestrator was supposed to be doing.

  workload-silent     workload_killed event during dormant has
                      pre_kill_probe.yes == "0". The schedule walked
                      but the in-guest workload never started — the
                      elliott-lab fingerprint.

  flat-cpu            /proc CPU% medians spread <5pp across phases.
                      A model can't learn to distinguish phases from
                      this; pure noise to the trainer.

CLI:
  cis490-prune                      # dry-run summary
  cis490-prune --reason no-sample   # restrict to one signal (repeatable)
  cis490-prune --host elliott-lab   # scope to one lab host
  cis490-prune --archive            # mv flagged → episodes-archive/
  cis490-prune --delete             # rm flagged + drop index rows
  cis490-prune --json               # machine-readable

Index rewrite is atomic: tempfile + os.replace, so a crash mid-write
leaves the live index intact.

Tests: 143 (was 132). New cases (tests/test_prune.py):
  - one healthy synthetic episode produces zero reasons
  - five tests covering each individual reason flag
  - dry-run leaves disk + index untouched
  - --archive moves tarballs and rewrites index
  - --delete removes tarballs and rewrites index
  - --host filter scopes correctly (no-match → exit 0)
  - multi-reason episodes report all matching reasons

Live state when this commit lands: 9 elliott-lab episodes from the
pre-fix code path, all flagged. Operator can clear them with one
command before elliott-lab re-ships under main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:41:10 -05:00
max
642f7a94d6 runners: take savevm baseline-v1 after boot so revert_at_* actually works
EpisodeConfig.revert_at_start / revert_at_end have been issuing
loadvm "baseline-v1" via QMP since the snapshot/revert wiring landed,
but no part of the system was running savevm — so loadvm targeted a
snapshot that didn't exist and silently emitted snapshot_revert_failed
every time. The reverted-baseline mode was, in effect, dead code.

Both runners now take a savevm immediately after the guest is up
and reachable, before any workload runs:

  run_real_vm_demo.py — after SerialClient.login() succeeds (Tier 2)
  run_tier3_demo.py   — after _wait_for_tcp on the vulnerable port
                        (Tier 3, before the exploit fires)

Both call qmp.QMPClient.savevm("baseline-v1"). Best-effort: if savevm
fails (older qemu, non-qcow2 disk, KVM nesting issue), we log a
warning and run the episode anyway — just without revert support.

The snapshot_name in EpisodeConfig is unified to "baseline-v1" across
both runners (Tier 3 was previously stamping "qcow2-snapshot-on" into
meta, which didn't match what loadvm would target).

Why both runners take savevm individually instead of a unified path:
the two runners boot different launchers (launch_demo.sh for the
Alpine cidata image, launch_target.sh for the vulnerable target).
Each is responsible for its own QMP socket lifecycle. A shared
savevm helper module would just be a one-line wrapper around the
existing qmp.QMPClient.savevm; not worth the indirection.

Existing test coverage: tests/test_qmp.py exercises
QMPClient.savevm/loadvm against a fake server (HMP wrapper, error
path). The runner-side call is exercised in production but not in
unit tests — would need a fake launcher subprocess, which is outside
this commit's scope.

132/132 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:37:05 -05:00
max
507eac617b Solvable Tier-3 holes: callback payloads, busybox workloads, bridge by default
Closes the next batch of issues from the post-mortem. The previous
"each run uses a different vulnerability" commit shipped 5 modules
but 3 of them couldn't actually fire under SLIRP+restrict=on:
their reverse-shell payloads needed a callback channel the launcher
didn't provide, AND their LHOST options were set to {{ target_ip }}
(the target's IP, not the attacker's — copy-paste from RHOSTS).
Same time, the workloads.py shell commands used bash-only /dev/tcp
redirects that silently no-op'd in the busybox shell sessions
Metasploitable2 returns. Net effect: episodes that selected those
modules would have produced session_open_timeout + dead workloads.

Module configs (the three callback ones):
  exploits/modules/distccd_command_exec.toml
  exploits/modules/php_cgi_arg_injection.toml
  exploits/modules/unreal_ircd_3281_backdoor.toml
    - Switch payload from cmd/unix/reverse* to cmd/unix/bind_perl
      so the target listens on a known port; msfrpcd connects to it
      via the host's hostfwd (no callback path required).
    - Drop the bogus LHOST = "{{ target_ip }}" — bind shells don't
      use LHOST.
    - Add [runtime] table:
        requires_bridge = true
        extra_target_ports = [<bind_lport>]
      Both fields are honored by the loader (ModuleConfig.requires_bridge)
      and the launcher (TARGET_PORTS gets the extra port hostfwd'd
      when BRIDGE mode is active).

orchestrator/fleet.py
  When BRIDGE is unset in env, _run_slot filters the module catalog
  down to modules where requires_bridge=False before calling
  select_module. Two same-socket-shell modules (vsftpd_234_backdoor +
  samba_usermap_script) survive — fleet still has variety; just
  doesn't pick modules whose payloads can't land. With BRIDGE set,
  the full catalog rotates as before, AND BRIDGE is propagated to
  the per-slot subprocess env so launch_target.sh enters tap+bridge
  mode.

exploits/workloads.py
  Replaced bash-only constructs in three profiles:
    scan-and-dial  /dev/tcp/HOST/PORT redirects → nc -z -w 1
    bursty-c2      same fix
    shell-resident exec 3<>/dev/tcp/...  → piping into nc -w
  All three now run cleanly in busybox / dash / Metasploitable2's
  default shell. The remaining three profiles (cpu-saturate, io-walk,
  low-and-slow) were already busybox-portable.

scripts/install-lab-host.sh
  - lab-host.env now defaults BRIDGE=br-malware (was commented out).
    Operator opt-out is to comment the line back in.
  - New step 6b: provisions br-malware via vm/setup_bridge.sh AND
    pre-creates a per-slot tap pool (cis490tap0..7 for Tier-2 demo,
    cis490target0..7 for Tier-3 target) all attached to br-malware
    and brought up. Launchers reference these by SLOT — no sudo
    needed at episode time.
  - On bridge-setup failure, the script auto-comments BRIDGE in the
    env file with a "auto-disabled: bridge setup failed" note so
    the fleet falls back to same-socket modules + Tier-2 cleanly.

tools/cis490_doctor.py
  Two new checks for the lab-host role:
    bridge: br-malware exists / up
    tier3: msfrpcd listening on 127.0.0.1:55553
    tier3: module catalog parses (counts same-socket vs requires_bridge)
  All three are warn-level — they don't fail an otherwise-healthy
  Tier-2-only setup; they tell the operator what's missing for full
  Tier-3 + source 4 coverage.

Tests: 132 (was 129). New cases:
  test_fleet.py +3
    - fleet skips requires_bridge modules when BRIDGE unset (asserted
      across 20 episodes; never picks a callback module)
    - fleet uses the full catalog when BRIDGE is set
    - BRIDGE env propagates to per-slot subprocess

What's still untested live: the bind_perl payloads against a real
Metasploitable2 in the bridge-enabled launcher path. That's a
deployment validation, not a code change. The unit tests confirm
the dispatch / filter logic; the live test is the next operator
action.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:32:52 -05:00
max
a193d17ead fleet: rotate exploit modules per (host, slot, ep); Tier 3 by default
Closes the "every run hits the same vulnerability" gap. Before this
commit, the fleet shipped Tier-2 episodes (no exploit at all) with
only the post-infection sample varying. Tier-3 had a single canned
module — vsftpd_234_backdoor — so even when exploit fire was
exercised, the entry vector never changed. Trainer would see one
shape of `armed → infecting` and learn nothing about how varied
real exploits look on the wire / in /proc.

What landed:

exploits/modules/
  + samba_usermap_script.toml          CVE-2007-2447, SMB:139
  + distccd_command_exec.toml          CVE-2004-2687, distcc:3632
  + php_cgi_arg_injection.toml         CVE-2012-1823, http:80
  + unreal_ircd_3281_backdoor.toml     CVE-2010-2075, ircd:6667
  (vsftpd_234_backdoor.toml unchanged)
  All five are canonical Metasploitable2 vectors with stable
  Metasploit modules. Each TOML carries the RPORT the launcher
  needs to wire its hostfwd at, plus a payload tuned to a clean
  shell session (cmd/unix/interact for in-band shells,
  cmd/unix/reverse* with deterministic LPORTs for reverse shells).

exploits/modules.py
  + select_module(catalog, host_id, slot, episode_index) — same
    SHA-256-keyed deterministic selection shape SampleManifest uses
    for samples. Two hosts at the same slot/episode hash to
    different modules; one host walks the full catalog within
    ~len(catalog) episodes.
  + module_target_port() — pulls RPORT off the module config so
    the fleet can plumb the launcher's hostfwd at the right service.

orchestrator/fleet.py
  - _run_slot now decides Tier 3 vs Tier 2 from msfrpcd reachability
    + module-catalog populated. Default is Tier 3 when both are true;
    Tier 2 fallback when not (logged + recorded in SlotResult.tier
    so trainers can filter no-exploit episodes).
  - Per-slot module via select_module() — each concurrent slot in a
    wave gets a different vector AND a different sample.
  - PORT_BASE per slot (target_port + slot * 1000) so concurrent
    Tier-3 targets don't collide on the host-side hostfwd port.
  - _msfrpcd_available() probe gates the dispatch.
  - Fleet-side log line records (slot, ep, tier, sample, module,
    run_dir) so the operator can see at a glance what each wave is
    exercising.
  - SlotResult grows tier + module_name fields; FleetConfig grows
    modules + force_tier2 + msfrpcd_{host,port} fields.

orchestrator/episode.py
  + EpisodeConfig.exploit_meta — plain dict the runner stamps into
    meta.exploit so every Tier-3 episode records {framework,
    module path, module type, payload, RPORT, RHOSTS template}.
    Trainers join on meta.exploit.module_name to stratify by entry
    vector; meta.sample.name to stratify by post-infection family.

tools/run_tier3_demo.py
  + Builds exploit_meta from the loaded ModuleConfig and passes it
    to EpisodeConfig. Sample is now also passed (was missing).

tools/run_fleet.py
  + --modules-dir (default exploits/modules/) — load module catalog
    on startup; pass to FleetConfig.
  + --force-tier2 — escape hatch for dev / smoke tests.
  + JSON output now includes per-slot {tier, module} so the operator
    can see at a glance what each slot ran without grepping logs.

Tests: 129 (was 119). New cases:
  test_exploits.py +6
    - catalog has at least the five canonical Metasploitable2 vectors
    - select_module is deterministic per (host, slot, ep)
    - select_module diversifies across hosts
    - select_module walks the full catalog over many episodes
    - module_target_port pulls RPORT for each shipped TOML
  test_fleet.py +4
    - _run_slot dispatches to run_tier3_demo.py when msfrpcd up
    - falls back to run_real_vm_demo.py when msfrpcd unreachable
    - falls back when module catalog empty
    - --force-tier2 overrides msfrpcd availability
    - PORT_BASE is unique per concurrent slot (no hostfwd collision)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:22:49 -05:00
max
d86502d950 workload audit trail: meta.sample + per-phase events + pre-kill probe
The elliott-lab episode showed every phase median'd 20% CPU because
the in-guest workload silently never fired — and there was no signal
in events.jsonl to detect that from outside, so a trainer would
treat the labels as ground truth and learn "all phases look identical".
This commit closes the audit gap so the failure is visible in meta:

orchestrator/episode.py
  EpisodeConfig.sample: Sample | None — the manifest entry that
  drove this episode's workload selection. Stamped into meta.sample
  as {name, family, category, profile, kind, sha256} so trainers
  can join cleanly without re-deriving from events. None means the
  v1 yes-loop fallback path ran (and the trainer should treat the
  episode with appropriate skepticism).

tools/vm_load_controller.py
  VMLoadController gains an emit_event callable. Every phase now
  emits a workload_* event into the runner's events.jsonl:
    workload_setup        login + initial cleanup OK
    workload_killed       clean / dormant. Dormant carries a
                          `pre_kill_probe` dict from inside the
                          guest (`pgrep -c yes`, `pgrep -c sh`,
                          /proc/loadavg) so the trainer can detect
                          the elliott-lab failure mode where the
                          workload never actually ran.
    workload_armed        armed handshake fired
    workload_infecting    dd urandom / payload write fired
    workload_started      infected_running command sent
    workload_failed       any of the above raised inside SerialClient
                          (timeout, EOF, partial login). The runner
                          would have silently swallowed the
                          exception via its on_phase try/except;
                          the audit row makes the failure detectable.
  Exceptions in shell calls surface as workload_failed events but
  do NOT propagate, matching the runner's existing on_phase
  contract.

tools/run_real_vm_demo.py
  Wires the controller's emit_event to the runner's emit_event via
  a small forward-reference closure (controller is built before
  runner; runner.emit_event needs to be the sink). Sample also
  flows into EpisodeConfig.sample so meta.sample matches what the
  controller actually ran.

Tests: 119 (was 106). New cases:
  tests/test_vm_load_controller.py  (11 tests against a FakeSerial)
    - setup emits workload_setup
    - infected_running runs the v1 yes-loop AND emits workload_started
    - dormant probes BEFORE killing and stamps pre_kill_probe
    - dormant probe records "yes=0" (the elliott-lab fingerprint)
    - clean / armed / infecting all emit their respective events
    - serial.run() exception → workload_failed event, no propagation
    - sample-with-profile dispatches to exploits.workloads command
      (NOT the v1 yes-loop)
    - missing emit_event callback is a no-op (back-compat)
  tests/test_episode.py  (2 new)
    - meta.sample carries name/family/category/profile/kind/sha256
      when EpisodeConfig.sample is set
    - meta.sample stays null in the v1 fallback path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:12:34 -05:00
max
8753340ea3 fleet: fix per-slot run-dir collision so concurrent VMs actually run
Root cause of "fleet says max_concurrent=3 but only one episode ships
per wave" symptom on elliott-lab:

  1. orchestrator/fleet.py::_run_slot set
     env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot.
  2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm
     (NO slot suffix), then UNCONDITIONALLY overwrote the env's
     RUN_DIR with that flag's value before exec'ing the launcher.
  3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots
     collided on the same socket dir.
  4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's
     rmtree literally deleted slot 0's pidfile + sockets mid-boot.
  5. Net effect: one VM survives per wave on a multi-core host that
     should be running ~cores-1 in parallel. Throughput collapses
     to 1/N.

Fix:

  tools/run_real_vm_demo.py + tools/run_tier3_demo.py:
    --run-dir default cascade —
      1) explicit CLI flag
      2) RUN_DIR env (set by fleet runner)
      3) /tmp/cis490-vm-<SLOT>  (SLOT from env, default 0)
    Same change in both runners so Tier-2 + Tier-3 fleet waves
    parallelize cleanly.

  orchestrator/fleet.py::_run_slot:
    Pass --run-dir explicitly to the subprocess so the per-slot path
    is audit-visible in the fleet log instead of buried in env.
    Also flip the subprocess interpreter to repo_root/.venv/bin/python
    when present (was /usr/bin/env python3 — worked by luck because
    the orchestrator path doesn't import msgpack/httpx, but a Tier-3
    fleet wave would have died at import-time on a host without those
    in system Python).

  etc/cis490-orchestrator.service:
    Removed the duplicate [Service] hardening block at the bottom of
    the file that was silently overriding the AmbientCapabilities
    grant (NoNewPrivileges=true at the bottom flipped the
    NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_
    ADMIN + CAP_PERFMON before per-episode subprocesses inherit
    them). Sources 3 + 4 would have failed silently inside the
    sandbox.
    Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable.

106/106 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:55:56 -05:00
max
6f8b744c33 cis490-doctor + AGENTS.md operator runbook + louder install script
Adds the missing diagnostic + onboarding tools so an agent (AI or
human) handed a fresh lab host can get to "shipping data" without
re-deriving every step from logs.

tools/cis490_doctor.py — one-shot health check that walks the full
stack from the bottom up. Each row is green/yellow/red with an
exact fix command for the red rows. Checks:
  - repo: branch, tree-clean, distance from origin/main
  - install: /opt/cis490, .venv python, /etc/cis490/{lab-host,receiver}.toml,
    /etc/cis490/lab-host.env
  - mTLS: /etc/cis490/certs/{wg-ca,lab-host}.{pem,key}, openssl chain verify
  - systemd: cis490-{shipper,orchestrator,receiver} active state
  - net: receiver.url DNS, TCP reach, mTLS handshake to collector.wg
  - vm prereqs: /dev/kvm, qemu-system-x86_64, zstd, alpine-baseline.qcow2,
    cidata.iso
  - tier3 prereqs: msfrpcd, metasploitable2.qcow2 (warn-level)
  - end-to-end: cis490-shipper --ping
Modes: --role {lab-host,receiver}, --json (machine-readable),
--no-tier3 (skip optional checks). Exits non-zero on any red row.
ANSI color (auto-disabled on non-tty / NO_COLOR).

AGENTS.md gains a "How a lab host gets to shipping data" canonical
flow at the top: cert delivery via wg-pki/deploy-cis490-cert.sh →
install-lab-host.sh → cis490-doctor → systemctl enable. Plus an
"on-demand episode" recipe + a "smallest E2E test" snippet for
agents that need to verify the pipe without waiting on the timer.
The strict "cloning the repo by itself does nothing" callout makes
the failure mode mu and elliott-lab hit explicit.

scripts/install-lab-host.sh prints a 5-step banner on first install
that points at cis490_doctor.py + the deploy-cis490-cert.sh flow,
plus an always-printed footer warning that "cloning + running
launchers manually is NOT enough." Same message the AGENTS.md
section reinforces.

Refs spectral/CIS490#8 (the "Tier-2 is shipping in the meantime"
claim that turned out to be untrue because no cis490-shipper
service was running on elliott-lab — exactly the case this
diagnostic tool targets).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:11:57 -05:00
max
a88ac83db0 Close out the deployment-readiness gaps
Wraps the gaps surfaced in the "what is not implemented" audit so the
fleet really is shippable end-to-end. Verified live on the Pi:
  - cis490-shipper --ping → HTTP 200 through Caddy + mTLS via the
    new wg-pki client CA leaf
  - real episode dir → tar+zstd → PUT → HTTP 201 stored
  - re-ship same bytes → 200 (idempotent)
  - re-ship different bytes under same id → 409 (conflict)

Changes:

orchestrator/episode.py
  - EpisodeConfig.revert_at_start / revert_at_end (Tier 0+ snapshot/
    revert per docs/architecture.md). When set + qmp_socket present,
    EpisodeRunner issues loadvm <snapshot_name> and emits
    snapshot_revert / snapshot_revert_failed events on the same
    monotonic clock as everything else.

collectors/qmp.py
  - savevm() / loadvm() helpers using human-monitor-command, plus a
    test against the fake QMP server.

exploits/workloads.py
  - chunked_real_binary_upload() returns a ChunkedUpload plan: 8 KiB
    base64 chunks (~6 KiB binary each) so msfrpc never sees a buffer-
    busting payload. Includes a finalize step that sha256-verifies on
    the guest before exec.
  - real_binary_workload() now wraps the chunked plan for backwards
    compat with single-shot callers.

exploits/driver.py
  - Tier-4 dispatch walks the chunked plan in MSFExploitDriver:
    each chunk is a separate session_shell_write; finalize verifies;
    exec only runs on sha-ok. New events: real_binary_upload_begin,
    real_binary_verify, real_binary_aborted.

etc/cis490-orchestrator.service
  - Reads /etc/cis490/lab-host.env (FLEET_HOST_ID + optional BRIDGE).
  - Grants AmbientCapabilities CAP_NET_RAW (tcpdump for source 4) +
    CAP_SYS_ADMIN + CAP_PERFMON (perf for source 3) so collectors
    work under hardening.

scripts/install-lab-host.sh
  - Writes /etc/cis490/lab-host.env on first install with FLEET_HOST_ID
    defaulting to `hostname -s`.
  - Best-effort: fetches the Alpine baseline qcow2 (sha512-pinned) and
    builds cidata.iso with the in-guest agent embedded; symlinks both
    into /opt/cis490/vm/images/ so launchers find them.

scripts/fetch-alpine-baseline.sh
  - Idempotent fetcher for the Alpine 3.21 cloud-init nocloud qcow2
    matching the sha512 in docs/sources.md.

tools/plot_envelope.py
  - Rebuilt to render whatever telemetry the episode dir contains:
    proc → QMP block ops → perf IPC/miss-rate → bridge pkts/SYNs →
    guest agent load/mem. Missing sources are silently skipped.

tools/index_reader.py
  - cis490-index CLI: filter receiver's index.jsonl by host / sample
    / time range, sort, count-by group. Closest thing to a query
    interface until we stand up Postgres/Timescale.

samples/README.md
  - Rewritten to match the new manifest schema, the kind=real vs mimic
    split, the per-(host, slot, ep) selection mechanic, and the
    chunked-upload safety story.

Tests: 106 pass (was 102). New cases:
  - test_qmp.py — savevm + loadvm (HMP wrapper + error path)
  - test_tier4.py — chunked plan splitting, sha-pinned finalize,
    end-to-end driver walks all chunks + verify + exec via the fake
    msfrpc client

Closes the "what is not implemented" punch list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:31:55 -05:00
max
bdcd2ecbef Close out the open issues: bridge pcap wiring, perf collector, Tier-4
Wraps the three remaining 🚧 items from the README so every collector
the threat-model promises is actually live, and the Tier-4 path
(real-malware fetch + upload + exec) works end-to-end as soon as a
sha256 lands in samples/store/.

Closes spectral/CIS490#4, #5, #6.

== #6 — Bridge pcap wiring ==
EpisodeConfig grows three optional fields:
  bridge_iface: str | None        # e.g. "br-malware"
  bridge_ip:    str = "10.200.0.1"
  pcap_snaplen: int = 256
When bridge_iface is set, EpisodeRunner spawns tcpdump for the duration
of the schedule (network.pcap), stops it cleanly on episode end, and
runs collectors.pcap.bucketize() to produce netflow.jsonl per the
100-ms schema in docs/data-model.md. EpisodeResult + meta.result
gain rows_netflow + pcap_bytes counters.

vm/launch_demo.sh + launch_target.sh now switch between SLIRP usermode
and tap+bridge based on $BRIDGE — operator pre-creates the tap as a
bridge member, no sudo from the launcher.

run_real_vm_demo.py picks BRIDGE up from env so the fleet runner can
opt entire waves into pcap mode by exporting BRIDGE before invocation.

== #5 — Source 3 perf collector ==
collectors/perf_qemu.py shells out to ``perf stat -p <pid> -I 100 -j``
and parses the per-event JSON stream. Aggregates one row per interval
across the canonical event set (cycles/instructions/cache-{refs,misses}/
branches/branch-misses/page-faults/context-switches), computes IPC +
cache-miss rate. Tolerates missing events (``<not counted>`` /
``<not supported>``) without dropping the row, and skips cleanly when
``perf`` isn't on PATH or the process can't be attached.

EpisodeConfig.enable_perf=True opts into the collector — off by default
because perf needs CAP_SYS_ADMIN or perf_event_paranoid <= 1. When
enabled, runs as a parallel thread alongside the other collectors;
EpisodeResult.rows_perf records the count.

== #4 — Tier 4 (real-malware fetch + upload + exec) ==
tools/fetch_sample.py: pulls a sample by sha256 from MalwareBazaar
(API key from env or samples/.bazaar.token), unzips with the standard
"infected" password, verifies the resulting binary's sha256, lands at
samples/store/<sha256>. Idempotent — already-staged correct binaries
return immediately.

samples/manifest.py: Sample.binary_path(store_root) resolves to the
staged binary path, or None for mimics / not-yet-fetched real samples.

exploits/workloads.py: real_binary_workload(bytes, sample) builds a
Workload that base64-uploads the binary into the shell session via a
heredoc, decodes + chmods + execs it in the background, captures the
PID for clean stop on dormant. Per-profile pid/bin paths so concurrent
samples in the same guest don't collide.

exploits/driver.py: dispatch order is now:
  1) sample.kind == "real" + binary staged at sample_store_root
     → real_binary_workload (Tier 4)
  2) profile mimic from workloads.workload_for() (Tier 3 v2)
  3) None → driver v1 fallback yes-loop
DriverConfig.sample_store_root is the new field; run_tier3_demo.py
wires it to repo_root/samples/store. driver_setup event records
sample_sha256 so trainers can join Tier-4 episodes against the
manifest by hash.

samples/store/.gitkeep added (binaries themselves are gitignored).

Tests: 102 pass (was 86). New suites:
  tests/test_perf_qemu.py — parser + builder + perf-missing fallback
  tests/test_tier4.py     — real_binary_workload base64 round-trip,
                            stop-cmd kills pidfile, per-profile path
                            isolation, driver dispatch chooses real vs
                            mimic correctly, fetcher input validation
                            and cached-fast-path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:17:49 -05:00
max
b80986d99c Driver v2: sample-profile-driven workloads (Tier-2 + Tier-3)
The v1 driver ran ``yes > /dev/null`` for every sample, which
produced the same envelope shape regardless of which malware family
the orchestrator claimed to be running. That's a poor training
signal: the model sees identical /proc + QMP traces tagged
"cryptominer" / "ransomware" / "RAT" with no distinguishing
features. v2 fixes this.

What landed:

  exploits/workloads.py — six ``Workload`` profiles, each producing
    a distinct in-session shell command pair (start_cmd / stop_cmd)
    that backgrounds a profile-shaped loop:

      cpu-saturate    — sustained 1-vCPU saturation (XMRig shape)
      scan-and-dial   — periodic SYN-style probes across 10.200.0.0/24
                        + dial-home to gateway (Mirai shape)
      io-walk         — fs traversal + 4 KiB urandom writes, periodic
                        re-read (ransomware shape)
      bursty-c2       — long idle, periodic 3-packet TCP egress burst
                        (Dridex C2 beacon shape)
      low-and-slow    — minimal CPU + periodic awk-driven memory churn
                        (Kovter / fileless shape)
      shell-resident  — single long-lived TCP socket pinned to gateway
                        with periodic 6-byte command ticks (RAT shape)

  Each profile uses a /tmp/.cis490-workload-<profile>.{pid,sh} pair so
  the stop_cmd can cleanly kill the loop and its descendants.

  exploits/driver.py — MSFExploitDriver now accepts an optional
    ``Sample``. With one supplied, ``infected_running`` dispatches to
    the matching workload via exploits.workloads.workload_for(); the
    ``sample_executed`` event records profile + sample name + sample
    kind so the trainer can join cleanly. Without a sample, the v1
    yes-loop path remains unchanged (backwards compat).

  tools/vm_load_controller.py — the same dispatch on the Tier-2 path
    (no exploit, real Alpine guest driven over the serial console).
    A fleet wave now produces six visually distinct envelopes per
    wave whether the underlying mode is Tier 2 or Tier 3.

  tools/run_real_vm_demo.py — accepts ``--sample <name>`` (or
    SAMPLE_NAME env from the fleet runner) + auto-wires QMP + agent
    sockets into the EpisodeConfig so all three new collectors
    (sources 2, 4, 5) run alongside source 1 by default.

  tools/run_tier3_demo.py — same ``--sample`` plumbing for the
    exploit-driven path.

Tests: 86 pass (was 82). New v2 cases:
  - profile dispatch routes infected_running to the workload's
    start_cmd (NOT the v1 yes-loop) when a Sample is set
  - all six profiles produce distinct start_cmds (the property the
    ML model needs)
  - unknown profile string falls back to cpu-saturate with a warning
  - v1 path (no Sample) still uses yes-loop (backwards compat)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:06:15 -05:00
max
1b6c7b2f4a Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts
This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.

Collectors landed:
  collectors/qmp.py          — source 2 (oracle). Tiny synchronous QMP
                               client + row builder + run loop. Tolerates
                               older qemu without query-stats.
  collectors/guest_agent.py  — source 5 (deployable). Reads the
                               virtio-serial host-side socket, parses
                               agent JSON-lines, re-stamps to the host
                               monotonic clock, persists.
  collectors/pcap.py         — source 4 (deployable). tcpdump capture
                               + pure-Python pcap reader + 100 ms
                               netflow.jsonl bucketizer. Decodes
                               Ethernet/IPv4/TCP/UDP enough for the
                               schema in docs/data-model.md.

In-guest agent:
  vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
    /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
    thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
  tools/build_cidata.py — embeds the agent + an OpenRC service into
    user-data so first boot of the Alpine cidata image auto-starts it.

Launchers:
  vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
    the agent socket; SLOT env support so multiple VMs run without
    socket / port collisions; PORT_BASE on launch_target so multiple
    target VMs hostfwd different host ports.
  vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
    no NAT). Idempotent.

Fleet:
  orchestrator/fleet.py — capacity detector (cores / RAM / load
    headroom) + concurrent-slot runner. Per-slot ENV selects the
    sample. FleetCapacity dataclass round-trips into meta.json so
    "this episode ran with 6 concurrent VMs" is auditable post-hoc.
  tools/run_fleet.py — CLI: --capacity report; --waves N runs N
    waves of (max_concurrent) episodes each, every slot with a
    different sample.
  etc/cis490-orchestrator.service — now drives the fleet runner with
    Restart=always so each invocation runs one wave and respawns,
    giving a continuous stream.

Samples:
  samples/manifest.toml — six profiles spanning the five major
    behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
  samples/manifest.py — strict TOML loader (rejects dups, unknown
    categories) + deterministic select(host_id, slot, episode_index)
    so different hosts on the network walk the catalog in different
    orders without any coordinator.

EpisodeRunner:
  orchestrator/episode.py — optional qmp_socket + guest_agent_socket
    fields on EpisodeConfig; when set, additional collector threads
    run alongside proc_qemu. EpisodeResult now carries rows_qmp +
    rows_guest counters.

Tier-3 setup automation:
  scripts/install-msfrpcd.sh — installs metasploit-framework where
    the package manager has it, generates a strong password into
    /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
    127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
    once MSFRPC_PASSWORD is sourced.
  scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
    from the operator (Rapid7 download is registration-walled), pulls,
    verifies, converts vmdk → qcow2, lands at vm/images/.

Tests: 82 pass (was 51). New suites:
  tests/test_qmp.py       — fake QMP server, capability handshake,
                            blockstats, async-event interleaving,
                            5-failure backoff
  tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
                              re-stamp, malformed-line tolerance
  tests/test_pcap.py      — synthetic pcap with TCP/UDP/ARP frames,
                            bucketize correctness across windows
  tests/test_fleet.py     — capacity math (8-core idle / low-RAM /
                            high-load / Pi5 / 1-core box), manifest
                            selection determinism + diversity

What's queued for the next commit (already discussed in convo):
  - MSFExploitDriver v2: map sample.profile → distinct in-session
    workload so Tier-3 episodes don't all produce the same yes-loop
    envelope. Critical for ML to learn varied malware shapes.
  - Real-sample fetch from MalwareBazaar by sha256.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:02:27 -05:00
max
613c6fa223 Tier 3: msfrpc-driven exploit driver + first module config
Adds the Tier-3 exploit driver — an MSFExploitDriver that plugs into
EpisodeRunner.on_phase, fires a Metasploit module against a target VM
via msfrpcd, watches for the resulting session, and stamps each
transition (exploit_fire, session_open, session_landing_probe,
sample_executed, session_dormant, session_killed) into the episode's
events.jsonl on the orchestrator's monotonic clock.

What landed:
- exploits/msfrpc.py — minimal msgpack-over-HTTPS client (auth,
  module.execute, job/session lifecycle) so we don't depend on a
  third-party MSF wrapper.
- exploits/driver.py — phase-to-msfrpc adapter; idempotent fire,
  session-open polling with timeout, workload start/stop, teardown.
- exploits/modules.py + exploits/modules/vsftpd_234_backdoor.toml —
  TOML module configs with {{ target_ip }} placeholders, replacing the
  imperative .rc-script approach the README previously hinted at.
- vm/launch_target.sh — SLIRP+restrict=on launcher for the
  intentionally-vulnerable target VM (host can reach guest via
  hostfwd, guest cannot reach host or internet).
- tools/run_tier3_demo.py — end-to-end runner mirroring run_real_vm_demo.
- tests/test_exploits.py — 12 new tests against a fake MSFRpcClient,
  including an integration test that drives a real EpisodeRunner.

Plumbing changes:
- EpisodeRunner._emit_event → public emit_event, so external drivers
  share the runner's monotonic clock and events.jsonl.
- mkdir for episode_dir moved to __init__ so emit_event is callable
  before run() (driver_setup fires pre-schedule).

Status: driver + tests pass (40/40); end-to-end against a live msfrpcd
+ Metasploitable2 image is the next bring-up step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:11:52 -05:00
Maximus Gorog
7216ec09bd Tier 2: real Alpine VM, real workload, real envelope
End-to-end now drives a real KVM guest through the full XMRig-shaped
phase schedule with the workload running INSIDE the guest. Telemetry is
host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU
saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven
over the serial console at every phase transition. The plotted envelope
shows clean idle → armed → infecting (disk spike) → infected_running
(100% CPU plateau) → dormant → re-entry → final clean.

Components:

  vm/launch_demo.sh              now boots Alpine 3.21 nocloud-cloudinit
                                 (Cirros 0.6.x's cirros-init blocks on the
                                 EC2 metadata service for ~17 min before
                                 falling through to NoCloud — abandoned).
                                 Mounts a cidata ISO as a second drive.

  tools/build_cidata.py          pure-Python NoCloud ISO builder (pycdlib).
                                 Sets root password and ssh_pwauth via
                                 runcmd so we don't depend on a specific
                                 cloud-init version's plain_text_passwd
                                 handling.

  tools/vm_serial.py             serial-console client (stdlib socket).
                                 Idempotent login (detects already-in-shell
                                 state), sentinel-bracketed run() that
                                 distinguishes shell output from the TTY
                                 echo of input by requiring a leading
                                 \r\n boundary on the marker.

  tools/vm_load_controller.py    in-guest load controller. set_phase()
                                 dispatches the per-phase shell command
                                 over the serial connection.

  tools/run_real_vm_demo.py      ties it all together: boot VM, wait for
                                 cloud-init runcmd, log in, run the
                                 EpisodeRunner with on_phase=controller,
                                 shut down VM.

Deps: paramiko, pycdlib added.

docs/sources.md updated with Alpine cloud image (sha512 pinned), and
the new Python deps.

README leads with the tier-2 plot now (real VM, real workload). The
previous synthetic plot is moved below with explicit "host-side mimic,
not a VM" labelling. Tier-2 status flipped to  in the tier table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:38:53 -06:00
Maximus Gorog
cc37fc6c4d Interactive envelope plot via WebAgg (browser-based)
plot_envelope.py grows a --show flag. With it, matplotlib's WebAgg backend
spins up a localhost server with a real interactive figure (zoom, pan,
hover, axes lock) — equivalent to a matlab plot window without needing
tkinter or Qt locally.

tools/show_envelope.sh is a NixOS-aware wrapper: it locates libstdc++.so.6
in /nix/store (numpy's prebuilt wheel needs it on LD_LIBRARY_PATH) and then
exec's the python script with --show. Default port 8988, override via
--port. Bound to 0.0.0.0 so the figure is reachable over WG too.

tornado is added to dev deps because WebAgg requires it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:06:22 -06:00
Maximus Gorog
970698af83 Synthetic envelope demo: phase-driven load mimic + plotter
End-to-end pipeline now produces a labeled envelope from a single command.
Drives the orchestrator through an 8-phase XMRig-shaped schedule and
renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands
sourced from labels.jsonl. Real telemetry, simulated load — validates the
collection + labeling shape before a real VM is involved.

Components:
- tools/load_mimic.py        phase-driven load generator. Reads phase
                             commands on stdin; CPU/IO behavior matches
                             the named phase (clean=idle, armed=light burst,
                             infecting=disk burst+CPU, infected_running=
                             CPU saturation+stratum-shaped writes,
                             dormant=quieter than clean).
- tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with
                             a default 85s schedule that includes the
                             classic infected_running → dormant → re-entry
                             pattern.
- tools/plot_envelope.py     reads telemetry + labels from an episode dir,
                             writes envelope.png with colored phase bands.

orchestrator: EpisodeRunner now takes an optional phase_schedule and an
on_phase callback. Walks the schedule emitting one label per transition.
Backwards-compatible — existing single-phase tests still green.

Doc fix (user pushback): README + architecture + threat-model no longer
imply the Pi5 is the deployment target. Pi5's actual role here is the
WireGuard-side collector for episode tarballs. Deployment target is
generic ("constrained Linux device"). The "gateway observer" concept
remains a deployment pattern, decoupled from the Pi5's collector role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:53:20 -06:00