CIS490

Author	SHA1	Message	Date
Max	c2a71de4b2	scene 9 bars: paint full zoo + 0–1 visible scale - multi_model_metrics: publish gbt / mlp / cnn / knn_semi / gru / lstm / bert (knn handled by knn streamer); read both _train.json and _eval.json with macro_f1.point fallback - dashboard.css: add palette gradients for the four non-canonical names so the bars render with a fill colour - dashboard.js: open the bar's visible scale to the full 0–1 range so honest-low cross-host F1s show as a bar instead of clamping to 0% - ship lambda-live-detection-loop.py + dashboard request docs (scenes 7/8/12, sticky cache, lambda-inference-demo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:18:00 -05:00
Max	4b2863ea99	producers/multi_model_metrics + scripts/rsync-from-lambda Pi-safe replacement for the original metrics.py + perf.py producers which load every checkpoint into memory and score the test set on each cycle. That pattern crashed the Pi during this project (300 MB knn pickles × 6 variants + 226 MB test set in memory at peak ≈ OOM). The new producer: - reads reports/eval/<model>_<mode>_train.json files (already contain the test_macro_f1 each trainer wrote) - publishes one model_metric event per file - publishes one model_perf event per file with a hardcoded per-architecture latency estimate (gbt 250 µs, knn 3500, mlp 50, cnn 500, gru 1500, lstm 2000, transformer 800, transformer_ssl 1000). These are family-level order-of-magnitude figures; proper benchmarks need to run on the deployment hardware (which is the A100, not the Pi). - re-publishes on a tick (default 30 s) for refresh-resilience. - NO model loading. Pi-safe. scripts/rsync-from-lambda.sh — pulls Lambda's artifacts/ + reports/eval/ to the Pi every 30 s. As Lambda finishes each model and writes its train.json, the Pi sees the new file within a cycle and the publisher broadcasts the metric on its next tick. Live multi-model dashboard during training, with no Pi-side inference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:04:15 -05:00
Max	3413a7c405	scripts/lambda-bootstrap.sh: also fix eval invocation paths The eval suite at the end of the bootstrap was using ../artifacts and ../data/* paths because they were originally invoked from inside repo/. Now that we no longer cd into repo, drop the ../ prefix. Same root cause as the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:25:40 -05:00
Max	ed7e3db035	scripts/lambda-bootstrap.sh — drop the cd-into-repo / before launching trainer The previous version did `(cd repo && "${cmd[@]}")` to "cd into repo for module imports." But PYTHONPATH was already set to $PWD/repo at the top of the script — so the cd was redundant for imports AND broke relative paths: the trainer expects to find data/processed/validation_v1.parquet from $HOME/cis490, not from $HOME/cis490/repo/. Symptom: every training job failed immediately with FileNotFoundError: data/processed/validation_v1.parquet Drop the cd; PYTHONPATH already does the import work. Found while running on the A100 today; trainer relaunched manually in-place via a stand-in bootstrap2.sh; this commit makes the next bundle clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:25:40 -05:00
Max	c1c8e98180	scripts/train-pi-cpu-models.sh — sequential Pi-side trainer chain Pi has 4 cores; only KNN and tree-based models are realistic to train here without GPU. While Lambda runs the full 16-job manifest in parallel (~1.7h), this chain trains the CPU-friendly subset on the Pi (~30 min) so scenes 8 & 12 populate with multi-model numbers within minutes instead of waiting on Lambda's full cycle. Order: gbt-realistic, knn-realistic, knn-oracle, knn_semi-realistic, knn_semi-oracle. Skips models whose .ckpt.json already exists (idempotent restart). Each is a subprocess of training/trainer/run.py so XGBoost/numpy/sklearn don't fight each other for cores. Caller is expected to start gbt-oracle separately (it's the longest single training and we kicked it off before invoking this script). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:12:34 -05:00
Max	69c563275a	training: parallelize lambda bootstrap (2 jobs at a time on the A100) At our model sizes (max ~250 K params, max batch 512), each training process uses ~1 GiB VRAM. A 40 GiB A100 is far from contention with two concurrent jobs. Bounded-concurrency rolling launcher cuts sequential ~3.5 h → parallel ~1.7 h for the full 14-job manifest. PARALLEL=2 (default) — override via env var if running on a smaller GPU or testing the queue logic. Per-job logs still land at logs/<model>_<mode>.log; failure reporting is the same. Idempotent: skipping already-present checkpoints unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:37:03 -05:00
Max	308140c6ce	training: lambda-cloud one-shot training integration External-GPU path for the time-pressured first round, before the Windows desktop joins the WG fleet. Lambda is treated as an "external worker" whose output lands in the same /var/lib/cis490/models/ tree the receiver-coordinated fleet uses, so cis490-jobs status reflects Lambda runs identically to fleet runs. Three scripts + one ingest tool: scripts/build-lambda-bundle.sh Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with: - the repo (sans .git, sans data/, sans artifacts*) - data/processed/{validation_v1,features_window_v1}.parquet - data/processed/feature_schema_v1.json - data/processed/tensor_window_v1/ (npz shards) - bootstrap.sh (entrypoint) - training_manifest.toml (the canonical job list) - BUNDLE_MANIFEST.json (commit hash + counts + build stamp) Verifies all four data inputs exist BEFORE compressing 5+ GB. scripts/run-on-lambda.sh ubuntu@<ip> rsync bundle up → ssh + run bootstrap → rsync artifacts + reports/eval back to artifacts-lambda/ + reports/lambda/. Resumable rsync; sha256-verified. scripts/lambda-bootstrap.sh (runs ON the Lambda instance) Creates .venv with cu121 torch + xgboost + the [training] deps, iterates the manifest's job list in priority order (highest first), runs trainer/run.py (or run_ssl.py for transformer_ssl) per job, skips jobs whose .ckpt.json already exists (idempotent on re-run), writes per-job logs/<model>_<mode>.log, runs eval suite at the end, stamps artifacts/RUN_SUMMARY.json with counts + failed-job list. tools/ingest_lambda_artifacts.py Bundles each (ckpt.json + sidecar + train.json) trio into a .tar.zst, sha256, PUTs to the local trainer-receiver's /v1/model/{job_id}, marks the job complete. Maps (model, mode) → job_id by re-reading the canonical manifest. Handles the queue state churn (requeue if completed, claim if pending, fail-back on race losses). End-to-end smoke verified on the A100 instance just provisioned: - SSH from Pi via ed25519 keypair (cis490-trainer-pi) - GPU: A100-SXM4-40GB, driver 580.105.08 - venv warmed: torch 2.5.1+cu121, xgboost 3.2.0 - 464 GB ephemeral disk available Pi-side feature build (build_features.py + build_tensors.py against all 72,952 accepted+degraded episodes) is in progress; bundle build gates on its completion. Estimated wall-clock for the full Lambda training run on A100: ~2.5 hours for 12 supervised + 2 SSL models + eval suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:32:04 -05:00
Max	8643192a71	training/fleet: distributed multi-host trainer with capability gating Symmetric companion to the collection fleet (orchestrator/fleet.py) but for training. Collection is embarrassingly parallel; training is not (a model is trained at most once across the fleet), so the receiver coordinates which worker gets which job. Operator-control surface is etc/training_manifest.toml.example — single canonical file declaring (a) per-host capability + per-model allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper) with capability constraints (require_cuda, prefer_cuda, min_vram_gib, min_ram_gib, allowed_hosts). Components: capability.py — self-detection: hostname, cores, RAM, CUDA presence, VRAM, torch version, git commit. Used by workers to filter eligible jobs before claiming. manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable sha256 of (model, mode, hyper, split_recipe, train_hosts, seed) so manifest reload is idempotent: existing rows keep their status, new jobs become claimable, removed jobs stay until cancelled. queue.py — SQLite job queue (training_jobs.db) with statuses pending\|claimed\|running\|completed\|failed\|cancelled. Atomic claim_next via single UPDATE WHERE status='pending'. Heartbeat, complete, fail. Stale-claim sweep (stale_after_s=600s) with max_attempts cutoff to failed. store.py — model artifact store mirroring receiver/store.py. Artifact ID is the sha256 of the uploaded tarball; bit-identical re-runs deduplicate. receiver.py — Starlette app exposing 11 endpoints: POST /v1/job/claim (worker) POST /v1/job/{id}/heartbeat (worker) POST /v1/job/{id}/complete (worker) POST /v1/job/{id}/fail (worker) PUT /v1/model/{id} (worker — uploads tarball) GET /v1/jobs (anyone) GET /v1/workers (anyone) POST /v1/job/{id}/cancel (operator: X-Operator-Token) POST /v1/job/{id}/requeue (operator) POST /v1/manifest/reload (operator) GET /v1/health (anyone) Runs as cis490-trainer-receiver.service on the Pi alongside the existing receiver, on a separate port. client.py — stdlib HTTP client (urllib only, no new deps). worker.py — long-running daemon. Loop: detect capability → claim → spawn training/trainer/run.py subprocess → heartbeat every 30s → tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe. Operator CLI (tools/cis490_jobs.py): status / list / show / cancel / requeue / reload / workers. Cancel and requeue require $CIS490_OPERATOR_TOKEN matching the receiver's configured value. Bootstrap: scripts/install-training-worker.sh (Linux systemd) and scripts/install-training-worker-windows.ps1 (Windows Scheduled Task) let the operator enroll a new host with one command after cloning the repo and setting up the venv. Worker self-tests capability before registering. End-to-end smoke verified on the Pi: receiver up, manifest synced, 14 jobs queued, worker registered, claimed 4 CPU-eligible jobs (allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle, mlp-oracle), 1 failed with the actual error visible via cis490-jobs status, 3 artifacts uploaded to /var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with proper index.jsonl row. 21 unit tests (manifest validation: 8; queue lifecycle + eligibility: 13). All pass alongside the prior 17 training tests = 38 green. Open limitations surfaced inline: - Hyper-key drift between manifest and run.py fails at training time, not at manifest reload (worth tightening to argparse introspection later). - mTLS not yet wired through Caddy for the trainer-receiver port — listens loopback-only until that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 01:20:20 -05:00
Max	1fabd4a246	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers The model layer of the project, built honestly: - tools/dataset_validate.py — full-sweep validator over the receiver store (sha256, schema, monotonic labels, telemetry-row gate). On the current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected + 7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet is committed as the per-episode acceptance index. - training/_features.py — channel registry (46 channels across proc/guest/qmp/netflow), summary-stat windowing AND channel×time tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns (Unix ns) — tested fix for a real netflow-vs-host clock-base inconsistency that was silently dropping every netflow channel. - training/_split.py — three held-out recipes (host / sample / time) with profile-stratification assertions. held_out_host carries untested_profiles for cases like scan-and-dial absent from the test host (5 of 6 profiles tested cross-device, never silently averaged). - training/models/ — 6 architectures behind a common BaseModel interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each trained twice (realistic / oracle) per the deployment threat model. Schema-hashed checkpoints refuse to load if _features.py changed since training (silent-input-drift protection, tested). - training/trainer/ — unified training loop: class-weighted CE, LR warmup + cosine, gradient clipping, mixed precision when CUDA, early stopping on val macro F1, best-on-val checkpoint. Same loop runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost early_stopping_rounds on val mlogloss. - training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1, per-profile and per-host breakdown, paired-bootstrap significance for model-vs-model gap. Confusion matrix uses union of seen labels. - training/dashboard/producers/ — replay/metrics/perf/profiles emitting the six event types the dashboard's awaiting scenes consume; on-demand tensor extraction so the Pi can run live inference without 65 GB of shards. - 17 unit tests (split coverage, features round-trip, schema mismatch, determinism, time-base alignment regression). End-to-end smoke-trained all six on a 567-episode subset; held-out test macro F1 reported with paired-bootstrap significance. The methodology now reports honest cross-device generalization, not in-distribution validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 01:19:00 -05:00
Elliott Kolden	b29d30a1b2	Tier-3: fix QEMU boot, catalog admission, verify module Bug 14 (vm/launch_target.sh): Metasploitable2 requires -machine pc (i440fx), -cpu kvm32, -drive if=ide, and -device e1000. The previous config (-machine q35, -cpu host, -drive if=virtio, virtio-net-pci) caused a kernel panic at boot because /dev/vda != the grub root=/dev/sda1. Services never started; the b'' probe fix (Bug 10) then correctly waited out the full timeout with no result. Bug 15 (scripts/install-tier-3-4.sh): verify step used vsftpd_234_backdoor which is requires_bridge=true and has a hardcoded port-6200 backdoor. Changed to distccd_command_exec with TARGET_PORTS="5632:3632,4444:4444". manifest.toml: admit distccd_command_exec and unreal_ircd_3281_backdoor to the module catalog. Both use cmd/unix/bind_perl (bind shell, no guest egress, SLIRP-safe). distccd returns a valid protocol response so MSF's handler runs and session_open fires. Verified against Metasploitable2 sourceforge image sha256 a8c019c3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 16:41:41 -06:00
Max Gorog	207a902c3e	PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml The experiment is now defined by a single version-pinned file — manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every lab host loads THIS exact file; per-host overrides of experiment shape are forbidden. Drops the following per-host CLI overrides that previously violated the canonical-manifest principle: * --manifest, --modules-dir (paths now derived) * --ram-per-vm-mib (in manifest.experiment) * --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling) * --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots) * --force-tier2 (not a §14 sanctioned override knob — ship empty catalog to disable Tier-3) * --require-real-samples (sample-side concern; out of fleet scope) * tools/run__demo.py --manifest (samples path now from canonical) New surface: manifest.toml — the single source of truth * orchestrator/manifest.py — load_canonical() + Manifest dataclass with strict validation, raises ManifestError on any failure * EpisodeConfig.experiment_meta — populated by run__demo.py from the canonical manifest; stamped into every episode's meta.json under "experiment" key for provenance cis490-orchestrator.service — RestartPreventExitStatus=78 so manifest-load failures stay stuck-and-loud (§9, §4.7) * install-lab-host.sh — validates manifest.toml at install time; missing or invalid = die with clear message Catalog admission semantics: only modules whose name appears in manifest.catalog get loaded into the runtime catalog (§4.3 in miniature, will tighten further in step 4 when verified_against / last_verified actually gate admission). Missing toml for an admitted name is a sysadmin error → exit 78. Renames cfg.manifest → cfg.samples + adds cfg.experiment to disambiguate sample-manifest from experiment-manifest. Rewrites test_fleet.py fixture to construct synthetic Manifest objects so test outcomes don't depend on the on-disk manifest.toml content. 12 new tests in tests/test_manifest.py: schema-version mismatch, unknown collector, duplicate collector, unknown phase, negative phase seconds, negative ram, missing catalog fields, json round-trip. Local run: `python tools/run_fleet.py --capacity` correctly logs the loaded manifest and prints capacity. 241 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:25:01 -05:00
Max Gorog	c41763bd28	install-lab-host: auto-install perf + tcpdump on Arch / Debian / RHEL PIPELINE.md §4.4 requires every collector in the active set to actually work end-to-end. On k-gamingcom (commit `dac03d2` episode at 02:21Z) the new perf_unavailable lifecycle event surfaced a concrete cause: `reason: binary_not_on_path` — perf is enabled but the binary isn't installed. Same story with tcpdump on k-gamingcom (pcap_unavailable events with `error: tcpdump not found`). The canonical install script is the right place to ensure the deps are present. detect_os reads /etc/os-release; ensure_collector_packages installs `perf` (Arch / RHEL) or `linux-perf` + `linux-tools-generic` (Debian/Ubuntu) plus `tcpdump`. After the install attempt the script re-checks `command -v` and dies loudly if either is still missing — silent silent silent forbidden per §1, so install failure has to be observable. Idempotent (`--needed` / equivalent skips already-installed packages). Operator owns full system upgrades; this only does targeted package install. On unknown distros logs a warning and dies on the followup check, with a clear pointer to install perf/tcpdump by hand. The next autoupdate tick on k-gamingcom should pull this and self-install perf + tcpdump, after which rows_perf > 0 and pcap should start producing bytes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 21:26:28 -05:00
max	49eba2fd60	fleet-health: proactive alerts on the Pi + per-host doctor reports Two pieces of self-monitoring so the maintainer isn't the alarm: (2) Receiver-side fleet health monitor cis490-fleet-health.timer runs check_fleet_health.py every 5 min. Detects three symptoms and writes them to /var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy to forward to a notifier): silent — host shipped in last 24h but has been quiet >30 min fatal-only — actively shipping but every PUT 4xx unstamped — shipping without X-Cis490-Code-Commit header Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault fires once per hour, not every 5 min. 15 unit tests cover the index parser, three detectors, and dedup. (3) Per-host doctor snapshots Lab hosts run cis490-doctor-check.timer once a day (10 min after boot, then daily with 30-min jitter). The timer runs cis490_doctor.py --json and PUTs the result to a new endpoint: PUT /v1/host-health/<host> → /var/lib/cis490/host-health/<host>.json GET /v1/host-health → aggregate across all hosts Endpoint is NOT gated by version_gate — sick hosts running stale code MUST still be able to report sickness. 11 unit tests cover PUT/GET, atomic-write semantics, bearer auth, and the not-gated-by-version-gate property. ship_health_check.py reuses the existing shipper transport (mTLS + bearer + receiver URL from lab-host.toml) so we don't reimplement auth. Both timers wired into install-lab-host.sh — the loop also enables the previously-added autoupdate + cert-fetch timers, so a single install run gives a host all four self-healing mechanisms. Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2 pre-existing test_fleet.py failures from the elliott-ThinkPad merge (`667f042`) are unrelated to this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:48:31 -05:00
max	3180f7b5ac	lab-host: cis490-cert-fetch.timer for automatic mTLS bootstrap retry k-gamingcom symptom (2026-05-02): the on-device agent successfully finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS material" because the cert auto-fetch step in install-lab-host.sh either ran with host_id still REPLACE_ME, or hit a transient bootstrap.wg failure, and there's no automatic retry. The Pi-side cert IS minted and the bootstrap endpoint serves it — the failure mode is purely "lab-host hasn't pulled it down." Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh (idempotent, no-op when certs are already on disk, no-op when host_id is unset, exit-0 on transient network failure so the unit doesn't get pinned as failed), and run it from a 5-minute systemd timer. The timer handles all three "stuck waiting on mTLS" cases without operator action: - operator edited host_id post-install but didn't re-run install - bootstrap.wg was briefly unreachable during install - lab host was offline when install ran but came up later The script `try-restart`s cis490-shipper after a successful fetch so the daemon picks up the new cert immediately instead of waiting for its lazy retry. install-lab-host.sh still calls the script on install for fast first-time bring-up — the timer is the safety net. Tarball extract is staged through a temp dir + atomic rename so a mid-extract crash never leaves us with a mismatched cert/key pair. AGENTS.md row 4 updated: "waiting on mTLS material" remediation now points at the timer, with the exact `systemctl start cis490-cert-fetch.service` command to force an immediate retry. Tests: 267/267 unchanged. The fetch script is idempotent + has all its happy/error paths handled inline; a unit test would mostly be testing systemd's behaviour. The integration test path is the timer running on a real lab host, which is the actual production case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:30:16 -05:00
Elliott Kolden	667f042707	Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01) Root causes and fixes documented in TIER3-BRINGUP.md. Summary: 1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot. 2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring modules selected on SLIRP runs; fix: always filter requires_bridge. 3. cmd/unix/interact creates no session.list entry → session_open_timeout every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl. 4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444); fix: extra_host_port:extra_host_port mapping so guest binds the per-slot LPORT directly. 5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots; fix: requires_bridge=true filters it from SLIRP fleet runs. 6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba boots (~60 s too early); fix: replace TCP probe with serial console _wait_for_serial_login that waits for actual "login:" prompt. 7. Stale QEMU survives orchestrator restart (start_new_session=True) → holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from old pidfile before rmtree. 8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100. 9. msfrpcd 6.x returns bytes for all string values even with raw=False; fix: MSFRpcClient._str() recursive decoder applied to all responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:26:19 -06:00
max	98dcd4f9f8	lab-host: cis490-autoupdate.timer for self-healing on push Today's incident: post-cutover, k-gamingcom went silent and elliott-thinkpad kept shipping pre-stamp episodes that the receiver gate 400'd in a 2300+ PUT loop. Both required `git pull && install- lab-host.sh` on the host — neither the on-device AI agent nor the operator pulled in time, and from the receiver Pi I cannot reach in (sshd off on the lab hosts). Fix the recurrence directly: a 30-min systemd timer that does git fetch + (if behind) ff-only pull + re-run install-lab-host.sh. Hosts catch up on the next tick on their own — no human or agent action required. Mechanics: - scripts/auto-update.sh runs as root, drops to cis490 for git ops to satisfy /opt/cis490 ownership ("dubious ownership" guard). - Refuses ff if local HEAD isn't an ancestor of origin/main — protects operator hand-edits from silent overwrite. - Network failures exit 0 (offline is normal, don't pin a unit failure); divergence + install failures exit non-zero so the journal records what broke. - RandomizedDelaySec=10min on the timer prevents thundering-herd when several hosts boot together. - Hands off to install-lab-host.sh via exec — exactly one path through bring-up; no special "auto" flow. The version-gate provides the quality boundary, so even if origin/ main moves forward unsafely, the receiver's allow-list still controls what lands in the index. install-lab-host.sh enables cis490-autoupdate.timer on every run, idempotent — existing hosts pick it up the next time they pull manually. Filed Forgejo #18 with the canonical command for elliott-thinkpad + k-gamingcom to bootstrap themselves out of the current incident (auto-update doesn't help them retroactively — it has to be running before the cutover to catch the next one). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:59:31 -05:00
max	eda6164897	fix: lab-host install loop after commit-gate cutover Why services weren't starting after the gate went live: 1. install-lab-host.sh self-copy. The receiver's 400 remediation tells the agent to `cd /opt/cis490 && git pull && sudo ./scripts/install-lab-host.sh`. That makes REPO_ROOT==INSTALL_ROOT and `cp -aT $REPO_ROOT $INSTALL_ROOT` errors with "are the same file"; `set -e` aborts before the systemd units install or anything restarts. Detect the same-dir case and skip the cp; chown still runs. 2. Services never restart. install-lab-host.sh and install-tier-3-4.sh both ended by telling the operator to restart, then exiting. The running shipper/orchestrator kept executing pre-gate code from the old module objects, so new `code_version` stamping never reached an episode. Both scripts now `systemctl restart` the units they own when those units are enabled. 3. Shipper queue fatal-loop. queue.py incremented `fatal++` but didn't move the episode out of `data/episodes/`. Next scan re-tarred and re-PUT the same dir, getting 400 again. With 4465+ pre-stamp episodes on k-gamingcom this burned ~1 PUT/sec for 5+ hours of receiver log. Fatal episodes now move to data/quarantine/<id>/ with a quarantine_reason.json beside them; the outbox tarball is deleted. 4. Pre-stamp backlog drain. tools/quarantine_unstamped.py is a one-shot that scans data/episodes/ and quarantines anything without a 40-char-hex code_version.commit. Wired into install-lab-host.sh step 9 so a re-install drains the queue automatically. Idempotent; safe to run while the shipper is active. Tests cover the queue's new fatal-quarantine path and every drain behaviour (kept/quarantined/dry-run/idempotent/missing-meta/collision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:36:21 -05:00
max	5c0bc9af8e	meta.json: stamp code_version (commit, branch, dirty) per episode Closes a real reproducibility gap. Three weeks of bug fixes have shipped (probe fix in `2707709`, multi-signal classifier in `321ea63`, mandatory tier-4 in `265f3ad`, etc.); without a per-episode code_version, trainers can't tell which episodes came from buggy pre-fix code and have to scan every tarball to guess. Resolution priority (cached across episodes): 1. $INSTALL_ROOT/VERSION (production — install-lab-host.sh writes it at install time since /opt/cis490 is a flat copy with no .git) 2. git rev-parse HEAD from the repo root (dev clones) 3. {"commit": "unknown", source: "unknown"} so the field is always present (filterable) Output shape, always present in meta.json: "code_version": { "commit": "<40-hex>" \| "unknown", "branch": "<name>" \| null, "dirty": bool \| null, "source": "VERSION-file" \| "git" \| "unknown" } install-lab-host.sh writes VERSION at install time with the source repo's git rev-parse HEAD + branch + clean-tree flag + install timestamp. Lab-host agents that pull main + re-run install-lab-host.sh get a fresh stamp automatically. 148/148 tests pass; test_episode_against_self_pid_produces_full_directory asserts the field's presence + valid `source` value.	2026-05-01 01:29:01 -05:00
max	265f3ad313	Tier-4 sample source: theZoo (no auth, no operator action) Replaces MalwareBazaar with theZoo (https://github.com/ytisf/theZoo). theZoo is a public security-research repo with hundreds of malware samples organized by family, password-protected with the well-known 'infected'. No API key, no signup, nothing for an operator to do — which is what zero-touch tier-4 actually means. Changes: - tools/auto_fetch_samples.py: rewrite. Clones theZoo (shallow, ~500 MB) to /var/lib/cis490/theZoo on first run, then for each manifest family without a sha256 it locates a matching Binaries/<Name> dir, extracts the .zip with password 'infected', picks the largest non-text payload as the binary, sha256s it, stages at samples/store/<sha256>, and rewrites manifest.toml in place (atomic tempfile + os.replace, stat preserved). Mandatory exit semantic: non-zero if no real samples landed. - scripts/install-tier-3-4.sh: dropped the MB-key resolution chain (env var → local file → bootstrap.wg fetch). Now just runs auto_fetch_samples.py and dies if zero samples land. SKIP_TIER4 remains as the explicit override but is documented as defeating the project. - bootstrap/app.py + __main__.py + etc/cis490-bootstrap.service: removed the /v1/secret/<name> endpoint and the --secrets-root flag. Dead code now that no API key needs distributing. Live-rolled back on the Pi (404 verified post-restart, stale /etc/cis490/secrets dir removed). - scripts/set-malwarebazaar-key.sh: deleted. No MB key means no one-time operator step. - tests/test_bootstrap_secrets.py: deleted (route removed). - AGENTS.md: rewrote tier-4 section to reflect zero-operator model. 148/148 tests pass. Bootstrap service rolled back live.	2026-05-01 01:17:50 -05:00
max	5d0e8e33a9	Tier 4 is mandatory: hard-fail on no real samples; auto-distribute MB key User: 'we don't want it to be optional, this real malware IS the data we want.' Acknowledged. Three changes make Tier 4 actually mandatory without forcing per-host operator action: 1. bootstrap.wg /v1/secret/<name> endpoint - Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts over the same trust boundary as the cert endpoint (WG mesh, iptmonads-gated). Strict allow-list — only `malwarebazaar` resolves; everything else 404s. Secret returned as bare text with Cache-Control: no-store. Live-verified on the Pi. - tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned, 200 with token, 404 unknown name, 500 on empty file. 2. install-tier-3-4.sh: Tier 4 is no longer optional - Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token → https://bootstrap.wg/v1/secret/malwarebazaar. - Caches the bootstrap-fetched key locally so re-runs are offline. - If all three resolution paths fail, dies with the exact remediation command for the operator (one-time set-malwarebazaar-key.sh on the Pi). - auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still works for emergency overrides but logs a warning that the host will produce only mimics). Deploy fails if zero binaries land in samples/store/ — no silent mimic-only fallback. - SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'. 3. scripts/set-malwarebazaar-key.sh - Pi-side helper: one operator command per fleet, ever. Accepts key via env or stdin, validates length, drops at the right path with the right perms. Lab hosts pull the rest automatically. AGENTS.md: rewrote the Tier-4 section to reflect mandatory status + the one-time-on-Pi distribution model. 152/152 tests pass. Bootstrap service updated live on the Pi.	2026-05-01 00:44:41 -05:00
max	683bfe9ce6	Tier 3 + Tier 4 auto-deploy: zero operator interaction Replaces the manual runbook with scripts that just work. install-lab-host.sh now runs the full Tier-3 deploy automatically as its 8th step (after the mTLS cert lands), and Tier-4 auto-fetches when MALWAREBAZAAR_API_KEY is set. Changes: - install-msfrpcd.sh: actually runs the Rapid7 omnibus installer when metasploit-framework isn't present (was: bail with "install manually"). apt-get and dnf paths both go through the same omnibus script with DEBIAN_FRONTEND=noninteractive. Idempotent. - fetch-metasploitable2.sh: bakes in the SourceForge public-mirror URL (https://downloads.sourceforge.net/project/metasploitable/...) so no operator URL is required. sha256 is now optional and TOFU-pinned — first run records the hash to OUT_DIR/metasploitable2.qcow2.sha256; subsequent runs verify against that. Skips if qcow2 already present. - scripts/install-tier-3-4.sh (new): orchestrates the four steps (msfrpcd → metasploitable2 → bridge → tier-3 verify) plus optional Tier-4 auto-fetch. Idempotent. SKIP_VERIFY / SKIP_BRIDGE / SKIP_TIER4 env knobs for partial deploys. - tools/auto_fetch_samples.py (new): when MALWAREBAZAAR_API_KEY is set, queries MB by each manifest entry's `family` (signature match), pulls the first match via fetch_sample.py, and rewrites manifest.toml in place (atomic tempfile + os.replace, preserving stat). Skips entries that already have sha256. - install-lab-host.sh: gains a step 8 that calls install-tier-3-4.sh automatically when mTLS certs are on disk. --skip-tier3 flag for operators who want Tier 2 only. Skipped silently before certs land so first-pass install (host_id=REPLACE_ME) still works. - AGENTS.md: rewrote the Tier-3 section to point at the one-shot script. Removed the old multi-command runbook so on-device agents can't accidentally follow stale steps. Net effect: a fresh lab host now gets Tier 3 (and Tier 4 if API key present) from a single sudo invocation. No operator picks for image URLs, no manual metasploit installs, no manual manifest edits.	2026-04-30 23:12:08 -05:00
max	f6d7d07837	Make mTLS bring-up unmistakable for on-device agents Sysadmin observed lab-host agents still trying to "secure the connection" — minting certs, generating CSRs, or otherwise reinventing a cert-delivery flow that's already automated through bootstrap.wg. Three reinforcements so an agent reading any of the three surfaces (AGENTS.md, install script output, journalctl) gets the same message: - AGENTS.md gains a top-of-file "do not mint your own certs" callout + a dedicated "Securing the connection (mTLS)" section with the one fix (re-run install-lab-host.sh after setting host_id) and an explicit "what NOT to do" list (no openssl, no copy from another host, no verify_tls=false). - install-lab-host.sh's FIRST-INSTALL NEXT STEPS now spells out that the cert auto-fetch is silently skipped while host_id is REPLACE_ME, and that the operator MUST re-run the script after editing host_id. Step 2 is now "RE-RUN THIS SCRIPT" with a DO NOT openssl warning. - The shipper's "waiting on mTLS material" warning now embeds the exact remediation command + a pointer to AGENTS.md, so an agent reading journalctl without ever opening the repo still gets it. Tests: 12/12 in test_shipper still pass; warning string change is not asserted on (only the dataclass error field).	2026-04-30 16:23:44 -05:00
elliott	95ac56a382	fix: three install-time bugs found during first lab-host bring-up on k-gamingcom 1. pyproject.toml — move pycdlib to main deps (was dev-only; cidata build fails on first install because the venv doesn't include dev extras). 2. scripts/install-lab-host.sh — create vm/images/ dir before symlinking alpine-baseline.qcow2 and cidata.iso into INSTALL_ROOT. Without the mkdir the ln -sf silently fails (\|\| true), leaving the launchers unable to find the images and causing every episode to fail within 15 s. 3. tools/cis490_doctor.py — two fixes: a. Insert repo_root into sys.path at doctor startup so the inline `from exploits.modules import ...` succeeds when running from /opt/cis490 (package = false means nothing is installed into site-packages). b. Pass cwd=/opt/cis490 to the shipper --ping subprocess so python -m shipper resolves the module correctly regardless of the caller's CWD. Tested on k-gamingcom: install script now builds cidata.iso on first run, 7-slot fleet wave completes with rc=0, doctor shows 13 ok / 4 warn / 2 fail (remaining failures are mTLS certs + collector.wg DNS — both need Pi-side action, not code changes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 15:05:00 -06:00
max	507eac617b	Solvable Tier-3 holes: callback payloads, busybox workloads, bridge by default Closes the next batch of issues from the post-mortem. The previous "each run uses a different vulnerability" commit shipped 5 modules but 3 of them couldn't actually fire under SLIRP+restrict=on: their reverse-shell payloads needed a callback channel the launcher didn't provide, AND their LHOST options were set to {{ target_ip }} (the target's IP, not the attacker's — copy-paste from RHOSTS). Same time, the workloads.py shell commands used bash-only /dev/tcp redirects that silently no-op'd in the busybox shell sessions Metasploitable2 returns. Net effect: episodes that selected those modules would have produced session_open_timeout + dead workloads. Module configs (the three callback ones): exploits/modules/distccd_command_exec.toml exploits/modules/php_cgi_arg_injection.toml exploits/modules/unreal_ircd_3281_backdoor.toml - Switch payload from cmd/unix/reverse* to cmd/unix/bind_perl so the target listens on a known port; msfrpcd connects to it via the host's hostfwd (no callback path required). - Drop the bogus LHOST = "{{ target_ip }}" — bind shells don't use LHOST. - Add [runtime] table: requires_bridge = true extra_target_ports = [<bind_lport>] Both fields are honored by the loader (ModuleConfig.requires_bridge) and the launcher (TARGET_PORTS gets the extra port hostfwd'd when BRIDGE mode is active). orchestrator/fleet.py When BRIDGE is unset in env, _run_slot filters the module catalog down to modules where requires_bridge=False before calling select_module. Two same-socket-shell modules (vsftpd_234_backdoor + samba_usermap_script) survive — fleet still has variety; just doesn't pick modules whose payloads can't land. With BRIDGE set, the full catalog rotates as before, AND BRIDGE is propagated to the per-slot subprocess env so launch_target.sh enters tap+bridge mode. exploits/workloads.py Replaced bash-only constructs in three profiles: scan-and-dial /dev/tcp/HOST/PORT redirects → nc -z -w 1 bursty-c2 same fix shell-resident exec 3<>/dev/tcp/... → piping into nc -w All three now run cleanly in busybox / dash / Metasploitable2's default shell. The remaining three profiles (cpu-saturate, io-walk, low-and-slow) were already busybox-portable. scripts/install-lab-host.sh - lab-host.env now defaults BRIDGE=br-malware (was commented out). Operator opt-out is to comment the line back in. - New step 6b: provisions br-malware via vm/setup_bridge.sh AND pre-creates a per-slot tap pool (cis490tap0..7 for Tier-2 demo, cis490target0..7 for Tier-3 target) all attached to br-malware and brought up. Launchers reference these by SLOT — no sudo needed at episode time. - On bridge-setup failure, the script auto-comments BRIDGE in the env file with a "auto-disabled: bridge setup failed" note so the fleet falls back to same-socket modules + Tier-2 cleanly. tools/cis490_doctor.py Two new checks for the lab-host role: bridge: br-malware exists / up tier3: msfrpcd listening on 127.0.0.1:55553 tier3: module catalog parses (counts same-socket vs requires_bridge) All three are warn-level — they don't fail an otherwise-healthy Tier-2-only setup; they tell the operator what's missing for full Tier-3 + source 4 coverage. Tests: 132 (was 129). New cases: test_fleet.py +3 - fleet skips requires_bridge modules when BRIDGE unset (asserted across 20 episodes; never picks a callback module) - fleet uses the full catalog when BRIDGE is set - BRIDGE env propagates to per-slot subprocess What's still untested live: the bind_perl payloads against a real Metasploitable2 in the bridge-enabled launcher path. That's a deployment validation, not a code change. The unit tests confirm the dispatch / filter logic; the live test is the next operator action. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 02:32:52 -05:00
max	a93a3ff221	bootstrap: auto-issue mTLS leaves to enrolled lab hosts (closes #9 , refs #3 ) Adds a pull-based cert distribution path so install-lab-host.sh can fetch its own leaf cert without operator intervention. Removes the ssh-from-Pi requirement that blocked elliott-lab. How the chicken-and-egg gets solved: a freshly wg-enrolled lab host already has WG access (gate kept by iptmonads at L4) and trusts the Caddy local CA (bundled in this repo at etc/caddy-root.crt). It makes a single TLS call to https://bootstrap.wg/v1/cert/<host_id> — no mTLS — gets back a tar of {ca.crt, leaf.pem, leaf.key}, extracts to /etc/cis490/certs/, and the shipper unblocks. Trust boundary is "reached :443 over WG"; no operator action needed. bootstrap/ app.py Starlette: GET /v1/cert/{host_id}, GET /v1/health. Validates host_id charset, rate-limits per source IP, logs every mint with the X-Real-IP Caddy injects. __main__.py uvicorn launcher; runs as root because the wg-pki CA private key is root-only. etc/cis490-bootstrap.service systemd unit on 127.0.0.1:8446 with ProtectSystem=strict + narrow ReadWritePaths=/var/lib/wg-pki. ProtectHome=no because systemd's read-only mode hides /home contents (the issuer script the wrapper exec's lives there). scripts/issue-cis490-client-cert-wrapper.sh Adapter the bootstrap service shells out to. Resolves the actual wg-pki issuer script across the three plausible install layouts (/opt/wg-pki, /home/max/wg-pki, /home/max/.env/wg-pki) so a single copy of the unit file works on any operator's box. Forces --out-dir to /var/lib/wg-pki/issued so writes stay inside the service's narrow ReadWritePaths. scripts/install-lab-host.sh After scaffolding lab-host.toml, if /etc/cis490/certs/lab-host.pem is absent, curls bootstrap.wg with --cacert etc/caddy-root.crt (no chicken-and-egg), extracts, chowns/chmods. Skips silently if bootstrap.wg is unreachable so manual hand-carry remains possible. scripts/install-receiver.sh Drops cis490-bootstrap.service alongside cis490-receiver and prints both as "enable --now" candidates. cis490-bootstrap is the thing that makes lab hosts self-provisioning. etc/caddy-root.crt Bundled copy of wg-pki's published Caddy local CA root, so the bootstrap fetch can verify TLS without depending on a wg-pki clone that may or may not be on the lab host yet. Verified live on the Pi: $ curl --cacert etc/caddy-root.crt https://bootstrap.wg/v1/cert/elliott-lab -o /tmp/x.tar HTTP 200 size=10240 $ tar tf /tmp/x.tar ca.crt elliott-lab.key elliott-lab.pem $ openssl verify -CAfile … elliott-lab.pem /tmp/.../elliott-lab.pem: OK $ openssl x509 -subject … -noout subject=CN=elliott-lab Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:30:29 -05:00
max	6f8b744c33	cis490-doctor + AGENTS.md operator runbook + louder install script Adds the missing diagnostic + onboarding tools so an agent (AI or human) handed a fresh lab host can get to "shipping data" without re-deriving every step from logs. tools/cis490_doctor.py — one-shot health check that walks the full stack from the bottom up. Each row is green/yellow/red with an exact fix command for the red rows. Checks: - repo: branch, tree-clean, distance from origin/main - install: /opt/cis490, .venv python, /etc/cis490/{lab-host,receiver}.toml, /etc/cis490/lab-host.env - mTLS: /etc/cis490/certs/{wg-ca,lab-host}.{pem,key}, openssl chain verify - systemd: cis490-{shipper,orchestrator,receiver} active state - net: receiver.url DNS, TCP reach, mTLS handshake to collector.wg - vm prereqs: /dev/kvm, qemu-system-x86_64, zstd, alpine-baseline.qcow2, cidata.iso - tier3 prereqs: msfrpcd, metasploitable2.qcow2 (warn-level) - end-to-end: cis490-shipper --ping Modes: --role {lab-host,receiver}, --json (machine-readable), --no-tier3 (skip optional checks). Exits non-zero on any red row. ANSI color (auto-disabled on non-tty / NO_COLOR). AGENTS.md gains a "How a lab host gets to shipping data" canonical flow at the top: cert delivery via wg-pki/deploy-cis490-cert.sh → install-lab-host.sh → cis490-doctor → systemctl enable. Plus an "on-demand episode" recipe + a "smallest E2E test" snippet for agents that need to verify the pipe without waiting on the timer. The strict "cloning the repo by itself does nothing" callout makes the failure mode mu and elliott-lab hit explicit. scripts/install-lab-host.sh prints a 5-step banner on first install that points at cis490_doctor.py + the deploy-cis490-cert.sh flow, plus an always-printed footer warning that "cloning + running launchers manually is NOT enough." Same message the AGENTS.md section reinforces. Refs spectral/CIS490#8 (the "Tier-2 is shipping in the meantime" claim that turned out to be untrue because no cis490-shipper service was running on elliott-lab — exactly the case this diagnostic tool targets). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:11:57 -05:00
max	a88ac83db0	Close out the deployment-readiness gaps Wraps the gaps surfaced in the "what is not implemented" audit so the fleet really is shippable end-to-end. Verified live on the Pi: - cis490-shipper --ping → HTTP 200 through Caddy + mTLS via the new wg-pki client CA leaf - real episode dir → tar+zstd → PUT → HTTP 201 stored - re-ship same bytes → 200 (idempotent) - re-ship different bytes under same id → 409 (conflict) Changes: orchestrator/episode.py - EpisodeConfig.revert_at_start / revert_at_end (Tier 0+ snapshot/ revert per docs/architecture.md). When set + qmp_socket present, EpisodeRunner issues loadvm <snapshot_name> and emits snapshot_revert / snapshot_revert_failed events on the same monotonic clock as everything else. collectors/qmp.py - savevm() / loadvm() helpers using human-monitor-command, plus a test against the fake QMP server. exploits/workloads.py - chunked_real_binary_upload() returns a ChunkedUpload plan: 8 KiB base64 chunks (~6 KiB binary each) so msfrpc never sees a buffer- busting payload. Includes a finalize step that sha256-verifies on the guest before exec. - real_binary_workload() now wraps the chunked plan for backwards compat with single-shot callers. exploits/driver.py - Tier-4 dispatch walks the chunked plan in MSFExploitDriver: each chunk is a separate session_shell_write; finalize verifies; exec only runs on sha-ok. New events: real_binary_upload_begin, real_binary_verify, real_binary_aborted. etc/cis490-orchestrator.service - Reads /etc/cis490/lab-host.env (FLEET_HOST_ID + optional BRIDGE). - Grants AmbientCapabilities CAP_NET_RAW (tcpdump for source 4) + CAP_SYS_ADMIN + CAP_PERFMON (perf for source 3) so collectors work under hardening. scripts/install-lab-host.sh - Writes /etc/cis490/lab-host.env on first install with FLEET_HOST_ID defaulting to `hostname -s`. - Best-effort: fetches the Alpine baseline qcow2 (sha512-pinned) and builds cidata.iso with the in-guest agent embedded; symlinks both into /opt/cis490/vm/images/ so launchers find them. scripts/fetch-alpine-baseline.sh - Idempotent fetcher for the Alpine 3.21 cloud-init nocloud qcow2 matching the sha512 in docs/sources.md. tools/plot_envelope.py - Rebuilt to render whatever telemetry the episode dir contains: proc → QMP block ops → perf IPC/miss-rate → bridge pkts/SYNs → guest agent load/mem. Missing sources are silently skipped. tools/index_reader.py - cis490-index CLI: filter receiver's index.jsonl by host / sample / time range, sort, count-by group. Closest thing to a query interface until we stand up Postgres/Timescale. samples/README.md - Rewritten to match the new manifest schema, the kind=real vs mimic split, the per-(host, slot, ep) selection mechanic, and the chunked-upload safety story. Tests: 106 pass (was 102). New cases: - test_qmp.py — savevm + loadvm (HMP wrapper + error path) - test_tier4.py — chunked plan splitting, sha-pinned finalize, end-to-end driver walks all chunks + verify + exec via the fake msfrpc client Closes the "what is not implemented" punch list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:31:55 -05:00
max	1b6c7b2f4a	Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts This is the chunk that makes "real data" actually flow on multiple hosts in parallel. End-to-end pipe was up at `613c6fa` / 2579683; now the lab-host side has the diversity + concurrency it needs. Collectors landed: collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP client + row builder + run loop. Tolerates older qemu without query-stats. collectors/guest_agent.py — source 5 (deployable). Reads the virtio-serial host-side socket, parses agent JSON-lines, re-stamps to the host monotonic clock, persists. collectors/pcap.py — source 4 (deployable). tcpdump capture + pure-Python pcap reader + 100 ms netflow.jsonl bucketizer. Decodes Ethernet/IPv4/TCP/UDP enough for the schema in docs/data-model.md. In-guest agent: vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs, thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent. tools/build_cidata.py — embeds the agent + an OpenRC service into user-data so first boot of the Alpine cidata image auto-starts it. Launchers: vm/launch_demo.sh / launch_target.sh — second virtio-serial port for the agent socket; SLOT env support so multiple VMs run without socket / port collisions; PORT_BASE on launch_target so multiple target VMs hostfwd different host ports. vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24, no NAT). Idempotent. Fleet: orchestrator/fleet.py — capacity detector (cores / RAM / load headroom) + concurrent-slot runner. Per-slot ENV selects the sample. FleetCapacity dataclass round-trips into meta.json so "this episode ran with 6 concurrent VMs" is auditable post-hoc. tools/run_fleet.py — CLI: --capacity report; --waves N runs N waves of (max_concurrent) episodes each, every slot with a different sample. etc/cis490-orchestrator.service — now drives the fleet runner with Restart=always so each invocation runs one wave and respawns, giving a continuous stream. Samples: samples/manifest.toml — six profiles spanning the five major behaviour shapes. Each entry is real OR mimic (sha256 distinguishes). samples/manifest.py — strict TOML loader (rejects dups, unknown categories) + deterministic select(host_id, slot, episode_index) so different hosts on the network walk the catalog in different orders without any coordinator. EpisodeRunner: orchestrator/episode.py — optional qmp_socket + guest_agent_socket fields on EpisodeConfig; when set, additional collector threads run alongside proc_qemu. EpisodeResult now carries rows_qmp + rows_guest counters. Tier-3 setup automation: scripts/install-msfrpcd.sh — installs metasploit-framework where the package manager has it, generates a strong password into /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to 127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch once MSFRPC_PASSWORD is sourced. scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256 from the operator (Rapid7 download is registration-walled), pulls, verifies, converts vmdk → qcow2, lands at vm/images/. Tests: 82 pass (was 51). New suites: tests/test_qmp.py — fake QMP server, capability handshake, blockstats, async-event interleaving, 5-failure backoff tests/test_guest_agent.py — fake virtio socket, JSON-lines read + re-stamp, malformed-line tolerance tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames, bucketize correctness across windows tests/test_fleet.py — capacity math (8-core idle / low-RAM / high-load / Pi5 / 1-core box), manifest selection determinism + diversity What's queued for the next commit (already discussed in convo): - MSFExploitDriver v2: map sample.profile → distinct in-session workload so Tier-3 episodes don't all produce the same yes-loop envelope. Critical for ML to learn varied malware shapes. - Real-sample fetch from MalwareBazaar by sha256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:02:27 -05:00
max	7c9f9582ca	Lab-host shipper + receiver /v1/ping + install scripts Implements the deployment loop end-to-end on the CIS490 side: shipper/ config.py ShipperConfig (host_id, paths, receiver endpoint, mTLS) transport.py httpx-based PUT + ping with mTLS + bearer support queue.py scan data/episodes/, tar+zstd via system zstd, ship, retire to data/shipped/. Idempotent across crashes per the state machine in docs/transport.md. __main__.py CLI: --ping (smoke test), --once (one pass), or daemon receiver/app.py: new POST /v1/ping that requires the same auth as PUT /v1/episodes but writes nothing. Used by `cis490-shipper --ping` during lab-host bring-up to verify the WG/Caddy/mTLS path before shipping any real bytes. etc/ cis490-shipper.service systemd unit for the lab-host shipper cis490-orchestrator.service systemd unit for the lab-host queue (kept disabled by default until queue mode lands) lab-host.toml.example config template scripts/ install-lab-host.sh idempotent installer; verifies prereqs, creates cis490 service user, syncs repo to /opt/cis490, builds venv, drops systemd units and config template install-receiver.sh same, for the receiver role on the central WG node (Pi5 in our setup) tests/test_shipper.py 11 end-to-end tests against a real Uvicorn server hosting the receiver app. Exercises ping, tar+ship, idempotent re-ship, 409 conflict, transient (receiver down), tarball round-trip via system zstd. AGENTS.md guidance for AI agents working on this and sibling repos. Headline: when you hit an issue you can't fully fix in scope, file a Forgejo issue rather than leaving a TODO. 51/51 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:41:32 -05:00

29 commits