CIS490

Author	SHA1	Message	Date
Max	2aa7b865fb	training/models: knn_semi — semi-supervised self-training KNN Registered as `knn_semi`. Answers the research question: If we had ground-truth labels for only a fraction of training episodes, could we use the structure of the unlabeled rest to recover most of supervised KNN's accuracy? Pipeline (Yarowsky-style self-training): 1. Split train slice deterministically into labeled (label_frac=0.2 default) and unlabeled (1 - label_frac) by row-index hash. 2. Fit a "labeler" KNN on the labeled fraction. 3. Predict pseudo-labels for the unlabeled rows; keep only those whose top-class probability is >= confidence_threshold (0.6). 4. Fit the final KNN on (labeled rows + confident pseudo-labels). Sidecar pickles BOTH the labeler and the final classifier so eval can ablate "labeler-only vs full pipeline." Smoke run (567-episode subset, oracle mode, label_frac=0.2): val_macro_f1 test_macro_f1 knn (100% labels) 0.737 0.133 knn_semi (20% labels) 0.654 0.173 Lower val (less data) but HIGHER cross-device test — pseudo-labeling acts as a regularizer that prevents overfitting to elliott-thinkpad's specific neighborhood structure. Honest research finding worth a slide in the writeup. Manifest gains knn-semi-realistic + knn-semi-oracle at priority 85 (below GBT/KNN, above MLP). Storage cost = augmented set × n_features × 4 bytes; same .knn.pkl sidecar format as plain KNN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:51:30 -05:00
Max	2187a5d752	training/models: KNN as a registered supervised model Non-parametric baseline alongside GBT/MLP/CNN/GRU/LSTM/Transformer. Same BaseModel + schema-hashed checkpoint contract; sidecar is a pickled sklearn KNeighborsClassifier (.knn.pkl) handled by the existing checkpoint machinery alongside .xgb.json / .pt. KNN's storage cost = n_train_rows × n_kept_features × 4 bytes. At 660k windows × 145 kept (realistic mode) features = ~380 MB sidecar; at 230 features (oracle) = ~600 MB. Heavy but ships through the same artifact-upload path. trainer/run.py learns a third fit branch: - GBT — XGBoost early stopping on val mlogloss - KNN — fit() memorizes; "training time" is val/test predict cost - NN — train_nn loop (the rest) Manifest gains knn-realistic + knn-oracle at priority 95 (just below GBT). KNN's k=10 default lives in the model class — overriding via hyper.k requires adding --k to run.py first to avoid the unknown-arg exit-2 issue. Smoke verified on the 567-episode subset: knn oracle val=0.7365 test=0.1333 (held-out k-gamingcom) That val/test gap (0.74 → 0.13) is the cross-device generalization story: KNN memorizes elliott-thinkpad's local feature space and falls apart on the other host. Honest baseline for the comparison report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:06:56 -05:00
Max	8643192a71	training/fleet: distributed multi-host trainer with capability gating Symmetric companion to the collection fleet (orchestrator/fleet.py) but for training. Collection is embarrassingly parallel; training is not (a model is trained at most once across the fleet), so the receiver coordinates which worker gets which job. Operator-control surface is etc/training_manifest.toml.example — single canonical file declaring (a) per-host capability + per-model allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper) with capability constraints (require_cuda, prefer_cuda, min_vram_gib, min_ram_gib, allowed_hosts). Components: capability.py — self-detection: hostname, cores, RAM, CUDA presence, VRAM, torch version, git commit. Used by workers to filter eligible jobs before claiming. manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable sha256 of (model, mode, hyper, split_recipe, train_hosts, seed) so manifest reload is idempotent: existing rows keep their status, new jobs become claimable, removed jobs stay until cancelled. queue.py — SQLite job queue (training_jobs.db) with statuses pending\|claimed\|running\|completed\|failed\|cancelled. Atomic claim_next via single UPDATE WHERE status='pending'. Heartbeat, complete, fail. Stale-claim sweep (stale_after_s=600s) with max_attempts cutoff to failed. store.py — model artifact store mirroring receiver/store.py. Artifact ID is the sha256 of the uploaded tarball; bit-identical re-runs deduplicate. receiver.py — Starlette app exposing 11 endpoints: POST /v1/job/claim (worker) POST /v1/job/{id}/heartbeat (worker) POST /v1/job/{id}/complete (worker) POST /v1/job/{id}/fail (worker) PUT /v1/model/{id} (worker — uploads tarball) GET /v1/jobs (anyone) GET /v1/workers (anyone) POST /v1/job/{id}/cancel (operator: X-Operator-Token) POST /v1/job/{id}/requeue (operator) POST /v1/manifest/reload (operator) GET /v1/health (anyone) Runs as cis490-trainer-receiver.service on the Pi alongside the existing receiver, on a separate port. client.py — stdlib HTTP client (urllib only, no new deps). worker.py — long-running daemon. Loop: detect capability → claim → spawn training/trainer/run.py subprocess → heartbeat every 30s → tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe. Operator CLI (tools/cis490_jobs.py): status / list / show / cancel / requeue / reload / workers. Cancel and requeue require $CIS490_OPERATOR_TOKEN matching the receiver's configured value. Bootstrap: scripts/install-training-worker.sh (Linux systemd) and scripts/install-training-worker-windows.ps1 (Windows Scheduled Task) let the operator enroll a new host with one command after cloning the repo and setting up the venv. Worker self-tests capability before registering. End-to-end smoke verified on the Pi: receiver up, manifest synced, 14 jobs queued, worker registered, claimed 4 CPU-eligible jobs (allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle, mlp-oracle), 1 failed with the actual error visible via cis490-jobs status, 3 artifacts uploaded to /var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with proper index.jsonl row. 21 unit tests (manifest validation: 8; queue lifecycle + eligibility: 13). All pass alongside the prior 17 training tests = 38 green. Open limitations surfaced inline: - Hyper-key drift between manifest and run.py fails at training time, not at manifest reload (worth tightening to argparse introspection later). - mTLS not yet wired through Caddy for the trainer-receiver port — listens loopback-only until that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 01:20:20 -05:00
Max Gorog	a8157ed177	training/dashboard: live deck at dashboard.wg, fed by receiver Starlette + WebSocket dashboard run on the Pi as cis490-dashboard.service (127.0.0.1:8447, Caddy-fronted at dashboard.wg). Tails /var/lib/cis490/index.jsonl for episode events, snapshots host counts every 30s, broadcasts to every connected browser. New connections get a warm snapshot (recent_episodes, total_bytes, host_counts) so reloads don't see a cold dashboard. Frontend is a 10-scene scrollytelling deck following the project outline: intro, collect, hosts, db explorer, baseline, attacks, chunking, models, knn, perf. Sticky full-bleed canvas with a right-aligned prose column (matrix-explorable layout). Hotkeys (arrows, space, j/k, c, Home/End), prev/next chevrons, FAB, and an opt-in click-to-advance toggle. Demo toggle drives synthetic data for the five scenes that have no real producer yet (attack envelopes, chunking, model bars, knn scatter, perf scatter); when off, those scenes show "awaiting <event_type> events" rather than fake data. Producers wire in by POSTing typed JSON to 127.0.0.1:8447/publish (loopback only; Caddy 404s it externally). Event types the widgets subscribe to: model_metric {model, accuracy}, embedding {x, y, phase}, model_perf {model, latency_us, accuracy}, prediction {episode_id, window_idx, predicted, actual}, attack_profile {name, shape, curve}, phase {phase}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 21:26:07 -05:00
Max Gorog	207a902c3e	PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml The experiment is now defined by a single version-pinned file — manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every lab host loads THIS exact file; per-host overrides of experiment shape are forbidden. Drops the following per-host CLI overrides that previously violated the canonical-manifest principle: * --manifest, --modules-dir (paths now derived) * --ram-per-vm-mib (in manifest.experiment) * --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling) * --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots) * --force-tier2 (not a §14 sanctioned override knob — ship empty catalog to disable Tier-3) * --require-real-samples (sample-side concern; out of fleet scope) * tools/run__demo.py --manifest (samples path now from canonical) New surface: manifest.toml — the single source of truth * orchestrator/manifest.py — load_canonical() + Manifest dataclass with strict validation, raises ManifestError on any failure * EpisodeConfig.experiment_meta — populated by run__demo.py from the canonical manifest; stamped into every episode's meta.json under "experiment" key for provenance cis490-orchestrator.service — RestartPreventExitStatus=78 so manifest-load failures stay stuck-and-loud (§9, §4.7) * install-lab-host.sh — validates manifest.toml at install time; missing or invalid = die with clear message Catalog admission semantics: only modules whose name appears in manifest.catalog get loaded into the runtime catalog (§4.3 in miniature, will tighten further in step 4 when verified_against / last_verified actually gate admission). Missing toml for an admitted name is a sysadmin error → exit 78. Renames cfg.manifest → cfg.samples + adds cfg.experiment to disambiguate sample-manifest from experiment-manifest. Rewrites test_fleet.py fixture to construct synthetic Manifest objects so test outcomes don't depend on the on-disk manifest.toml content. 12 new tests in tests/test_manifest.py: schema-version mismatch, unknown collector, duplicate collector, unknown phase, negative phase seconds, negative ram, missing catalog fields, json round-trip. Local run: `python tools/run_fleet.py --capacity` correctly logs the loaded manifest and prints capacity. 241 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:25:01 -05:00
max	49eba2fd60	fleet-health: proactive alerts on the Pi + per-host doctor reports Two pieces of self-monitoring so the maintainer isn't the alarm: (2) Receiver-side fleet health monitor cis490-fleet-health.timer runs check_fleet_health.py every 5 min. Detects three symptoms and writes them to /var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy to forward to a notifier): silent — host shipped in last 24h but has been quiet >30 min fatal-only — actively shipping but every PUT 4xx unstamped — shipping without X-Cis490-Code-Commit header Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault fires once per hour, not every 5 min. 15 unit tests cover the index parser, three detectors, and dedup. (3) Per-host doctor snapshots Lab hosts run cis490-doctor-check.timer once a day (10 min after boot, then daily with 30-min jitter). The timer runs cis490_doctor.py --json and PUTs the result to a new endpoint: PUT /v1/host-health/<host> → /var/lib/cis490/host-health/<host>.json GET /v1/host-health → aggregate across all hosts Endpoint is NOT gated by version_gate — sick hosts running stale code MUST still be able to report sickness. 11 unit tests cover PUT/GET, atomic-write semantics, bearer auth, and the not-gated-by-version-gate property. ship_health_check.py reuses the existing shipper transport (mTLS + bearer + receiver URL from lab-host.toml) so we don't reimplement auth. Both timers wired into install-lab-host.sh — the loop also enables the previously-added autoupdate + cert-fetch timers, so a single install run gives a host all four self-healing mechanisms. Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2 pre-existing test_fleet.py failures from the elliott-ThinkPad merge (`667f042`) are unrelated to this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:48:31 -05:00
max	3180f7b5ac	lab-host: cis490-cert-fetch.timer for automatic mTLS bootstrap retry k-gamingcom symptom (2026-05-02): the on-device agent successfully finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS material" because the cert auto-fetch step in install-lab-host.sh either ran with host_id still REPLACE_ME, or hit a transient bootstrap.wg failure, and there's no automatic retry. The Pi-side cert IS minted and the bootstrap endpoint serves it — the failure mode is purely "lab-host hasn't pulled it down." Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh (idempotent, no-op when certs are already on disk, no-op when host_id is unset, exit-0 on transient network failure so the unit doesn't get pinned as failed), and run it from a 5-minute systemd timer. The timer handles all three "stuck waiting on mTLS" cases without operator action: - operator edited host_id post-install but didn't re-run install - bootstrap.wg was briefly unreachable during install - lab host was offline when install ran but came up later The script `try-restart`s cis490-shipper after a successful fetch so the daemon picks up the new cert immediately instead of waiting for its lazy retry. install-lab-host.sh still calls the script on install for fast first-time bring-up — the timer is the safety net. Tarball extract is staged through a temp dir + atomic rename so a mid-extract crash never leaves us with a mismatched cert/key pair. AGENTS.md row 4 updated: "waiting on mTLS material" remediation now points at the timer, with the exact `systemctl start cis490-cert-fetch.service` command to force an immediate retry. Tests: 267/267 unchanged. The fetch script is idempotent + has all its happy/error paths handled inline; a unit test would mostly be testing systemd's behaviour. The integration test path is the timer running on a real lab host, which is the actual production case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:30:16 -05:00
Elliott Kolden	667f042707	Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01) Root causes and fixes documented in TIER3-BRINGUP.md. Summary: 1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot. 2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring modules selected on SLIRP runs; fix: always filter requires_bridge. 3. cmd/unix/interact creates no session.list entry → session_open_timeout every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl. 4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444); fix: extra_host_port:extra_host_port mapping so guest binds the per-slot LPORT directly. 5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots; fix: requires_bridge=true filters it from SLIRP fleet runs. 6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba boots (~60 s too early); fix: replace TCP probe with serial console _wait_for_serial_login that waits for actual "login:" prompt. 7. Stale QEMU survives orchestrator restart (start_new_session=True) → holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from old pidfile before rmtree. 8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100. 9. msfrpcd 6.x returns bytes for all string values even with raw=False; fix: MSFRpcClient._str() recursive decoder applied to all responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:26:19 -06:00
max	98dcd4f9f8	lab-host: cis490-autoupdate.timer for self-healing on push Today's incident: post-cutover, k-gamingcom went silent and elliott-thinkpad kept shipping pre-stamp episodes that the receiver gate 400'd in a 2300+ PUT loop. Both required `git pull && install- lab-host.sh` on the host — neither the on-device AI agent nor the operator pulled in time, and from the receiver Pi I cannot reach in (sshd off on the lab hosts). Fix the recurrence directly: a 30-min systemd timer that does git fetch + (if behind) ff-only pull + re-run install-lab-host.sh. Hosts catch up on the next tick on their own — no human or agent action required. Mechanics: - scripts/auto-update.sh runs as root, drops to cis490 for git ops to satisfy /opt/cis490 ownership ("dubious ownership" guard). - Refuses ff if local HEAD isn't an ancestor of origin/main — protects operator hand-edits from silent overwrite. - Network failures exit 0 (offline is normal, don't pin a unit failure); divergence + install failures exit non-zero so the journal records what broke. - RandomizedDelaySec=10min on the timer prevents thundering-herd when several hosts boot together. - Hands off to install-lab-host.sh via exec — exactly one path through bring-up; no special "auto" flow. The version-gate provides the quality boundary, so even if origin/ main moves forward unsafely, the receiver's allow-list still controls what lands in the index. install-lab-host.sh enables cis490-autoupdate.timer on every run, idempotent — existing hosts pick it up the next time they pull manually. Filed Forgejo #18 with the canonical command for elliott-thinkpad + k-gamingcom to bootstrap themselves out of the current incident (auto-update doesn't help them retroactively — it has to be running before the cutover to catch the next one). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:59:31 -05:00
max	f9b2e5c4e6	shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors Three robustness items off the future-work list: 1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The daemon sends READY=1 after queue construction and WATCHDOG=1 once per scan pass via a heartbeat callback wired into run_forever. Restart=on-failure only catches process death — silent stalls (deadlock, hung tar subprocess, blocked I/O past timeout) used to leave a zombie running with the data backlog growing. Now systemd kills + restarts the daemon if no WATCHDOG=1 arrives within 180s. Verified end-to-end against systemd via `systemd-run --transient --property=Type=notify --property=WatchdogSec=10`: unit transitions to active on READY=1; SIGSTOP'ing the process triggers `Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at exactly t+10s, then unit goes failed → restart cycle. 2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew forever as fatal episodes piled up. New ShipperConfig fields: quarantine_keep_days = 30 # opt-out: 0 disables quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't # statx() the whole tree Cleanup runs at the start of run_once() but is gated to once per hour. Removed entries logged. 3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper journal and surfaces 412/400/transient patterns as red/yellow rows with the canonical fix command. An on-device agent running cis490_doctor.py now sees one line ("12 ship(s) rejected as out-of-window") instead of needing to grep the journal. Tests: 200/200 (was 188). New coverage: heartbeat callback fires + survives exceptions; quarantine cleanup respects keep_days, gate, and opt-out; doctor parser correctly classifies 412/400/transient/clean/ empty/journalctl-denied; both error classes prioritise 412 (more actionable) when present together. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:02:59 -05:00
max	ed5e6b0581	docs+doctor: surface VERSION-stamp + fallback wiring receiver.toml.example: the local_repo_path comment was wrong about when it kicks in. With the new fallback path, it's used both when forgejo_url is unset (sole backend) AND when forgejo is unreachable (failover). Document that, plus the auto-detect of /opt/cis490/.git. cis490_doctor: add a VERSION-stamp check for lab-host role. If /opt/cis490/VERSION is missing or malformed, the orchestrator stamps "unknown" → receiver gate rejects every PUT → quarantine. Surface this as a red row with the canonical fix (re-run install-lab-host.sh) so an on-device agent doesn't have to grep journal logs to figure it out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:54:36 -05:00
max	cc0c96953e	version_gate: Forgejo as canonical commit source (no fs perms needed) Initial git-log-based gate ran into a permission wall: the cis490 service user can't read /home/max/cis490/.git (ProtectHome=true + home-dir mode). Switching the production source to the local Forgejo HTTP API (already accessible to all WG peers, single source of truth both lab hosts and the receiver pull from). When the maintainer pushes new code to spectral/CIS490, the next 5-second cache refresh sees the new commit and lab hosts can immediately ship under it. VersionGate now takes either: - forgejo_url + repo_owner + repo_name + branch (+ optional auth_token for private repos): hits /api/v1/repos/<owner>/<name>/commits?sha=<branch>&limit=<n> - repo_path: dev-only fallback, runs `git log` locally Local-git path retained for tests + the dev-only case. receiver.toml.example gains forgejo_url/repo_owner/repo_name/branch with auth_token commented; live-deployed receiver.toml on the Pi has the spectral org + token. Live state on the Pi: 41 valid hashes loaded, head=f8ad02b. Verified end-to-end: bogus commit → 412 + remediation HEAD commit → clears gate (fails downstream at sha-mismatch as expected for the empty-body verify probe) Test added: test_forgejo_backend_accepts_returned_commits stands up a tiny canned-response HTTPServer in-process, exercises the parser without depending on a live Forgejo instance. Brings test_version_gate to 10 cases; total 158/158.	2026-05-01 01:42:45 -05:00
max	f8ad02b2d7	Receiver enforces X-Cis490-Code-Commit allow-list (live, auto-refreshed) Stops out-of-date lab hosts from polluting the dataset with episodes generated by buggy code. The valid-commits set mirrors the maintainer's working clone on the Pi automatically — when the maintainer pulls or pushes a new commit, the receiver picks it up within the 5-second cache TTL with no service restart. Receiver changes: - receiver/version_gate.py (new): VersionGate(repo_path, window). Each check() consults a frozenset of the last `window` commit hashes from `git -C <repo> log --format=%H -n <window>`, refreshed every 5s under a lock. Resilient to transient git failure (keeps prior cache so a flaky `git` doesn't lock out every shipper). - receiver/app.py: PUT extracts X-Cis490-Code-Commit; gate.check() before ingest. Rejects with: 400 + remediation if header missing or malformed 412 + remediation + your_commit + head_commit if not in window Remediation block is verbatim copy-pasteable into the lab-host shell: cd /opt/cis490 && sudo -u cis490 git pull origin main sudo /opt/cis490/scripts/install-lab-host.sh sudo systemctl restart cis490-orchestrator - receiver/store.py: ingest_stream takes commit kwarg, stamps it on the index.jsonl row (new optional field). Backfilled rows from index_backfill.py also pull commit out of meta.json. - receiver/config.py + etc/receiver.toml.example: new [version_gate] section. enabled=true, repo_path=/home/max/cis490, window=100 by default. Enabled toggle exists for emergency disable-and-collect. Shipper changes: - shipper/transport.py: ship_tarball() takes commit kwarg, sends X-Cis490-Code-Commit header. 412 maps to status='fatal' so the queue doesn't infinite-retry — operator must pull and reinstall before the next ship will succeed. - shipper/queue.py: reads meta.json::code_version.commit per episode, passes through. On 412, logs the receiver's full remediation block at ERROR level so journalctl on the lab host shows exactly what to run. Tests: 9 in test_version_gate (including 2 end-to-end via starlette.testclient), 2 cover the boundary where new commits land mid-cache and where missing-repo gracefully keeps prior cache. 157/157 total. Index schema: existing rows stay valid (commit field is optional on read). New rows from receiver-direct AND from index_backfill.py include commit.	2026-05-01 01:38:50 -05:00
max	265f3ad313	Tier-4 sample source: theZoo (no auth, no operator action) Replaces MalwareBazaar with theZoo (https://github.com/ytisf/theZoo). theZoo is a public security-research repo with hundreds of malware samples organized by family, password-protected with the well-known 'infected'. No API key, no signup, nothing for an operator to do — which is what zero-touch tier-4 actually means. Changes: - tools/auto_fetch_samples.py: rewrite. Clones theZoo (shallow, ~500 MB) to /var/lib/cis490/theZoo on first run, then for each manifest family without a sha256 it locates a matching Binaries/<Name> dir, extracts the .zip with password 'infected', picks the largest non-text payload as the binary, sha256s it, stages at samples/store/<sha256>, and rewrites manifest.toml in place (atomic tempfile + os.replace, stat preserved). Mandatory exit semantic: non-zero if no real samples landed. - scripts/install-tier-3-4.sh: dropped the MB-key resolution chain (env var → local file → bootstrap.wg fetch). Now just runs auto_fetch_samples.py and dies if zero samples land. SKIP_TIER4 remains as the explicit override but is documented as defeating the project. - bootstrap/app.py + __main__.py + etc/cis490-bootstrap.service: removed the /v1/secret/<name> endpoint and the --secrets-root flag. Dead code now that no API key needs distributing. Live-rolled back on the Pi (404 verified post-restart, stale /etc/cis490/secrets dir removed). - scripts/set-malwarebazaar-key.sh: deleted. No MB key means no one-time operator step. - tests/test_bootstrap_secrets.py: deleted (route removed). - AGENTS.md: rewrote tier-4 section to reflect zero-operator model. 148/148 tests pass. Bootstrap service rolled back live.	2026-05-01 01:17:50 -05:00
max	5d0e8e33a9	Tier 4 is mandatory: hard-fail on no real samples; auto-distribute MB key User: 'we don't want it to be optional, this real malware IS the data we want.' Acknowledged. Three changes make Tier 4 actually mandatory without forcing per-host operator action: 1. bootstrap.wg /v1/secret/<name> endpoint - Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts over the same trust boundary as the cert endpoint (WG mesh, iptmonads-gated). Strict allow-list — only `malwarebazaar` resolves; everything else 404s. Secret returned as bare text with Cache-Control: no-store. Live-verified on the Pi. - tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned, 200 with token, 404 unknown name, 500 on empty file. 2. install-tier-3-4.sh: Tier 4 is no longer optional - Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token → https://bootstrap.wg/v1/secret/malwarebazaar. - Caches the bootstrap-fetched key locally so re-runs are offline. - If all three resolution paths fail, dies with the exact remediation command for the operator (one-time set-malwarebazaar-key.sh on the Pi). - auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still works for emergency overrides but logs a warning that the host will produce only mimics). Deploy fails if zero binaries land in samples/store/ — no silent mimic-only fallback. - SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'. 3. scripts/set-malwarebazaar-key.sh - Pi-side helper: one operator command per fleet, ever. Accepts key via env or stdin, validates length, drops at the right path with the right perms. Lab hosts pull the rest automatically. AGENTS.md: rewrote the Tier-4 section to reflect mandatory status + the one-time-on-Pi distribution model. 152/152 tests pass. Bootstrap service updated live on the Pi.	2026-05-01 00:44:41 -05:00
elliott	4e8d2bdb04	etc/lab-host.toml.example: pin Caddy root, not wg-pki client CA (closes #14 ) ca_bundle is what the shipper uses to verify collector.wg's TLS cert. That cert is signed by the Caddy Local Authority, bundled in the repo as etc/caddy-root.crt. Pointing it at wg-ca.pem (the wg-pki CIS490 Lab-Host Client CA, which is the receiver's trust anchor for our client cert) caused CERTIFICATE_VERIFY_FAILED on every ship. Original fix authored by the on-device agent on k-gamingcom in Dev_REL2_043026@786b8da; cherry-picked here onto main. Co-Authored-By: k-gamingcom on-device agent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:26:36 -05:00
max	8753340ea3	fleet: fix per-slot run-dir collision so concurrent VMs actually run Root cause of "fleet says max_concurrent=3 but only one episode ships per wave" symptom on elliott-lab: 1. orchestrator/fleet.py::_run_slot set env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot. 2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm (NO slot suffix), then UNCONDITIONALLY overwrote the env's RUN_DIR with that flag's value before exec'ing the launcher. 3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots collided on the same socket dir. 4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's rmtree literally deleted slot 0's pidfile + sockets mid-boot. 5. Net effect: one VM survives per wave on a multi-core host that should be running ~cores-1 in parallel. Throughput collapses to 1/N. Fix: tools/run_real_vm_demo.py + tools/run_tier3_demo.py: --run-dir default cascade — 1) explicit CLI flag 2) RUN_DIR env (set by fleet runner) 3) /tmp/cis490-vm-<SLOT> (SLOT from env, default 0) Same change in both runners so Tier-2 + Tier-3 fleet waves parallelize cleanly. orchestrator/fleet.py::_run_slot: Pass --run-dir explicitly to the subprocess so the per-slot path is audit-visible in the fleet log instead of buried in env. Also flip the subprocess interpreter to repo_root/.venv/bin/python when present (was /usr/bin/env python3 — worked by luck because the orchestrator path doesn't import msgpack/httpx, but a Tier-3 fleet wave would have died at import-time on a host without those in system Python). etc/cis490-orchestrator.service: Removed the duplicate [Service] hardening block at the bottom of the file that was silently overriding the AmbientCapabilities grant (NoNewPrivileges=true at the bottom flipped the NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_ ADMIN + CAP_PERFMON before per-episode subprocesses inherit them). Sources 3 + 4 would have failed silently inside the sandbox. Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable. 106/106 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:55:56 -05:00
max	a93a3ff221	bootstrap: auto-issue mTLS leaves to enrolled lab hosts (closes #9 , refs #3 ) Adds a pull-based cert distribution path so install-lab-host.sh can fetch its own leaf cert without operator intervention. Removes the ssh-from-Pi requirement that blocked elliott-lab. How the chicken-and-egg gets solved: a freshly wg-enrolled lab host already has WG access (gate kept by iptmonads at L4) and trusts the Caddy local CA (bundled in this repo at etc/caddy-root.crt). It makes a single TLS call to https://bootstrap.wg/v1/cert/<host_id> — no mTLS — gets back a tar of {ca.crt, leaf.pem, leaf.key}, extracts to /etc/cis490/certs/, and the shipper unblocks. Trust boundary is "reached :443 over WG"; no operator action needed. bootstrap/ app.py Starlette: GET /v1/cert/{host_id}, GET /v1/health. Validates host_id charset, rate-limits per source IP, logs every mint with the X-Real-IP Caddy injects. __main__.py uvicorn launcher; runs as root because the wg-pki CA private key is root-only. etc/cis490-bootstrap.service systemd unit on 127.0.0.1:8446 with ProtectSystem=strict + narrow ReadWritePaths=/var/lib/wg-pki. ProtectHome=no because systemd's read-only mode hides /home contents (the issuer script the wrapper exec's lives there). scripts/issue-cis490-client-cert-wrapper.sh Adapter the bootstrap service shells out to. Resolves the actual wg-pki issuer script across the three plausible install layouts (/opt/wg-pki, /home/max/wg-pki, /home/max/.env/wg-pki) so a single copy of the unit file works on any operator's box. Forces --out-dir to /var/lib/wg-pki/issued so writes stay inside the service's narrow ReadWritePaths. scripts/install-lab-host.sh After scaffolding lab-host.toml, if /etc/cis490/certs/lab-host.pem is absent, curls bootstrap.wg with --cacert etc/caddy-root.crt (no chicken-and-egg), extracts, chowns/chmods. Skips silently if bootstrap.wg is unreachable so manual hand-carry remains possible. scripts/install-receiver.sh Drops cis490-bootstrap.service alongside cis490-receiver and prints both as "enable --now" candidates. cis490-bootstrap is the thing that makes lab hosts self-provisioning. etc/caddy-root.crt Bundled copy of wg-pki's published Caddy local CA root, so the bootstrap fetch can verify TLS without depending on a wg-pki clone that may or may not be on the lab host yet. Verified live on the Pi: $ curl --cacert etc/caddy-root.crt https://bootstrap.wg/v1/cert/elliott-lab -o /tmp/x.tar HTTP 200 size=10240 $ tar tf /tmp/x.tar ca.crt elliott-lab.key elliott-lab.pem $ openssl verify -CAfile … elliott-lab.pem /tmp/.../elliott-lab.pem: OK $ openssl x509 -subject … -noout subject=CN=elliott-lab Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:30:29 -05:00
max	a88ac83db0	Close out the deployment-readiness gaps Wraps the gaps surfaced in the "what is not implemented" audit so the fleet really is shippable end-to-end. Verified live on the Pi: - cis490-shipper --ping → HTTP 200 through Caddy + mTLS via the new wg-pki client CA leaf - real episode dir → tar+zstd → PUT → HTTP 201 stored - re-ship same bytes → 200 (idempotent) - re-ship different bytes under same id → 409 (conflict) Changes: orchestrator/episode.py - EpisodeConfig.revert_at_start / revert_at_end (Tier 0+ snapshot/ revert per docs/architecture.md). When set + qmp_socket present, EpisodeRunner issues loadvm <snapshot_name> and emits snapshot_revert / snapshot_revert_failed events on the same monotonic clock as everything else. collectors/qmp.py - savevm() / loadvm() helpers using human-monitor-command, plus a test against the fake QMP server. exploits/workloads.py - chunked_real_binary_upload() returns a ChunkedUpload plan: 8 KiB base64 chunks (~6 KiB binary each) so msfrpc never sees a buffer- busting payload. Includes a finalize step that sha256-verifies on the guest before exec. - real_binary_workload() now wraps the chunked plan for backwards compat with single-shot callers. exploits/driver.py - Tier-4 dispatch walks the chunked plan in MSFExploitDriver: each chunk is a separate session_shell_write; finalize verifies; exec only runs on sha-ok. New events: real_binary_upload_begin, real_binary_verify, real_binary_aborted. etc/cis490-orchestrator.service - Reads /etc/cis490/lab-host.env (FLEET_HOST_ID + optional BRIDGE). - Grants AmbientCapabilities CAP_NET_RAW (tcpdump for source 4) + CAP_SYS_ADMIN + CAP_PERFMON (perf for source 3) so collectors work under hardening. scripts/install-lab-host.sh - Writes /etc/cis490/lab-host.env on first install with FLEET_HOST_ID defaulting to `hostname -s`. - Best-effort: fetches the Alpine baseline qcow2 (sha512-pinned) and builds cidata.iso with the in-guest agent embedded; symlinks both into /opt/cis490/vm/images/ so launchers find them. scripts/fetch-alpine-baseline.sh - Idempotent fetcher for the Alpine 3.21 cloud-init nocloud qcow2 matching the sha512 in docs/sources.md. tools/plot_envelope.py - Rebuilt to render whatever telemetry the episode dir contains: proc → QMP block ops → perf IPC/miss-rate → bridge pkts/SYNs → guest agent load/mem. Missing sources are silently skipped. tools/index_reader.py - cis490-index CLI: filter receiver's index.jsonl by host / sample / time range, sort, count-by group. Closest thing to a query interface until we stand up Postgres/Timescale. samples/README.md - Rewritten to match the new manifest schema, the kind=real vs mimic split, the per-(host, slot, ep) selection mechanic, and the chunked-upload safety story. Tests: 106 pass (was 102). New cases: - test_qmp.py — savevm + loadvm (HMP wrapper + error path) - test_tier4.py — chunked plan splitting, sha-pinned finalize, end-to-end driver walks all chunks + verify + exec via the fake msfrpc client Closes the "what is not implemented" punch list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:31:55 -05:00
max	1b6c7b2f4a	Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts This is the chunk that makes "real data" actually flow on multiple hosts in parallel. End-to-end pipe was up at `613c6fa` / 2579683; now the lab-host side has the diversity + concurrency it needs. Collectors landed: collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP client + row builder + run loop. Tolerates older qemu without query-stats. collectors/guest_agent.py — source 5 (deployable). Reads the virtio-serial host-side socket, parses agent JSON-lines, re-stamps to the host monotonic clock, persists. collectors/pcap.py — source 4 (deployable). tcpdump capture + pure-Python pcap reader + 100 ms netflow.jsonl bucketizer. Decodes Ethernet/IPv4/TCP/UDP enough for the schema in docs/data-model.md. In-guest agent: vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs, thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent. tools/build_cidata.py — embeds the agent + an OpenRC service into user-data so first boot of the Alpine cidata image auto-starts it. Launchers: vm/launch_demo.sh / launch_target.sh — second virtio-serial port for the agent socket; SLOT env support so multiple VMs run without socket / port collisions; PORT_BASE on launch_target so multiple target VMs hostfwd different host ports. vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24, no NAT). Idempotent. Fleet: orchestrator/fleet.py — capacity detector (cores / RAM / load headroom) + concurrent-slot runner. Per-slot ENV selects the sample. FleetCapacity dataclass round-trips into meta.json so "this episode ran with 6 concurrent VMs" is auditable post-hoc. tools/run_fleet.py — CLI: --capacity report; --waves N runs N waves of (max_concurrent) episodes each, every slot with a different sample. etc/cis490-orchestrator.service — now drives the fleet runner with Restart=always so each invocation runs one wave and respawns, giving a continuous stream. Samples: samples/manifest.toml — six profiles spanning the five major behaviour shapes. Each entry is real OR mimic (sha256 distinguishes). samples/manifest.py — strict TOML loader (rejects dups, unknown categories) + deterministic select(host_id, slot, episode_index) so different hosts on the network walk the catalog in different orders without any coordinator. EpisodeRunner: orchestrator/episode.py — optional qmp_socket + guest_agent_socket fields on EpisodeConfig; when set, additional collector threads run alongside proc_qemu. EpisodeResult now carries rows_qmp + rows_guest counters. Tier-3 setup automation: scripts/install-msfrpcd.sh — installs metasploit-framework where the package manager has it, generates a strong password into /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to 127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch once MSFRPC_PASSWORD is sourced. scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256 from the operator (Rapid7 download is registration-walled), pulls, verifies, converts vmdk → qcow2, lands at vm/images/. Tests: 82 pass (was 51). New suites: tests/test_qmp.py — fake QMP server, capability handshake, blockstats, async-event interleaving, 5-failure backoff tests/test_guest_agent.py — fake virtio socket, JSON-lines read + re-stamp, malformed-line tolerance tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames, bucketize correctness across windows tests/test_fleet.py — capacity math (8-core idle / low-RAM / high-load / Pi5 / 1-core box), manifest selection determinism + diversity What's queued for the next commit (already discussed in convo): - MSFExploitDriver v2: map sample.profile → distinct in-session workload so Tier-3 episodes don't all produce the same yes-loop envelope. Critical for ML to learn varied malware shapes. - Real-sample fetch from MalwareBazaar by sha256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:02:27 -05:00
max	2579683efb	receiver: default to 127.0.0.1:8444 (avoid wg-enroll-listener on 8443) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:45:23 -05:00
max	7c9f9582ca	Lab-host shipper + receiver /v1/ping + install scripts Implements the deployment loop end-to-end on the CIS490 side: shipper/ config.py ShipperConfig (host_id, paths, receiver endpoint, mTLS) transport.py httpx-based PUT + ping with mTLS + bearer support queue.py scan data/episodes/, tar+zstd via system zstd, ship, retire to data/shipped/. Idempotent across crashes per the state machine in docs/transport.md. __main__.py CLI: --ping (smoke test), --once (one pass), or daemon receiver/app.py: new POST /v1/ping that requires the same auth as PUT /v1/episodes but writes nothing. Used by `cis490-shipper --ping` during lab-host bring-up to verify the WG/Caddy/mTLS path before shipping any real bytes. etc/ cis490-shipper.service systemd unit for the lab-host shipper cis490-orchestrator.service systemd unit for the lab-host queue (kept disabled by default until queue mode lands) lab-host.toml.example config template scripts/ install-lab-host.sh idempotent installer; verifies prereqs, creates cis490 service user, syncs repo to /opt/cis490, builds venv, drops systemd units and config template install-receiver.sh same, for the receiver role on the central WG node (Pi5 in our setup) tests/test_shipper.py 11 end-to-end tests against a real Uvicorn server hosting the receiver app. Exercises ping, tar+ship, idempotent re-ship, 409 conflict, transient (receiver down), tarball round-trip via system zstd. AGENTS.md guidance for AI agents working on this and sibling repos. Headline: when you hit an issue you can't fully fix in scope, file a Forgejo issue rather than leaving a TODO. 51/51 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:41:32 -05:00
Maximus Gorog	83e111961d	Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency Implements docs/transport.md as a small Starlette app. The receiver streams episode tarballs to disk, verifies sha256 against an X-Content-SHA256 header, atomically renames into the store on success, and appends one row to a flat index.jsonl. No DB. Idempotent re-PUTs return 200; conflicting bodies return 409. Optional bearer-token auth (mTLS terminates at Caddy in prod). receiver/ store.py EpisodeStore: sha-verifying streaming ingest, atomic rename, append-only index. No HTTP. app.py make_app(): Starlette routes + bearer guard. config.py ReceiverConfig.load(): TOML parser. __main__.py uvicorn entrypoint, reads --config TOML. tests/test_receiver.py — 13 tests via httpx.ASGITransport. Covers: 201 new, 200 idempotent replay, 409 conflict, 400 sha mismatch + cleanup, 400 missing/ short header, 400 bad id, 400 bad suffix, 413 too large, 401 bearer enforcement, schema-version pass-through. etc/cis490-receiver.service — systemd unit with hardening flags. etc/receiver.toml.example — config template matching docs/deploy.md. End-to-end smoke-tested with curl: 201 → 200 → 409 path verified, file on disk, single index row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:34:04 -06:00

23 commits