CIS490

Author	SHA1	Message	Date
Max Gorog	207a902c3e	PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml The experiment is now defined by a single version-pinned file — manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every lab host loads THIS exact file; per-host overrides of experiment shape are forbidden. Drops the following per-host CLI overrides that previously violated the canonical-manifest principle: * --manifest, --modules-dir (paths now derived) * --ram-per-vm-mib (in manifest.experiment) * --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling) * --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots) * --force-tier2 (not a §14 sanctioned override knob — ship empty catalog to disable Tier-3) * --require-real-samples (sample-side concern; out of fleet scope) * tools/run__demo.py --manifest (samples path now from canonical) New surface: manifest.toml — the single source of truth * orchestrator/manifest.py — load_canonical() + Manifest dataclass with strict validation, raises ManifestError on any failure * EpisodeConfig.experiment_meta — populated by run__demo.py from the canonical manifest; stamped into every episode's meta.json under "experiment" key for provenance cis490-orchestrator.service — RestartPreventExitStatus=78 so manifest-load failures stay stuck-and-loud (§9, §4.7) * install-lab-host.sh — validates manifest.toml at install time; missing or invalid = die with clear message Catalog admission semantics: only modules whose name appears in manifest.catalog get loaded into the runtime catalog (§4.3 in miniature, will tighten further in step 4 when verified_against / last_verified actually gate admission). Missing toml for an admitted name is a sysadmin error → exit 78. Renames cfg.manifest → cfg.samples + adds cfg.experiment to disambiguate sample-manifest from experiment-manifest. Rewrites test_fleet.py fixture to construct synthetic Manifest objects so test outcomes don't depend on the on-disk manifest.toml content. 12 new tests in tests/test_manifest.py: schema-version mismatch, unknown collector, duplicate collector, unknown phase, negative phase seconds, negative ram, missing catalog fields, json round-trip. Local run: `python tools/run_fleet.py --capacity` correctly logs the loaded manifest and prints capacity. 241 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:25:01 -05:00
Elliott Kolden	667f042707	Tier-3 bring-up: 9 bugs fixed on elliott-ThinkPad (2026-05-01) Root causes and fixes documented in TIER3-BRINGUP.md. Summary: 1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot. 2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring modules selected on SLIRP runs; fix: always filter requires_bridge. 3. cmd/unix/interact creates no session.list entry → session_open_timeout every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl. 4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444); fix: extra_host_port:extra_host_port mapping so guest binds the per-slot LPORT directly. 5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots; fix: requires_bridge=true filters it from SLIRP fleet runs. 6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba boots (~60 s too early); fix: replace TCP probe with serial console _wait_for_serial_login that waits for actual "login:" prompt. 7. Stale QEMU survives orchestrator restart (start_new_session=True) → holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from old pidfile before rmtree. 8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100. 9. msfrpcd 6.x returns bytes for all string values even with raw=False; fix: MSFRpcClient._str() recursive decoder applied to all responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:26:19 -06:00
max	8753340ea3	fleet: fix per-slot run-dir collision so concurrent VMs actually run Root cause of "fleet says max_concurrent=3 but only one episode ships per wave" symptom on elliott-lab: 1. orchestrator/fleet.py::_run_slot set env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot. 2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm (NO slot suffix), then UNCONDITIONALLY overwrote the env's RUN_DIR with that flag's value before exec'ing the launcher. 3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots collided on the same socket dir. 4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's rmtree literally deleted slot 0's pidfile + sockets mid-boot. 5. Net effect: one VM survives per wave on a multi-core host that should be running ~cores-1 in parallel. Throughput collapses to 1/N. Fix: tools/run_real_vm_demo.py + tools/run_tier3_demo.py: --run-dir default cascade — 1) explicit CLI flag 2) RUN_DIR env (set by fleet runner) 3) /tmp/cis490-vm-<SLOT> (SLOT from env, default 0) Same change in both runners so Tier-2 + Tier-3 fleet waves parallelize cleanly. orchestrator/fleet.py::_run_slot: Pass --run-dir explicitly to the subprocess so the per-slot path is audit-visible in the fleet log instead of buried in env. Also flip the subprocess interpreter to repo_root/.venv/bin/python when present (was /usr/bin/env python3 — worked by luck because the orchestrator path doesn't import msgpack/httpx, but a Tier-3 fleet wave would have died at import-time on a host without those in system Python). etc/cis490-orchestrator.service: Removed the duplicate [Service] hardening block at the bottom of the file that was silently overriding the AmbientCapabilities grant (NoNewPrivileges=true at the bottom flipped the NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_ ADMIN + CAP_PERFMON before per-episode subprocesses inherit them). Sources 3 + 4 would have failed silently inside the sandbox. Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable. 106/106 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:55:56 -05:00
max	a88ac83db0	Close out the deployment-readiness gaps Wraps the gaps surfaced in the "what is not implemented" audit so the fleet really is shippable end-to-end. Verified live on the Pi: - cis490-shipper --ping → HTTP 200 through Caddy + mTLS via the new wg-pki client CA leaf - real episode dir → tar+zstd → PUT → HTTP 201 stored - re-ship same bytes → 200 (idempotent) - re-ship different bytes under same id → 409 (conflict) Changes: orchestrator/episode.py - EpisodeConfig.revert_at_start / revert_at_end (Tier 0+ snapshot/ revert per docs/architecture.md). When set + qmp_socket present, EpisodeRunner issues loadvm <snapshot_name> and emits snapshot_revert / snapshot_revert_failed events on the same monotonic clock as everything else. collectors/qmp.py - savevm() / loadvm() helpers using human-monitor-command, plus a test against the fake QMP server. exploits/workloads.py - chunked_real_binary_upload() returns a ChunkedUpload plan: 8 KiB base64 chunks (~6 KiB binary each) so msfrpc never sees a buffer- busting payload. Includes a finalize step that sha256-verifies on the guest before exec. - real_binary_workload() now wraps the chunked plan for backwards compat with single-shot callers. exploits/driver.py - Tier-4 dispatch walks the chunked plan in MSFExploitDriver: each chunk is a separate session_shell_write; finalize verifies; exec only runs on sha-ok. New events: real_binary_upload_begin, real_binary_verify, real_binary_aborted. etc/cis490-orchestrator.service - Reads /etc/cis490/lab-host.env (FLEET_HOST_ID + optional BRIDGE). - Grants AmbientCapabilities CAP_NET_RAW (tcpdump for source 4) + CAP_SYS_ADMIN + CAP_PERFMON (perf for source 3) so collectors work under hardening. scripts/install-lab-host.sh - Writes /etc/cis490/lab-host.env on first install with FLEET_HOST_ID defaulting to `hostname -s`. - Best-effort: fetches the Alpine baseline qcow2 (sha512-pinned) and builds cidata.iso with the in-guest agent embedded; symlinks both into /opt/cis490/vm/images/ so launchers find them. scripts/fetch-alpine-baseline.sh - Idempotent fetcher for the Alpine 3.21 cloud-init nocloud qcow2 matching the sha512 in docs/sources.md. tools/plot_envelope.py - Rebuilt to render whatever telemetry the episode dir contains: proc → QMP block ops → perf IPC/miss-rate → bridge pkts/SYNs → guest agent load/mem. Missing sources are silently skipped. tools/index_reader.py - cis490-index CLI: filter receiver's index.jsonl by host / sample / time range, sort, count-by group. Closest thing to a query interface until we stand up Postgres/Timescale. samples/README.md - Rewritten to match the new manifest schema, the kind=real vs mimic split, the per-(host, slot, ep) selection mechanic, and the chunked-upload safety story. Tests: 106 pass (was 102). New cases: - test_qmp.py — savevm + loadvm (HMP wrapper + error path) - test_tier4.py — chunked plan splitting, sha-pinned finalize, end-to-end driver walks all chunks + verify + exec via the fake msfrpc client Closes the "what is not implemented" punch list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:31:55 -05:00
max	1b6c7b2f4a	Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts This is the chunk that makes "real data" actually flow on multiple hosts in parallel. End-to-end pipe was up at `613c6fa` / 2579683; now the lab-host side has the diversity + concurrency it needs. Collectors landed: collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP client + row builder + run loop. Tolerates older qemu without query-stats. collectors/guest_agent.py — source 5 (deployable). Reads the virtio-serial host-side socket, parses agent JSON-lines, re-stamps to the host monotonic clock, persists. collectors/pcap.py — source 4 (deployable). tcpdump capture + pure-Python pcap reader + 100 ms netflow.jsonl bucketizer. Decodes Ethernet/IPv4/TCP/UDP enough for the schema in docs/data-model.md. In-guest agent: vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs, thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent. tools/build_cidata.py — embeds the agent + an OpenRC service into user-data so first boot of the Alpine cidata image auto-starts it. Launchers: vm/launch_demo.sh / launch_target.sh — second virtio-serial port for the agent socket; SLOT env support so multiple VMs run without socket / port collisions; PORT_BASE on launch_target so multiple target VMs hostfwd different host ports. vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24, no NAT). Idempotent. Fleet: orchestrator/fleet.py — capacity detector (cores / RAM / load headroom) + concurrent-slot runner. Per-slot ENV selects the sample. FleetCapacity dataclass round-trips into meta.json so "this episode ran with 6 concurrent VMs" is auditable post-hoc. tools/run_fleet.py — CLI: --capacity report; --waves N runs N waves of (max_concurrent) episodes each, every slot with a different sample. etc/cis490-orchestrator.service — now drives the fleet runner with Restart=always so each invocation runs one wave and respawns, giving a continuous stream. Samples: samples/manifest.toml — six profiles spanning the five major behaviour shapes. Each entry is real OR mimic (sha256 distinguishes). samples/manifest.py — strict TOML loader (rejects dups, unknown categories) + deterministic select(host_id, slot, episode_index) so different hosts on the network walk the catalog in different orders without any coordinator. EpisodeRunner: orchestrator/episode.py — optional qmp_socket + guest_agent_socket fields on EpisodeConfig; when set, additional collector threads run alongside proc_qemu. EpisodeResult now carries rows_qmp + rows_guest counters. Tier-3 setup automation: scripts/install-msfrpcd.sh — installs metasploit-framework where the package manager has it, generates a strong password into /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to 127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch once MSFRPC_PASSWORD is sourced. scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256 from the operator (Rapid7 download is registration-walled), pulls, verifies, converts vmdk → qcow2, lands at vm/images/. Tests: 82 pass (was 51). New suites: tests/test_qmp.py — fake QMP server, capability handshake, blockstats, async-event interleaving, 5-failure backoff tests/test_guest_agent.py — fake virtio socket, JSON-lines read + re-stamp, malformed-line tolerance tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames, bucketize correctness across windows tests/test_fleet.py — capacity math (8-core idle / low-RAM / high-load / Pi5 / 1-core box), manifest selection determinism + diversity What's queued for the next commit (already discussed in convo): - MSFExploitDriver v2: map sample.profile → distinct in-session workload so Tier-3 episodes don't all produce the same yes-loop envelope. Critical for ML to learn varied malware shapes. - Real-sample fetch from MalwareBazaar by sha256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:02:27 -05:00
max	7c9f9582ca	Lab-host shipper + receiver /v1/ping + install scripts Implements the deployment loop end-to-end on the CIS490 side: shipper/ config.py ShipperConfig (host_id, paths, receiver endpoint, mTLS) transport.py httpx-based PUT + ping with mTLS + bearer support queue.py scan data/episodes/, tar+zstd via system zstd, ship, retire to data/shipped/. Idempotent across crashes per the state machine in docs/transport.md. __main__.py CLI: --ping (smoke test), --once (one pass), or daemon receiver/app.py: new POST /v1/ping that requires the same auth as PUT /v1/episodes but writes nothing. Used by `cis490-shipper --ping` during lab-host bring-up to verify the WG/Caddy/mTLS path before shipping any real bytes. etc/ cis490-shipper.service systemd unit for the lab-host shipper cis490-orchestrator.service systemd unit for the lab-host queue (kept disabled by default until queue mode lands) lab-host.toml.example config template scripts/ install-lab-host.sh idempotent installer; verifies prereqs, creates cis490 service user, syncs repo to /opt/cis490, builds venv, drops systemd units and config template install-receiver.sh same, for the receiver role on the central WG node (Pi5 in our setup) tests/test_shipper.py 11 end-to-end tests against a real Uvicorn server hosting the receiver app. Exercises ping, tar+ship, idempotent re-ship, 409 conflict, transient (receiver down), tarball round-trip via system zstd. AGENTS.md guidance for AI agents working on this and sibling repos. Headline: when you hit an issue you can't fully fix in scope, file a Forgejo issue rather than leaving a TODO. 51/51 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:41:32 -05:00

6 commits