History

max 8753340ea3 fleet: fix per-slot run-dir collision so concurrent VMs actually run Root cause of "fleet says max_concurrent=3 but only one episode ships per wave" symptom on elliott-lab: 1. orchestrator/fleet.py::_run_slot set env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot. 2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm (NO slot suffix), then UNCONDITIONALLY overwrote the env's RUN_DIR with that flag's value before exec'ing the launcher. 3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots collided on the same socket dir. 4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's rmtree literally deleted slot 0's pidfile + sockets mid-boot. 5. Net effect: one VM survives per wave on a multi-core host that should be running ~cores-1 in parallel. Throughput collapses to 1/N. Fix: tools/run_real_vm_demo.py + tools/run_tier3_demo.py: --run-dir default cascade — 1) explicit CLI flag 2) RUN_DIR env (set by fleet runner) 3) /tmp/cis490-vm-<SLOT> (SLOT from env, default 0) Same change in both runners so Tier-2 + Tier-3 fleet waves parallelize cleanly. orchestrator/fleet.py::_run_slot: Pass --run-dir explicitly to the subprocess so the per-slot path is audit-visible in the fleet log instead of buried in env. Also flip the subprocess interpreter to repo_root/.venv/bin/python when present (was /usr/bin/env python3 — worked by luck because the orchestrator path doesn't import msgpack/httpx, but a Tier-3 fleet wave would have died at import-time on a host without those in system Python). etc/cis490-orchestrator.service: Removed the duplicate [Service] hardening block at the bottom of the file that was silently overriding the AmbientCapabilities grant (NoNewPrivileges=true at the bottom flipped the NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_ ADMIN + CAP_PERFMON before per-episode subprocesses inherit them). Sources 3 + 4 would have failed silently inside the sandbox. Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable. 106/106 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-30 01:55:56 -05:00
..
__init__.py	Add v0 orchestrator + first oracle collector (host /proc)	2026-04-28 23:40:25 -06:00
__main__.py	Add v0 orchestrator + first oracle collector (host /proc)	2026-04-28 23:40:25 -06:00
episode.py	orchestrator: emit snapshot_load before _write_meta to keep t_mono ~0	2026-04-30 00:49:50 -05:00
fleet.py	fleet: fix per-slot run-dir collision so concurrent VMs actually run	2026-04-30 01:55:56 -05:00
README.md	Scaffold project: docs, repo skeleton, transport + deploy design	2026-04-28 23:21:00 -06:00
ulid.py	Add v0 orchestrator + first oracle collector (host /proc)	2026-04-28 23:40:25 -06:00

README.md

orchestrator/

The state machine that drives a single episode:

snapshot_load → clean → armed → infecting → infected_running → dormant → reverting

Responsibilities:

Bring up the host-only bridge and verify isolation before the guest starts.
Boot the guest from a named snapshot.
Spawn the five telemetry collectors (collectors/) with a shared episode id and shared monotonic clock origin.
Drive the Metasploit Framework over RPC to fire the configured exploit module.
Upload + execute the configured malware sample once a session is open.
Emit phase transitions to labels.jsonl at the moment the action is taken.
Revert the snapshot at episode end.
Write meta.json with the result summary.

Implementation lives in this directory and is imported as orchestrator.*.