CIS490/orchestrator
max 8753340ea3 fleet: fix per-slot run-dir collision so concurrent VMs actually run
Root cause of "fleet says max_concurrent=3 but only one episode ships
per wave" symptom on elliott-lab:

  1. orchestrator/fleet.py::_run_slot set
     env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot.
  2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm
     (NO slot suffix), then UNCONDITIONALLY overwrote the env's
     RUN_DIR with that flag's value before exec'ing the launcher.
  3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots
     collided on the same socket dir.
  4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's
     rmtree literally deleted slot 0's pidfile + sockets mid-boot.
  5. Net effect: one VM survives per wave on a multi-core host that
     should be running ~cores-1 in parallel. Throughput collapses
     to 1/N.

Fix:

  tools/run_real_vm_demo.py + tools/run_tier3_demo.py:
    --run-dir default cascade —
      1) explicit CLI flag
      2) RUN_DIR env (set by fleet runner)
      3) /tmp/cis490-vm-<SLOT>  (SLOT from env, default 0)
    Same change in both runners so Tier-2 + Tier-3 fleet waves
    parallelize cleanly.

  orchestrator/fleet.py::_run_slot:
    Pass --run-dir explicitly to the subprocess so the per-slot path
    is audit-visible in the fleet log instead of buried in env.
    Also flip the subprocess interpreter to repo_root/.venv/bin/python
    when present (was /usr/bin/env python3 — worked by luck because
    the orchestrator path doesn't import msgpack/httpx, but a Tier-3
    fleet wave would have died at import-time on a host without those
    in system Python).

  etc/cis490-orchestrator.service:
    Removed the duplicate [Service] hardening block at the bottom of
    the file that was silently overriding the AmbientCapabilities
    grant (NoNewPrivileges=true at the bottom flipped the
    NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_
    ADMIN + CAP_PERFMON before per-episode subprocesses inherit
    them). Sources 3 + 4 would have failed silently inside the
    sandbox.
    Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable.

106/106 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:55:56 -05:00
..
__init__.py Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00
__main__.py Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00
episode.py orchestrator: emit snapshot_load before _write_meta to keep t_mono ~0 2026-04-30 00:49:50 -05:00
fleet.py fleet: fix per-slot run-dir collision so concurrent VMs actually run 2026-04-30 01:55:56 -05:00
README.md Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00
ulid.py Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00

orchestrator/

The state machine that drives a single episode:

snapshot_load → clean → armed → infecting → infected_running → dormant → reverting

Responsibilities:

  • Bring up the host-only bridge and verify isolation before the guest starts.
  • Boot the guest from a named snapshot.
  • Spawn the five telemetry collectors (collectors/) with a shared episode id and shared monotonic clock origin.
  • Drive the Metasploit Framework over RPC to fire the configured exploit module.
  • Upload + execute the configured malware sample once a session is open.
  • Emit phase transitions to labels.jsonl at the moment the action is taken.
  • Revert the snapshot at episode end.
  • Write meta.json with the result summary.

Implementation lives in this directory and is imported as orchestrator.*.