CIS490/etc
max 8753340ea3 fleet: fix per-slot run-dir collision so concurrent VMs actually run
Root cause of "fleet says max_concurrent=3 but only one episode ships
per wave" symptom on elliott-lab:

  1. orchestrator/fleet.py::_run_slot set
     env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot.
  2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm
     (NO slot suffix), then UNCONDITIONALLY overwrote the env's
     RUN_DIR with that flag's value before exec'ing the launcher.
  3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots
     collided on the same socket dir.
  4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's
     rmtree literally deleted slot 0's pidfile + sockets mid-boot.
  5. Net effect: one VM survives per wave on a multi-core host that
     should be running ~cores-1 in parallel. Throughput collapses
     to 1/N.

Fix:

  tools/run_real_vm_demo.py + tools/run_tier3_demo.py:
    --run-dir default cascade —
      1) explicit CLI flag
      2) RUN_DIR env (set by fleet runner)
      3) /tmp/cis490-vm-<SLOT>  (SLOT from env, default 0)
    Same change in both runners so Tier-2 + Tier-3 fleet waves
    parallelize cleanly.

  orchestrator/fleet.py::_run_slot:
    Pass --run-dir explicitly to the subprocess so the per-slot path
    is audit-visible in the fleet log instead of buried in env.
    Also flip the subprocess interpreter to repo_root/.venv/bin/python
    when present (was /usr/bin/env python3 — worked by luck because
    the orchestrator path doesn't import msgpack/httpx, but a Tier-3
    fleet wave would have died at import-time on a host without those
    in system Python).

  etc/cis490-orchestrator.service:
    Removed the duplicate [Service] hardening block at the bottom of
    the file that was silently overriding the AmbientCapabilities
    grant (NoNewPrivileges=true at the bottom flipped the
    NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_
    ADMIN + CAP_PERFMON before per-episode subprocesses inherit
    them). Sources 3 + 4 would have failed silently inside the
    sandbox.
    Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable.

106/106 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:55:56 -05:00
..
caddy-root.crt bootstrap: auto-issue mTLS leaves to enrolled lab hosts (closes #9, refs #3) 2026-04-30 01:30:29 -05:00
cis490-bootstrap.service bootstrap: auto-issue mTLS leaves to enrolled lab hosts (closes #9, refs #3) 2026-04-30 01:30:29 -05:00
cis490-orchestrator.service fleet: fix per-slot run-dir collision so concurrent VMs actually run 2026-04-30 01:55:56 -05:00
cis490-receiver.service Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency 2026-04-28 23:34:04 -06:00
cis490-shipper.service Lab-host shipper + receiver /v1/ping + install scripts 2026-04-29 23:41:32 -05:00
lab-host.toml.example Lab-host shipper + receiver /v1/ping + install scripts 2026-04-29 23:41:32 -05:00
README.md Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency 2026-04-28 23:34:04 -06:00
receiver.toml.example receiver: default to 127.0.0.1:8444 (avoid wg-enroll-listener on 8443) 2026-04-29 23:45:23 -05:00

etc/

Templates for system-level files installed by scripts/install-*.sh:

  • cis490-receiver.service — systemd unit for the receiver
  • receiver.toml.example — config template for the receiver
  • cis490-orchestrator.service (TODO) — systemd unit for the orchestrator
  • cis490-shipper.service (TODO) — systemd unit for the shipper
  • lab-host.toml.example (TODO) — config template for the lab host

See docs/deploy.md for the install flow.