Root cause of "fleet says max_concurrent=3 but only one episode ships
per wave" symptom on elliott-lab:
1. orchestrator/fleet.py::_run_slot set
env["RUN_DIR"]=/tmp/cis490-vm-fleet-{slot} per slot.
2. tools/run_real_vm_demo.py defaulted --run-dir to /tmp/cis490-vm
(NO slot suffix), then UNCONDITIONALLY overwrote the env's
RUN_DIR with that flag's value before exec'ing the launcher.
3. So every slot's launcher saw RUN_DIR=/tmp/cis490-vm. All slots
collided on the same socket dir.
4. run_real_vm_demo.py also rmtree(run_dir) on entry — slot 1's
rmtree literally deleted slot 0's pidfile + sockets mid-boot.
5. Net effect: one VM survives per wave on a multi-core host that
should be running ~cores-1 in parallel. Throughput collapses
to 1/N.
Fix:
tools/run_real_vm_demo.py + tools/run_tier3_demo.py:
--run-dir default cascade —
1) explicit CLI flag
2) RUN_DIR env (set by fleet runner)
3) /tmp/cis490-vm-<SLOT> (SLOT from env, default 0)
Same change in both runners so Tier-2 + Tier-3 fleet waves
parallelize cleanly.
orchestrator/fleet.py::_run_slot:
Pass --run-dir explicitly to the subprocess so the per-slot path
is audit-visible in the fleet log instead of buried in env.
Also flip the subprocess interpreter to repo_root/.venv/bin/python
when present (was /usr/bin/env python3 — worked by luck because
the orchestrator path doesn't import msgpack/httpx, but a Tier-3
fleet wave would have died at import-time on a host without those
in system Python).
etc/cis490-orchestrator.service:
Removed the duplicate [Service] hardening block at the bottom of
the file that was silently overriding the AmbientCapabilities
grant (NoNewPrivileges=true at the bottom flipped the
NoNewPrivileges=false at the top, dropping CAP_NET_RAW + CAP_SYS_
ADMIN + CAP_PERFMON before per-episode subprocesses inherit
them). Sources 3 + 4 would have failed silently inside the
sandbox.
Added /tmp to ReadWritePaths so per-slot RUN_DIRs are writable.
106/106 tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| caddy-root.crt | ||
| cis490-bootstrap.service | ||
| cis490-orchestrator.service | ||
| cis490-receiver.service | ||
| cis490-shipper.service | ||
| lab-host.toml.example | ||
| README.md | ||
| receiver.toml.example | ||
etc/
Templates for system-level files installed by scripts/install-*.sh:
cis490-receiver.service— systemd unit for the receiverreceiver.toml.example— config template for the receivercis490-orchestrator.service(TODO) — systemd unit for the orchestratorcis490-shipper.service(TODO) — systemd unit for the shipperlab-host.toml.example(TODO) — config template for the lab host
See docs/deploy.md for the install flow.