CIS490/etc/cis490-orchestrator.service
Max Gorog 207a902c3e PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml
The experiment is now defined by a single version-pinned file —
manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every
lab host loads THIS exact file; per-host overrides of experiment
shape are forbidden.

Drops the following per-host CLI overrides that previously violated
the canonical-manifest principle:
  * --manifest, --modules-dir       (paths now derived)
  * --ram-per-vm-mib                (in manifest.experiment)
  * --max-concurrent                (manifest.experiment.fleet.max_concurrent_ceiling)
  * --max-tier3-slots               (manifest.experiment.fleet.max_tier3_slots)
  * --force-tier2                   (not a §14 sanctioned override knob —
                                     ship empty catalog to disable Tier-3)
  * --require-real-samples          (sample-side concern; out of fleet scope)
  * tools/run_*_demo.py --manifest  (samples path now from canonical)

New surface:
  * manifest.toml                   — the single source of truth
  * orchestrator/manifest.py        — load_canonical() + Manifest dataclass
                                      with strict validation, raises
                                      ManifestError on any failure
  * EpisodeConfig.experiment_meta   — populated by run_*_demo.py from
                                      the canonical manifest; stamped
                                      into every episode's meta.json
                                      under "experiment" key for
                                      provenance
  * cis490-orchestrator.service     — RestartPreventExitStatus=78 so
                                      manifest-load failures stay
                                      stuck-and-loud (§9, §4.7)
  * install-lab-host.sh             — validates manifest.toml at
                                      install time; missing or invalid
                                      = die with clear message

Catalog admission semantics: only modules whose name appears in
manifest.catalog get loaded into the runtime catalog (§4.3 in
miniature, will tighten further in step 4 when verified_against /
last_verified actually gate admission). Missing toml for an admitted
name is a sysadmin error → exit 78.

Renames cfg.manifest → cfg.samples + adds cfg.experiment to
disambiguate sample-manifest from experiment-manifest. Rewrites
test_fleet.py fixture to construct synthetic Manifest objects so
test outcomes don't depend on the on-disk manifest.toml content.

12 new tests in tests/test_manifest.py: schema-version mismatch,
unknown collector, duplicate collector, unknown phase, negative
phase seconds, negative ram, missing catalog fields, json round-trip.

Local run: `python tools/run_fleet.py --capacity` correctly logs the
loaded manifest and prints capacity. 241 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:25:01 -05:00

58 lines
2.4 KiB
Desktop File

[Unit]
Description=CIS490 lab-host episode orchestrator (fleet mode)
Documentation=https://maxgit.wg/spectral/CIS490
# Episodes need KVM. msfrpcd (for Tier 3+) is brought up out-of-band
# by cis490-msfrpcd.service when installed.
After=network-online.target wg-quick@wg0.service
Wants=network-online.target
[Service]
Type=simple
User=cis490
Group=cis490
WorkingDirectory=/opt/cis490
# /etc/cis490/lab-host.env is written by scripts/install-lab-host.sh;
# carries FLEET_HOST_ID, BRIDGE, and any operator-supplied overrides.
EnvironmentFile=/etc/cis490/lab-host.env
# msfrpc credentials (written by install-msfrpcd.sh). Optional (-) so the
# unit still starts on Tier-2-only hosts where msfrpcd isn't installed.
EnvironmentFile=-/etc/cis490/msfrpc.env
# Fleet mode: detect host capacity, run that many concurrent episodes
# per wave with samples + experiment shape drawn from the canonical
# manifest at /opt/cis490/manifest.toml. Each invocation runs one wave
# and exits; systemd respawns per Restart= below.
#
# Per PIPELINE.md §4.1 there are no --manifest, --max-tier3-slots,
# --ram-per-vm-mib, --max-concurrent, --force-tier2, or
# --require-real-samples flags. Experiment-shape parameters live in
# manifest.toml. Per-host overrides are forbidden.
#
# Exit 78 (sysadmin error) when the canonical manifest fails to load
# or when the host can't run the experiment. RestartPreventExitStatus=78
# keeps the unit stuck-and-loud rather than respawning into the same
# broken state — operator notices and fixes.
ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/run_fleet.py \
--data-root /var/lib/cis490/data \
--waves 1
Restart=always
RestartSec=15
RestartPreventExitStatus=78
# Hardening — explicitly grant CAP_NET_RAW for tcpdump (source 4) and
# CAP_SYS_ADMIN / CAP_PERFMON for perf (source 3) when the operator
# enables those. Both are inherited by per-episode subprocesses.
# NoNewPrivileges=false is required because AmbientCapabilities only
# survives across exec() if NNP is off.
NoNewPrivileges=false
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
# /tmp is needed for per-slot RUN_DIR (cis490-vm-fleet-<slot>) — the
# fleet runner stages QEMU's sockets + pidfile there.
ReadWritePaths=/var/lib/cis490 /tmp
SupplementaryGroups=kvm
AmbientCapabilities=CAP_NET_RAW CAP_NET_ADMIN CAP_SYS_ADMIN CAP_PERFMON
CapabilityBoundingSet=CAP_NET_RAW CAP_NET_ADMIN CAP_SYS_ADMIN CAP_PERFMON CAP_DAC_READ_SEARCH
[Install]
WantedBy=multi-user.target