CIS490/etc
Max 8643192a71 training/fleet: distributed multi-host trainer with capability gating
Symmetric companion to the collection fleet (orchestrator/fleet.py)
but for *training*. Collection is embarrassingly parallel; training
is not (a model is trained at most once across the fleet), so the
receiver coordinates which worker gets which job.

Operator-control surface is etc/training_manifest.toml.example —
single canonical file declaring (a) per-host capability + per-model
allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper)
with capability constraints (require_cuda, prefer_cuda, min_vram_gib,
min_ram_gib, allowed_hosts).

Components:

  capability.py — self-detection: hostname, cores, RAM, CUDA presence,
    VRAM, torch version, git commit. Used by workers to filter
    eligible jobs before claiming.

  manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable
    sha256 of (model, mode, hyper, split_recipe, train_hosts, seed)
    so manifest reload is idempotent: existing rows keep their status,
    new jobs become claimable, removed jobs stay until cancelled.

  queue.py — SQLite job queue (training_jobs.db) with statuses
    pending|claimed|running|completed|failed|cancelled. Atomic
    claim_next via single UPDATE WHERE status='pending'. Heartbeat,
    complete, fail. Stale-claim sweep (stale_after_s=600s) with
    max_attempts cutoff to failed.

  store.py — model artifact store mirroring receiver/store.py.
    Artifact ID is the sha256 of the uploaded tarball; bit-identical
    re-runs deduplicate.

  receiver.py — Starlette app exposing 11 endpoints:
    POST /v1/job/claim          (worker)
    POST /v1/job/{id}/heartbeat (worker)
    POST /v1/job/{id}/complete  (worker)
    POST /v1/job/{id}/fail      (worker)
    PUT  /v1/model/{id}         (worker — uploads tarball)
    GET  /v1/jobs               (anyone)
    GET  /v1/workers            (anyone)
    POST /v1/job/{id}/cancel    (operator: X-Operator-Token)
    POST /v1/job/{id}/requeue   (operator)
    POST /v1/manifest/reload    (operator)
    GET  /v1/health             (anyone)
    Runs as cis490-trainer-receiver.service on the Pi alongside the
    existing receiver, on a separate port.

  client.py — stdlib HTTP client (urllib only, no new deps).

  worker.py — long-running daemon. Loop: detect capability → claim →
    spawn training/trainer/run.py subprocess → heartbeat every 30s →
    tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe.

Operator CLI (tools/cis490_jobs.py): status / list / show / cancel /
requeue / reload / workers. Cancel and requeue require
$CIS490_OPERATOR_TOKEN matching the receiver's configured value.

Bootstrap: scripts/install-training-worker.sh (Linux systemd) and
scripts/install-training-worker-windows.ps1 (Windows Scheduled Task)
let the operator enroll a new host with one command after cloning
the repo and setting up the venv. Worker self-tests capability
before registering.

End-to-end smoke verified on the Pi: receiver up, manifest synced,
14 jobs queued, worker registered, claimed 4 CPU-eligible jobs
(allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle,
mlp-oracle), 1 failed with the actual error visible via
cis490-jobs status, 3 artifacts uploaded to
/var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with
proper index.jsonl row.

21 unit tests (manifest validation: 8; queue lifecycle + eligibility:
13). All pass alongside the prior 17 training tests = 38 green.

Open limitations surfaced inline:
  - Hyper-key drift between manifest and run.py fails at training
    time, not at manifest reload (worth tightening to argparse
    introspection later).
  - mTLS not yet wired through Caddy for the trainer-receiver port —
    listens loopback-only until that lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:20:20 -05:00
..
caddy-root.crt bootstrap: auto-issue mTLS leaves to enrolled lab hosts (closes #9, refs #3) 2026-04-30 01:30:29 -05:00
cis490-autoupdate.service lab-host: cis490-autoupdate.timer for self-healing on push 2026-05-01 16:59:31 -05:00
cis490-autoupdate.timer lab-host: cis490-autoupdate.timer for self-healing on push 2026-05-01 16:59:31 -05:00
cis490-bootstrap.service Tier-4 sample source: theZoo (no auth, no operator action) 2026-05-01 01:17:50 -05:00
cis490-cert-fetch.service lab-host: cis490-cert-fetch.timer for automatic mTLS bootstrap retry 2026-05-02 13:30:16 -05:00
cis490-cert-fetch.timer lab-host: cis490-cert-fetch.timer for automatic mTLS bootstrap retry 2026-05-02 13:30:16 -05:00
cis490-dashboard.service training/dashboard: live deck at dashboard.wg, fed by receiver 2026-05-07 21:26:07 -05:00
cis490-doctor-check.service fleet-health: proactive alerts on the Pi + per-host doctor reports 2026-05-02 13:48:31 -05:00
cis490-doctor-check.timer fleet-health: proactive alerts on the Pi + per-host doctor reports 2026-05-02 13:48:31 -05:00
cis490-fleet-health.service fleet-health: proactive alerts on the Pi + per-host doctor reports 2026-05-02 13:48:31 -05:00
cis490-fleet-health.timer fleet-health: proactive alerts on the Pi + per-host doctor reports 2026-05-02 13:48:31 -05:00
cis490-orchestrator.service PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml 2026-05-04 01:25:01 -05:00
cis490-receiver.service Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency 2026-04-28 23:34:04 -06:00
cis490-shipper.service shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors 2026-05-01 12:02:59 -05:00
cis490-trainer-receiver.service training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
cis490-trainer-worker.service training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
lab-host.toml.example etc/lab-host.toml.example: pin Caddy root, not wg-pki client CA (closes #14) 2026-04-30 17:26:36 -05:00
README.md training/dashboard: live deck at dashboard.wg, fed by receiver 2026-05-07 21:26:07 -05:00
receiver.toml.example docs+doctor: surface VERSION-stamp + fallback wiring 2026-05-01 11:54:36 -05:00
training_manifest.toml.example training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00

etc/

Templates for system-level files installed by scripts/install-*.sh:

  • cis490-receiver.service — systemd unit for the receiver
  • cis490-dashboard.service — systemd unit for the dashboard.wg live display
  • receiver.toml.example — config template for the receiver
  • cis490-orchestrator.service (TODO) — systemd unit for the orchestrator
  • cis490-shipper.service (TODO) — systemd unit for the shipper
  • lab-host.toml.example (TODO) — config template for the lab host

See docs/deploy.md for the install flow.