# training/fleet/ — distributed training across multiple hosts Symmetric to the *collection* fleet (`orchestrator/fleet.py`), but for *training* the models. The collection fleet is embarrassingly parallel (every lab host runs the same manifest and produces independent data). The training fleet is the opposite: each `(model, mode, hyper)` job is trained at most once, so the receiver coordinates which worker gets which job. ## Roles | Component | Where it runs | Responsibility | |---|---|---| | `cis490-trainer-receiver.service` | Pi (`10.100.0.1`) | Job queue (SQLite), claim/heartbeat/complete endpoints, artifact ingest | | `cis490-trainer-worker.service` | every training host | Self-detect capability → claim eligible job → run trainer → ship artifact → repeat | | `etc/training_manifest.toml` | Pi `/etc/cis490/` | Operator's single source of truth: which jobs to train, with what hyperparameters and capability constraints | | `cis490-jobs` (`tools/cis490_jobs.py`) | anywhere | Operator CLI: status, list, show, cancel, requeue, reload | ## How the operator controls it **Edit the manifest** (`/etc/cis490/training_manifest.toml`): - Add or remove `[[jobs]]` entries - Change priorities, hyperparameters, capability constraints - Add a new host under `[hosts.]` with allow_jobs / deny_jobs / priority **Reload**: ```sh cis490-jobs reload # or: systemctl reload cis490-trainer-receiver.service # or: sudo kill -HUP $(pgrep -f training.fleet.receiver) ``` The reload is idempotent. Existing rows keep their status; new jobs become claimable; jobs the operator removes from the manifest **stay** in the queue (use `cis490-jobs cancel ` to mark them `cancelled`). **Status**: ```sh cis490-jobs status cis490-jobs list --status running cis490-jobs show transformer-oracle cis490-jobs workers ``` **Override a stuck job**: ```sh cis490-jobs requeue # force back to pending from any state cis490-jobs cancel ``` Note: `requeue` requires `$CIS490_OPERATOR_TOKEN` to match the receiver's configured operator token. ## Adding a new training host ### Linux (Pi, GPU box, anything that can run torch) ```sh # On the host you want to enroll, as root: git clone http://maxgit.wg/spectral/CIS490 /opt/cis490 cd /opt/cis490 python3 -m venv .venv && .venv/bin/pip install -e '.[training]' sudo /opt/cis490/scripts/install-training-worker.sh ``` The script: 1. Verifies the WG mesh + receiver reachability 2. Prints the host's self-reported capability (CPU cores, RAM, CUDA, VRAM) 3. Drops `/etc/cis490/trainer-worker.env` with the receiver URL 4. Installs and starts `cis490-trainer-worker.service` 5. Tails the journal so you see the worker claim its first job ### Windows (e.g., the operator's desktop with the GPU) ```powershell # As Administrator in PowerShell: git clone http://maxgit.wg/spectral/CIS490 C:\cis490 cd C:\cis490 py -3.11 -m venv .venv .\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121 .\.venv\Scripts\pip install -e . powershell -ExecutionPolicy Bypass -File .\scripts\install-training-worker-windows.ps1 ``` Registers a Scheduled Task that runs the worker at startup + restarts it if it stops. Logs to `C:\cis490\logs\trainer-worker.log`. ### After enrollment The new host appears in `cis490-jobs workers` within ~15 s. The receiver sees its capability and starts handing it eligible jobs. **You did not need to coordinate with anyone** — the operator-defined manifest already described what jobs are out there; the new host just claimed the ones its CUDA capacity unblocked. ## Capability gating Each job declares constraints; each worker self-reports capability. The receiver computes eligibility and only hands a job to a worker that can run it. ``` require_cuda prefer_cuda min_vram_gib Pi desktop GPU gbt no - 0 ✓ ✓ mlp no - 0 ✓ ✓ cnn no yes 1 ✓ (after ✓ 5min grace) gru / lstm yes - 2 - ✓ transformer yes - 4 - ✓ transformer_ssl yes - 4 - ✓ ``` `prefer_cuda` jobs wait `prefer_cuda_grace_s` (default 300 s) before a CPU worker is allowed to claim them — so a GPU worker has a chance even if a CPU worker is idle. ## Per-host policy In the manifest: ```toml [hosts.office-print] allow_jobs = ["gbt", "mlp"] # whitelist; absent or empty = all allowed deny_jobs = [] priority = 0 ``` A worker matching `office-print` will only claim jobs whose `model` is in `allow_jobs`. Useful for "I want the Pi to never train the Transformer even if I happened to put pytorch-cuda on it." ## Architecture notes ### Atomic claim `JobQueue.claim_next` runs the eligibility filter in Python, then the state transition is a single `UPDATE … WHERE status='pending'` — exactly one of N racing workers wins. ### Stale-claim recovery Workers heartbeat every 30 s. The receiver periodically sweeps for claimed/running rows whose last heartbeat is older than 600 s and returns them to pending (or marks failed if attempts ≥ max_attempts). A worker crash never permanently strands a job. ### Artifact deduplication The artifact_id is the sha256 of the uploaded tarball. Re-running a job with bit-identical output (same code, same data, same hyper, same seed) → already-present, no re-upload. ### Schema continuity with the supervised pipeline The receiver's queue rows reference job_ids that hash the SAME spec fields the trainer uses, so re-syncing a manifest after a code change that doesn't affect the trained-model identity is a no-op. Changing `hyper.lr` produces a NEW job_id — the queue treats it as a new job and the old artifact stays around for comparison. ## Endpoints (reference) ``` POST /v1/job/claim (worker) POST /v1/job/{id}/heartbeat (worker) POST /v1/job/{id}/complete (worker) POST /v1/job/{id}/fail (worker) PUT /v1/model/{id} (worker — uploads tarball) GET /v1/jobs[?status=...] (anyone) GET /v1/workers (anyone) POST /v1/job/{id}/cancel (operator: X-Operator-Token) POST /v1/job/{id}/requeue (operator) POST /v1/manifest/reload (operator) GET /v1/health (anyone) ``` ## Files - `capability.py` — self-detection - `manifest.py` — TOML loader + JobSpec / HostSpec - `queue.py` — SQLite queue with atomic claim - `store.py` — model-artifact store on the Pi - `receiver.py` — Starlette app exposing the endpoints above - `client.py` — stdlib HTTP client (no extra deps) - `worker.py` — long-running worker daemon - `__main__.py` not needed; each module has its own `main()`