Symmetric companion to the collection fleet (orchestrator/fleet.py)
but for *training*. Collection is embarrassingly parallel; training
is not (a model is trained at most once across the fleet), so the
receiver coordinates which worker gets which job.
Operator-control surface is etc/training_manifest.toml.example —
single canonical file declaring (a) per-host capability + per-model
allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper)
with capability constraints (require_cuda, prefer_cuda, min_vram_gib,
min_ram_gib, allowed_hosts).
Components:
capability.py — self-detection: hostname, cores, RAM, CUDA presence,
VRAM, torch version, git commit. Used by workers to filter
eligible jobs before claiming.
manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable
sha256 of (model, mode, hyper, split_recipe, train_hosts, seed)
so manifest reload is idempotent: existing rows keep their status,
new jobs become claimable, removed jobs stay until cancelled.
queue.py — SQLite job queue (training_jobs.db) with statuses
pending|claimed|running|completed|failed|cancelled. Atomic
claim_next via single UPDATE WHERE status='pending'. Heartbeat,
complete, fail. Stale-claim sweep (stale_after_s=600s) with
max_attempts cutoff to failed.
store.py — model artifact store mirroring receiver/store.py.
Artifact ID is the sha256 of the uploaded tarball; bit-identical
re-runs deduplicate.
receiver.py — Starlette app exposing 11 endpoints:
POST /v1/job/claim (worker)
POST /v1/job/{id}/heartbeat (worker)
POST /v1/job/{id}/complete (worker)
POST /v1/job/{id}/fail (worker)
PUT /v1/model/{id} (worker — uploads tarball)
GET /v1/jobs (anyone)
GET /v1/workers (anyone)
POST /v1/job/{id}/cancel (operator: X-Operator-Token)
POST /v1/job/{id}/requeue (operator)
POST /v1/manifest/reload (operator)
GET /v1/health (anyone)
Runs as cis490-trainer-receiver.service on the Pi alongside the
existing receiver, on a separate port.
client.py — stdlib HTTP client (urllib only, no new deps).
worker.py — long-running daemon. Loop: detect capability → claim →
spawn training/trainer/run.py subprocess → heartbeat every 30s →
tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe.
Operator CLI (tools/cis490_jobs.py): status / list / show / cancel /
requeue / reload / workers. Cancel and requeue require
$CIS490_OPERATOR_TOKEN matching the receiver's configured value.
Bootstrap: scripts/install-training-worker.sh (Linux systemd) and
scripts/install-training-worker-windows.ps1 (Windows Scheduled Task)
let the operator enroll a new host with one command after cloning
the repo and setting up the venv. Worker self-tests capability
before registering.
End-to-end smoke verified on the Pi: receiver up, manifest synced,
14 jobs queued, worker registered, claimed 4 CPU-eligible jobs
(allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle,
mlp-oracle), 1 failed with the actual error visible via
cis490-jobs status, 3 artifacts uploaded to
/var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with
proper index.jsonl row.
21 unit tests (manifest validation: 8; queue lifecycle + eligibility:
13). All pass alongside the prior 17 training tests = 38 green.
Open limitations surfaced inline:
- Hyper-key drift between manifest and run.py fails at training
time, not at manifest reload (worth tightening to argparse
introspection later).
- mTLS not yet wired through Caddy for the trainer-receiver port —
listens loopback-only until that lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
182 lines
6.8 KiB
Markdown
182 lines
6.8 KiB
Markdown
# training/fleet/ — distributed training across multiple hosts
|
|
|
|
Symmetric to the *collection* fleet (`orchestrator/fleet.py`), but for
|
|
*training* the models. The collection fleet is embarrassingly parallel
|
|
(every lab host runs the same manifest and produces independent data).
|
|
The training fleet is the opposite: each `(model, mode, hyper)` job is
|
|
trained at most once, so the receiver coordinates which worker gets
|
|
which job.
|
|
|
|
## Roles
|
|
|
|
| Component | Where it runs | Responsibility |
|
|
|---|---|---|
|
|
| `cis490-trainer-receiver.service` | Pi (`10.100.0.1`) | Job queue (SQLite), claim/heartbeat/complete endpoints, artifact ingest |
|
|
| `cis490-trainer-worker.service` | every training host | Self-detect capability → claim eligible job → run trainer → ship artifact → repeat |
|
|
| `etc/training_manifest.toml` | Pi `/etc/cis490/` | Operator's single source of truth: which jobs to train, with what hyperparameters and capability constraints |
|
|
| `cis490-jobs` (`tools/cis490_jobs.py`) | anywhere | Operator CLI: status, list, show, cancel, requeue, reload |
|
|
|
|
## How the operator controls it
|
|
|
|
**Edit the manifest** (`/etc/cis490/training_manifest.toml`):
|
|
- Add or remove `[[jobs]]` entries
|
|
- Change priorities, hyperparameters, capability constraints
|
|
- Add a new host under `[hosts.<name>]` with allow_jobs / deny_jobs / priority
|
|
|
|
**Reload**:
|
|
```sh
|
|
cis490-jobs reload
|
|
# or: systemctl reload cis490-trainer-receiver.service
|
|
# or: sudo kill -HUP $(pgrep -f training.fleet.receiver)
|
|
```
|
|
The reload is idempotent. Existing rows keep their status; new jobs become
|
|
claimable; jobs the operator removes from the manifest **stay** in the
|
|
queue (use `cis490-jobs cancel <id>` to mark them `cancelled`).
|
|
|
|
**Status**:
|
|
```sh
|
|
cis490-jobs status
|
|
cis490-jobs list --status running
|
|
cis490-jobs show transformer-oracle
|
|
cis490-jobs workers
|
|
```
|
|
|
|
**Override a stuck job**:
|
|
```sh
|
|
cis490-jobs requeue <job_id> # force back to pending from any state
|
|
cis490-jobs cancel <job_id>
|
|
```
|
|
Note: `requeue` requires `$CIS490_OPERATOR_TOKEN` to match the receiver's
|
|
configured operator token.
|
|
|
|
## Adding a new training host
|
|
|
|
### Linux (Pi, GPU box, anything that can run torch)
|
|
|
|
```sh
|
|
# On the host you want to enroll, as root:
|
|
git clone http://maxgit.wg/spectral/CIS490 /opt/cis490
|
|
cd /opt/cis490
|
|
python3 -m venv .venv && .venv/bin/pip install -e '.[training]'
|
|
sudo /opt/cis490/scripts/install-training-worker.sh
|
|
```
|
|
|
|
The script:
|
|
1. Verifies the WG mesh + receiver reachability
|
|
2. Prints the host's self-reported capability (CPU cores, RAM, CUDA, VRAM)
|
|
3. Drops `/etc/cis490/trainer-worker.env` with the receiver URL
|
|
4. Installs and starts `cis490-trainer-worker.service`
|
|
5. Tails the journal so you see the worker claim its first job
|
|
|
|
### Windows (e.g., the operator's desktop with the GPU)
|
|
|
|
```powershell
|
|
# As Administrator in PowerShell:
|
|
git clone http://maxgit.wg/spectral/CIS490 C:\cis490
|
|
cd C:\cis490
|
|
py -3.11 -m venv .venv
|
|
.\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
|
|
.\.venv\Scripts\pip install -e .
|
|
|
|
powershell -ExecutionPolicy Bypass -File .\scripts\install-training-worker-windows.ps1
|
|
```
|
|
|
|
Registers a Scheduled Task that runs the worker at startup + restarts it
|
|
if it stops. Logs to `C:\cis490\logs\trainer-worker.log`.
|
|
|
|
### After enrollment
|
|
|
|
The new host appears in `cis490-jobs workers` within ~15 s. The receiver
|
|
sees its capability and starts handing it eligible jobs. **You did not
|
|
need to coordinate with anyone** — the operator-defined manifest already
|
|
described what jobs are out there; the new host just claimed the ones
|
|
its CUDA capacity unblocked.
|
|
|
|
## Capability gating
|
|
|
|
Each job declares constraints; each worker self-reports capability. The
|
|
receiver computes eligibility and only hands a job to a worker that
|
|
can run it.
|
|
|
|
```
|
|
require_cuda prefer_cuda min_vram_gib Pi desktop GPU
|
|
gbt no - 0 ✓ ✓
|
|
mlp no - 0 ✓ ✓
|
|
cnn no yes 1 ✓ (after ✓
|
|
5min grace)
|
|
gru / lstm yes - 2 - ✓
|
|
transformer yes - 4 - ✓
|
|
transformer_ssl yes - 4 - ✓
|
|
```
|
|
|
|
`prefer_cuda` jobs wait `prefer_cuda_grace_s` (default 300 s) before a
|
|
CPU worker is allowed to claim them — so a GPU worker has a chance even
|
|
if a CPU worker is idle.
|
|
|
|
## Per-host policy
|
|
|
|
In the manifest:
|
|
|
|
```toml
|
|
[hosts.office-print]
|
|
allow_jobs = ["gbt", "mlp"] # whitelist; absent or empty = all allowed
|
|
deny_jobs = []
|
|
priority = 0
|
|
```
|
|
|
|
A worker matching `office-print` will only claim jobs whose `model` is in
|
|
`allow_jobs`. Useful for "I want the Pi to never train the Transformer
|
|
even if I happened to put pytorch-cuda on it."
|
|
|
|
## Architecture notes
|
|
|
|
### Atomic claim
|
|
`JobQueue.claim_next` runs the eligibility filter in Python, then the
|
|
state transition is a single `UPDATE … WHERE status='pending'` — exactly
|
|
one of N racing workers wins.
|
|
|
|
### Stale-claim recovery
|
|
Workers heartbeat every 30 s. The receiver periodically sweeps for
|
|
claimed/running rows whose last heartbeat is older than 600 s and
|
|
returns them to pending (or marks failed if attempts ≥ max_attempts).
|
|
A worker crash never permanently strands a job.
|
|
|
|
### Artifact deduplication
|
|
The artifact_id is the sha256 of the uploaded tarball. Re-running a
|
|
job with bit-identical output (same code, same data, same hyper, same
|
|
seed) → already-present, no re-upload.
|
|
|
|
### Schema continuity with the supervised pipeline
|
|
The receiver's queue rows reference job_ids that hash the SAME spec
|
|
fields the trainer uses, so re-syncing a manifest after a code change
|
|
that doesn't affect the trained-model identity is a no-op. Changing
|
|
`hyper.lr` produces a NEW job_id — the queue treats it as a new job
|
|
and the old artifact stays around for comparison.
|
|
|
|
## Endpoints (reference)
|
|
|
|
```
|
|
POST /v1/job/claim (worker)
|
|
POST /v1/job/{id}/heartbeat (worker)
|
|
POST /v1/job/{id}/complete (worker)
|
|
POST /v1/job/{id}/fail (worker)
|
|
PUT /v1/model/{id} (worker — uploads tarball)
|
|
|
|
GET /v1/jobs[?status=...] (anyone)
|
|
GET /v1/workers (anyone)
|
|
POST /v1/job/{id}/cancel (operator: X-Operator-Token)
|
|
POST /v1/job/{id}/requeue (operator)
|
|
POST /v1/manifest/reload (operator)
|
|
GET /v1/health (anyone)
|
|
```
|
|
|
|
## Files
|
|
|
|
- `capability.py` — self-detection
|
|
- `manifest.py` — TOML loader + JobSpec / HostSpec
|
|
- `queue.py` — SQLite queue with atomic claim
|
|
- `store.py` — model-artifact store on the Pi
|
|
- `receiver.py` — Starlette app exposing the endpoints above
|
|
- `client.py` — stdlib HTTP client (no extra deps)
|
|
- `worker.py` — long-running worker daemon
|
|
- `__main__.py` not needed; each module has its own `main()`
|