Validator's allowed-models frozenset was missing knn and knn_semi
even though the manifest gained those jobs and the model registry
registered the classes. Lambda bootstrap blocked at:
TrainingManifestError: job 'knn-realistic': model 'knn' not in
['cnn', 'gbt', 'gru', 'lstm', 'mlp', 'transformer', 'transformer_ssl']
Now {gbt, knn, knn_semi, mlp, cnn, gru, lstm, transformer, transformer_ssl}.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| capability.py | ||
| client.py | ||
| manifest.py | ||
| queue.py | ||
| README.md | ||
| receiver.py | ||
| store.py | ||
| worker.py | ||
training/fleet/ — distributed training across multiple hosts
Symmetric to the collection fleet (orchestrator/fleet.py), but for
training the models. The collection fleet is embarrassingly parallel
(every lab host runs the same manifest and produces independent data).
The training fleet is the opposite: each (model, mode, hyper) job is
trained at most once, so the receiver coordinates which worker gets
which job.
Roles
| Component | Where it runs | Responsibility |
|---|---|---|
cis490-trainer-receiver.service |
Pi (10.100.0.1) |
Job queue (SQLite), claim/heartbeat/complete endpoints, artifact ingest |
cis490-trainer-worker.service |
every training host | Self-detect capability → claim eligible job → run trainer → ship artifact → repeat |
etc/training_manifest.toml |
Pi /etc/cis490/ |
Operator's single source of truth: which jobs to train, with what hyperparameters and capability constraints |
cis490-jobs (tools/cis490_jobs.py) |
anywhere | Operator CLI: status, list, show, cancel, requeue, reload |
How the operator controls it
Edit the manifest (/etc/cis490/training_manifest.toml):
- Add or remove
[[jobs]]entries - Change priorities, hyperparameters, capability constraints
- Add a new host under
[hosts.<name>]with allow_jobs / deny_jobs / priority
Reload:
cis490-jobs reload
# or: systemctl reload cis490-trainer-receiver.service
# or: sudo kill -HUP $(pgrep -f training.fleet.receiver)
The reload is idempotent. Existing rows keep their status; new jobs become
claimable; jobs the operator removes from the manifest stay in the
queue (use cis490-jobs cancel <id> to mark them cancelled).
Status:
cis490-jobs status
cis490-jobs list --status running
cis490-jobs show transformer-oracle
cis490-jobs workers
Override a stuck job:
cis490-jobs requeue <job_id> # force back to pending from any state
cis490-jobs cancel <job_id>
Note: requeue requires $CIS490_OPERATOR_TOKEN to match the receiver's
configured operator token.
Adding a new training host
Linux (Pi, GPU box, anything that can run torch)
# On the host you want to enroll, as root:
git clone http://maxgit.wg/spectral/CIS490 /opt/cis490
cd /opt/cis490
python3 -m venv .venv && .venv/bin/pip install -e '.[training]'
sudo /opt/cis490/scripts/install-training-worker.sh
The script:
- Verifies the WG mesh + receiver reachability
- Prints the host's self-reported capability (CPU cores, RAM, CUDA, VRAM)
- Drops
/etc/cis490/trainer-worker.envwith the receiver URL - Installs and starts
cis490-trainer-worker.service - Tails the journal so you see the worker claim its first job
Windows (e.g., the operator's desktop with the GPU)
# As Administrator in PowerShell:
git clone http://maxgit.wg/spectral/CIS490 C:\cis490
cd C:\cis490
py -3.11 -m venv .venv
.\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
.\.venv\Scripts\pip install -e .
powershell -ExecutionPolicy Bypass -File .\scripts\install-training-worker-windows.ps1
Registers a Scheduled Task that runs the worker at startup + restarts it
if it stops. Logs to C:\cis490\logs\trainer-worker.log.
After enrollment
The new host appears in cis490-jobs workers within ~15 s. The receiver
sees its capability and starts handing it eligible jobs. You did not
need to coordinate with anyone — the operator-defined manifest already
described what jobs are out there; the new host just claimed the ones
its CUDA capacity unblocked.
Capability gating
Each job declares constraints; each worker self-reports capability. The receiver computes eligibility and only hands a job to a worker that can run it.
require_cuda prefer_cuda min_vram_gib Pi desktop GPU
gbt no - 0 ✓ ✓
mlp no - 0 ✓ ✓
cnn no yes 1 ✓ (after ✓
5min grace)
gru / lstm yes - 2 - ✓
transformer yes - 4 - ✓
transformer_ssl yes - 4 - ✓
prefer_cuda jobs wait prefer_cuda_grace_s (default 300 s) before a
CPU worker is allowed to claim them — so a GPU worker has a chance even
if a CPU worker is idle.
Per-host policy
In the manifest:
[hosts.office-print]
allow_jobs = ["gbt", "mlp"] # whitelist; absent or empty = all allowed
deny_jobs = []
priority = 0
A worker matching office-print will only claim jobs whose model is in
allow_jobs. Useful for "I want the Pi to never train the Transformer
even if I happened to put pytorch-cuda on it."
Architecture notes
Atomic claim
JobQueue.claim_next runs the eligibility filter in Python, then the
state transition is a single UPDATE … WHERE status='pending' — exactly
one of N racing workers wins.
Stale-claim recovery
Workers heartbeat every 30 s. The receiver periodically sweeps for claimed/running rows whose last heartbeat is older than 600 s and returns them to pending (or marks failed if attempts ≥ max_attempts). A worker crash never permanently strands a job.
Artifact deduplication
The artifact_id is the sha256 of the uploaded tarball. Re-running a job with bit-identical output (same code, same data, same hyper, same seed) → already-present, no re-upload.
Schema continuity with the supervised pipeline
The receiver's queue rows reference job_ids that hash the SAME spec
fields the trainer uses, so re-syncing a manifest after a code change
that doesn't affect the trained-model identity is a no-op. Changing
hyper.lr produces a NEW job_id — the queue treats it as a new job
and the old artifact stays around for comparison.
Endpoints (reference)
POST /v1/job/claim (worker)
POST /v1/job/{id}/heartbeat (worker)
POST /v1/job/{id}/complete (worker)
POST /v1/job/{id}/fail (worker)
PUT /v1/model/{id} (worker — uploads tarball)
GET /v1/jobs[?status=...] (anyone)
GET /v1/workers (anyone)
POST /v1/job/{id}/cancel (operator: X-Operator-Token)
POST /v1/job/{id}/requeue (operator)
POST /v1/manifest/reload (operator)
GET /v1/health (anyone)
Files
capability.py— self-detectionmanifest.py— TOML loader + JobSpec / HostSpecqueue.py— SQLite queue with atomic claimstore.py— model-artifact store on the Pireceiver.py— Starlette app exposing the endpoints aboveclient.py— stdlib HTTP client (no extra deps)worker.py— long-running worker daemon__main__.pynot needed; each module has its ownmain()