CIS490/training/fleet
Max c42bf033e5 training/fleet/manifest: accept knn + knn_semi in _ALLOWED_MODELS
Validator's allowed-models frozenset was missing knn and knn_semi
even though the manifest gained those jobs and the model registry
registered the classes. Lambda bootstrap blocked at:
  TrainingManifestError: job 'knn-realistic': model 'knn' not in
    ['cnn', 'gbt', 'gru', 'lstm', 'mlp', 'transformer', 'transformer_ssl']

Now {gbt, knn, knn_semi, mlp, cnn, gru, lstm, transformer, transformer_ssl}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:46:33 -05:00
..
__init__.py training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
capability.py training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
client.py training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
manifest.py training/fleet/manifest: accept knn + knn_semi in _ALLOWED_MODELS 2026-05-08 14:46:33 -05:00
queue.py training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
README.md training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
receiver.py training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
store.py training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
worker.py training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00

training/fleet/ — distributed training across multiple hosts

Symmetric to the collection fleet (orchestrator/fleet.py), but for training the models. The collection fleet is embarrassingly parallel (every lab host runs the same manifest and produces independent data). The training fleet is the opposite: each (model, mode, hyper) job is trained at most once, so the receiver coordinates which worker gets which job.

Roles

Component Where it runs Responsibility
cis490-trainer-receiver.service Pi (10.100.0.1) Job queue (SQLite), claim/heartbeat/complete endpoints, artifact ingest
cis490-trainer-worker.service every training host Self-detect capability → claim eligible job → run trainer → ship artifact → repeat
etc/training_manifest.toml Pi /etc/cis490/ Operator's single source of truth: which jobs to train, with what hyperparameters and capability constraints
cis490-jobs (tools/cis490_jobs.py) anywhere Operator CLI: status, list, show, cancel, requeue, reload

How the operator controls it

Edit the manifest (/etc/cis490/training_manifest.toml):

  • Add or remove [[jobs]] entries
  • Change priorities, hyperparameters, capability constraints
  • Add a new host under [hosts.<name>] with allow_jobs / deny_jobs / priority

Reload:

cis490-jobs reload
# or:  systemctl reload cis490-trainer-receiver.service
# or:  sudo kill -HUP $(pgrep -f training.fleet.receiver)

The reload is idempotent. Existing rows keep their status; new jobs become claimable; jobs the operator removes from the manifest stay in the queue (use cis490-jobs cancel <id> to mark them cancelled).

Status:

cis490-jobs status
cis490-jobs list --status running
cis490-jobs show transformer-oracle
cis490-jobs workers

Override a stuck job:

cis490-jobs requeue <job_id>   # force back to pending from any state
cis490-jobs cancel <job_id>

Note: requeue requires $CIS490_OPERATOR_TOKEN to match the receiver's configured operator token.

Adding a new training host

Linux (Pi, GPU box, anything that can run torch)

# On the host you want to enroll, as root:
git clone http://maxgit.wg/spectral/CIS490 /opt/cis490
cd /opt/cis490
python3 -m venv .venv && .venv/bin/pip install -e '.[training]'
sudo /opt/cis490/scripts/install-training-worker.sh

The script:

  1. Verifies the WG mesh + receiver reachability
  2. Prints the host's self-reported capability (CPU cores, RAM, CUDA, VRAM)
  3. Drops /etc/cis490/trainer-worker.env with the receiver URL
  4. Installs and starts cis490-trainer-worker.service
  5. Tails the journal so you see the worker claim its first job

Windows (e.g., the operator's desktop with the GPU)

# As Administrator in PowerShell:
git clone http://maxgit.wg/spectral/CIS490 C:\cis490
cd C:\cis490
py -3.11 -m venv .venv
.\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
.\.venv\Scripts\pip install -e .

powershell -ExecutionPolicy Bypass -File .\scripts\install-training-worker-windows.ps1

Registers a Scheduled Task that runs the worker at startup + restarts it if it stops. Logs to C:\cis490\logs\trainer-worker.log.

After enrollment

The new host appears in cis490-jobs workers within ~15 s. The receiver sees its capability and starts handing it eligible jobs. You did not need to coordinate with anyone — the operator-defined manifest already described what jobs are out there; the new host just claimed the ones its CUDA capacity unblocked.

Capability gating

Each job declares constraints; each worker self-reports capability. The receiver computes eligibility and only hands a job to a worker that can run it.

                   require_cuda  prefer_cuda  min_vram_gib  Pi  desktop GPU
gbt                  no            -             0           ✓        ✓
mlp                  no            -             0           ✓        ✓
cnn                  no            yes           1           ✓ (after  ✓
                                                                 5min grace)
gru / lstm           yes           -             2           -        ✓
transformer          yes           -             4           -        ✓
transformer_ssl      yes           -             4           -        ✓

prefer_cuda jobs wait prefer_cuda_grace_s (default 300 s) before a CPU worker is allowed to claim them — so a GPU worker has a chance even if a CPU worker is idle.

Per-host policy

In the manifest:

[hosts.office-print]
allow_jobs = ["gbt", "mlp"]    # whitelist; absent or empty = all allowed
deny_jobs = []
priority   = 0

A worker matching office-print will only claim jobs whose model is in allow_jobs. Useful for "I want the Pi to never train the Transformer even if I happened to put pytorch-cuda on it."

Architecture notes

Atomic claim

JobQueue.claim_next runs the eligibility filter in Python, then the state transition is a single UPDATE … WHERE status='pending' — exactly one of N racing workers wins.

Stale-claim recovery

Workers heartbeat every 30 s. The receiver periodically sweeps for claimed/running rows whose last heartbeat is older than 600 s and returns them to pending (or marks failed if attempts ≥ max_attempts). A worker crash never permanently strands a job.

Artifact deduplication

The artifact_id is the sha256 of the uploaded tarball. Re-running a job with bit-identical output (same code, same data, same hyper, same seed) → already-present, no re-upload.

Schema continuity with the supervised pipeline

The receiver's queue rows reference job_ids that hash the SAME spec fields the trainer uses, so re-syncing a manifest after a code change that doesn't affect the trained-model identity is a no-op. Changing hyper.lr produces a NEW job_id — the queue treats it as a new job and the old artifact stays around for comparison.

Endpoints (reference)

POST /v1/job/claim                (worker)
POST /v1/job/{id}/heartbeat       (worker)
POST /v1/job/{id}/complete        (worker)
POST /v1/job/{id}/fail            (worker)
PUT  /v1/model/{id}               (worker — uploads tarball)

GET  /v1/jobs[?status=...]        (anyone)
GET  /v1/workers                  (anyone)
POST /v1/job/{id}/cancel          (operator: X-Operator-Token)
POST /v1/job/{id}/requeue         (operator)
POST /v1/manifest/reload          (operator)
GET  /v1/health                   (anyone)

Files

  • capability.py — self-detection
  • manifest.py — TOML loader + JobSpec / HostSpec
  • queue.py — SQLite queue with atomic claim
  • store.py — model-artifact store on the Pi
  • receiver.py — Starlette app exposing the endpoints above
  • client.py — stdlib HTTP client (no extra deps)
  • worker.py — long-running worker daemon
  • __main__.py not needed; each module has its own main()