CIS490/training/fleet/README.md

# training/fleet/ — distributed training across multiple hosts

Symmetric to the *collection* fleet (`orchestrator/fleet.py`), but for
*training* the models. The collection fleet is embarrassingly parallel
(every lab host runs the same manifest and produces independent data).
The training fleet is the opposite: each `(model, mode, hyper)` job is
trained at most once, so the receiver coordinates which worker gets
which job.

## Roles

| Component | Where it runs | Responsibility |
|---|---|---|
| `cis490-trainer-receiver.service` | Pi (`10.100.0.1`) | Job queue (SQLite), claim/heartbeat/complete endpoints, artifact ingest |
| `cis490-trainer-worker.service` | every training host | Self-detect capability → claim eligible job → run trainer → ship artifact → repeat |
| `etc/training_manifest.toml` | Pi `/etc/cis490/` | Operator's single source of truth: which jobs to train, with what hyperparameters and capability constraints |
| `cis490-jobs` (`tools/cis490_jobs.py`) | anywhere | Operator CLI: status, list, show, cancel, requeue, reload |

## How the operator controls it

**Edit the manifest** (`/etc/cis490/training_manifest.toml`):
- Add or remove `[[jobs]]` entries
- Change priorities, hyperparameters, capability constraints
- Add a new host under `[hosts.<name>]` with allow_jobs / deny_jobs / priority

**Reload**:
```sh
cis490-jobs reload
# or:  systemctl reload cis490-trainer-receiver.service
# or:  sudo kill -HUP $(pgrep -f training.fleet.receiver)
```
The reload is idempotent. Existing rows keep their status; new jobs become
claimable; jobs the operator removes from the manifest **stay** in the
queue (use `cis490-jobs cancel <id>` to mark them `cancelled`).

**Status**:
```sh
cis490-jobs status
cis490-jobs list --status running
cis490-jobs show transformer-oracle
cis490-jobs workers
```

**Override a stuck job**:
```sh
cis490-jobs requeue <job_id>   # force back to pending from any state
cis490-jobs cancel <job_id>
```
Note: `requeue` requires `$CIS490_OPERATOR_TOKEN` to match the receiver's
configured operator token.

## Adding a new training host

### Linux (Pi, GPU box, anything that can run torch)

```sh
# On the host you want to enroll, as root:
git clone http://maxgit.wg/spectral/CIS490 /opt/cis490
cd /opt/cis490
python3 -m venv .venv && .venv/bin/pip install -e '.[training]'
sudo /opt/cis490/scripts/install-training-worker.sh
```

The script:
1. Verifies the WG mesh + receiver reachability
2. Prints the host's self-reported capability (CPU cores, RAM, CUDA, VRAM)
3. Drops `/etc/cis490/trainer-worker.env` with the receiver URL
4. Installs and starts `cis490-trainer-worker.service`
5. Tails the journal so you see the worker claim its first job

### Windows (e.g., the operator's desktop with the GPU)

```powershell
# As Administrator in PowerShell:
git clone http://maxgit.wg/spectral/CIS490 C:\cis490
cd C:\cis490
py -3.11 -m venv .venv
.\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
.\.venv\Scripts\pip install -e .

powershell -ExecutionPolicy Bypass -File .\scripts\install-training-worker-windows.ps1
```

Registers a Scheduled Task that runs the worker at startup + restarts it
if it stops. Logs to `C:\cis490\logs\trainer-worker.log`.

### After enrollment

The new host appears in `cis490-jobs workers` within ~15 s. The receiver
sees its capability and starts handing it eligible jobs. **You did not
need to coordinate with anyone** — the operator-defined manifest already
described what jobs are out there; the new host just claimed the ones
its CUDA capacity unblocked.

## Capability gating

Each job declares constraints; each worker self-reports capability. The
receiver computes eligibility and only hands a job to a worker that
can run it.

```
                   require_cuda  prefer_cuda  min_vram_gib  Pi  desktop GPU
gbt                  no            -             0           ✓        ✓
mlp                  no            -             0           ✓        ✓
cnn                  no            yes           1           ✓ (after  ✓
                                                                 5min grace)
gru / lstm           yes           -             2           -        ✓
transformer          yes           -             4           -        ✓
transformer_ssl      yes           -             4           -        ✓
```

`prefer_cuda` jobs wait `prefer_cuda_grace_s` (default 300 s) before a
CPU worker is allowed to claim them — so a GPU worker has a chance even
if a CPU worker is idle.

## Per-host policy

In the manifest:

```toml
[hosts.office-print]
allow_jobs = ["gbt", "mlp"]    # whitelist; absent or empty = all allowed
deny_jobs = []
priority   = 0
```

A worker matching `office-print` will only claim jobs whose `model` is in
`allow_jobs`. Useful for "I want the Pi to never train the Transformer
even if I happened to put pytorch-cuda on it."

## Architecture notes

### Atomic claim
`JobQueue.claim_next` runs the eligibility filter in Python, then the
state transition is a single `UPDATE … WHERE status='pending'` — exactly
one of N racing workers wins.

### Stale-claim recovery
Workers heartbeat every 30 s. The receiver periodically sweeps for
claimed/running rows whose last heartbeat is older than 600 s and
returns them to pending (or marks failed if attempts ≥ max_attempts).
A worker crash never permanently strands a job.

### Artifact deduplication
The artifact_id is the sha256 of the uploaded tarball. Re-running a
job with bit-identical output (same code, same data, same hyper, same
seed) → already-present, no re-upload.

### Schema continuity with the supervised pipeline
The receiver's queue rows reference job_ids that hash the SAME spec
fields the trainer uses, so re-syncing a manifest after a code change
that doesn't affect the trained-model identity is a no-op. Changing
`hyper.lr` produces a NEW job_id — the queue treats it as a new job
and the old artifact stays around for comparison.

## Endpoints (reference)

```
POST /v1/job/claim                (worker)
POST /v1/job/{id}/heartbeat       (worker)
POST /v1/job/{id}/complete        (worker)
POST /v1/job/{id}/fail            (worker)
PUT  /v1/model/{id}               (worker — uploads tarball)

GET  /v1/jobs[?status=...]        (anyone)
GET  /v1/workers                  (anyone)
POST /v1/job/{id}/cancel          (operator: X-Operator-Token)
POST /v1/job/{id}/requeue         (operator)
POST /v1/manifest/reload          (operator)
GET  /v1/health                   (anyone)
```

## Files

- `capability.py` — self-detection
- `manifest.py` — TOML loader + JobSpec / HostSpec
- `queue.py` — SQLite queue with atomic claim
- `store.py` — model-artifact store on the Pi
- `receiver.py` — Starlette app exposing the endpoints above
- `client.py` — stdlib HTTP client (no extra deps)
- `worker.py` — long-running worker daemon
- `__main__.py` not needed; each module has its own `main()`