training/fleet: distributed multi-host trainer with capability gating

Symmetric companion to the collection fleet (orchestrator/fleet.py) but for *training*. Collection is embarrassingly parallel; training is not (a model is trained at most once across the fleet), so the receiver coordinates which worker gets which job. Operator-control surface is etc/training_manifest.toml.example — single canonical file declaring (a) per-host capability + per-model allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper) with capability constraints (require_cuda, prefer_cuda, min_vram_gib, min_ram_gib, allowed_hosts). Components: capability.py — self-detection: hostname, cores, RAM, CUDA presence, VRAM, torch version, git commit. Used by workers to filter eligible jobs before claiming. manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable sha256 of (model, mode, hyper, split_recipe, train_hosts, seed) so manifest reload is idempotent: existing rows keep their status, new jobs become claimable, removed jobs stay until cancelled. queue.py — SQLite job queue (training_jobs.db) with statuses pending|claimed|running|completed|failed|cancelled. Atomic claim_next via single UPDATE WHERE status='pending'. Heartbeat, complete, fail. Stale-claim sweep (stale_after_s=600s) with max_attempts cutoff to failed. store.py — model artifact store mirroring receiver/store.py. Artifact ID is the sha256 of the uploaded tarball; bit-identical re-runs deduplicate. receiver.py — Starlette app exposing 11 endpoints: POST /v1/job/claim (worker) POST /v1/job/{id}/heartbeat (worker) POST /v1/job/{id}/complete (worker) POST /v1/job/{id}/fail (worker) PUT /v1/model/{id} (worker — uploads tarball) GET /v1/jobs (anyone) GET /v1/workers (anyone) POST /v1/job/{id}/cancel (operator: X-Operator-Token) POST /v1/job/{id}/requeue (operator) POST /v1/manifest/reload (operator) GET /v1/health (anyone) Runs as cis490-trainer-receiver.service on the Pi alongside the existing receiver, on a separate port. client.py — stdlib HTTP client (urllib only, no new deps). worker.py — long-running daemon. Loop: detect capability → claim → spawn training/trainer/run.py subprocess → heartbeat every 30s → tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe. Operator CLI (tools/cis490_jobs.py): status / list / show / cancel / requeue / reload / workers. Cancel and requeue require $CIS490_OPERATOR_TOKEN matching the receiver's configured value. Bootstrap: scripts/install-training-worker.sh (Linux systemd) and scripts/install-training-worker-windows.ps1 (Windows Scheduled Task) let the operator enroll a new host with one command after cloning the repo and setting up the venv. Worker self-tests capability before registering. End-to-end smoke verified on the Pi: receiver up, manifest synced, 14 jobs queued, worker registered, claimed 4 CPU-eligible jobs (allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle, mlp-oracle), 1 failed with the actual error visible via cis490-jobs status, 3 artifacts uploaded to /var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with proper index.jsonl row. 21 unit tests (manifest validation: 8; queue lifecycle + eligibility: 13). All pass alongside the prior 17 training tests = 38 green. Open limitations surfaced inline: - Hyper-key drift between manifest and run.py fails at training time, not at manifest reload (worth tightening to argparse introspection later). - mTLS not yet wired through Caddy for the trainer-receiver port — listens loopback-only until that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:20:20 -05:00 · 2026-05-08 01:20:20 -05:00 · 8643192a71
commit 8643192a71
parent 3ea6bca6f0
17 changed files with 3070 additions and 0 deletions
--- a/etc/cis490-trainer-receiver.service
+++ b/etc/cis490-trainer-receiver.service
@ -0,0 +1,40 @@
 [Unit]
 Description=CIS490 trainer-receiver (training-fleet coordinator)
 After=network-online.target
 Wants=network-online.target
 Documentation=https://maxgit.wg/spectral/CIS490
 [Service]
 Type=simple
 User=cis490
 Group=cis490
 EnvironmentFile=-/etc/cis490/trainer-receiver.env
 ExecStart=/opt/cis490/.venv/bin/python -m training.fleet.receiver \
    --listen-addr 127.0.0.1:8445 \
    --manifest /etc/cis490/training_manifest.toml \
    --db /var/lib/cis490/training_jobs.db \
    --store-root /var/lib/cis490/models \
    --incoming-root /var/lib/cis490/incoming-models \
    --index-path /var/lib/cis490/models/index.jsonl
 # Reload behavior — SIGHUP re-reads manifest into the queue without dropping
 # in-flight jobs. The receiver's own /v1/manifest/reload endpoint is the
 # preferred control surface; this is for systemctl reload compatibility.
 ExecReload=/bin/kill -HUP $MAINPID
 WorkingDirectory=/opt/cis490
 Restart=on-failure
 RestartSec=5s
 RestartPreventExitStatus=78          # sysadmin error — don't respawn
 # Hardening — same shape as cis490-receiver.service
 ProtectSystem=strict
 ProtectHome=true
 PrivateTmp=true
 NoNewPrivileges=true
 ReadWritePaths=/var/lib/cis490
 [Install]
 WantedBy=multi-user.target
--- a/etc/cis490-trainer-worker.service
+++ b/etc/cis490-trainer-worker.service
@ -0,0 +1,40 @@
 [Unit]
 Description=CIS490 trainer worker (claims jobs, runs trainings, ships artifacts)
 After=network-online.target
 Wants=network-online.target
 Documentation=https://maxgit.wg/spectral/CIS490
 [Service]
 Type=simple
 User=cis490
 Group=cis490
 EnvironmentFile=-/etc/cis490/trainer-worker.env
 # CIS490_TRAINER_RECEIVER_URL — set in trainer-worker.env
 # FLEET_HOST_ID — override the hostname-derived host_id (optional)
 ExecStart=/opt/cis490/.venv/bin/python -m training.fleet.worker \
    --receiver-url ${CIS490_TRAINER_RECEIVER_URL} \
    --validation /opt/cis490/data/processed/validation_v1.parquet \
    --summary /opt/cis490/data/processed/features_window_v1.parquet \
    --tensors /opt/cis490/data/processed/tensor_window_v1 \
    --artifacts-dir artifacts \
    --reports-dir reports/eval
 WorkingDirectory=/opt/cis490
 Restart=on-failure
 RestartSec=15s
 # Workers do compute-heavy training. Don't kill them just because a single
 # job failed; let the daemon's own loop handle that.
 TimeoutStopSec=120s
 ProtectSystem=strict
 ProtectHome=true
 PrivateTmp=false                     # need /tmp for trainer scratch
 NoNewPrivileges=true
 ReadWritePaths=/opt/cis490 /var/lib/cis490 /tmp
 [Install]
 WantedBy=multi-user.target
--- a/etc/training_manifest.toml.example
+++ b/etc/training_manifest.toml.example
@ -0,0 +1,216 @@
 # CIS490 training fleet manifest — example/template.
 #
 # This is the ONLY thing the operator edits to control what gets trained
 # across the training fleet. Mirrors the collection-side manifest.toml in
 # spirit: a single canonical file, no per-host overrides, every host loads
 # THIS exact file when it claims its next job.
 #
 # Copy to /etc/cis490/training_manifest.toml on the Pi (the receiver) and
 # the receiver loads it on startup + on SIGHUP. Workers don't read it
 # directly; they ask the receiver for jobs that match their capability.
 #
 # To change the fleet's plan:
 #   1. Edit this file
 #   2. systemctl reload cis490-receiver        (or send SIGHUP)
 #   3. New jobs become claimable; in-flight jobs continue
 #
 # To add a new training host (e.g., your desktop):
 #   1. Append it to [hosts.<name>] below with its declared capabilities
 #   2. Run scripts/install-training-worker-{linux,windows}.{sh,ps1} on it
 #   3. The worker connects, reports its capability, and starts claiming
 #      jobs whose constraints it satisfies
 schema_version = 1
 name = "cis490-training-v1"
 # --------------------------------------------------------------------
 # [defaults] — applied to every job unless the job overrides
 # --------------------------------------------------------------------
 [defaults]
 split_recipe = "host"               # host | sample | time
 train_hosts  = ["elliott-thinkpad"] # which hosts' episodes train; rest = test
 seed         = 0
 n_resamples  = 1000                  # bootstrap CIs
 # --------------------------------------------------------------------
 # [hosts.<name>] — declared capability for each known training host
 # --------------------------------------------------------------------
 # These declarations are *advisory*. The worker ALSO self-detects
 # capability at startup; the receiver intersects the two and uses the
 # more restrictive set. So if you say a host has a 2070 Super here but
 # the worker doesn't actually find CUDA, the worker is treated as CPU-only
 # and won't claim cuda-required jobs. This prevents misconfiguration.
 [hosts.office-print]
 description = "the Pi (receiver). CPU-only, slow. Useful for GBT smoke runs."
 priority    = 0       # higher number = pick this host first when multiple eligible
 allow_jobs  = ["gbt", "mlp"]    # whitelist of model names this host may run
 deny_jobs   = []      # blacklist; deny wins over allow
 [hosts.spectral-desktop]
 description = "operator desktop. RTX 2070 Super (~8 GiB VRAM)."
 priority    = 100
 # allow_jobs  = []    # empty list (or absent) = all jobs allowed
 # Add more hosts here as you enroll them. Names must match the worker's
 # self-reported hostname (or its FLEET_HOST_ID env var override).
 # --------------------------------------------------------------------
 # [[jobs]] — the training plan. One entry per (model, mode) you want
 # trained. Add or remove freely; the receiver re-syncs the queue
 # against the file on SIGHUP.
 # --------------------------------------------------------------------
 # ============ Tier 1: tree + dense baselines (CPU-friendly) ============
 [[jobs]]
 name        = "gbt-realistic"
 model       = "gbt"
 mode        = "realistic"
 priority    = 100                # higher = picked first when multiple eligible
 require_cuda = false             # no GPU needed; CPU is fine
 min_ram_gib  = 4
 [[jobs]]
 name        = "gbt-oracle"
 model       = "gbt"
 mode        = "oracle"
 priority    = 100
 require_cuda = false
 min_ram_gib  = 4
 [[jobs]]
 name        = "mlp-realistic"
 model       = "mlp"
 mode        = "realistic"
 priority    = 90
 require_cuda = false             # tiny MLP — CPU OK, GPU nice
 min_ram_gib  = 4
 # hyper.* keys must match flags accepted by training/trainer/run.py
 # (currently: --epochs, --batch-size, --lr, --patience). Architecture-
 # specific knobs (hidden, n_layers, dropout) are baked into the model
 # class defaults; override them by editing the model file rather than
 # via the manifest until run.py grows the corresponding flags.
 hyper.epochs = 60
 hyper.batch_size = 1024
 hyper.lr     = 1e-3
 [[jobs]]
 name        = "mlp-oracle"
 model       = "mlp"
 mode        = "oracle"
 priority    = 90
 require_cuda = false
 min_ram_gib  = 4
 # ============ Tier 2: sequence models (GPU strongly preferred) =========
 [[jobs]]
 name        = "cnn-realistic"
 model       = "cnn"
 mode        = "realistic"
 priority    = 80
 require_cuda = false             # 1D-CNN is small enough to run on CPU
 prefer_cuda = true               # but route to a GPU host if available
 min_vram_gib = 1
 hyper.epochs = 60
 hyper.batch_size = 512
 [[jobs]]
 name        = "cnn-oracle"
 model       = "cnn"
 mode        = "oracle"
 priority    = 80
 require_cuda = false
 prefer_cuda = true
 min_vram_gib = 1
 [[jobs]]
 name        = "gru-realistic"
 model       = "gru"
 mode        = "realistic"
 priority    = 70
 require_cuda = true              # RNNs slow on CPU; require GPU
 min_vram_gib = 2
 [[jobs]]
 name        = "gru-oracle"
 model       = "gru"
 mode        = "oracle"
 priority    = 70
 require_cuda = true
 min_vram_gib = 2
 [[jobs]]
 name        = "lstm-realistic"
 model       = "lstm"
 mode        = "realistic"
 priority    = 60
 require_cuda = true
 min_vram_gib = 2
 [[jobs]]
 name        = "lstm-oracle"
 model       = "lstm"
 mode        = "oracle"
 priority    = 60
 require_cuda = true
 min_vram_gib = 2
 [[jobs]]
 name        = "transformer-realistic"
 model       = "transformer"
 mode        = "realistic"
 priority    = 50
 require_cuda = true
 min_vram_gib = 4
 hyper.epochs = 80
 hyper.batch_size = 256
 [[jobs]]
 name        = "transformer-oracle"
 model       = "transformer"
 mode        = "oracle"
 priority    = 50
 require_cuda = true
 min_vram_gib = 4
 hyper.epochs = 80
 hyper.batch_size = 256
 # ============ Tier 3: self-supervised pretrain (GPU recommended) =======
 [[jobs]]
 name        = "transformer-ssl-realistic"
 model       = "transformer_ssl"
 mode        = "realistic"
 priority    = 40
 require_cuda = true
 min_vram_gib = 4
 hyper.epochs = 100
 hyper.target_fpr = 0.05
 [[jobs]]
 name        = "transformer-ssl-oracle"
 model       = "transformer_ssl"
 mode        = "oracle"
 priority    = 40
 require_cuda = true
 min_vram_gib = 4
 hyper.epochs = 100
 # Notes on the priority field:
 #   - Higher number = claimed first when multiple jobs are eligible
 #   - Tier 1 (cheap, fast, foundational) > Tier 2 (slower) > Tier 3 (research)
 #   - You can override on a per-job basis if e.g. you want to rush a
 #     specific architecture
 #
 # Notes on require_cuda vs prefer_cuda:
 #   - require_cuda = true: only CUDA workers can claim
 #   - prefer_cuda = true: any worker can claim, but CUDA workers are preferred
 #                         (the receiver waits ~5 min for a CUDA worker
 #                         before letting a CPU worker take it)
 #
 # Notes on hyperparameters:
 #   - All hyper.* keys are passed to training/trainer/run.py as --<key>
 #   - Unset keys fall back to the trainer's defaults
 #   - The receiver hashes the full (model, mode, hyper) blob into job_id
 #     so the same job always produces the same id; re-queueing is idempotent
--- a/scripts/install-training-worker-windows.ps1
+++ b/scripts/install-training-worker-windows.ps1
@ -0,0 +1,116 @@
 # Install a CIS490 trainer worker on a Windows host (e.g., the operator's
 # desktop with the GPU).
 #
 # Symmetric to install-training-worker.sh but for Windows. Sets up:
 #   - Confirms WireGuard reachability to the Pi receiver
 #   - Confirms a Python venv with torch (CUDA) is present
 #   - Registers a Scheduled Task that runs the worker at startup + every
 #     5 minutes if it isn't running
 #
 # Run as Administrator in PowerShell:
 #   powershell.exe -ExecutionPolicy Bypass -File install-training-worker-windows.ps1
 #
 # Prereqs (set up these manually before running):
 #   - Git clone of the CIS490 repo at $env:CIS490_HOME (default: C:\cis490)
 #   - Python 3.11+ in $env:CIS490_HOME\.venv with torch (CUDA) + xgboost
 #       py -3.11 -m venv .venv
 #       .\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
 #       .\.venv\Scripts\pip install -e .
 #   - WireGuard tunnel up to 10.100.0.1
 #
 # After install, the worker logs go to $env:CIS490_HOME\logs\trainer-worker.log
 param(
    [string]$RepoRoot = $(if ($env:CIS490_HOME) { $env:CIS490_HOME } else { "C:\cis490" }),
    [string]$ReceiverUrl = $(if ($env:CIS490_TRAINER_RECEIVER_URL) { $env:CIS490_TRAINER_RECEIVER_URL } else { "http://10.100.0.1:8445" }),
    [string]$HostId = $(if ($env:FLEET_HOST_ID) { $env:FLEET_HOST_ID } else { $env:COMPUTERNAME })
 )
 $ErrorActionPreference = "Stop"
 if (-not (Test-Path $RepoRoot)) {
    Write-Error "Repo not found at $RepoRoot. Set `$env:CIS490_HOME or pass -RepoRoot."
    exit 1
 }
 $VenvPy = Join-Path $RepoRoot ".venv\Scripts\python.exe"
 if (-not (Test-Path $VenvPy)) {
    Write-Error @"
 No Python venv at $VenvPy.
 Set up first:
  cd $RepoRoot
  py -3.11 -m venv .venv
  .\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
  .\.venv\Scripts\pip install -e .
 "@
    exit 1
 }
 # Receiver reachability
 Write-Host "Checking trainer-receiver at $ReceiverUrl..."
 try {
    $r = Invoke-WebRequest -Uri "$ReceiverUrl/v1/health" -TimeoutSec 5 -UseBasicParsing
    if ($r.StatusCode -ne 200) { throw "non-200" }
    Write-Host "  receiver OK"
 } catch {
    Write-Error @"
 Cannot reach $ReceiverUrl.
  - Is the WireGuard tunnel up? (Get-NetAdapter | ? Name -like 'wg*')
  - Is cis490-trainer-receiver.service running on the Pi?
 "@
    exit 1
 }
 # Capability self-test
 Write-Host ""
 Write-Host "=== capability self-report ==="
 & $VenvPy -m training.fleet.capability
 Write-Host ""
 # Logs dir
 $LogsDir = Join-Path $RepoRoot "logs"
 New-Item -ItemType Directory -Force -Path $LogsDir | Out-Null
 $LogPath = Join-Path $LogsDir "trainer-worker.log"
 # Build the launcher .cmd that the scheduled task invokes
 $LauncherPath = Join-Path $RepoRoot "scripts\run-trainer-worker.cmd"
@"
@echo off
 cd /d "$RepoRoot"
 set CIS490_TRAINER_RECEIVER_URL=$ReceiverUrl
 set FLEET_HOST_ID=$HostId
 "$VenvPy" -m training.fleet.worker --receiver-url "$ReceiverUrl" --host-id "$HostId" >> "$LogPath" 2>&1
 "@ | Set-Content -Encoding ASCII $LauncherPath
 Write-Host "wrote launcher: $LauncherPath"
 # Register / replace the scheduled task
 $TaskName = "CIS490-TrainerWorker"
 $existing = schtasks /Query /TN $TaskName 2>$null
 if ($existing) {
    Write-Host "removing existing scheduled task $TaskName"
    schtasks /Delete /TN $TaskName /F | Out-Null
 }
 # Run as the current user, at startup, restart if it stops, every 5 min check
 schtasks /Create /TN $TaskName /TR "`"$LauncherPath`"" /SC ONSTART /RU "$env:USERDOMAIN\$env:USERNAME" /RL HIGHEST /F | Out-Null
 # Add a second trigger that ensures the task is running every 5 minutes
 schtasks /Change /TN $TaskName /RI 5 /DU 9999:00 2>$null
 Write-Host ""
 Write-Host "scheduled task '$TaskName' created."
 Write-Host "Starting it now..."
 schtasks /Run /TN $TaskName | Out-Null
 Start-Sleep -Seconds 3
 if (Test-Path $LogPath) {
    Write-Host ""
    Write-Host "=== first 30 log lines ==="
    Get-Content $LogPath -Tail 30
 }
 Write-Host ""
 Write-Host "Done."
 Write-Host "  Logs:    Get-Content '$LogPath' -Wait"
 Write-Host "  Status:  schtasks /Query /TN $TaskName /V /FO LIST"
 Write-Host "  Stop:    schtasks /End /TN $TaskName"
 Write-Host "  Remove:  schtasks /Delete /TN $TaskName /F"
--- a/scripts/install-training-worker.sh
+++ b/scripts/install-training-worker.sh
@ -0,0 +1,83 @@
 #!/usr/bin/env bash
 # Install a CIS490 trainer worker on a Linux host (Pi or x86 GPU box).
 #
 # This is the symmetric companion to install-lab-host.sh — same idea,
 # different role. Run as root on the host you want to enroll. Prereqs:
 # - WireGuard up to 10.100.0.1
 # - A working Python 3.11+ with the training deps installed
 # - Repo cloned to /opt/cis490, working tree clean, on origin/main
 #
 # What this script does:
 #   1. Verifies repo + venv + WG mesh reachability
 #   2. Writes /etc/systemd/system/cis490-trainer-worker.service
 #   3. Drops a default /etc/cis490/trainer-worker.env (operator edits if needed)
 #   4. systemctl enable --now cis490-trainer-worker.service
 #   5. Tails the worker log briefly to confirm it claims at least one job
 set -euo pipefail
 REPO=/opt/cis490
 VENV_PY=$REPO/.venv/bin/python
 RECEIVER_URL=${CIS490_TRAINER_RECEIVER_URL:-http://10.100.0.1:8445}
 HOST_ID=${FLEET_HOST_ID:-$(hostname)}
 if [[ $EUID -ne 0 ]]; then
    echo "must run as root" >&2; exit 1
 fi
 if [[ ! -d $REPO ]]; then
    echo "repo not at $REPO; clone http://maxgit.wg/spectral/CIS490 first" >&2
    exit 1
 fi
 if [[ ! -x $VENV_PY ]]; then
    echo "no venv at $REPO/.venv. Run:" >&2
    echo "  cd $REPO && python3 -m venv .venv && .venv/bin/pip install -e ." >&2
    exit 1
 fi
 # Receiver reachable?
 if ! curl -s --max-time 3 "$RECEIVER_URL/v1/health" >/dev/null; then
    echo "trainer-receiver unreachable at $RECEIVER_URL" >&2
    echo "  - is the WG mesh up? (ip a show wg0)"  >&2
    echo "  - is cis490-trainer-receiver.service running on the Pi?" >&2
    exit 1
 fi
 # Capability self-test — what will the worker report?
 echo "=== capability self-report ==="
 sudo -u cis490 $VENV_PY -m training.fleet.capability
 echo
 # Drop the env file (idempotent — keeps existing edits)
 mkdir -p /etc/cis490
 if [[ ! -f /etc/cis490/trainer-worker.env ]]; then
    cat > /etc/cis490/trainer-worker.env <<EOF
 # CIS490 trainer-worker config
 CIS490_TRAINER_RECEIVER_URL=$RECEIVER_URL
 FLEET_HOST_ID=$HOST_ID
 EOF
    chmod 0644 /etc/cis490/trainer-worker.env
    echo "wrote /etc/cis490/trainer-worker.env"
 else
    echo "/etc/cis490/trainer-worker.env exists; leaving it alone"
 fi
 # Install the systemd unit
 cp $REPO/etc/cis490-trainer-worker.service /etc/systemd/system/
 systemctl daemon-reload
 systemctl enable --now cis490-trainer-worker.service
 # Confirm
 sleep 3
 if ! systemctl is-active --quiet cis490-trainer-worker.service; then
    echo "trainer-worker did not start; see:" >&2
    echo "  journalctl -u cis490-trainer-worker.service -n 50" >&2
    exit 1
 fi
 echo "OK. Tailing 30 lines of journal:"
 journalctl -u cis490-trainer-worker.service --no-pager -n 30
 echo
 echo "Status from the Pi:"
 echo "  ssh max@10.100.0.1 cis490-jobs status"
 echo "Local control:"
 echo "  systemctl status cis490-trainer-worker.service"
 echo "  journalctl -u cis490-trainer-worker.service -f"
--- a/tests/test_fleet_manifest.py
+++ b/tests/test_fleet_manifest.py
@ -0,0 +1,146 @@
 """Tests for training/fleet/manifest.py — TOML loader + schema."""
 from __future__ import annotations
 from pathlib import Path
 import pytest
 from training.fleet.manifest import (
    JobSpec, TrainingManifestError, load,
 )
 def _write(tmp_path: Path, body: str) -> Path:
    p = tmp_path / "training_manifest.toml"
    p.write_text(body)
    return p
 def test_load_minimal(tmp_path):
    p = _write(tmp_path, """
 schema_version = 1
 name = "test"
 [[jobs]]
 name = "gbt-r"
 model = "gbt"
 mode = "realistic"
 """)
    m = load(p)
    assert m.name == "test"
    assert len(m.jobs) == 1
    assert m.jobs[0].model == "gbt"
    assert m.jobs[0].mode == "realistic"
 def test_unknown_model_rejected(tmp_path):
    p = _write(tmp_path, """
 schema_version = 1
 name = "test"
 [[jobs]]
 name = "bogus"
 model = "transformer_xl"
 mode = "realistic"
 """)
    with pytest.raises(TrainingManifestError, match="not in"):
        load(p)
 def test_unknown_mode_rejected(tmp_path):
    p = _write(tmp_path, """
 schema_version = 1
 [[jobs]]
 name = "x"
 model = "gbt"
 mode = "weirdo"
 """)
    with pytest.raises(TrainingManifestError, match="mode"):
        load(p)
 def test_duplicate_job_id_rejected(tmp_path):
    """Same model+mode+hyper → same job_id → operator must disambiguate."""
    p = _write(tmp_path, """
 schema_version = 1
 [[jobs]]
 name = "first"
 model = "gbt"
 mode = "realistic"
 [[jobs]]
 name = "duplicate-by-content"
 model = "gbt"
 mode = "realistic"
 """)
    with pytest.raises(TrainingManifestError, match="duplicates"):
        load(p)
 def test_disambiguation_via_hyper(tmp_path):
    """Same model+mode but different hyper → different job_ids → OK."""
    p = _write(tmp_path, """
 schema_version = 1
 [[jobs]]
 name = "lr1"
 model = "gbt"
 mode = "realistic"
 hyper.lr = 0.1
 [[jobs]]
 name = "lr2"
 model = "gbt"
 mode = "realistic"
 hyper.lr = 0.05
 """)
    m = load(p)
    assert m.jobs[0].job_id != m.jobs[1].job_id
 def test_host_allow_deny(tmp_path):
    p = _write(tmp_path, """
 schema_version = 1
 [hosts.tiny]
 allow_jobs = ["gbt"]
 [hosts.huge]
 deny_jobs  = ["transformer"]
 [[jobs]]
 name = "x"
 model = "gbt"
 mode = "realistic"
 """)
    m = load(p)
    assert m.hosts["tiny"].is_model_allowed("gbt")
    assert not m.hosts["tiny"].is_model_allowed("transformer")
    assert m.hosts["huge"].is_model_allowed("gbt")
    assert not m.hosts["huge"].is_model_allowed("transformer")
 def test_job_id_stable_across_loads(tmp_path):
    src = """
 schema_version = 1
 [[jobs]]
 name = "stable"
 model = "transformer"
 mode = "oracle"
 hyper.epochs = 80
 hyper.batch_size = 256
 """
    a = load(_write(tmp_path / "a", src) if False else _write(tmp_path, src))
    p2 = tmp_path / "b.toml"
    p2.write_text(src)
    b = load(p2)
    # Same content → same job_id (it's the load-portable identity)
    assert a.jobs[0].job_id == b.jobs[0].job_id
 def test_priority_default_zero(tmp_path):
    p = _write(tmp_path, """
 schema_version = 1
 [[jobs]]
 name = "x"
 model = "gbt"
 mode = "realistic"
 """)
    m = load(p)
    assert m.jobs[0].priority == 0
--- a/tests/test_fleet_queue.py
+++ b/tests/test_fleet_queue.py
@ -0,0 +1,189 @@
 """Tests for training/fleet/queue.py — atomic claim + lifecycle."""
 from __future__ import annotations
 import json
 import time
 from pathlib import Path
 import pytest
 from training.fleet.queue import JobQueue, _eligible
@pytest.fixture
 def q(tmp_path):
    return JobQueue(tmp_path / "jobs.db")
 def _job(name: str, *, model="gbt", mode="realistic",
          require_cuda=False, prefer_cuda=False,
          min_vram_gib=0.0, min_ram_gib=2.0, min_cores=1,
          priority=10, hyper=None) -> dict:
    return {
        "name": name, "job_id": f"id-{name}",
        "model": model, "mode": mode, "priority": priority,
        "require_cuda": require_cuda, "prefer_cuda": prefer_cuda,
        "min_vram_gib": min_vram_gib, "min_ram_gib": min_ram_gib,
        "min_cores": min_cores,
        "allowed_hosts": [], "denied_hosts": [],
        "hyper": hyper or {}, "split_recipe": "host",
        "train_hosts": ["a"], "seed": 0, "n_resamples": 100,
    }
 def _cap(*, cuda=False, vram=0.0, ram=8.0, cores=4) -> dict:
    devs = ([{"name": "fake", "vram_total_gib": vram, "vram_free_gib": vram}]
            if cuda else [])
    return {"cuda_available": cuda, "cuda_devices": devs,
            "ram_available_gib": ram, "cpu_cores": cores}
 def test_sync_idempotent(q):
    counts = q.sync_from_manifest([_job("a"), _job("b")])
    assert counts["inserted"] == 2
    counts = q.sync_from_manifest([_job("a"), _job("b")])
    assert counts["unchanged"] == 2
    assert counts["inserted"] == 0
 def test_claim_priority_order(q):
    q.sync_from_manifest([
        _job("low", priority=1),
        _job("high", priority=100),
        _job("mid", priority=50),
    ])
    j = q.claim_next(worker_hostname="w", capability=_cap())
    assert j.name == "high"
    j = q.claim_next(worker_hostname="w", capability=_cap())
    assert j.name == "mid"
 def test_claim_atomic_no_double_assign(q):
    q.sync_from_manifest([_job("only")])
    j1 = q.claim_next(worker_hostname="w1", capability=_cap())
    j2 = q.claim_next(worker_hostname="w2", capability=_cap())
    assert j1 is not None
    assert j2 is None  # already claimed
 def test_eligible_require_cuda(q):
    spec = _job("gpu", require_cuda=True, min_vram_gib=2.0)
    ok, reason = _eligible(spec=spec, hostname="w",
                            capability=_cap(cuda=False),
                            host_spec=None,
                            prefer_cuda_grace_s=0.0, job_age_s=10.0)
    assert not ok
    assert "no CUDA" in reason
    ok, _ = _eligible(spec=spec, hostname="w",
                      capability=_cap(cuda=True, vram=4.0),
                      host_spec=None,
                      prefer_cuda_grace_s=0.0, job_age_s=10.0)
    assert ok
 def test_eligible_min_vram_check(q):
    spec = _job("big-gpu", require_cuda=True, min_vram_gib=8.0)
    ok, reason = _eligible(spec=spec, hostname="w",
                            capability=_cap(cuda=True, vram=2.0),
                            host_spec=None,
                            prefer_cuda_grace_s=0.0, job_age_s=10.0)
    assert not ok
    assert "vram_free" in reason
 def test_prefer_cuda_grace_blocks_cpu_then_releases(q):
    spec = _job("nice-to-cuda", prefer_cuda=True)
    cap = _cap(cuda=False)
    ok_early, _ = _eligible(spec=spec, hostname="w", capability=cap,
                              host_spec=None,
                              prefer_cuda_grace_s=300.0, job_age_s=60.0)
    ok_late, _ = _eligible(spec=spec, hostname="w", capability=cap,
                            host_spec=None,
                            prefer_cuda_grace_s=300.0, job_age_s=400.0)
    assert not ok_early
    assert ok_late
 def test_host_allow_jobs_filter(q):
    spec = _job("gbt-job", model="gbt")
    spec_other = _job("transformer-job", model="transformer")
    host_spec = {"allow_jobs": ["gbt"], "deny_jobs": []}
    ok, _ = _eligible(spec=spec, hostname="pi", capability=_cap(),
                      host_spec=host_spec,
                      prefer_cuda_grace_s=0.0, job_age_s=10.0)
    assert ok
    ok, reason = _eligible(spec=spec_other, hostname="pi",
                            capability=_cap(), host_spec=host_spec,
                            prefer_cuda_grace_s=0.0, job_age_s=10.0)
    assert not ok
    assert "whitelist" in reason
 def test_lifecycle_claim_heartbeat_complete(q):
    q.sync_from_manifest([_job("x")])
    j = q.claim_next(worker_hostname="w", capability=_cap())
    assert j.status == "claimed"
    assert q.heartbeat(j.job_id, "w")
    assert q.complete(j.job_id, "w", artifact_id="abc123")
    after = q.get(j.job_id)
    assert after.status == "completed"
    assert after.artifact_id == "abc123"
 def test_heartbeat_rejects_wrong_worker(q):
    q.sync_from_manifest([_job("x")])
    j = q.claim_next(worker_hostname="w1", capability=_cap())
    assert not q.heartbeat(j.job_id, "w2")
 def test_requeue_from_any_state(q):
    q.sync_from_manifest([_job("x")])
    j = q.claim_next(worker_hostname="w", capability=_cap())
    # Stuck in claimed — operator override must work
    assert q.requeue(j.job_id)
    assert q.get(j.job_id).status == "pending"
 def test_sweep_stale(q):
    q.sync_from_manifest([_job("x")])
    j = q.claim_next(worker_hostname="w", capability=_cap())
    # Manually fudge the heartbeat to look ancient
    q._conn.execute(
        "UPDATE jobs SET heartbeat_at=? WHERE job_id=?",
        (time.time() - 10_000, j.job_id),
    )
    n = q.sweep_stale(stale_after_s=600.0, max_attempts=3)
    assert n == 1
    assert q.get(j.job_id).status == "pending"
 def test_sweep_failed_after_max_attempts(q):
    q.sync_from_manifest([_job("x")])
    # Simulate 3 prior stale claims
    for _ in range(3):
        j = q.claim_next(worker_hostname="w", capability=_cap())
        q._conn.execute(
            "UPDATE jobs SET heartbeat_at=? WHERE job_id=?",
            (time.time() - 10_000, j.job_id),
        )
        q.sweep_stale(stale_after_s=600.0, max_attempts=99)
    # On the 4th claim+stale, with max_attempts=3, sweep should mark failed
    j = q.claim_next(worker_hostname="w", capability=_cap())
    q._conn.execute(
        "UPDATE jobs SET heartbeat_at=? WHERE job_id=?",
        (time.time() - 10_000, j.job_id),
    )
    n = q.sweep_stale(stale_after_s=600.0, max_attempts=3)
    assert n == 1
    assert q.get(j.job_id).status == "failed"
 def test_workers_recorded_on_claim(q):
    q.sync_from_manifest([_job("x")])
    cap = _cap(cores=8, ram=16.0)
    q.claim_next(worker_hostname="w1", capability=cap)
    workers = q.workers()
    assert len(workers) == 1
    assert workers[0]["hostname"] == "w1"
    assert workers[0]["capability"]["cpu_cores"] == 8
--- a/tools/cis490_jobs.py
+++ b/tools/cis490_jobs.py
@ -0,0 +1,198 @@
 """cis490-jobs — operator control CLI for the training fleet.
 Talks to the trainer-receiver over HTTP. Subcommands:
  cis490-jobs status                  pretty-print queue + worker status
  cis490-jobs list [--status pending]
  cis490-jobs show <job_id>
  cis490-jobs cancel <job_id>
  cis490-jobs requeue <job_id>        force-requeue from any state
  cis490-jobs reload                  re-read manifest, sync queue
  cis490-jobs workers                 last-seen capability per worker
 Auth: control endpoints require X-Operator-Token. Set it via
 $CIS490_OPERATOR_TOKEN. Status endpoints (status, list, show, workers)
 work without a token.
 Usage from outside the Pi: set --receiver-url to the Pi's WG address
 (e.g., http://10.100.0.1:8445).
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from training.fleet.client import FleetClient
 def _client_from_args(args) -> FleetClient:
    token = (args.token if args.token
             else os.environ.get("CIS490_OPERATOR_TOKEN"))
    return FleetClient(args.receiver_url,
                        host_id=args.as_host or os.uname().nodename,
                        operator_token=token)
 def cmd_status(args) -> int:
    c = _client_from_args(args)
    jobs = c.list_jobs()
    workers = c.workers()
    from collections import Counter
    counts = Counter(j["status"] for j in jobs)
    print("=== queue ===")
    for s in ("pending", "claimed", "running", "completed", "failed", "cancelled"):
        n = counts.get(s, 0)
        print(f"  {s:>10}  {n}")
    print()
    print(f"=== workers ({len(workers)}) ===")
    now = time.time()
    for w in workers:
        cap = w.get("capability", {})
        seen = (now - float(w.get("last_seen", 0)))
        cuda = "CUDA" if cap.get("cuda_available") else "CPU"
        vram = cap.get("cuda_devices", [{}])[0].get("vram_total_gib", 0.0) \
                if cap.get("cuda_devices") else 0.0
        print(f"  {w['hostname']:>20}  {cuda}  cores={cap.get('cpu_cores')}"
              f"  ram={cap.get('ram_available_gib', 0):.1f}/"
              f"{cap.get('ram_total_gib', 0):.1f}GiB"
              f"  vram={vram:.1f}GiB  last_seen={seen:.0f}s ago")
    print()
    print("=== running ===")
    for j in jobs:
        if j["status"] in ("claimed", "running"):
            print(f"  {j['name']:>26}  by={j['claimed_by']}  status={j['status']}")
    print()
    print("=== failed ===")
    for j in jobs:
        if j["status"] == "failed":
            err = (j.get("last_error") or "")[:100]
            print(f"  {j['name']:>26}  attempts={j['attempts']}  err={err}")
    return 0
 def cmd_list(args) -> int:
    c = _client_from_args(args)
    jobs = c.list_jobs(status=args.status)
    if args.json:
        print(json.dumps(jobs, indent=2))
        return 0
    print(f"  {'name':<26}  {'model':<18} {'mode':<10} {'prio':>5} "
          f"{'status':<10} {'host':<16}")
    for j in jobs:
        print(f"  {j['name']:<26}  {j.get('model','?'):<18} "
              f"{j.get('mode','?'):<10} {j.get('priority','?'):>5} "
              f"{j['status']:<10} {(j.get('claimed_by') or '-'):<16}")
    return 0
 def cmd_show(args) -> int:
    c = _client_from_args(args)
    jobs = c.list_jobs()
    job = next((j for j in jobs if j["job_id"] == args.job_id
                or j["name"] == args.job_id), None)
    if job is None:
        print(f"no job matching {args.job_id!r}", file=sys.stderr)
        return 1
    print(json.dumps(job, indent=2))
    return 0
 def cmd_cancel(args) -> int:
    c = _client_from_args(args)
    ok = c.cancel(args.job_id)
    print("cancelled" if ok else "cancel failed (wrong state? unknown id?)",
          file=sys.stderr)
    return 0 if ok else 1
 def cmd_requeue(args) -> int:
    c = _client_from_args(args)
    ok = c.requeue(args.job_id)
    print("requeued" if ok else "requeue failed",
          file=sys.stderr)
    return 0 if ok else 1
 def cmd_reload(args) -> int:
    c = _client_from_args(args)
    res = c.reload_manifest()
    print(json.dumps(res, indent=2))
    return 0
 def cmd_workers(args) -> int:
    c = _client_from_args(args)
    workers = c.workers()
    if args.json:
        print(json.dumps(workers, indent=2))
    else:
        for w in workers:
            print(f"\n=== {w['hostname']} ===")
            cap = w.get("capability", {})
            print(f"  os/arch: {cap.get('os')}/{cap.get('arch')}")
            print(f"  python:  {cap.get('python_version')} torch={cap.get('torch_version')}")
            print(f"  cores:   {cap.get('cpu_cores')}")
            print(f"  ram:     {cap.get('ram_available_gib', 0):.1f} / "
                  f"{cap.get('ram_total_gib', 0):.1f} GiB")
            print(f"  cuda:    {cap.get('cuda_available')}")
            for d in cap.get("cuda_devices") or []:
                print(f"    {d.get('name')}  "
                      f"vram={d.get('vram_free_gib',0):.1f}/{d.get('vram_total_gib',0):.1f} GiB")
            print(f"  commit:  {(cap.get('training_commit') or '-')[:12]}")
    return 0
 def main() -> int:
    p = argparse.ArgumentParser(prog="cis490-jobs")
    p.add_argument("--receiver-url", default=os.environ.get(
        "CIS490_TRAINER_RECEIVER_URL", "http://10.100.0.1:8445"
    ))
    p.add_argument("--token",
                    help="operator token (or $CIS490_OPERATOR_TOKEN)")
    p.add_argument("--as-host", default=None,
                    help="X-Lab-Host header (default: this machine)")
    sub = p.add_subparsers(dest="cmd", required=True)
    s_status = sub.add_parser("status",
                                help="pretty-print queue + worker status")
    s_status.set_defaults(func=cmd_status)
    s_list = sub.add_parser("list", help="list jobs")
    s_list.add_argument("--status",
                          choices=["pending","claimed","running","completed",
                                    "failed","cancelled"])
    s_list.add_argument("--json", action="store_true")
    s_list.set_defaults(func=cmd_list)
    s_show = sub.add_parser("show", help="full detail for one job (id or name)")
    s_show.add_argument("job_id")
    s_show.set_defaults(func=cmd_show)
    s_cancel = sub.add_parser("cancel", help="mark pending/failed → cancelled")
    s_cancel.add_argument("job_id")
    s_cancel.set_defaults(func=cmd_cancel)
    s_requeue = sub.add_parser("requeue",
                                  help="force any non-pending job back to pending")
    s_requeue.add_argument("job_id")
    s_requeue.set_defaults(func=cmd_requeue)
    s_reload = sub.add_parser("reload",
                                  help="re-read manifest, sync queue")
    s_reload.set_defaults(func=cmd_reload)
    s_workers = sub.add_parser("workers", help="list workers + capabilities")
    s_workers.add_argument("--json", action="store_true")
    s_workers.set_defaults(func=cmd_workers)
    args = p.parse_args()
    return args.func(args)
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/training/fleet/README.md
+++ b/training/fleet/README.md
@ -0,0 +1,182 @@
 # training/fleet/ — distributed training across multiple hosts
 Symmetric to the *collection* fleet (`orchestrator/fleet.py`), but for
 *training* the models. The collection fleet is embarrassingly parallel
 (every lab host runs the same manifest and produces independent data).
 The training fleet is the opposite: each `(model, mode, hyper)` job is
 trained at most once, so the receiver coordinates which worker gets
 which job.
 ## Roles
 | Component | Where it runs | Responsibility |
 |---|---|---|
 | `cis490-trainer-receiver.service` | Pi (`10.100.0.1`) | Job queue (SQLite), claim/heartbeat/complete endpoints, artifact ingest |
 | `cis490-trainer-worker.service` | every training host | Self-detect capability → claim eligible job → run trainer → ship artifact → repeat |
 | `etc/training_manifest.toml` | Pi `/etc/cis490/` | Operator's single source of truth: which jobs to train, with what hyperparameters and capability constraints |
 | `cis490-jobs` (`tools/cis490_jobs.py`) | anywhere | Operator CLI: status, list, show, cancel, requeue, reload |
 ## How the operator controls it
 **Edit the manifest** (`/etc/cis490/training_manifest.toml`):
 - Add or remove `[[jobs]]` entries
 - Change priorities, hyperparameters, capability constraints
 - Add a new host under `[hosts.<name>]` with allow_jobs / deny_jobs / priority
 **Reload**:
 ```sh
 cis490-jobs reload
 # or:  systemctl reload cis490-trainer-receiver.service
 # or:  sudo kill -HUP $(pgrep -f training.fleet.receiver)
 ```
 The reload is idempotent. Existing rows keep their status; new jobs become
 claimable; jobs the operator removes from the manifest **stay** in the
 queue (use `cis490-jobs cancel <id>` to mark them `cancelled`).
 **Status**:
 ```sh
 cis490-jobs status
 cis490-jobs list --status running
 cis490-jobs show transformer-oracle
 cis490-jobs workers
 ```
 **Override a stuck job**:
 ```sh
 cis490-jobs requeue <job_id>   # force back to pending from any state
 cis490-jobs cancel <job_id>
 ```
 Note: `requeue` requires `$CIS490_OPERATOR_TOKEN` to match the receiver's
 configured operator token.
 ## Adding a new training host
 ### Linux (Pi, GPU box, anything that can run torch)
 ```sh
 # On the host you want to enroll, as root:
 git clone http://maxgit.wg/spectral/CIS490 /opt/cis490
 cd /opt/cis490
 python3 -m venv .venv && .venv/bin/pip install -e '.[training]'
 sudo /opt/cis490/scripts/install-training-worker.sh
 ```
 The script:
 1. Verifies the WG mesh + receiver reachability
 2. Prints the host's self-reported capability (CPU cores, RAM, CUDA, VRAM)
 3. Drops `/etc/cis490/trainer-worker.env` with the receiver URL
 4. Installs and starts `cis490-trainer-worker.service`
 5. Tails the journal so you see the worker claim its first job
 ### Windows (e.g., the operator's desktop with the GPU)
 ```powershell
 # As Administrator in PowerShell:
 git clone http://maxgit.wg/spectral/CIS490 C:\cis490
 cd C:\cis490
 py -3.11 -m venv .venv
 .\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
 .\.venv\Scripts\pip install -e .
 powershell -ExecutionPolicy Bypass -File .\scripts\install-training-worker-windows.ps1
 ```
 Registers a Scheduled Task that runs the worker at startup + restarts it
 if it stops. Logs to `C:\cis490\logs\trainer-worker.log`.
 ### After enrollment
 The new host appears in `cis490-jobs workers` within ~15 s. The receiver
 sees its capability and starts handing it eligible jobs. **You did not
 need to coordinate with anyone** — the operator-defined manifest already
 described what jobs are out there; the new host just claimed the ones
 its CUDA capacity unblocked.
 ## Capability gating
 Each job declares constraints; each worker self-reports capability. The
 receiver computes eligibility and only hands a job to a worker that
 can run it.
 ```
                   require_cuda  prefer_cuda  min_vram_gib  Pi  desktop GPU
 gbt                  no            -             0           ✓        ✓
 mlp                  no            -             0           ✓        ✓
 cnn                  no            yes           1           ✓ (after  ✓
                                                                 5min grace)
 gru / lstm           yes           -             2           -        ✓
 transformer          yes           -             4           -        ✓
 transformer_ssl      yes           -             4           -        ✓
 ```
 `prefer_cuda` jobs wait `prefer_cuda_grace_s` (default 300 s) before a
 CPU worker is allowed to claim them — so a GPU worker has a chance even
 if a CPU worker is idle.
 ## Per-host policy
 In the manifest:
 ```toml
 [hosts.office-print]
 allow_jobs = ["gbt", "mlp"]    # whitelist; absent or empty = all allowed
 deny_jobs = []
 priority   = 0
 ```
 A worker matching `office-print` will only claim jobs whose `model` is in
 `allow_jobs`. Useful for "I want the Pi to never train the Transformer
 even if I happened to put pytorch-cuda on it."
 ## Architecture notes
 ### Atomic claim
 `JobQueue.claim_next` runs the eligibility filter in Python, then the
 state transition is a single `UPDATE … WHERE status='pending'` — exactly
 one of N racing workers wins.
 ### Stale-claim recovery
 Workers heartbeat every 30 s. The receiver periodically sweeps for
 claimed/running rows whose last heartbeat is older than 600 s and
 returns them to pending (or marks failed if attempts ≥ max_attempts).
 A worker crash never permanently strands a job.
 ### Artifact deduplication
 The artifact_id is the sha256 of the uploaded tarball. Re-running a
 job with bit-identical output (same code, same data, same hyper, same
 seed) → already-present, no re-upload.
 ### Schema continuity with the supervised pipeline
 The receiver's queue rows reference job_ids that hash the SAME spec
 fields the trainer uses, so re-syncing a manifest after a code change
 that doesn't affect the trained-model identity is a no-op. Changing
 `hyper.lr` produces a NEW job_id — the queue treats it as a new job
 and the old artifact stays around for comparison.
 ## Endpoints (reference)
 ```
 POST /v1/job/claim                (worker)
 POST /v1/job/{id}/heartbeat       (worker)
 POST /v1/job/{id}/complete        (worker)
 POST /v1/job/{id}/fail            (worker)
 PUT  /v1/model/{id}               (worker — uploads tarball)
 GET  /v1/jobs[?status=...]        (anyone)
 GET  /v1/workers                  (anyone)
 POST /v1/job/{id}/cancel          (operator: X-Operator-Token)
 POST /v1/job/{id}/requeue         (operator)
 POST /v1/manifest/reload          (operator)
 GET  /v1/health                   (anyone)
 ```
 ## Files
 - `capability.py` — self-detection
 - `manifest.py` — TOML loader + JobSpec / HostSpec
 - `queue.py` — SQLite queue with atomic claim
 - `store.py` — model-artifact store on the Pi
 - `receiver.py` — Starlette app exposing the endpoints above
 - `client.py` — stdlib HTTP client (no extra deps)
 - `worker.py` — long-running worker daemon
 - `__main__.py` not needed; each module has its own `main()`
--- a/training/fleet/init.py
+++ b/training/fleet/init.py
@ -0,0 +1,14 @@
 """Training fleet — multi-host distributed training coordinator.
 Mirrors the collection-side fleet pattern:
  - Single canonical training_manifest.toml (operator-edited)
  - Workers self-detect capability + report to the receiver
  - Receiver maintains a SQLite job queue, atomic claim + heartbeat
  - Workers loop: claim → train → ship artifact → repeat
  - Operator controls deployment via the manifest only
 The collection fleet is embarrassingly parallel (every host runs the
 same plan). Training jobs must be assigned at most once across the
 fleet, so the receiver coordinates claims; everything else is symmetric.
 """
--- a/training/fleet/capability.py
+++ b/training/fleet/capability.py
@ -0,0 +1,208 @@
 """Capability self-detection for a training-fleet worker.
 Each worker reports a Capability blob to the receiver at startup +
 periodically thereafter. The receiver intersects this with the
 host's declared capability in the training manifest (more
 restrictive wins) and uses the result to filter claimable jobs.
 What we report:
  hostname        — same as the worker's host_id by default
  os, arch        — for diagnostics
  cpu_cores       — physical, not hyperthreaded (best-effort)
  ram_total_gib
  ram_available_gib
  cuda_available  — bool; torch.cuda.is_available() result
  cuda_devices    — list of {name, vram_total_gib, vram_free_gib}
  torch_version
  python_version
  training_commit — git commit of /opt/cis490 (or the worker's repo)
 Detection is best-effort: if torch isn't importable we report
 cuda_available=false rather than failing. If a CUDA device is
 present but CUDA fails to initialize, we still report it as
 cuda_available=false.
 """
 from __future__ import annotations
 import os
 import platform
 import socket
 import subprocess
 import sys
 from dataclasses import asdict, dataclass, field
 from pathlib import Path
@dataclass(frozen=True)
 class CudaDevice:
    name: str
    vram_total_gib: float
    vram_free_gib: float
@dataclass(frozen=True)
 class Capability:
    hostname: str
    os: str
    arch: str
    cpu_cores: int
    ram_total_gib: float
    ram_available_gib: float
    cuda_available: bool
    cuda_devices: tuple[CudaDevice, ...]
    torch_version: str | None
    python_version: str
    training_commit: str | None
    def to_dict(self) -> dict:
        d = asdict(self)
        d["cuda_devices"] = [asdict(c) for c in self.cuda_devices]
        return d
    def best_vram_gib(self) -> float:
        """VRAM of the largest visible CUDA device (free memory)."""
        if not self.cuda_devices:
            return 0.0
        return max(c.vram_free_gib for c in self.cuda_devices)
    def can_run(self, *, require_cuda: bool, min_vram_gib: float,
                min_ram_gib: float, min_cores: int) -> tuple[bool, str]:
        """Return (eligible, reason). False eligible → reason explains why."""
        if require_cuda and not self.cuda_available:
            return False, "require_cuda but no CUDA device available"
        if require_cuda and self.best_vram_gib() < min_vram_gib:
            return False, (f"require_cuda but largest free VRAM "
                            f"{self.best_vram_gib():.1f} GiB < "
                            f"{min_vram_gib:.1f} GiB needed")
        if self.ram_available_gib < min_ram_gib:
            return False, (f"available RAM {self.ram_available_gib:.1f} GiB < "
                            f"{min_ram_gib:.1f} GiB needed")
        if self.cpu_cores < min_cores:
            return False, (f"cpu_cores {self.cpu_cores} < "
                            f"{min_cores} needed")
        return True, "ok"
 def _detect_ram_gib() -> tuple[float, float]:
    """(total, available) in GiB. Linux /proc/meminfo first, fall
    back to platform-specific tools."""
    try:
        meminfo = Path("/proc/meminfo").read_text()
        parts = {}
        for line in meminfo.splitlines():
            k, _, rest = line.partition(":")
            v = rest.strip().split()
            if v and v[-1].lower() == "kb":
                try:
                    parts[k.strip()] = int(v[0])
                except ValueError:
                    pass
        total_kib = parts.get("MemTotal", 0)
        avail_kib = parts.get("MemAvailable") or parts.get("MemFree", 0)
        return (total_kib / (1024 * 1024), avail_kib / (1024 * 1024))
    except (FileNotFoundError, PermissionError):
        pass
    # Windows/macOS fallback via psutil if installed
    try:
        import psutil  # type: ignore
        v = psutil.virtual_memory()
        return (v.total / (1024 ** 3), v.available / (1024 ** 3))
    except ImportError:
        return (0.0, 0.0)
 def _detect_cpu_cores() -> int:
    """Physical core count, best-effort."""
    try:
        # Linux /proc/cpuinfo "physical id"+"core id" pairs
        info = Path("/proc/cpuinfo").read_text()
        pairs: set[tuple[str, str]] = set()
        cur = {}
        for line in info.splitlines():
            line = line.strip()
            if not line:
                if "physical id" in cur and "core id" in cur:
                    pairs.add((cur["physical id"], cur["core id"]))
                cur = {}
                continue
            if ":" in line:
                k, _, v = line.partition(":")
                cur[k.strip()] = v.strip()
        if pairs:
            return len(pairs)
    except (FileNotFoundError, PermissionError):
        pass
    # Fallback: logical count
    return os.cpu_count() or 1
 def _detect_cuda() -> tuple[bool, tuple[CudaDevice, ...], str | None]:
    """Probe torch for CUDA. Returns (available, devices, torch_version)."""
    try:
        import torch
        torch_ver = torch.__version__
    except Exception:
        return False, (), None
    try:
        if not torch.cuda.is_available():
            return False, (), torch_ver
        devs: list[CudaDevice] = []
        for i in range(torch.cuda.device_count()):
            name = torch.cuda.get_device_name(i)
            free, total = torch.cuda.mem_get_info(i)
            devs.append(CudaDevice(
                name=name,
                vram_total_gib=total / (1024 ** 3),
                vram_free_gib=free / (1024 ** 3),
            ))
        return True, tuple(devs), torch_ver
    except Exception:
        return False, (), torch_ver
 def _detect_commit(repo_root: Path) -> str | None:
    try:
        r = subprocess.run(
            ["git", "rev-parse", "HEAD"],
            cwd=str(repo_root), capture_output=True, text=True, timeout=2,
        )
        if r.returncode == 0:
            return r.stdout.strip()
    except (FileNotFoundError, subprocess.TimeoutExpired):
        pass
    return None
 def detect(*, hostname_override: str | None = None,
            repo_root: Path | None = None) -> Capability:
    hostname = (hostname_override or os.environ.get("FLEET_HOST_ID")
                 or socket.gethostname())
    ram_total, ram_avail = _detect_ram_gib()
    cuda_available, cuda_devs, torch_ver = _detect_cuda()
    commit = _detect_commit(repo_root or Path(__file__).resolve().parents[2])
    return Capability(
        hostname=hostname,
        os=platform.system(),
        arch=platform.machine(),
        cpu_cores=_detect_cpu_cores(),
        ram_total_gib=ram_total,
        ram_available_gib=ram_avail,
        cuda_available=cuda_available,
        cuda_devices=cuda_devs,
        torch_version=torch_ver,
        python_version=platform.python_version(),
        training_commit=commit,
    )
 def main() -> int:
    """`python -m training.fleet.capability` — debug print."""
    import json
    cap = detect()
    print(json.dumps(cap.to_dict(), indent=2))
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/training/fleet/client.py
+++ b/training/fleet/client.py
@ -0,0 +1,141 @@
 """HTTP client for the trainer-receiver. Stdlib-only so the worker
 doesn't pull a new dep into pyproject.toml.
 Used by the worker daemon (training/fleet/worker.py) and by the
 operator CLI (tools/cis490_jobs.py)."""
 from __future__ import annotations
 import hashlib
 import json
 import logging
 import urllib.error
 import urllib.request
 from pathlib import Path
 from typing import Any
 log = logging.getLogger("cis490.fleet.client")
 class FleetClient:
    """HTTP client for the trainer-receiver."""
    def __init__(self, base_url: str = "https://10.100.0.1:8445",
                 *, host_id: str, operator_token: str | None = None,
                 timeout: float = 30.0) -> None:
        self.base_url = base_url.rstrip("/")
        self.host_id = host_id
        self.operator_token = operator_token
        self.timeout = timeout
    def _request(self, method: str, path: str, *,
                  body: bytes | None = None,
                  json_body: Any = None,
                  extra_headers: dict | None = None,
                  expect_status: tuple[int, ...] = (200, 201, 204)
                  ) -> tuple[int, dict | bytes]:
        url = f"{self.base_url}{path}"
        headers = {"x-lab-host": self.host_id}
        if extra_headers:
            headers.update(extra_headers)
        if json_body is not None:
            body = json.dumps(json_body).encode()
            headers["content-type"] = "application/json"
        if self.operator_token:
            headers["x-operator-token"] = self.operator_token
        req = urllib.request.Request(url, data=body, method=method,
                                       headers=headers)
        try:
            with urllib.request.urlopen(req, timeout=self.timeout) as resp:
                code = resp.status
                raw = resp.read()
        except urllib.error.HTTPError as e:
            return e.code, e.read()
        if code == 204 or not raw:
            return code, {}
        ctype = resp.headers.get("content-type", "")
        if "json" in ctype:
            return code, json.loads(raw)
        return code, raw
    # ------------------------------------------------------------------
    # Worker API
    # ------------------------------------------------------------------
    def claim(self, capability: dict) -> dict | None:
        code, body = self._request("POST", "/v1/job/claim",
                                     json_body={"capability": capability})
        # 200 with {"job": None} is the "no eligible job" sentinel.
        if code != 200 or not isinstance(body, dict):
            return None
        if body.get("job", "<missing>") is None:
            return None
        if not body.get("job_id"):
            return None
        return body
    def heartbeat(self, job_id: str) -> bool:
        code, _ = self._request("POST", f"/v1/job/{job_id}/heartbeat")
        return code == 200
    def complete(self, job_id: str, *, artifact_id: str) -> bool:
        code, _ = self._request("POST", f"/v1/job/{job_id}/complete",
                                  json_body={"artifact_id": artifact_id})
        return code == 200
    def fail(self, job_id: str, *, error: str) -> bool:
        code, _ = self._request("POST", f"/v1/job/{job_id}/fail",
                                  json_body={"error": error})
        return code == 200
    def upload_artifact(self, job_id: str, bundle_path: Path) -> dict:
        h = hashlib.sha256()
        with bundle_path.open("rb") as f:
            for ch in iter(lambda: f.read(1 << 20), b""):
                h.update(ch)
        sha = h.hexdigest()
        size = bundle_path.stat().st_size
        with bundle_path.open("rb") as f:
            data = f.read()
        code, body = self._request(
            "PUT", f"/v1/model/{job_id}",
            body=data,
            extra_headers={
                "x-content-sha256": sha,
                "content-length": str(size),
                "content-type": "application/octet-stream",
            },
            expect_status=(200, 201),
        )
        if code not in (200, 201):
            raise RuntimeError(f"artifact upload failed: code={code} body={body!r}")
        return body if isinstance(body, dict) else {}
    # ------------------------------------------------------------------
    # Operator API
    # ------------------------------------------------------------------
    def list_jobs(self, *, status: str | None = None) -> list[dict]:
        path = "/v1/jobs"
        if status:
            path += f"?status={status}"
        code, body = self._request("GET", path)
        return body.get("jobs", []) if isinstance(body, dict) else []
    def cancel(self, job_id: str) -> bool:
        code, body = self._request("POST", f"/v1/job/{job_id}/cancel")
        return code == 200 and bool((body or {}).get("ok"))
    def requeue(self, job_id: str) -> bool:
        code, body = self._request("POST", f"/v1/job/{job_id}/requeue")
        return code == 200 and bool((body or {}).get("ok"))
    def reload_manifest(self) -> dict:
        code, body = self._request("POST", "/v1/manifest/reload")
        if code != 200:
            raise RuntimeError(f"reload failed: code={code} body={body!r}")
        return body if isinstance(body, dict) else {}
    def workers(self) -> list[dict]:
        code, body = self._request("GET", "/v1/workers")
        return body.get("workers", []) if isinstance(body, dict) else []
--- a/training/fleet/manifest.py
+++ b/training/fleet/manifest.py
@ -0,0 +1,232 @@
 """Loader + validator for ``training_manifest.toml``.
 Every job in the manifest is hashed into a stable ``job_id`` based on
 ``(model, mode, hyper-blob, schema_version)`` so the same manifest entry
 always maps to the same queue row across reload/restart. This makes
 ``systemctl reload cis490-receiver`` idempotent: jobs already complete
 stay complete; new jobs become claimable; deleted jobs are not removed
 (operator marks them cancelled explicitly).
 """
 from __future__ import annotations
 import hashlib
 import json
 import tomllib
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Any
 CANONICAL_FILENAMES = (
    "/etc/cis490/training_manifest.toml",
    "training_manifest.toml",
 )
 class TrainingManifestError(ValueError):
    pass
@dataclass(frozen=True)
 class HostSpec:
    name: str
    description: str = ""
    priority: int = 0
    allow_jobs: tuple[str, ...] = ()
    deny_jobs: tuple[str, ...] = ()
    def is_model_allowed(self, model: str) -> bool:
        if model in self.deny_jobs:
            return False
        if self.allow_jobs and model not in self.allow_jobs:
            return False
        return True
@dataclass(frozen=True)
 class JobSpec:
    name: str
    model: str
    mode: str
    priority: int = 0
    require_cuda: bool = False
    prefer_cuda: bool = False
    min_vram_gib: float = 0.0
    min_ram_gib: float = 4.0
    min_cores: int = 1
    allowed_hosts: tuple[str, ...] = ()      # if non-empty, only these hosts
    denied_hosts: tuple[str, ...] = ()
    hyper: dict[str, Any] = field(default_factory=dict)
    split_recipe: str = "host"
    train_hosts: tuple[str, ...] = ("elliott-thinkpad",)
    seed: int = 0
    n_resamples: int = 1000
    @property
    def job_id(self) -> str:
        """Stable hash over all the fields that define what the job IS.
        Excludes priority + cuda preferences (those are scheduling-only
        and shouldn't change the identity of a completed artifact)."""
        payload = {
            "model": self.model, "mode": self.mode,
            "hyper": self.hyper,
            "split_recipe": self.split_recipe,
            "train_hosts": list(self.train_hosts),
            "seed": self.seed,
        }
        blob = json.dumps(payload, sort_keys=True).encode()
        return hashlib.sha256(blob).hexdigest()[:16]
    def to_dict(self) -> dict:
        return {
            "name": self.name,
            "job_id": self.job_id,
            "model": self.model, "mode": self.mode,
            "priority": self.priority,
            "require_cuda": self.require_cuda,
            "prefer_cuda": self.prefer_cuda,
            "min_vram_gib": self.min_vram_gib,
            "min_ram_gib": self.min_ram_gib,
            "min_cores": self.min_cores,
            "allowed_hosts": list(self.allowed_hosts),
            "denied_hosts": list(self.denied_hosts),
            "hyper": dict(self.hyper),
            "split_recipe": self.split_recipe,
            "train_hosts": list(self.train_hosts),
            "seed": self.seed,
            "n_resamples": self.n_resamples,
        }
@dataclass(frozen=True)
 class TrainingManifest:
    schema_version: int
    name: str
    defaults: dict[str, Any]
    hosts: dict[str, HostSpec]
    jobs: tuple[JobSpec, ...]
 # Allowed model names — keep in sync with training/models/REGISTRY
 _ALLOWED_MODELS = frozenset({
    "gbt", "mlp", "cnn", "gru", "lstm", "transformer", "transformer_ssl",
 })
 _ALLOWED_MODES = frozenset({"realistic", "oracle"})
 _ALLOWED_RECIPES = frozenset({"host", "sample", "time"})
 def load(path: Path) -> TrainingManifest:
    if not path.exists():
        raise TrainingManifestError(f"manifest not found at {path}")
    try:
        raw = tomllib.loads(path.read_text())
    except tomllib.TOMLDecodeError as e:
        raise TrainingManifestError(f"invalid TOML at {path}: {e}") from e
    sv = raw.get("schema_version")
    if sv != 1:
        raise TrainingManifestError(
            f"schema_version must be 1, got {sv}"
        )
    defaults = raw.get("defaults", {}) or {}
    hosts_raw = raw.get("hosts", {}) or {}
    jobs_raw = raw.get("jobs", []) or []
    if not jobs_raw:
        raise TrainingManifestError("manifest has no [[jobs]] entries")
    hosts: dict[str, HostSpec] = {}
    for hname, h in hosts_raw.items():
        if not isinstance(h, dict):
            raise TrainingManifestError(
                f"hosts.{hname} must be a table"
            )
        hosts[hname] = HostSpec(
            name=hname,
            description=str(h.get("description", "")),
            priority=int(h.get("priority", 0)),
            allow_jobs=tuple(h.get("allow_jobs", [])),
            deny_jobs=tuple(h.get("deny_jobs", [])),
        )
    seen_ids: set[str] = set()
    jobs: list[JobSpec] = []
    for j in jobs_raw:
        if "name" not in j:
            raise TrainingManifestError(f"job missing 'name': {j}")
        if "model" not in j:
            raise TrainingManifestError(f"job '{j['name']}' missing 'model'")
        model = str(j["model"])
        if model not in _ALLOWED_MODELS:
            raise TrainingManifestError(
                f"job '{j['name']}': model {model!r} not in "
                f"{sorted(_ALLOWED_MODELS)}"
            )
        mode = str(j.get("mode", "realistic"))
        if mode not in _ALLOWED_MODES:
            raise TrainingManifestError(
                f"job '{j['name']}': mode {mode!r} not in "
                f"{sorted(_ALLOWED_MODES)}"
            )
        recipe = str(j.get("split_recipe", defaults.get("split_recipe", "host")))
        if recipe not in _ALLOWED_RECIPES:
            raise TrainingManifestError(
                f"job '{j['name']}': split_recipe {recipe!r} not in "
                f"{sorted(_ALLOWED_RECIPES)}"
            )
        spec = JobSpec(
            name=str(j["name"]),
            model=model,
            mode=mode,
            priority=int(j.get("priority", 0)),
            require_cuda=bool(j.get("require_cuda", False)),
            prefer_cuda=bool(j.get("prefer_cuda", False)),
            min_vram_gib=float(j.get("min_vram_gib", 0.0)),
            min_ram_gib=float(j.get("min_ram_gib", defaults.get("min_ram_gib", 4.0))),
            min_cores=int(j.get("min_cores", defaults.get("min_cores", 1))),
            allowed_hosts=tuple(j.get("allowed_hosts", [])),
            denied_hosts=tuple(j.get("denied_hosts", [])),
            hyper=dict(j.get("hyper", {})),
            split_recipe=recipe,
            train_hosts=tuple(j.get("train_hosts",
                                     defaults.get("train_hosts",
                                                   ["elliott-thinkpad"]))),
            seed=int(j.get("seed", defaults.get("seed", 0))),
            n_resamples=int(j.get("n_resamples",
                                   defaults.get("n_resamples", 1000))),
        )
        if spec.job_id in seen_ids:
            # Two manifest entries with identical (model, mode, hyper, …) —
            # they'd hash to the same job_id and collide. Operator error.
            raise TrainingManifestError(
                f"job '{spec.name}' duplicates an earlier job by content "
                f"(same model+mode+hyper+split). Disambiguate via hyper."
            )
        seen_ids.add(spec.job_id)
        jobs.append(spec)
    return TrainingManifest(
        schema_version=1,
        name=str(raw.get("name", "training-fleet")),
        defaults=dict(defaults),
        hosts=hosts,
        jobs=tuple(jobs),
    )
 def load_canonical(repo_root: Path | None = None) -> TrainingManifest:
    """Load the manifest from the standard locations: /etc/cis490/ first,
    then repo_root/training_manifest.toml. Raises if neither exists."""
    candidates: list[Path] = []
    candidates.append(Path("/etc/cis490/training_manifest.toml"))
    if repo_root is not None:
        candidates.append(repo_root / "training_manifest.toml")
    candidates.append(Path("training_manifest.toml"))
    for p in candidates:
        if p.exists():
            return load(p)
    raise TrainingManifestError(
        f"no training_manifest.toml found at any of: "
        f"{[str(p) for p in candidates]}"
    )
--- a/training/fleet/queue.py
+++ b/training/fleet/queue.py
@ -0,0 +1,422 @@
 """SQLite-backed job queue for the training fleet.
 Used by the receiver. One file: ``training_jobs.db``. One main table:
  jobs(job_id, name, spec_json, status, claimed_by, claimed_at,
        heartbeat_at, completed_at, attempts, last_error, artifact_id)
 Job statuses:
  pending     — claimable
  claimed     — assigned to a worker but not yet running (or briefly so)
  running     — worker has heartbeated since claim
  completed   — artifact uploaded
  failed      — worker reported failure
  cancelled   — operator marked cancelled; never reclaimed
 Atomicity: every state transition uses a single UPDATE with both a WHERE
 clause matching the prior state and a RETURNING (where supported) so two
 workers racing the same row see exactly one winner.
 Stale claim handling: a job in claimed/running with no heartbeat for
 ``stale_after_s`` (default 600 s) is automatically returned to pending
 on the next ``sweep()`` call. Re-queue increments ``attempts``; if a job
 fails ``max_attempts`` times consecutively it stays failed.
 The queue is the receiver's responsibility, not the worker's. Workers
 talk to the receiver over HTTP and never see this file directly.
 """
 from __future__ import annotations
 import json
 import logging
 import sqlite3
 import time
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Iterable
 log = logging.getLogger("cis490.fleet.queue")
 _SCHEMA = """
 CREATE TABLE IF NOT EXISTS jobs (
    job_id        TEXT PRIMARY KEY,
    name          TEXT NOT NULL,
    spec_json     TEXT NOT NULL,
    status        TEXT NOT NULL CHECK (status IN
                    ('pending','claimed','running',
                     'completed','failed','cancelled')),
    claimed_by    TEXT,
    claimed_at    REAL,
    heartbeat_at  REAL,
    completed_at  REAL,
    attempts      INTEGER NOT NULL DEFAULT 0,
    last_error    TEXT,
    artifact_id   TEXT,
    created_at    REAL NOT NULL,
    updated_at    REAL NOT NULL
 );
 CREATE INDEX IF NOT EXISTS idx_jobs_status ON jobs(status);
 CREATE INDEX IF NOT EXISTS idx_jobs_claimed_by ON jobs(claimed_by);
 CREATE TABLE IF NOT EXISTS workers (
    hostname        TEXT PRIMARY KEY,
    capability_json TEXT NOT NULL,
    last_seen       REAL NOT NULL,
    last_claim_id   TEXT
 );
 """
@dataclass(frozen=True)
 class JobRow:
    job_id: str
    name: str
    spec: dict[str, Any]
    status: str
    claimed_by: str | None
    claimed_at: float | None
    heartbeat_at: float | None
    completed_at: float | None
    attempts: int
    last_error: str | None
    artifact_id: str | None
 class JobQueue:
    def __init__(self, db_path: Path) -> None:
        self.db_path = db_path
        db_path.parent.mkdir(parents=True, exist_ok=True)
        self._conn = sqlite3.connect(
            str(db_path), isolation_level=None,    # autocommit; we use transactions explicitly
            check_same_thread=False, timeout=30.0,
        )
        self._conn.execute("PRAGMA journal_mode=WAL")
        self._conn.execute("PRAGMA synchronous=NORMAL")
        self._conn.execute("PRAGMA foreign_keys=ON")
        self._conn.executescript(_SCHEMA)
    # ------------------------------------------------------------------
    # Sync from manifest
    # ------------------------------------------------------------------
    def sync_from_manifest(self, jobs: Iterable[dict]) -> dict[str, int]:
        """Idempotent insert of manifest jobs. Existing rows keep their
        status; only spec_json/name are updated for jobs that already
        exist (so editing priority/hyper in the manifest then
        SIGHUP-reloading is safe). Jobs deleted from the manifest are
        NOT removed — operator must explicitly cancel them via the
        control CLI.
        Returns counts {"inserted", "updated", "unchanged"}.
        """
        now = time.time()
        c = {"inserted": 0, "updated": 0, "unchanged": 0}
        with self._conn:
            for job in jobs:
                job_id = job["job_id"]
                spec_json = json.dumps(job, sort_keys=True)
                row = self._conn.execute(
                    "SELECT spec_json, name FROM jobs WHERE job_id=?",
                    (job_id,),
                ).fetchone()
                if row is None:
                    self._conn.execute(
                        "INSERT INTO jobs(job_id, name, spec_json, status, "
                        "attempts, created_at, updated_at) "
                        "VALUES (?, ?, ?, 'pending', 0, ?, ?)",
                        (job_id, job["name"], spec_json, now, now),
                    )
                    c["inserted"] += 1
                elif row[0] != spec_json or row[1] != job["name"]:
                    self._conn.execute(
                        "UPDATE jobs SET name=?, spec_json=?, updated_at=? "
                        "WHERE job_id=?",
                        (job["name"], spec_json, now, job_id),
                    )
                    c["updated"] += 1
                else:
                    c["unchanged"] += 1
        return c
    # ------------------------------------------------------------------
    # Claim
    # ------------------------------------------------------------------
    def claim_next(
        self,
        *,
        worker_hostname: str,
        capability: dict,
        host_spec: dict | None = None,
        prefer_cuda_grace_s: float = 300.0,
    ) -> JobRow | None:
        """Atomically claim the highest-priority pending job that this
        worker can run. Returns None if nothing is eligible.
        Capability filter applies inline. We pick within Python rather
        than SQL because the eligibility logic (require_cuda, min_vram,
        prefer_cuda grace, host allow/deny) is more legible here and
        the queue is small (~hundreds of rows).
        """
        now = time.time()
        with self._conn:
            self._record_worker_seen(worker_hostname, capability, now)
            # Pull all pending rows ordered by priority desc, created_at asc
            rows = self._conn.execute(
                "SELECT job_id, name, spec_json, attempts FROM jobs "
                "WHERE status='pending' "
                "ORDER BY json_extract(spec_json, '$.priority') DESC, "
                "         created_at ASC"
            ).fetchall()
            for jid, name, spec_json, attempts in rows:
                spec = json.loads(spec_json)
                ok, reason = _eligible(
                    spec=spec, hostname=worker_hostname,
                    capability=capability, host_spec=host_spec,
                    prefer_cuda_grace_s=prefer_cuda_grace_s,
                    job_age_s=(now - self._conn.execute(
                        "SELECT created_at FROM jobs WHERE job_id=?",
                        (jid,),
                    ).fetchone()[0]),
                )
                if not ok:
                    continue
                # Atomic claim: only succeeds if the row is still pending.
                upd = self._conn.execute(
                    "UPDATE jobs SET status='claimed', claimed_by=?, "
                    "claimed_at=?, heartbeat_at=?, attempts=attempts+1, "
                    "last_error=NULL, updated_at=? "
                    "WHERE job_id=? AND status='pending'",
                    (worker_hostname, now, now, now, jid),
                )
                if upd.rowcount == 1:
                    return self.get(jid)
                # Lost the race; try the next candidate
                continue
        return None
    # ------------------------------------------------------------------
    # Heartbeat / complete / fail
    # ------------------------------------------------------------------
    def heartbeat(self, job_id: str, worker: str) -> bool:
        now = time.time()
        with self._conn:
            r = self._conn.execute(
                "UPDATE jobs SET status='running', heartbeat_at=?, "
                "updated_at=? WHERE job_id=? AND claimed_by=? "
                "AND status IN ('claimed','running')",
                (now, now, job_id, worker),
            )
            return r.rowcount == 1
    def complete(self, job_id: str, worker: str, *,
                  artifact_id: str) -> bool:
        now = time.time()
        with self._conn:
            r = self._conn.execute(
                "UPDATE jobs SET status='completed', completed_at=?, "
                "artifact_id=?, updated_at=? "
                "WHERE job_id=? AND claimed_by=? AND status IN "
                "('claimed','running')",
                (now, artifact_id, now, job_id, worker),
            )
            return r.rowcount == 1
    def fail(self, job_id: str, worker: str, *, error: str) -> bool:
        now = time.time()
        with self._conn:
            r = self._conn.execute(
                "UPDATE jobs SET status='failed', last_error=?, "
                "updated_at=? WHERE job_id=? AND claimed_by=? "
                "AND status IN ('claimed','running')",
                (error[:1024], now, job_id, worker),
            )
            return r.rowcount == 1
    # ------------------------------------------------------------------
    # Operator control
    # ------------------------------------------------------------------
    def cancel(self, job_id: str) -> bool:
        now = time.time()
        with self._conn:
            r = self._conn.execute(
                "UPDATE jobs SET status='cancelled', updated_at=? "
                "WHERE job_id=? AND status IN ('pending','failed')",
                (now, job_id),
            )
            return r.rowcount == 1
    def requeue(self, job_id: str) -> bool:
        """Move a job back to pending. Resets attempts.
        Operator override: force-requeue ANY non-pending state, including
        claimed/running. Useful when a worker has crashed without the
        sweep grace window having elapsed yet."""
        now = time.time()
        with self._conn:
            r = self._conn.execute(
                "UPDATE jobs SET status='pending', claimed_by=NULL, "
                "claimed_at=NULL, heartbeat_at=NULL, completed_at=NULL, "
                "attempts=0, last_error=NULL, artifact_id=NULL, updated_at=? "
                "WHERE job_id=? AND status != 'pending'",
                (now, job_id),
            )
            return r.rowcount == 1
    def sweep_stale(self, *, stale_after_s: float = 600.0,
                     max_attempts: int = 3) -> int:
        """Return claimed/running jobs with no heartbeat in `stale_after_s`
        to pending (or to failed if attempts exceeds max_attempts).
        Returns the number of rows touched."""
        now = time.time()
        with self._conn:
            stale_cutoff = now - stale_after_s
            # First pass: jobs over max_attempts → failed
            r1 = self._conn.execute(
                "UPDATE jobs SET status='failed', "
                "last_error='exceeded max_attempts due to stale claims', "
                "updated_at=? "
                "WHERE status IN ('claimed','running') "
                "AND heartbeat_at < ? AND attempts >= ?",
                (now, stale_cutoff, max_attempts),
            )
            # Second pass: stale but under max_attempts → pending
            r2 = self._conn.execute(
                "UPDATE jobs SET status='pending', claimed_by=NULL, "
                "claimed_at=NULL, heartbeat_at=NULL, updated_at=? "
                "WHERE status IN ('claimed','running') "
                "AND heartbeat_at < ?",
                (now, stale_cutoff),
            )
            return r1.rowcount + r2.rowcount
    # ------------------------------------------------------------------
    # Read API
    # ------------------------------------------------------------------
    def get(self, job_id: str) -> JobRow | None:
        r = self._conn.execute(
            "SELECT job_id, name, spec_json, status, claimed_by, "
            "claimed_at, heartbeat_at, completed_at, attempts, last_error, "
            "artifact_id FROM jobs WHERE job_id=?",
            (job_id,),
        ).fetchone()
        if r is None:
            return None
        return JobRow(
            job_id=r[0], name=r[1], spec=json.loads(r[2]),
            status=r[3], claimed_by=r[4], claimed_at=r[5],
            heartbeat_at=r[6], completed_at=r[7], attempts=r[8],
            last_error=r[9], artifact_id=r[10],
        )
    def list_jobs(self, *, status: str | None = None) -> list[JobRow]:
        sql = ("SELECT job_id, name, spec_json, status, claimed_by, "
               "claimed_at, heartbeat_at, completed_at, attempts, "
               "last_error, artifact_id FROM jobs")
        params: tuple = ()
        if status is not None:
            sql += " WHERE status=?"
            params = (status,)
        sql += (" ORDER BY json_extract(spec_json, '$.priority') DESC, "
                "created_at ASC")
        return [
            JobRow(
                job_id=r[0], name=r[1], spec=json.loads(r[2]),
                status=r[3], claimed_by=r[4], claimed_at=r[5],
                heartbeat_at=r[6], completed_at=r[7], attempts=r[8],
                last_error=r[9], artifact_id=r[10],
            )
            for r in self._conn.execute(sql, params).fetchall()
        ]
    def workers(self) -> list[dict]:
        rows = self._conn.execute(
            "SELECT hostname, capability_json, last_seen, last_claim_id "
            "FROM workers ORDER BY last_seen DESC"
        ).fetchall()
        return [
            {"hostname": r[0],
             "capability": json.loads(r[1]),
             "last_seen": r[2],
             "last_claim_id": r[3]}
            for r in rows
        ]
    # ------------------------------------------------------------------
    # Internal
    # ------------------------------------------------------------------
    def _record_worker_seen(self, hostname: str, capability: dict,
                              now: float) -> None:
        cap_json = json.dumps(capability, sort_keys=True)
        self._conn.execute(
            "INSERT INTO workers(hostname, capability_json, last_seen) "
            "VALUES (?, ?, ?) "
            "ON CONFLICT(hostname) DO UPDATE SET "
            "capability_json=excluded.capability_json, "
            "last_seen=excluded.last_seen",
            (hostname, cap_json, now),
        )
 # --------------------------------------------------------------------
 # Eligibility logic — pulled out so we can test it directly
 # --------------------------------------------------------------------
 def _eligible(
    *,
    spec: dict,
    hostname: str,
    capability: dict,
    host_spec: dict | None,
    prefer_cuda_grace_s: float,
    job_age_s: float,
 ) -> tuple[bool, str]:
    """Return (eligible, reason)."""
    # 1. Host-level allow/deny from manifest (operator's per-host policy)
    if host_spec is not None:
        deny_jobs = set(host_spec.get("deny_jobs") or ())
        allow_jobs = set(host_spec.get("allow_jobs") or ())
        if spec["model"] in deny_jobs:
            return False, f"host {hostname} deny_jobs includes {spec['model']!r}"
        if allow_jobs and spec["model"] not in allow_jobs:
            return False, (f"host {hostname} allow_jobs whitelist excludes "
                            f"{spec['model']!r}")
    # 2. Per-job allowed_hosts / denied_hosts
    allowed = set(spec.get("allowed_hosts") or ())
    if allowed and hostname not in allowed:
        return False, f"job restricted to {sorted(allowed)}; hostname={hostname}"
    if hostname in (spec.get("denied_hosts") or ()):
        return False, f"job denies hostname={hostname}"
    # 3. CUDA + VRAM + RAM + cores
    cuda_avail = bool(capability.get("cuda_available"))
    vram_free = max((d.get("vram_free_gib", 0.0)
                      for d in capability.get("cuda_devices", [])),
                     default=0.0)
    ram_avail = float(capability.get("ram_available_gib", 0.0))
    cores = int(capability.get("cpu_cores", 0))
    if spec.get("require_cuda") and not cuda_avail:
        return False, "require_cuda but no CUDA on this worker"
    if spec.get("require_cuda") and vram_free < float(spec.get("min_vram_gib", 0.0)):
        return False, (f"require_cuda but vram_free {vram_free:.1f} GiB < "
                        f"{spec.get('min_vram_gib')} GiB needed")
    if ram_avail < float(spec.get("min_ram_gib", 0.0)):
        return False, (f"ram_available {ram_avail:.1f} GiB < "
                        f"{spec.get('min_ram_gib')} GiB needed")
    if cores < int(spec.get("min_cores", 0)):
        return False, (f"cpu_cores {cores} < "
                        f"{spec.get('min_cores')} needed")
    # 4. prefer_cuda grace: if job prefers CUDA but this worker is CPU,
    # only let the CPU worker claim after the grace window has expired
    # (i.e. assume a CUDA worker had a chance and didn't take it).
    if (spec.get("prefer_cuda") and not cuda_avail
            and job_age_s < prefer_cuda_grace_s):
        return False, (f"prefer_cuda; waiting {prefer_cuda_grace_s:.0f}s for "
                        f"a CUDA worker (job age {job_age_s:.0f}s)")
    return True, "ok"
--- a/training/fleet/receiver.py
+++ b/training/fleet/receiver.py
@ -0,0 +1,379 @@
 """Starlette app — training fleet coordinator endpoints.
 Runs as its own process (``cis490-trainer-receiver.service``) on the
 Pi, listening on a loopback port (default 127.0.0.1:8445). Caddy in
 front of it mTLS-gates external access exactly the way the existing
 receiver does.
 Endpoints:
  POST /v1/job/claim
        body  : {"capability": {...}}
        header: X-Lab-Host: <hostname>
        return: 200 {job_id, name, model, mode, hyper, ...} or 204
  POST /v1/job/{job_id}/heartbeat
        header: X-Lab-Host
        return: 200 {ok: true} or 410 if reclaimed/cancelled
  POST /v1/job/{job_id}/complete
        body  : {"artifact_id": "<sha256>"}
        header: X-Lab-Host
        return: 200
  POST /v1/job/{job_id}/fail
        body  : {"error": "..."}
        return: 200
  PUT /v1/model/{job_id}
        header: X-Content-SHA256, X-Lab-Host
        body  : tar.zst bundle
        return: 201 {artifact_id, size_bytes}
  GET /v1/jobs                    — operator status, no body
  POST /v1/job/{job_id}/cancel
  POST /v1/job/{job_id}/requeue
  POST /v1/manifest/reload        — operator: re-read manifest
  GET /v1/workers                 — last-seen capability per worker
  GET /v1/health                  — liveness probe
 The control endpoints (cancel / requeue / reload) require a separate
 operator-only header X-Operator-Token to match a configured value.
 Worker endpoints are unauthenticated at this layer — Caddy + mTLS
 handles authentication upstream.
 """
 from __future__ import annotations
 import json
 import logging
 import secrets
 import time
 from pathlib import Path
 from starlette.applications import Starlette
 from starlette.requests import Request
 from starlette.responses import JSONResponse, Response
 from starlette.routing import Route
 from training.fleet.manifest import (
    TrainingManifestError, load_canonical, load,
 )
 from training.fleet.queue import JobQueue
 from training.fleet.store import ModelStore, is_valid_id
 log = logging.getLogger("cis490.fleet.receiver")
 def make_app(
    *,
    queue: JobQueue,
    store: ModelStore,
    manifest_path: Path,
    operator_token: str | None = None,
    max_artifact_bytes: int = 1024 * 1024 * 1024,    # 1 GiB
    sweep_every_s: float = 60.0,
 ) -> Starlette:
    """Build the trainer-receiver Starlette app."""
    last_sweep = {"t": 0.0}
    def _maybe_sweep() -> None:
        now = time.time()
        if now - last_sweep["t"] > sweep_every_s:
            n = queue.sweep_stale()
            if n:
                log.info("swept %d stale claim(s)", n)
            last_sweep["t"] = now
    def _operator_check(request: Request) -> Response | None:
        if operator_token is None:
            return None
        presented = request.headers.get("x-operator-token", "")
        if not secrets.compare_digest(presented, operator_token):
            return JSONResponse({"error": "operator token required"},
                                 status_code=401)
        return None
    def _hostname(request: Request) -> str:
        return request.headers.get("x-lab-host", "").strip()
    # ------------------------------------------------------------------
    # Worker endpoints
    # ------------------------------------------------------------------
    async def claim(request: Request) -> JSONResponse:
        _maybe_sweep()
        host = _hostname(request)
        if not is_valid_id(host):
            return JSONResponse({"error": "X-Lab-Host required"},
                                 status_code=400)
        try:
            body = await request.json()
        except (json.JSONDecodeError, ValueError):
            return JSONResponse({"error": "body must be JSON"}, status_code=400)
        capability = (body or {}).get("capability") or {}
        # Look up host_spec from the loaded manifest (re-read each time
        # for simplicity; manifest is small)
        try:
            man = load(manifest_path)
        except TrainingManifestError as e:
            log.warning("claim: manifest load failed: %s", e)
            man = None
        host_spec = None
        if man is not None and host in man.hosts:
            host_spec = {
                "allow_jobs": list(man.hosts[host].allow_jobs),
                "deny_jobs": list(man.hosts[host].deny_jobs),
            }
        job = queue.claim_next(
            worker_hostname=host, capability=capability, host_spec=host_spec,
        )
        if job is None:
            # HTTP 204 forbids a body; we want the body, so 200 + sentinel.
            return JSONResponse({"job": None})
        return JSONResponse({
            "job_id": job.job_id, "name": job.name,
            "spec": job.spec, "attempts": job.attempts,
        })
    async def heartbeat(request: Request) -> JSONResponse:
        host = _hostname(request)
        job_id = request.path_params["job_id"]
        if not is_valid_id(host):
            return JSONResponse({"error": "X-Lab-Host required"},
                                 status_code=400)
        ok = queue.heartbeat(job_id, host)
        if not ok:
            return JSONResponse(
                {"error": "job no longer claimed by you"},
                status_code=410,
            )
        return JSONResponse({"ok": True})
    async def complete(request: Request) -> JSONResponse:
        host = _hostname(request)
        job_id = request.path_params["job_id"]
        try:
            body = await request.json()
        except (json.JSONDecodeError, ValueError):
            return JSONResponse({"error": "body must be JSON"}, status_code=400)
        artifact_id = (body or {}).get("artifact_id")
        if not artifact_id:
            return JSONResponse({"error": "artifact_id required"},
                                 status_code=400)
        ok = queue.complete(job_id, host, artifact_id=artifact_id)
        if not ok:
            return JSONResponse(
                {"error": "job not in claimed/running for this worker"},
                status_code=410,
            )
        log.info("job %s completed by %s artifact=%s",
                 job_id, host, artifact_id[:12])
        return JSONResponse({"ok": True})
    async def fail(request: Request) -> JSONResponse:
        host = _hostname(request)
        job_id = request.path_params["job_id"]
        try:
            body = await request.json()
        except (json.JSONDecodeError, ValueError):
            return JSONResponse({"error": "body must be JSON"}, status_code=400)
        err = (body or {}).get("error", "no error message")
        ok = queue.fail(job_id, host, error=str(err))
        if not ok:
            return JSONResponse(
                {"error": "job not in claimed/running for this worker"},
                status_code=410,
            )
        log.warning("job %s failed by %s: %s", job_id, host, str(err)[:200])
        return JSONResponse({"ok": True})
    async def put_model(request: Request) -> JSONResponse:
        host = _hostname(request)
        job_id = request.path_params["job_id"]
        if not is_valid_id(host) or not is_valid_id(job_id):
            return JSONResponse({"error": "bad host or job_id"},
                                 status_code=400)
        job = queue.get(job_id)
        if job is None:
            return JSONResponse({"error": "unknown job_id"}, status_code=404)
        expected_sha = request.headers.get("x-content-sha256", "").lower()
        if not expected_sha or len(expected_sha) != 64:
            return JSONResponse(
                {"error": "X-Content-SHA256 (64 hex) required"},
                status_code=400,
            )
        cl = request.headers.get("content-length")
        if cl is not None:
            try:
                if int(cl) > max_artifact_bytes:
                    return JSONResponse(
                        {"error": "artifact exceeds max size"},
                        status_code=413,
                    )
            except ValueError:
                return JSONResponse({"error": "bad Content-Length"},
                                     status_code=400)
        result = await store.ingest_stream(
            job_id=job_id, model=job.spec["model"], mode=job.spec["mode"],
            worker=host, expected_sha256=expected_sha,
            body=request.stream(), max_bytes=max_artifact_bytes,
        )
        if result.status == "stored":
            return JSONResponse(
                {"status": "stored", "artifact_id": result.artifact_id,
                 "size_bytes": result.size_bytes},
                status_code=201,
            )
        if result.status == "already-present":
            return JSONResponse(
                {"status": "already-present",
                 "artifact_id": result.artifact_id},
                status_code=200,
            )
        if result.status == "sha-mismatch":
            return JSONResponse(
                {"status": "sha-mismatch",
                 "actual_sha256": result.artifact_id},
                status_code=400,
            )
        if result.status == "too-large":
            return JSONResponse({"error": "artifact exceeds max size"},
                                 status_code=413)
        return JSONResponse({"error": "unknown ingest result"},
                             status_code=500)
    # ------------------------------------------------------------------
    # Operator endpoints
    # ------------------------------------------------------------------
    async def list_jobs(request: Request) -> JSONResponse:
        status_filter = request.query_params.get("status")
        rows = queue.list_jobs(
            status=status_filter if status_filter else None
        )
        return JSONResponse({
            "jobs": [{
                "job_id": r.job_id, "name": r.name,
                "model": r.spec.get("model"), "mode": r.spec.get("mode"),
                "priority": r.spec.get("priority"),
                "status": r.status, "claimed_by": r.claimed_by,
                "claimed_at": r.claimed_at,
                "heartbeat_at": r.heartbeat_at,
                "completed_at": r.completed_at,
                "attempts": r.attempts,
                "last_error": r.last_error,
                "artifact_id": r.artifact_id,
            } for r in rows],
        })
    async def cancel(request: Request) -> JSONResponse:
        guard = _operator_check(request)
        if guard is not None:
            return guard
        job_id = request.path_params["job_id"]
        ok = queue.cancel(job_id)
        return JSONResponse({"ok": ok})
    async def requeue(request: Request) -> JSONResponse:
        guard = _operator_check(request)
        if guard is not None:
            return guard
        job_id = request.path_params["job_id"]
        ok = queue.requeue(job_id)
        return JSONResponse({"ok": ok})
    async def reload(request: Request) -> JSONResponse:
        guard = _operator_check(request)
        if guard is not None:
            return guard
        try:
            man = load(manifest_path)
        except TrainingManifestError as e:
            return JSONResponse({"error": str(e)}, status_code=400)
        counts = queue.sync_from_manifest([j.to_dict() for j in man.jobs])
        return JSONResponse({"ok": True, "counts": counts,
                              "n_jobs": len(man.jobs)})
    async def workers(request: Request) -> JSONResponse:
        return JSONResponse({"workers": queue.workers()})
    async def health(request: Request) -> JSONResponse:
        return JSONResponse({"status": "ok"})
    routes = [
        # Worker
        Route("/v1/job/claim", claim, methods=["POST"]),
        Route("/v1/job/{job_id}/heartbeat", heartbeat, methods=["POST"]),
        Route("/v1/job/{job_id}/complete", complete, methods=["POST"]),
        Route("/v1/job/{job_id}/fail", fail, methods=["POST"]),
        Route("/v1/model/{job_id}", put_model, methods=["PUT"]),
        # Operator
        Route("/v1/jobs", list_jobs, methods=["GET"]),
        Route("/v1/job/{job_id}/cancel", cancel, methods=["POST"]),
        Route("/v1/job/{job_id}/requeue", requeue, methods=["POST"]),
        Route("/v1/manifest/reload", reload, methods=["POST"]),
        Route("/v1/workers", workers, methods=["GET"]),
        Route("/v1/health", health, methods=["GET"]),
    ]
    return Starlette(routes=routes)
 def main() -> int:
    """python -m training.fleet.receiver"""
    import argparse, os, uvicorn
    ap = argparse.ArgumentParser()
    ap.add_argument("--listen-addr", default="127.0.0.1:8445")
    ap.add_argument("--manifest", type=Path,
                    default=Path("/etc/cis490/training_manifest.toml"))
    ap.add_argument("--db", type=Path,
                    default=Path("/var/lib/cis490/training_jobs.db"))
    ap.add_argument("--store-root", type=Path,
                    default=Path("/var/lib/cis490/models"))
    ap.add_argument("--incoming-root", type=Path,
                    default=Path("/var/lib/cis490/incoming-models"))
    ap.add_argument("--index-path", type=Path,
                    default=Path("/var/lib/cis490/models/index.jsonl"))
    ap.add_argument("--operator-token-env", default="CIS490_OPERATOR_TOKEN")
    ap.add_argument("--log-level", default="INFO")
    args = ap.parse_args()
    logging.basicConfig(
        level=args.log_level,
        format="%(asctime)s %(levelname)s %(name)s %(message)s",
    )
    # Load manifest + sync queue at startup
    try:
        man = load(args.manifest)
    except TrainingManifestError as e:
        log.error("manifest load failed: %s", e)
        return 78
    queue = JobQueue(args.db)
    counts = queue.sync_from_manifest([j.to_dict() for j in man.jobs])
    log.info("manifest: %s; sync counts: %s", man.name, counts)
    store = ModelStore(args.store_root, args.incoming_root, args.index_path)
    operator_token = os.environ.get(args.operator_token_env)
    if not operator_token:
        log.warning(
            "no operator token configured (set $%s); "
            "operator endpoints will be open from loopback",
            args.operator_token_env,
        )
    app = make_app(queue=queue, store=store, manifest_path=args.manifest,
                    operator_token=operator_token)
    host, _, port = args.listen_addr.partition(":")
    uvicorn.run(app, host=host, port=int(port), log_level=args.log_level.lower())
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/training/fleet/store.py
+++ b/training/fleet/store.py
@ -0,0 +1,123 @@
 """Trained-artifact store on the Pi.
 Mirrors ``receiver/store.py`` for episodes — same atomic-write,
 sha256-verified, stream-ingest design — but stores trained models
 under ``/var/lib/cis490/models/<model>_<mode>/<artifact_id>/``.
 An ``artifact_id`` is the sha256 of the uploaded tarball. The same
 job_id can produce multiple artifact_ids if the operator re-runs the
 job (different code commit, different epoch, different seed); the
 queue records the latest artifact_id for each completed job, but the
 store keeps every uploaded artifact so re-runs can be compared.
 Layout::
    /var/lib/cis490/models/
        index.jsonl                                — append-only ingest log
        <model>_<mode>/
            <artifact_id>/
                bundle.tar.zst                     — what was uploaded
                meta.json                          — header from the bundle
 """
 from __future__ import annotations
 import hashlib
 import json
 import re
 import time
 from dataclasses import dataclass
 from pathlib import Path
 from typing import AsyncIterator
 _ID_RE = re.compile(r"^[A-Za-z0-9_.-]{1,128}$")
 def is_valid_id(s: str) -> bool:
    return bool(_ID_RE.match(s))
@dataclass(frozen=True)
 class StoreResult:
    status: str        # "stored" | "already-present" | "sha-mismatch" | "too-large"
    artifact_id: str | None
    size_bytes: int | None
 class ModelStore:
    def __init__(self, store_root: Path, incoming_root: Path,
                 index_path: Path) -> None:
        self.store_root = store_root
        self.incoming_root = incoming_root
        self.index_path = index_path
        self.store_root.mkdir(parents=True, exist_ok=True)
        self.incoming_root.mkdir(parents=True, exist_ok=True)
        self.index_path.parent.mkdir(parents=True, exist_ok=True)
        self.index_path.touch(exist_ok=True)
    def final_dir(self, model: str, mode: str, artifact_id: str) -> Path:
        return self.store_root / f"{model}_{mode}" / artifact_id
    async def ingest_stream(
        self,
        *,
        job_id: str,
        model: str,
        mode: str,
        worker: str,
        expected_sha256: str,
        body: AsyncIterator[bytes],
        max_bytes: int,
    ) -> StoreResult:
        # Final artifact id == the uploaded tarball's sha256, so
        # uploading the same bytes twice deduplicates.
        h = hashlib.sha256()
        n = 0
        incoming_dir = self.incoming_root / f"{model}_{mode}"
        incoming_dir.mkdir(parents=True, exist_ok=True)
        partial = incoming_dir / f"{job_id}-{int(time.time())}.tar.zst.partial"
        try:
            with partial.open("wb") as out:
                async for chunk in body:
                    n += len(chunk)
                    if n > max_bytes:
                        partial.unlink(missing_ok=True)
                        return StoreResult("too-large", None, n)
                    h.update(chunk)
                    out.write(chunk)
            actual = h.hexdigest()
            if expected_sha256 and actual != expected_sha256.lower():
                partial.unlink(missing_ok=True)
                return StoreResult("sha-mismatch", actual, n)
            artifact_id = actual
            final_dir = self.final_dir(model, mode, artifact_id)
            if final_dir.exists() and (final_dir / "bundle.tar.zst").exists():
                partial.unlink(missing_ok=True)
                return StoreResult("already-present", artifact_id, n)
            final_dir.mkdir(parents=True, exist_ok=True)
            final = final_dir / "bundle.tar.zst"
            partial.replace(final)
            self._write_meta(final_dir, model=model, mode=mode,
                              job_id=job_id, worker=worker,
                              artifact_id=artifact_id, size_bytes=n)
            self._append_index({
                "received_at_wall": time.strftime("%Y-%m-%dT%H:%M:%SZ",
                                                   time.gmtime()),
                "job_id": job_id, "model": model, "mode": mode,
                "worker": worker, "artifact_id": artifact_id,
                "size_bytes": n,
            })
            return StoreResult("stored", artifact_id, n)
        except BaseException:
            partial.unlink(missing_ok=True)
            raise
    def _write_meta(self, final_dir: Path, **kwargs) -> None:
        (final_dir / "meta.json").write_text(
            json.dumps(kwargs, indent=2) + "\n"
        )
    def _append_index(self, row: dict) -> None:
        line = json.dumps(row, sort_keys=True) + "\n"
        with self.index_path.open("a") as f:
            f.write(line)
--- a/training/fleet/worker.py
+++ b/training/fleet/worker.py
@ -0,0 +1,341 @@
 """Trainer worker daemon.
 Loops:
  1. Detect capability + report to the receiver via /v1/job/claim
  2. If receiver returns a job → run training/trainer/run.py with the
     spec's hyperparameters
  3. Send heartbeats every ``heartbeat_s`` seconds while training runs
  4. On success: tar the artifact, sha256, PUT /v1/model/{job_id}
                 then POST /v1/job/{job_id}/complete
  5. On failure: POST /v1/job/{job_id}/fail with the error
  6. Sleep ``poll_s`` and repeat
  7. SIGTERM: cancel the in-flight training subprocess, mark the job
     failed with reason "worker shutdown" so the queue re-queues.
 The worker is a single Python process. The training subprocess is
 isolated so a torch crash doesn't kill the worker; the worker reads
 the subprocess's stdout/stderr and reports lines via heartbeat
 metadata for live observability.
 """
 from __future__ import annotations
 import argparse
 import hashlib
 import io
 import json
 import logging
 import os
 import signal
 import subprocess
 import sys
 import tarfile
 import threading
 import time
 from pathlib import Path
 import zstandard as zstd
 from training.fleet.capability import detect
 from training.fleet.client import FleetClient
 log = logging.getLogger("cis490.fleet.worker")
 class WorkerStop(Exception):
    """Raised when the worker has been asked to shut down."""
 class TrainerWorker:
    def __init__(
        self,
        *,
        client: FleetClient,
        repo_root: Path,
        venv_python: Path,
        artifacts_dir: Path,
        reports_dir: Path,
        validation_path: Path,
        summary_path: Path,
        tensors_path: Path,
        heartbeat_s: float = 30.0,
        poll_s: float = 15.0,
    ) -> None:
        self.client = client
        self.repo_root = repo_root
        self.venv_python = venv_python
        self.artifacts_dir = artifacts_dir
        self.reports_dir = reports_dir
        self.validation_path = validation_path
        self.summary_path = summary_path
        self.tensors_path = tensors_path
        self.heartbeat_s = heartbeat_s
        self.poll_s = poll_s
        self._stop = threading.Event()
        self._current_proc: subprocess.Popen | None = None
        self._current_job_id: str | None = None
    def stop(self) -> None:
        self._stop.set()
        proc = self._current_proc
        if proc is not None and proc.poll() is None:
            log.info("SIGTERM-ing in-flight trainer (job=%s pid=%s)",
                     self._current_job_id, proc.pid)
            try:
                proc.terminate()
            except OSError:
                pass
    # ------------------------------------------------------------------
    # Main loop
    # ------------------------------------------------------------------
    def run(self) -> int:
        log.info("worker starting, host_id=%s, polling %s every %.0fs",
                 self.client.host_id, self.client.base_url, self.poll_s)
        while not self._stop.is_set():
            try:
                cap = detect(repo_root=self.repo_root)
                claim = self.client.claim(cap.to_dict())
            except Exception as e:
                log.warning("claim failed: %s", e)
                self._sleep(self.poll_s)
                continue
            if not claim or not claim.get("job_id"):
                self._sleep(self.poll_s)
                continue
            job_id = claim["job_id"]
            self._current_job_id = job_id
            try:
                self._run_one_job(claim)
            except WorkerStop:
                # Best-effort: tell receiver we failed so it re-queues
                try:
                    self.client.fail(job_id, error="worker shutdown")
                except Exception:
                    pass
                break
            except Exception as e:
                log.exception("job %s failed: %s", job_id, e)
                try:
                    self.client.fail(job_id, error=f"{type(e).__name__}: {e}")
                except Exception:
                    pass
            finally:
                self._current_job_id = None
        log.info("worker stopped")
        return 0
    def _sleep(self, seconds: float) -> None:
        # Interruptible sleep so SIGTERM responds quickly
        deadline = time.monotonic() + seconds
        while not self._stop.is_set() and time.monotonic() < deadline:
            time.sleep(min(0.5, deadline - time.monotonic()))
    # ------------------------------------------------------------------
    # One job
    # ------------------------------------------------------------------
    def _run_one_job(self, claim: dict) -> None:
        job_id = claim["job_id"]
        spec = claim["spec"]
        name = claim.get("name", spec.get("name", "<unnamed>"))
        log.info("claimed job %s (%s) — model=%s mode=%s",
                 job_id, name, spec["model"], spec["mode"])
        cmd = self._build_cmd(spec, job_id)
        log.info("trainer cmd: %s", " ".join(cmd))
        # Start trainer subprocess
        self.artifacts_dir.mkdir(parents=True, exist_ok=True)
        self.reports_dir.mkdir(parents=True, exist_ok=True)
        proc = subprocess.Popen(
            cmd, cwd=str(self.repo_root),
            stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
            text=True, bufsize=1,
        )
        self._current_proc = proc
        # Heartbeat thread
        beat_stop = threading.Event()
        def _beat():
            while not beat_stop.is_set():
                try:
                    self.client.heartbeat(job_id)
                except Exception as e:
                    log.warning("heartbeat failed: %s", e)
                if beat_stop.wait(self.heartbeat_s):
                    return
        beat_thread = threading.Thread(target=_beat, daemon=True)
        beat_thread.start()
        # Stream output
        try:
            assert proc.stdout is not None
            for line in proc.stdout:
                line = line.rstrip()
                if line:
                    log.info("[trainer] %s", line)
                if self._stop.is_set():
                    proc.terminate()
                    raise WorkerStop()
            rc = proc.wait()
        finally:
            beat_stop.set()
            beat_thread.join(timeout=2.0)
            self._current_proc = None
        if rc != 0:
            raise RuntimeError(f"trainer exited with code {rc}")
        # Bundle + upload artifact
        artifact_path = self._bundle_artifact(spec, job_id)
        log.info("uploading artifact (%.1f MiB)…",
                 artifact_path.stat().st_size / (1024 * 1024))
        resp = self.client.upload_artifact(job_id, artifact_path)
        artifact_id = resp.get("artifact_id")
        if not artifact_id:
            raise RuntimeError(f"upload returned no artifact_id: {resp!r}")
        # Mark complete
        ok = self.client.complete(job_id, artifact_id=artifact_id)
        if not ok:
            raise RuntimeError("complete() did not return ok")
        log.info("job %s done — artifact=%s", job_id, artifact_id[:12])
    def _build_cmd(self, spec: dict, job_id: str) -> list[str]:
        """Compose the trainer subprocess command from the job spec."""
        model = spec["model"]
        mode = spec["mode"]
        # transformer_ssl uses run_ssl.py; everything else uses run.py
        if model == "transformer_ssl":
            cmd = [str(self.venv_python),
                   "-m", "training.trainer.run_ssl",
                   "--mode", mode,
                   "--validation", str(self.validation_path),
                   "--tensors", str(self.tensors_path),
                   "--out-dir", str(self.artifacts_dir),
                   "--reports-dir", str(self.reports_dir),
                   "--seed", str(spec.get("seed", 0))]
        else:
            cmd = [str(self.venv_python),
                   "-m", "training.trainer.run",
                   "--model", model, "--mode", mode,
                   "--validation", str(self.validation_path),
                   "--summary", str(self.summary_path),
                   "--tensors", str(self.tensors_path),
                   "--schema", str(self.summary_path.parent / "feature_schema_v1.json"),
                   "--out-dir", str(self.artifacts_dir),
                   "--reports-dir", str(self.reports_dir),
                   "--split-recipe", spec.get("split_recipe", "host"),
                   "--seed", str(spec.get("seed", 0))]
            for h in spec.get("train_hosts") or []:
                cmd.extend(["--train-hosts", h])
        # Hyperparameter pass-through
        hyper = spec.get("hyper") or {}
        for k, v in hyper.items():
            flag = "--" + k.replace("_", "-")
            cmd.extend([flag, str(v)])
        return cmd
    def _bundle_artifact(self, spec: dict, job_id: str) -> Path:
        """Tar the trained checkpoint + sidecar + train report into a
        single .tar.zst file we PUT to the receiver."""
        model = spec["model"]
        mode = spec["mode"]
        base = f"{model}_{mode}"
        if model == "transformer_ssl":
            # SSL emits transformer_ssl_<mode>.{ckpt.json,pt}
            sidecar_suffix = ".pt"
        elif model == "gbt":
            sidecar_suffix = ".xgb.json"
        else:
            sidecar_suffix = ".pt"
        ckpt_json = self.artifacts_dir / f"{base}.ckpt.json"
        sidecar = self.artifacts_dir / f"{base}{sidecar_suffix}"
        train_json_name = ("transformer_ssl_" + mode + "_pretrain.json"
                           if model == "transformer_ssl"
                           else f"{model}_{mode}_train.json")
        train_json = self.reports_dir / train_json_name
        for required in (ckpt_json, sidecar):
            if not required.exists():
                raise FileNotFoundError(
                    f"trainer did not produce {required}"
                )
        bundle_dir = self.artifacts_dir / "_bundle"
        bundle_dir.mkdir(parents=True, exist_ok=True)
        bundle_path = bundle_dir / f"{base}-{job_id}.tar.zst"
        cctx = zstd.ZstdCompressor(level=10)
        with bundle_path.open("wb") as outf:
            with cctx.stream_writer(outf) as zw:
                with tarfile.open(fileobj=zw, mode="w|") as tar:
                    tar.add(ckpt_json, arcname=ckpt_json.name)
                    tar.add(sidecar, arcname=sidecar.name)
                    if train_json.exists():
                        tar.add(train_json, arcname=train_json.name)
        return bundle_path
 def main() -> int:
    ap = argparse.ArgumentParser()
    ap.add_argument("--receiver-url", default="http://10.100.0.1:8445",
                    help="Trainer-receiver base URL")
    ap.add_argument("--host-id",
                    default=os.environ.get("FLEET_HOST_ID")
                            or os.uname().nodename)
    ap.add_argument("--repo-root", type=Path,
                    default=Path(__file__).resolve().parents[2])
    ap.add_argument("--venv-python", type=Path,
                    default=Path(sys.executable))
    ap.add_argument("--artifacts-dir", type=Path, default=Path("artifacts"))
    ap.add_argument("--reports-dir", type=Path, default=Path("reports/eval"))
    ap.add_argument("--validation", type=Path,
                    default=Path("data/processed/validation_v1.parquet"))
    ap.add_argument("--summary", type=Path,
                    default=Path("data/processed/features_window_v1.parquet"))
    ap.add_argument("--tensors", type=Path,
                    default=Path("data/processed/tensor_window_v1"))
    ap.add_argument("--poll-s", type=float, default=15.0)
    ap.add_argument("--heartbeat-s", type=float, default=30.0)
    ap.add_argument("--log-level", default="INFO")
    args = ap.parse_args()
    logging.basicConfig(level=args.log_level,
                        format="%(asctime)s %(levelname)s %(name)s %(message)s")
    client = FleetClient(args.receiver_url, host_id=args.host_id)
    worker = TrainerWorker(
        client=client,
        repo_root=args.repo_root, venv_python=args.venv_python,
        artifacts_dir=args.repo_root / args.artifacts_dir,
        reports_dir=args.repo_root / args.reports_dir,
        validation_path=args.repo_root / args.validation,
        summary_path=args.repo_root / args.summary,
        tensors_path=args.repo_root / args.tensors,
        poll_s=args.poll_s, heartbeat_s=args.heartbeat_s,
    )
    def _sigterm(signum, frame):
        log.info("received signal %s; stopping after current job", signum)
        worker.stop()
    signal.signal(signal.SIGTERM, _sigterm)
    signal.signal(signal.SIGINT, _sigterm)
    return worker.run()
 if __name__ == "__main__":
    raise SystemExit(main())