Symmetric companion to the collection fleet (orchestrator/fleet.py)
but for *training*. Collection is embarrassingly parallel; training
is not (a model is trained at most once across the fleet), so the
receiver coordinates which worker gets which job.
Operator-control surface is etc/training_manifest.toml.example —
single canonical file declaring (a) per-host capability + per-model
allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper)
with capability constraints (require_cuda, prefer_cuda, min_vram_gib,
min_ram_gib, allowed_hosts).
Components:
capability.py — self-detection: hostname, cores, RAM, CUDA presence,
VRAM, torch version, git commit. Used by workers to filter
eligible jobs before claiming.
manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable
sha256 of (model, mode, hyper, split_recipe, train_hosts, seed)
so manifest reload is idempotent: existing rows keep their status,
new jobs become claimable, removed jobs stay until cancelled.
queue.py — SQLite job queue (training_jobs.db) with statuses
pending|claimed|running|completed|failed|cancelled. Atomic
claim_next via single UPDATE WHERE status='pending'. Heartbeat,
complete, fail. Stale-claim sweep (stale_after_s=600s) with
max_attempts cutoff to failed.
store.py — model artifact store mirroring receiver/store.py.
Artifact ID is the sha256 of the uploaded tarball; bit-identical
re-runs deduplicate.
receiver.py — Starlette app exposing 11 endpoints:
POST /v1/job/claim (worker)
POST /v1/job/{id}/heartbeat (worker)
POST /v1/job/{id}/complete (worker)
POST /v1/job/{id}/fail (worker)
PUT /v1/model/{id} (worker — uploads tarball)
GET /v1/jobs (anyone)
GET /v1/workers (anyone)
POST /v1/job/{id}/cancel (operator: X-Operator-Token)
POST /v1/job/{id}/requeue (operator)
POST /v1/manifest/reload (operator)
GET /v1/health (anyone)
Runs as cis490-trainer-receiver.service on the Pi alongside the
existing receiver, on a separate port.
client.py — stdlib HTTP client (urllib only, no new deps).
worker.py — long-running daemon. Loop: detect capability → claim →
spawn training/trainer/run.py subprocess → heartbeat every 30s →
tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe.
Operator CLI (tools/cis490_jobs.py): status / list / show / cancel /
requeue / reload / workers. Cancel and requeue require
$CIS490_OPERATOR_TOKEN matching the receiver's configured value.
Bootstrap: scripts/install-training-worker.sh (Linux systemd) and
scripts/install-training-worker-windows.ps1 (Windows Scheduled Task)
let the operator enroll a new host with one command after cloning
the repo and setting up the venv. Worker self-tests capability
before registering.
End-to-end smoke verified on the Pi: receiver up, manifest synced,
14 jobs queued, worker registered, claimed 4 CPU-eligible jobs
(allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle,
mlp-oracle), 1 failed with the actual error visible via
cis490-jobs status, 3 artifacts uploaded to
/var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with
proper index.jsonl row.
21 unit tests (manifest validation: 8; queue lifecycle + eligibility:
13). All pass alongside the prior 17 training tests = 38 green.
Open limitations surfaced inline:
- Hyper-key drift between manifest and run.py fails at training
time, not at manifest reload (worth tightening to argparse
introspection later).
- mTLS not yet wired through Caddy for the trainer-receiver port —
listens loopback-only until that lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
116 lines
4.1 KiB
PowerShell
116 lines
4.1 KiB
PowerShell
# Install a CIS490 trainer worker on a Windows host (e.g., the operator's
|
|
# desktop with the GPU).
|
|
#
|
|
# Symmetric to install-training-worker.sh but for Windows. Sets up:
|
|
# - Confirms WireGuard reachability to the Pi receiver
|
|
# - Confirms a Python venv with torch (CUDA) is present
|
|
# - Registers a Scheduled Task that runs the worker at startup + every
|
|
# 5 minutes if it isn't running
|
|
#
|
|
# Run as Administrator in PowerShell:
|
|
# powershell.exe -ExecutionPolicy Bypass -File install-training-worker-windows.ps1
|
|
#
|
|
# Prereqs (set up these manually before running):
|
|
# - Git clone of the CIS490 repo at $env:CIS490_HOME (default: C:\cis490)
|
|
# - Python 3.11+ in $env:CIS490_HOME\.venv with torch (CUDA) + xgboost
|
|
# py -3.11 -m venv .venv
|
|
# .\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
|
|
# .\.venv\Scripts\pip install -e .
|
|
# - WireGuard tunnel up to 10.100.0.1
|
|
#
|
|
# After install, the worker logs go to $env:CIS490_HOME\logs\trainer-worker.log
|
|
|
|
param(
|
|
[string]$RepoRoot = $(if ($env:CIS490_HOME) { $env:CIS490_HOME } else { "C:\cis490" }),
|
|
[string]$ReceiverUrl = $(if ($env:CIS490_TRAINER_RECEIVER_URL) { $env:CIS490_TRAINER_RECEIVER_URL } else { "http://10.100.0.1:8445" }),
|
|
[string]$HostId = $(if ($env:FLEET_HOST_ID) { $env:FLEET_HOST_ID } else { $env:COMPUTERNAME })
|
|
)
|
|
|
|
$ErrorActionPreference = "Stop"
|
|
|
|
if (-not (Test-Path $RepoRoot)) {
|
|
Write-Error "Repo not found at $RepoRoot. Set `$env:CIS490_HOME or pass -RepoRoot."
|
|
exit 1
|
|
}
|
|
|
|
$VenvPy = Join-Path $RepoRoot ".venv\Scripts\python.exe"
|
|
if (-not (Test-Path $VenvPy)) {
|
|
Write-Error @"
|
|
No Python venv at $VenvPy.
|
|
Set up first:
|
|
cd $RepoRoot
|
|
py -3.11 -m venv .venv
|
|
.\.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
|
|
.\.venv\Scripts\pip install -e .
|
|
"@
|
|
exit 1
|
|
}
|
|
|
|
# Receiver reachability
|
|
Write-Host "Checking trainer-receiver at $ReceiverUrl..."
|
|
try {
|
|
$r = Invoke-WebRequest -Uri "$ReceiverUrl/v1/health" -TimeoutSec 5 -UseBasicParsing
|
|
if ($r.StatusCode -ne 200) { throw "non-200" }
|
|
Write-Host " receiver OK"
|
|
} catch {
|
|
Write-Error @"
|
|
Cannot reach $ReceiverUrl.
|
|
- Is the WireGuard tunnel up? (Get-NetAdapter | ? Name -like 'wg*')
|
|
- Is cis490-trainer-receiver.service running on the Pi?
|
|
"@
|
|
exit 1
|
|
}
|
|
|
|
# Capability self-test
|
|
Write-Host ""
|
|
Write-Host "=== capability self-report ==="
|
|
& $VenvPy -m training.fleet.capability
|
|
Write-Host ""
|
|
|
|
# Logs dir
|
|
$LogsDir = Join-Path $RepoRoot "logs"
|
|
New-Item -ItemType Directory -Force -Path $LogsDir | Out-Null
|
|
$LogPath = Join-Path $LogsDir "trainer-worker.log"
|
|
|
|
# Build the launcher .cmd that the scheduled task invokes
|
|
$LauncherPath = Join-Path $RepoRoot "scripts\run-trainer-worker.cmd"
|
|
@"
|
|
@echo off
|
|
cd /d "$RepoRoot"
|
|
set CIS490_TRAINER_RECEIVER_URL=$ReceiverUrl
|
|
set FLEET_HOST_ID=$HostId
|
|
"$VenvPy" -m training.fleet.worker --receiver-url "$ReceiverUrl" --host-id "$HostId" >> "$LogPath" 2>&1
|
|
"@ | Set-Content -Encoding ASCII $LauncherPath
|
|
Write-Host "wrote launcher: $LauncherPath"
|
|
|
|
# Register / replace the scheduled task
|
|
$TaskName = "CIS490-TrainerWorker"
|
|
$existing = schtasks /Query /TN $TaskName 2>$null
|
|
if ($existing) {
|
|
Write-Host "removing existing scheduled task $TaskName"
|
|
schtasks /Delete /TN $TaskName /F | Out-Null
|
|
}
|
|
|
|
# Run as the current user, at startup, restart if it stops, every 5 min check
|
|
schtasks /Create /TN $TaskName /TR "`"$LauncherPath`"" /SC ONSTART /RU "$env:USERDOMAIN\$env:USERNAME" /RL HIGHEST /F | Out-Null
|
|
# Add a second trigger that ensures the task is running every 5 minutes
|
|
schtasks /Change /TN $TaskName /RI 5 /DU 9999:00 2>$null
|
|
|
|
Write-Host ""
|
|
Write-Host "scheduled task '$TaskName' created."
|
|
Write-Host "Starting it now..."
|
|
schtasks /Run /TN $TaskName | Out-Null
|
|
|
|
Start-Sleep -Seconds 3
|
|
if (Test-Path $LogPath) {
|
|
Write-Host ""
|
|
Write-Host "=== first 30 log lines ==="
|
|
Get-Content $LogPath -Tail 30
|
|
}
|
|
|
|
Write-Host ""
|
|
Write-Host "Done."
|
|
Write-Host " Logs: Get-Content '$LogPath' -Wait"
|
|
Write-Host " Status: schtasks /Query /TN $TaskName /V /FO LIST"
|
|
Write-Host " Stop: schtasks /End /TN $TaskName"
|
|
Write-Host " Remove: schtasks /Delete /TN $TaskName /F"
|