Symmetric companion to the collection fleet (orchestrator/fleet.py)
but for *training*. Collection is embarrassingly parallel; training
is not (a model is trained at most once across the fleet), so the
receiver coordinates which worker gets which job.
Operator-control surface is etc/training_manifest.toml.example —
single canonical file declaring (a) per-host capability + per-model
allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper)
with capability constraints (require_cuda, prefer_cuda, min_vram_gib,
min_ram_gib, allowed_hosts).
Components:
capability.py — self-detection: hostname, cores, RAM, CUDA presence,
VRAM, torch version, git commit. Used by workers to filter
eligible jobs before claiming.
manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable
sha256 of (model, mode, hyper, split_recipe, train_hosts, seed)
so manifest reload is idempotent: existing rows keep their status,
new jobs become claimable, removed jobs stay until cancelled.
queue.py — SQLite job queue (training_jobs.db) with statuses
pending|claimed|running|completed|failed|cancelled. Atomic
claim_next via single UPDATE WHERE status='pending'. Heartbeat,
complete, fail. Stale-claim sweep (stale_after_s=600s) with
max_attempts cutoff to failed.
store.py — model artifact store mirroring receiver/store.py.
Artifact ID is the sha256 of the uploaded tarball; bit-identical
re-runs deduplicate.
receiver.py — Starlette app exposing 11 endpoints:
POST /v1/job/claim (worker)
POST /v1/job/{id}/heartbeat (worker)
POST /v1/job/{id}/complete (worker)
POST /v1/job/{id}/fail (worker)
PUT /v1/model/{id} (worker — uploads tarball)
GET /v1/jobs (anyone)
GET /v1/workers (anyone)
POST /v1/job/{id}/cancel (operator: X-Operator-Token)
POST /v1/job/{id}/requeue (operator)
POST /v1/manifest/reload (operator)
GET /v1/health (anyone)
Runs as cis490-trainer-receiver.service on the Pi alongside the
existing receiver, on a separate port.
client.py — stdlib HTTP client (urllib only, no new deps).
worker.py — long-running daemon. Loop: detect capability → claim →
spawn training/trainer/run.py subprocess → heartbeat every 30s →
tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe.
Operator CLI (tools/cis490_jobs.py): status / list / show / cancel /
requeue / reload / workers. Cancel and requeue require
$CIS490_OPERATOR_TOKEN matching the receiver's configured value.
Bootstrap: scripts/install-training-worker.sh (Linux systemd) and
scripts/install-training-worker-windows.ps1 (Windows Scheduled Task)
let the operator enroll a new host with one command after cloning
the repo and setting up the venv. Worker self-tests capability
before registering.
End-to-end smoke verified on the Pi: receiver up, manifest synced,
14 jobs queued, worker registered, claimed 4 CPU-eligible jobs
(allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle,
mlp-oracle), 1 failed with the actual error visible via
cis490-jobs status, 3 artifacts uploaded to
/var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with
proper index.jsonl row.
21 unit tests (manifest validation: 8; queue lifecycle + eligibility:
13). All pass alongside the prior 17 training tests = 38 green.
Open limitations surfaced inline:
- Hyper-key drift between manifest and run.py fails at training
time, not at manifest reload (worth tightening to argparse
introspection later).
- mTLS not yet wired through Caddy for the trainer-receiver port —
listens loopback-only until that lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
83 lines
2.8 KiB
Bash
Executable file
83 lines
2.8 KiB
Bash
Executable file
#!/usr/bin/env bash
|
|
# Install a CIS490 trainer worker on a Linux host (Pi or x86 GPU box).
|
|
#
|
|
# This is the symmetric companion to install-lab-host.sh — same idea,
|
|
# different role. Run as root on the host you want to enroll. Prereqs:
|
|
# - WireGuard up to 10.100.0.1
|
|
# - A working Python 3.11+ with the training deps installed
|
|
# - Repo cloned to /opt/cis490, working tree clean, on origin/main
|
|
#
|
|
# What this script does:
|
|
# 1. Verifies repo + venv + WG mesh reachability
|
|
# 2. Writes /etc/systemd/system/cis490-trainer-worker.service
|
|
# 3. Drops a default /etc/cis490/trainer-worker.env (operator edits if needed)
|
|
# 4. systemctl enable --now cis490-trainer-worker.service
|
|
# 5. Tails the worker log briefly to confirm it claims at least one job
|
|
|
|
set -euo pipefail
|
|
|
|
REPO=/opt/cis490
|
|
VENV_PY=$REPO/.venv/bin/python
|
|
RECEIVER_URL=${CIS490_TRAINER_RECEIVER_URL:-http://10.100.0.1:8445}
|
|
HOST_ID=${FLEET_HOST_ID:-$(hostname)}
|
|
|
|
if [[ $EUID -ne 0 ]]; then
|
|
echo "must run as root" >&2; exit 1
|
|
fi
|
|
if [[ ! -d $REPO ]]; then
|
|
echo "repo not at $REPO; clone http://maxgit.wg/spectral/CIS490 first" >&2
|
|
exit 1
|
|
fi
|
|
if [[ ! -x $VENV_PY ]]; then
|
|
echo "no venv at $REPO/.venv. Run:" >&2
|
|
echo " cd $REPO && python3 -m venv .venv && .venv/bin/pip install -e ." >&2
|
|
exit 1
|
|
fi
|
|
|
|
# Receiver reachable?
|
|
if ! curl -s --max-time 3 "$RECEIVER_URL/v1/health" >/dev/null; then
|
|
echo "trainer-receiver unreachable at $RECEIVER_URL" >&2
|
|
echo " - is the WG mesh up? (ip a show wg0)" >&2
|
|
echo " - is cis490-trainer-receiver.service running on the Pi?" >&2
|
|
exit 1
|
|
fi
|
|
|
|
# Capability self-test — what will the worker report?
|
|
echo "=== capability self-report ==="
|
|
sudo -u cis490 $VENV_PY -m training.fleet.capability
|
|
echo
|
|
|
|
# Drop the env file (idempotent — keeps existing edits)
|
|
mkdir -p /etc/cis490
|
|
if [[ ! -f /etc/cis490/trainer-worker.env ]]; then
|
|
cat > /etc/cis490/trainer-worker.env <<EOF
|
|
# CIS490 trainer-worker config
|
|
CIS490_TRAINER_RECEIVER_URL=$RECEIVER_URL
|
|
FLEET_HOST_ID=$HOST_ID
|
|
EOF
|
|
chmod 0644 /etc/cis490/trainer-worker.env
|
|
echo "wrote /etc/cis490/trainer-worker.env"
|
|
else
|
|
echo "/etc/cis490/trainer-worker.env exists; leaving it alone"
|
|
fi
|
|
|
|
# Install the systemd unit
|
|
cp $REPO/etc/cis490-trainer-worker.service /etc/systemd/system/
|
|
systemctl daemon-reload
|
|
systemctl enable --now cis490-trainer-worker.service
|
|
|
|
# Confirm
|
|
sleep 3
|
|
if ! systemctl is-active --quiet cis490-trainer-worker.service; then
|
|
echo "trainer-worker did not start; see:" >&2
|
|
echo " journalctl -u cis490-trainer-worker.service -n 50" >&2
|
|
exit 1
|
|
fi
|
|
echo "OK. Tailing 30 lines of journal:"
|
|
journalctl -u cis490-trainer-worker.service --no-pager -n 30
|
|
echo
|
|
echo "Status from the Pi:"
|
|
echo " ssh max@10.100.0.1 cis490-jobs status"
|
|
echo "Local control:"
|
|
echo " systemctl status cis490-trainer-worker.service"
|
|
echo " journalctl -u cis490-trainer-worker.service -f"
|