Symmetric companion to the collection fleet (orchestrator/fleet.py)
but for *training*. Collection is embarrassingly parallel; training
is not (a model is trained at most once across the fleet), so the
receiver coordinates which worker gets which job.
Operator-control surface is etc/training_manifest.toml.example —
single canonical file declaring (a) per-host capability + per-model
allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper)
with capability constraints (require_cuda, prefer_cuda, min_vram_gib,
min_ram_gib, allowed_hosts).
Components:
capability.py — self-detection: hostname, cores, RAM, CUDA presence,
VRAM, torch version, git commit. Used by workers to filter
eligible jobs before claiming.
manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable
sha256 of (model, mode, hyper, split_recipe, train_hosts, seed)
so manifest reload is idempotent: existing rows keep their status,
new jobs become claimable, removed jobs stay until cancelled.
queue.py — SQLite job queue (training_jobs.db) with statuses
pending|claimed|running|completed|failed|cancelled. Atomic
claim_next via single UPDATE WHERE status='pending'. Heartbeat,
complete, fail. Stale-claim sweep (stale_after_s=600s) with
max_attempts cutoff to failed.
store.py — model artifact store mirroring receiver/store.py.
Artifact ID is the sha256 of the uploaded tarball; bit-identical
re-runs deduplicate.
receiver.py — Starlette app exposing 11 endpoints:
POST /v1/job/claim (worker)
POST /v1/job/{id}/heartbeat (worker)
POST /v1/job/{id}/complete (worker)
POST /v1/job/{id}/fail (worker)
PUT /v1/model/{id} (worker — uploads tarball)
GET /v1/jobs (anyone)
GET /v1/workers (anyone)
POST /v1/job/{id}/cancel (operator: X-Operator-Token)
POST /v1/job/{id}/requeue (operator)
POST /v1/manifest/reload (operator)
GET /v1/health (anyone)
Runs as cis490-trainer-receiver.service on the Pi alongside the
existing receiver, on a separate port.
client.py — stdlib HTTP client (urllib only, no new deps).
worker.py — long-running daemon. Loop: detect capability → claim →
spawn training/trainer/run.py subprocess → heartbeat every 30s →
tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe.
Operator CLI (tools/cis490_jobs.py): status / list / show / cancel /
requeue / reload / workers. Cancel and requeue require
$CIS490_OPERATOR_TOKEN matching the receiver's configured value.
Bootstrap: scripts/install-training-worker.sh (Linux systemd) and
scripts/install-training-worker-windows.ps1 (Windows Scheduled Task)
let the operator enroll a new host with one command after cloning
the repo and setting up the venv. Worker self-tests capability
before registering.
End-to-end smoke verified on the Pi: receiver up, manifest synced,
14 jobs queued, worker registered, claimed 4 CPU-eligible jobs
(allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle,
mlp-oracle), 1 failed with the actual error visible via
cis490-jobs status, 3 artifacts uploaded to
/var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with
proper index.jsonl row.
21 unit tests (manifest validation: 8; queue lifecycle + eligibility:
13). All pass alongside the prior 17 training tests = 38 green.
Open limitations surfaced inline:
- Hyper-key drift between manifest and run.py fails at training
time, not at manifest reload (worth tightening to argparse
introspection later).
- mTLS not yet wired through Caddy for the trainer-receiver port —
listens loopback-only until that lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
123 lines
4.5 KiB
Python
123 lines
4.5 KiB
Python
"""Trained-artifact store on the Pi.
|
|
|
|
Mirrors ``receiver/store.py`` for episodes — same atomic-write,
|
|
sha256-verified, stream-ingest design — but stores trained models
|
|
under ``/var/lib/cis490/models/<model>_<mode>/<artifact_id>/``.
|
|
|
|
An ``artifact_id`` is the sha256 of the uploaded tarball. The same
|
|
job_id can produce multiple artifact_ids if the operator re-runs the
|
|
job (different code commit, different epoch, different seed); the
|
|
queue records the latest artifact_id for each completed job, but the
|
|
store keeps every uploaded artifact so re-runs can be compared.
|
|
|
|
Layout::
|
|
|
|
/var/lib/cis490/models/
|
|
index.jsonl — append-only ingest log
|
|
<model>_<mode>/
|
|
<artifact_id>/
|
|
bundle.tar.zst — what was uploaded
|
|
meta.json — header from the bundle
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import hashlib
|
|
import json
|
|
import re
|
|
import time
|
|
from dataclasses import dataclass
|
|
from pathlib import Path
|
|
from typing import AsyncIterator
|
|
|
|
|
|
_ID_RE = re.compile(r"^[A-Za-z0-9_.-]{1,128}$")
|
|
|
|
|
|
def is_valid_id(s: str) -> bool:
|
|
return bool(_ID_RE.match(s))
|
|
|
|
|
|
@dataclass(frozen=True)
|
|
class StoreResult:
|
|
status: str # "stored" | "already-present" | "sha-mismatch" | "too-large"
|
|
artifact_id: str | None
|
|
size_bytes: int | None
|
|
|
|
|
|
class ModelStore:
|
|
def __init__(self, store_root: Path, incoming_root: Path,
|
|
index_path: Path) -> None:
|
|
self.store_root = store_root
|
|
self.incoming_root = incoming_root
|
|
self.index_path = index_path
|
|
self.store_root.mkdir(parents=True, exist_ok=True)
|
|
self.incoming_root.mkdir(parents=True, exist_ok=True)
|
|
self.index_path.parent.mkdir(parents=True, exist_ok=True)
|
|
self.index_path.touch(exist_ok=True)
|
|
|
|
def final_dir(self, model: str, mode: str, artifact_id: str) -> Path:
|
|
return self.store_root / f"{model}_{mode}" / artifact_id
|
|
|
|
async def ingest_stream(
|
|
self,
|
|
*,
|
|
job_id: str,
|
|
model: str,
|
|
mode: str,
|
|
worker: str,
|
|
expected_sha256: str,
|
|
body: AsyncIterator[bytes],
|
|
max_bytes: int,
|
|
) -> StoreResult:
|
|
# Final artifact id == the uploaded tarball's sha256, so
|
|
# uploading the same bytes twice deduplicates.
|
|
h = hashlib.sha256()
|
|
n = 0
|
|
incoming_dir = self.incoming_root / f"{model}_{mode}"
|
|
incoming_dir.mkdir(parents=True, exist_ok=True)
|
|
partial = incoming_dir / f"{job_id}-{int(time.time())}.tar.zst.partial"
|
|
try:
|
|
with partial.open("wb") as out:
|
|
async for chunk in body:
|
|
n += len(chunk)
|
|
if n > max_bytes:
|
|
partial.unlink(missing_ok=True)
|
|
return StoreResult("too-large", None, n)
|
|
h.update(chunk)
|
|
out.write(chunk)
|
|
actual = h.hexdigest()
|
|
if expected_sha256 and actual != expected_sha256.lower():
|
|
partial.unlink(missing_ok=True)
|
|
return StoreResult("sha-mismatch", actual, n)
|
|
artifact_id = actual
|
|
final_dir = self.final_dir(model, mode, artifact_id)
|
|
if final_dir.exists() and (final_dir / "bundle.tar.zst").exists():
|
|
partial.unlink(missing_ok=True)
|
|
return StoreResult("already-present", artifact_id, n)
|
|
final_dir.mkdir(parents=True, exist_ok=True)
|
|
final = final_dir / "bundle.tar.zst"
|
|
partial.replace(final)
|
|
self._write_meta(final_dir, model=model, mode=mode,
|
|
job_id=job_id, worker=worker,
|
|
artifact_id=artifact_id, size_bytes=n)
|
|
self._append_index({
|
|
"received_at_wall": time.strftime("%Y-%m-%dT%H:%M:%SZ",
|
|
time.gmtime()),
|
|
"job_id": job_id, "model": model, "mode": mode,
|
|
"worker": worker, "artifact_id": artifact_id,
|
|
"size_bytes": n,
|
|
})
|
|
return StoreResult("stored", artifact_id, n)
|
|
except BaseException:
|
|
partial.unlink(missing_ok=True)
|
|
raise
|
|
|
|
def _write_meta(self, final_dir: Path, **kwargs) -> None:
|
|
(final_dir / "meta.json").write_text(
|
|
json.dumps(kwargs, indent=2) + "\n"
|
|
)
|
|
|
|
def _append_index(self, row: dict) -> None:
|
|
line = json.dumps(row, sort_keys=True) + "\n"
|
|
with self.index_path.open("a") as f:
|
|
f.write(line)
|