CIS490

History

Max 8643192a71 training/fleet: distributed multi-host trainer with capability gating Symmetric companion to the collection fleet (orchestrator/fleet.py) but for training. Collection is embarrassingly parallel; training is not (a model is trained at most once across the fleet), so the receiver coordinates which worker gets which job. Operator-control surface is etc/training_manifest.toml.example — single canonical file declaring (a) per-host capability + per-model allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper) with capability constraints (require_cuda, prefer_cuda, min_vram_gib, min_ram_gib, allowed_hosts). Components: capability.py — self-detection: hostname, cores, RAM, CUDA presence, VRAM, torch version, git commit. Used by workers to filter eligible jobs before claiming. manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable sha256 of (model, mode, hyper, split_recipe, train_hosts, seed) so manifest reload is idempotent: existing rows keep their status, new jobs become claimable, removed jobs stay until cancelled. queue.py — SQLite job queue (training_jobs.db) with statuses pending\|claimed\|running\|completed\|failed\|cancelled. Atomic claim_next via single UPDATE WHERE status='pending'. Heartbeat, complete, fail. Stale-claim sweep (stale_after_s=600s) with max_attempts cutoff to failed. store.py — model artifact store mirroring receiver/store.py. Artifact ID is the sha256 of the uploaded tarball; bit-identical re-runs deduplicate. receiver.py — Starlette app exposing 11 endpoints: POST /v1/job/claim (worker) POST /v1/job/{id}/heartbeat (worker) POST /v1/job/{id}/complete (worker) POST /v1/job/{id}/fail (worker) PUT /v1/model/{id} (worker — uploads tarball) GET /v1/jobs (anyone) GET /v1/workers (anyone) POST /v1/job/{id}/cancel (operator: X-Operator-Token) POST /v1/job/{id}/requeue (operator) POST /v1/manifest/reload (operator) GET /v1/health (anyone) Runs as cis490-trainer-receiver.service on the Pi alongside the existing receiver, on a separate port. client.py — stdlib HTTP client (urllib only, no new deps). worker.py — long-running daemon. Loop: detect capability → claim → spawn training/trainer/run.py subprocess → heartbeat every 30s → tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe. Operator CLI (tools/cis490_jobs.py): status / list / show / cancel / requeue / reload / workers. Cancel and requeue require $CIS490_OPERATOR_TOKEN matching the receiver's configured value. Bootstrap: scripts/install-training-worker.sh (Linux systemd) and scripts/install-training-worker-windows.ps1 (Windows Scheduled Task) let the operator enroll a new host with one command after cloning the repo and setting up the venv. Worker self-tests capability before registering. End-to-end smoke verified on the Pi: receiver up, manifest synced, 14 jobs queued, worker registered, claimed 4 CPU-eligible jobs (allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle, mlp-oracle), 1 failed with the actual error visible via cis490-jobs status, 3 artifacts uploaded to /var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with proper index.jsonl row. 21 unit tests (manifest validation: 8; queue lifecycle + eligibility: 13). All pass alongside the prior 17 training tests = 38 green. Open limitations surfaced inline: - Hyper-key drift between manifest and run.py fails at training time, not at manifest reload (worth tightening to argparse introspection later). - mTLS not yet wired through Caddy for the trainer-receiver port — listens loopback-only until that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-08 01:20:20 -05:00
..
__init__.py	Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency	2026-04-28 23:34:04 -06:00
test_auto_fetch_samples.py	auto_fetch_samples: pick Linux i386 ELF; manifest matches theZoo	2026-05-01 03:28:26 -05:00
test_collectors_emit.py	PIPELINE §5 step 5: collector admission emit tests (§4.4)	2026-05-04 01:37:40 -05:00
test_containment.py	PIPELINE §5 step 3: target VM build infrastructure + containment posture	2026-05-04 01:31:40 -05:00
test_doctor_shipping.py	shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors	2026-05-01 12:02:59 -05:00
test_episode.py	meta.json: stamp code_version (commit, branch, dirty) per episode	2026-05-01 01:29:01 -05:00
test_event_driven_labeller.py	PIPELINE §5 step 6: event-driven labeller (§4.5)	2026-05-04 01:43:16 -05:00
test_exploits.py	catalog: remove samba_usermap_script — never landed sessions in prod	2026-05-03 22:48:03 -05:00
test_fleet.py	PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml	2026-05-04 01:25:01 -05:00
test_fleet_health.py	fleet-health: exit 0 when alerts found (don't mark unit failed)	2026-05-02 13:51:20 -05:00
test_fleet_manifest.py	training/fleet: distributed multi-host trainer with capability gating	2026-05-08 01:20:20 -05:00
test_fleet_queue.py	training/fleet: distributed multi-host trainer with capability gating	2026-05-08 01:20:20 -05:00
test_guest_agent.py	Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts	2026-04-30 00:02:27 -05:00
test_host_health.py	fleet-health: proactive alerts on the Pi + per-host doctor reports	2026-05-02 13:48:31 -05:00
test_manifest.py	PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml	2026-05-04 01:25:01 -05:00
test_pcap.py	Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts	2026-04-30 00:02:27 -05:00
test_perf_qemu.py	Close out the open issues: bridge pcap wiring, perf collector, Tier-4	2026-04-30 00:17:49 -05:00
test_proc_qemu.py	Add v0 orchestrator + first oracle collector (host /proc)	2026-04-28 23:40:25 -06:00
test_prune.py	Multi-signal prune classifier: rescue valid episodes /proc misses	2026-04-30 19:10:01 -05:00
test_qmp.py	Close out the deployment-readiness gaps	2026-04-30 00:31:55 -05:00
test_quarantine_unstamped.py	fix: lab-host install loop after commit-gate cutover	2026-05-01 11:36:21 -05:00
test_receiver.py	Add receiver: PUT /v1/episodes ingest with sha256 verify and idempotency	2026-04-28 23:34:04 -06:00
test_shipper.py	shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors	2026-05-01 12:02:59 -05:00
test_target_spec.py	PIPELINE §5 step 3: target VM build infrastructure + containment posture	2026-05-04 01:31:40 -05:00
test_tier3_local_verify.py	tools/verify_tier3_local.py: Pi-runnable Tier-3 verifier	2026-05-01 03:41:21 -05:00
test_tier4.py	Close out the deployment-readiness gaps	2026-04-30 00:31:55 -05:00
test_training_checkpoint.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
test_training_features.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
test_training_split.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
test_ulid.py	Add v0 orchestrator + first oracle collector (host /proc)	2026-04-28 23:40:25 -06:00
test_verify_catalog.py	PIPELINE §5 step 4: catalog admission verifier (§4.3)	2026-05-04 01:35:32 -05:00
test_version_gate.py	robustness: gate falls back to local git, queue sweeps stale tarballs	2026-05-01 11:49:38 -05:00
test_vm_load_controller.py	workload audit trail: meta.sample + per-phase events + pre-kill probe	2026-04-30 02:12:34 -05:00