CIS490/tools
Max 308140c6ce training: lambda-cloud one-shot training integration
External-GPU path for the time-pressured first round, before the
Windows desktop joins the WG fleet. Lambda is treated as an "external
worker" whose output lands in the same /var/lib/cis490/models/ tree
the receiver-coordinated fleet uses, so cis490-jobs status reflects
Lambda runs identically to fleet runs.

Three scripts + one ingest tool:

  scripts/build-lambda-bundle.sh
    Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with:
      - the repo (sans .git, sans data/, sans artifacts*)
      - data/processed/{validation_v1,features_window_v1}.parquet
      - data/processed/feature_schema_v1.json
      - data/processed/tensor_window_v1/   (npz shards)
      - bootstrap.sh (entrypoint)
      - training_manifest.toml (the canonical job list)
      - BUNDLE_MANIFEST.json (commit hash + counts + build stamp)
    Verifies all four data inputs exist BEFORE compressing 5+ GB.

  scripts/run-on-lambda.sh ubuntu@<ip>
    rsync bundle up → ssh + run bootstrap → rsync artifacts +
    reports/eval back to artifacts-lambda/ + reports/lambda/.
    Resumable rsync; sha256-verified.

  scripts/lambda-bootstrap.sh   (runs ON the Lambda instance)
    Creates .venv with cu121 torch + xgboost + the [training] deps,
    iterates the manifest's job list in priority order (highest first),
    runs trainer/run.py (or run_ssl.py for transformer_ssl) per job,
    skips jobs whose .ckpt.json already exists (idempotent on re-run),
    writes per-job logs/<model>_<mode>.log, runs eval suite at the end,
    stamps artifacts/RUN_SUMMARY.json with counts + failed-job list.

  tools/ingest_lambda_artifacts.py
    Bundles each (ckpt.json + sidecar + train.json) trio into a
    .tar.zst, sha256, PUTs to the local trainer-receiver's
    /v1/model/{job_id}, marks the job complete. Maps (model, mode) →
    job_id by re-reading the canonical manifest. Handles the queue
    state churn (requeue if completed, claim if pending, fail-back
    on race losses).

End-to-end smoke verified on the A100 instance just provisioned:
  - SSH from Pi via ed25519 keypair (cis490-trainer-pi)
  - GPU: A100-SXM4-40GB, driver 580.105.08
  - venv warmed: torch 2.5.1+cu121, xgboost 3.2.0
  - 464 GB ephemeral disk available

Pi-side feature build (build_features.py + build_tensors.py against
all 72,952 accepted+degraded episodes) is in progress; bundle build
gates on its completion. Estimated wall-clock for the full Lambda
training run on A100: ~2.5 hours for 12 supervised + 2 SSL models +
eval suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:32:04 -05:00
..
auto_fetch_samples.py auto_fetch_samples: pick Linux i386 ELF; manifest matches theZoo 2026-05-01 03:28:26 -05:00
build_cidata.py PIPELINE §5 step 1: fix four root-cause defects 2026-05-03 17:05:25 -05:00
build_target.py PIPELINE §5 step 3: target VM build infrastructure + containment posture 2026-05-04 01:31:40 -05:00
check_fleet_health.py fleet-health: exit 0 when alerts found (don't mark unit failed) 2026-05-02 13:51:20 -05:00
cis490_doctor.py shipper: systemd watchdog, quarantine cleanup; doctor surfaces ship errors 2026-05-01 12:02:59 -05:00
cis490_jobs.py training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
dataset_validate.py training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers 2026-05-08 01:19:00 -05:00
fetch_sample.py Close out the open issues: bridge pcap wiring, perf collector, Tier-4 2026-04-30 00:17:49 -05:00
index_backfill.py Receiver enforces X-Cis490-Code-Commit allow-list (live, auto-refreshed) 2026-05-01 01:38:50 -05:00
index_reader.py Close out the deployment-readiness gaps 2026-04-30 00:31:55 -05:00
ingest_lambda_artifacts.py training: lambda-cloud one-shot training integration 2026-05-08 12:32:04 -05:00
load_mimic.py Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00
plot_envelope.py Close out the deployment-readiness gaps 2026-04-30 00:31:55 -05:00
prune_episodes.py Multi-signal prune classifier: rescue valid episodes /proc misses 2026-04-30 19:10:01 -05:00
quarantine_unstamped.py fix: lab-host install loop after commit-gate cutover 2026-05-01 11:36:21 -05:00
run_campaign.py Add automated campaign runner, shipper, and systemd units 2026-04-30 14:53:40 -06:00
run_envelope_demo.py Synthetic envelope demo: phase-driven load mimic + plotter 2026-04-28 23:53:20 -06:00
run_fleet.py PIPELINE §5 step 2: canonical manifest at <repo>/manifest.toml 2026-05-04 01:25:01 -05:00
run_real_vm_demo.py Tier-2 episodes use clean-only schedule; .gitignore VERSION 2026-05-04 01:55:37 -05:00
run_tier3_demo.py Tier-3 fixes: b'' probe false-positive, requires_bridge, msgpack 2026-05-05 15:15:18 -06:00
ship_health_check.py fleet-health: proactive alerts on the Pi + per-host doctor reports 2026-05-02 13:48:31 -05:00
shipper.py Add automated campaign runner, shipper, and systemd units 2026-04-30 14:53:40 -06:00
show_envelope.sh Interactive envelope plot via WebAgg (browser-based) 2026-04-29 00:06:22 -06:00
verify_catalog.py PIPELINE §5 step 4: catalog admission verifier (§4.3) 2026-05-04 01:35:32 -05:00
verify_tier3_local.py tools/verify_tier3_local.py: Pi-runnable Tier-3 verifier 2026-05-01 03:41:21 -05:00
vm_load_controller.py Fix workload-silent false-positive on Alpine busybox guests (closes #15) 2026-04-30 17:28:48 -05:00
vm_serial.py Tier 2: real Alpine VM, real workload, real envelope 2026-04-29 08:38:53 -06:00