CIS490/scripts
Max 69c563275a training: parallelize lambda bootstrap (2 jobs at a time on the A100)
At our model sizes (max ~250 K params, max batch 512), each training
process uses ~1 GiB VRAM. A 40 GiB A100 is far from contention with
two concurrent jobs. Bounded-concurrency rolling launcher cuts
sequential ~3.5 h → parallel ~1.7 h for the full 14-job manifest.

  PARALLEL=2 (default) — override via env var if running on a smaller GPU
  or testing the queue logic.

Per-job logs still land at logs/<model>_<mode>.log; failure reporting
is the same. Idempotent: skipping already-present checkpoints unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:37:03 -05:00
..
auto-update.sh
build-lambda-bundle.sh training: lambda-cloud one-shot training integration 2026-05-08 12:32:04 -05:00
fetch-alpine-baseline.sh
fetch-lab-host-cert.sh
fetch-metasploitable2.sh
install-lab-host.sh
install-msfrpcd.sh
install-receiver.sh
install-tier-3-4.sh
install-training-worker-windows.ps1 training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
install-training-worker.sh training/fleet: distributed multi-host trainer with capability gating 2026-05-08 01:20:20 -05:00
issue-cis490-client-cert-wrapper.sh
lambda-bootstrap.sh training: parallelize lambda bootstrap (2 jobs at a time on the A100) 2026-05-08 12:37:03 -05:00
run-on-lambda.sh training: lambda-cloud one-shot training integration 2026-05-08 12:32:04 -05:00
sync-training-data.sh training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers 2026-05-08 01:19:00 -05:00