At our model sizes (max ~250 K params, max batch 512), each training process uses ~1 GiB VRAM. A 40 GiB A100 is far from contention with two concurrent jobs. Bounded-concurrency rolling launcher cuts sequential ~3.5 h → parallel ~1.7 h for the full 14-job manifest. PARALLEL=2 (default) — override via env var if running on a smaller GPU or testing the queue logic. Per-job logs still land at logs/<model>_<mode>.log; failure reporting is the same. Idempotent: skipping already-present checkpoints unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| auto-update.sh | ||
| build-lambda-bundle.sh | ||
| fetch-alpine-baseline.sh | ||
| fetch-lab-host-cert.sh | ||
| fetch-metasploitable2.sh | ||
| install-lab-host.sh | ||
| install-msfrpcd.sh | ||
| install-receiver.sh | ||
| install-tier-3-4.sh | ||
| install-training-worker-windows.ps1 | ||
| install-training-worker.sh | ||
| issue-cis490-client-cert-wrapper.sh | ||
| lambda-bootstrap.sh | ||
| run-on-lambda.sh | ||
| sync-training-data.sh | ||