At our model sizes (max ~250 K params, max batch 512), each training
process uses ~1 GiB VRAM. A 40 GiB A100 is far from contention with
two concurrent jobs. Bounded-concurrency rolling launcher cuts
sequential ~3.5 h → parallel ~1.7 h for the full 14-job manifest.
PARALLEL=2 (default) — override via env var if running on a smaller GPU
or testing the queue logic.
Per-job logs still land at logs/<model>_<mode>.log; failure reporting
is the same. Idempotent: skipping already-present checkpoints unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>