External-GPU path for the time-pressured first round, before the
Windows desktop joins the WG fleet. Lambda is treated as an "external
worker" whose output lands in the same /var/lib/cis490/models/ tree
the receiver-coordinated fleet uses, so cis490-jobs status reflects
Lambda runs identically to fleet runs.
Three scripts + one ingest tool:
scripts/build-lambda-bundle.sh
Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with:
- the repo (sans .git, sans data/, sans artifacts*)
- data/processed/{validation_v1,features_window_v1}.parquet
- data/processed/feature_schema_v1.json
- data/processed/tensor_window_v1/ (npz shards)
- bootstrap.sh (entrypoint)
- training_manifest.toml (the canonical job list)
- BUNDLE_MANIFEST.json (commit hash + counts + build stamp)
Verifies all four data inputs exist BEFORE compressing 5+ GB.
scripts/run-on-lambda.sh ubuntu@<ip>
rsync bundle up → ssh + run bootstrap → rsync artifacts +
reports/eval back to artifacts-lambda/ + reports/lambda/.
Resumable rsync; sha256-verified.
scripts/lambda-bootstrap.sh (runs ON the Lambda instance)
Creates .venv with cu121 torch + xgboost + the [training] deps,
iterates the manifest's job list in priority order (highest first),
runs trainer/run.py (or run_ssl.py for transformer_ssl) per job,
skips jobs whose .ckpt.json already exists (idempotent on re-run),
writes per-job logs/<model>_<mode>.log, runs eval suite at the end,
stamps artifacts/RUN_SUMMARY.json with counts + failed-job list.
tools/ingest_lambda_artifacts.py
Bundles each (ckpt.json + sidecar + train.json) trio into a
.tar.zst, sha256, PUTs to the local trainer-receiver's
/v1/model/{job_id}, marks the job complete. Maps (model, mode) →
job_id by re-reading the canonical manifest. Handles the queue
state churn (requeue if completed, claim if pending, fail-back
on race losses).
End-to-end smoke verified on the A100 instance just provisioned:
- SSH from Pi via ed25519 keypair (cis490-trainer-pi)
- GPU: A100-SXM4-40GB, driver 580.105.08
- venv warmed: torch 2.5.1+cu121, xgboost 3.2.0
- 464 GB ephemeral disk available
Pi-side feature build (build_features.py + build_tensors.py against
all 72,952 accepted+degraded episodes) is in progress; bundle build
gates on its completion. Estimated wall-clock for the full Lambda
training run on A100: ~2.5 hours for 12 supervised + 2 SSL models +
eval suite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
101 lines
3.7 KiB
Bash
Executable file
101 lines
3.7 KiB
Bash
Executable file
#!/usr/bin/env bash
|
|
# End-to-end driver: rsync the bundle to a Lambda instance, run training,
|
|
# rsync artifacts back. Run on the Pi.
|
|
#
|
|
# Usage:
|
|
# bash scripts/run-on-lambda.sh ubuntu@<lambda-ip>
|
|
#
|
|
# What this does:
|
|
# 1. Verifies the bundle exists (if not: build it first)
|
|
# 2. rsync the bundle to ~/cis490-bundle.tar.zst on the Lambda instance
|
|
# 3. SSH in, extract, run bootstrap.sh, stream logs back to the Pi
|
|
# 4. rsync the resulting artifacts/ + reports/ back to the Pi
|
|
# 5. Print a summary; you decide whether to ingest into the local
|
|
# trainer-receiver via tools/ingest-lambda-artifacts.py
|
|
#
|
|
# The script is idempotent on the Lambda side: if you re-run with the
|
|
# same bundle, it skips the rsync (sha256 match) and re-runs training
|
|
# from where bootstrap left off (each model checks for an existing
|
|
# .ckpt.json before retraining).
|
|
set -euo pipefail
|
|
|
|
REMOTE="${1:-}"
|
|
if [[ -z "$REMOTE" ]]; then
|
|
echo "usage: $0 ubuntu@<lambda-ip>" >&2
|
|
exit 1
|
|
fi
|
|
|
|
REPO_ROOT="${REPO_ROOT:-/home/max/.env/CIS490}"
|
|
OUT_DIR="${OUT_DIR:-/tmp/cis490-lambda}"
|
|
SSH_KEY="${SSH_KEY:-$HOME/.ssh/lambda_ed25519}"
|
|
SSH_OPTS=(-i "$SSH_KEY" -o StrictHostKeyChecking=accept-new -o ServerAliveInterval=30)
|
|
|
|
# Find the latest bundle (most-recently-modified .tar.zst)
|
|
BUNDLE=$(ls -t "$OUT_DIR"/lambda-bundle-*.tar.zst 2>/dev/null | head -1 || true)
|
|
if [[ -z "$BUNDLE" ]]; then
|
|
echo "no bundle found in $OUT_DIR. run scripts/build-lambda-bundle.sh first." >&2
|
|
exit 1
|
|
fi
|
|
SHORT=$(basename "$BUNDLE" .tar.zst | sed 's/^lambda-bundle-//')
|
|
|
|
echo "=== bundle ==="
|
|
ls -lh "$BUNDLE" "$BUNDLE.sha256" 2>/dev/null
|
|
echo
|
|
echo "=== remote ==="
|
|
echo " $REMOTE (key=$SSH_KEY)"
|
|
echo
|
|
|
|
# Sanity: can we ssh?
|
|
if ! ssh "${SSH_OPTS[@]}" -o ConnectTimeout=10 "$REMOTE" 'echo connected' 2>&1; then
|
|
echo "ssh to $REMOTE failed. Check the IP, key permissions, and that" >&2
|
|
echo "the instance is fully booted." >&2
|
|
exit 1
|
|
fi
|
|
|
|
# Rsync the bundle. -P resumes partial transfers + shows progress.
|
|
echo "=== rsync bundle → lambda ==="
|
|
rsync -P --partial -e "ssh ${SSH_OPTS[*]}" "$BUNDLE" "$REMOTE:cis490-bundle.tar.zst"
|
|
rsync -P --partial -e "ssh ${SSH_OPTS[*]}" "$BUNDLE.sha256" "$REMOTE:cis490-bundle.tar.zst.sha256"
|
|
echo
|
|
|
|
# Run bootstrap remotely. We pipe stdout/stderr back so the operator
|
|
# sees training progress live.
|
|
echo "=== running bootstrap.sh on lambda ==="
|
|
ssh "${SSH_OPTS[@]}" "$REMOTE" 'bash -s' <<'REMOTE_SCRIPT'
|
|
set -euo pipefail
|
|
cd "$HOME"
|
|
|
|
# Verify the bundle if we have the sha256 alongside it
|
|
if [[ -f cis490-bundle.tar.zst.sha256 ]]; then
|
|
if ! sha256sum -c cis490-bundle.tar.zst.sha256 >/dev/null 2>&1; then
|
|
echo "bundle sha256 mismatch — corrupted rsync? aborting." >&2
|
|
exit 2
|
|
fi
|
|
fi
|
|
|
|
# Extract into ~/cis490 (delete prior extraction if present)
|
|
mkdir -p cis490
|
|
cd cis490
|
|
tar --use-compress-program='zstd -T0' -xf ../cis490-bundle.tar.zst
|
|
|
|
# Hand off to the bundle's bootstrap.sh
|
|
exec bash bootstrap.sh
|
|
REMOTE_SCRIPT
|
|
|
|
echo
|
|
echo "=== bootstrap returned ok; rsync artifacts back ==="
|
|
mkdir -p "$REPO_ROOT/artifacts-lambda" "$REPO_ROOT/reports/lambda"
|
|
|
|
rsync -av --partial -e "ssh ${SSH_OPTS[*]}" \
|
|
"$REMOTE:cis490/artifacts/" "$REPO_ROOT/artifacts-lambda/"
|
|
rsync -av --partial -e "ssh ${SSH_OPTS[*]}" \
|
|
"$REMOTE:cis490/reports/" "$REPO_ROOT/reports/lambda/"
|
|
|
|
echo
|
|
echo "✓ artifacts pulled back"
|
|
echo " $REPO_ROOT/artifacts-lambda/ ($(du -sh "$REPO_ROOT/artifacts-lambda" | awk '{print $1}'))"
|
|
echo " $REPO_ROOT/reports/lambda/ ($(du -sh "$REPO_ROOT/reports/lambda" | awk '{print $1}'))"
|
|
echo
|
|
echo "next: bash scripts/ingest-lambda-artifacts.sh # uploads each artifact"
|
|
echo " # to the local trainer-receiver"
|
|
echo " # so cis490-jobs status reflects them"
|