CIS490/scripts/run-on-lambda.sh
Max 308140c6ce training: lambda-cloud one-shot training integration
External-GPU path for the time-pressured first round, before the
Windows desktop joins the WG fleet. Lambda is treated as an "external
worker" whose output lands in the same /var/lib/cis490/models/ tree
the receiver-coordinated fleet uses, so cis490-jobs status reflects
Lambda runs identically to fleet runs.

Three scripts + one ingest tool:

  scripts/build-lambda-bundle.sh
    Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with:
      - the repo (sans .git, sans data/, sans artifacts*)
      - data/processed/{validation_v1,features_window_v1}.parquet
      - data/processed/feature_schema_v1.json
      - data/processed/tensor_window_v1/   (npz shards)
      - bootstrap.sh (entrypoint)
      - training_manifest.toml (the canonical job list)
      - BUNDLE_MANIFEST.json (commit hash + counts + build stamp)
    Verifies all four data inputs exist BEFORE compressing 5+ GB.

  scripts/run-on-lambda.sh ubuntu@<ip>
    rsync bundle up → ssh + run bootstrap → rsync artifacts +
    reports/eval back to artifacts-lambda/ + reports/lambda/.
    Resumable rsync; sha256-verified.

  scripts/lambda-bootstrap.sh   (runs ON the Lambda instance)
    Creates .venv with cu121 torch + xgboost + the [training] deps,
    iterates the manifest's job list in priority order (highest first),
    runs trainer/run.py (or run_ssl.py for transformer_ssl) per job,
    skips jobs whose .ckpt.json already exists (idempotent on re-run),
    writes per-job logs/<model>_<mode>.log, runs eval suite at the end,
    stamps artifacts/RUN_SUMMARY.json with counts + failed-job list.

  tools/ingest_lambda_artifacts.py
    Bundles each (ckpt.json + sidecar + train.json) trio into a
    .tar.zst, sha256, PUTs to the local trainer-receiver's
    /v1/model/{job_id}, marks the job complete. Maps (model, mode) →
    job_id by re-reading the canonical manifest. Handles the queue
    state churn (requeue if completed, claim if pending, fail-back
    on race losses).

End-to-end smoke verified on the A100 instance just provisioned:
  - SSH from Pi via ed25519 keypair (cis490-trainer-pi)
  - GPU: A100-SXM4-40GB, driver 580.105.08
  - venv warmed: torch 2.5.1+cu121, xgboost 3.2.0
  - 464 GB ephemeral disk available

Pi-side feature build (build_features.py + build_tensors.py against
all 72,952 accepted+degraded episodes) is in progress; bundle build
gates on its completion. Estimated wall-clock for the full Lambda
training run on A100: ~2.5 hours for 12 supervised + 2 SSL models +
eval suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:32:04 -05:00

101 lines
3.7 KiB
Bash
Executable file

#!/usr/bin/env bash
# End-to-end driver: rsync the bundle to a Lambda instance, run training,
# rsync artifacts back. Run on the Pi.
#
# Usage:
# bash scripts/run-on-lambda.sh ubuntu@<lambda-ip>
#
# What this does:
# 1. Verifies the bundle exists (if not: build it first)
# 2. rsync the bundle to ~/cis490-bundle.tar.zst on the Lambda instance
# 3. SSH in, extract, run bootstrap.sh, stream logs back to the Pi
# 4. rsync the resulting artifacts/ + reports/ back to the Pi
# 5. Print a summary; you decide whether to ingest into the local
# trainer-receiver via tools/ingest-lambda-artifacts.py
#
# The script is idempotent on the Lambda side: if you re-run with the
# same bundle, it skips the rsync (sha256 match) and re-runs training
# from where bootstrap left off (each model checks for an existing
# .ckpt.json before retraining).
set -euo pipefail
REMOTE="${1:-}"
if [[ -z "$REMOTE" ]]; then
echo "usage: $0 ubuntu@<lambda-ip>" >&2
exit 1
fi
REPO_ROOT="${REPO_ROOT:-/home/max/.env/CIS490}"
OUT_DIR="${OUT_DIR:-/tmp/cis490-lambda}"
SSH_KEY="${SSH_KEY:-$HOME/.ssh/lambda_ed25519}"
SSH_OPTS=(-i "$SSH_KEY" -o StrictHostKeyChecking=accept-new -o ServerAliveInterval=30)
# Find the latest bundle (most-recently-modified .tar.zst)
BUNDLE=$(ls -t "$OUT_DIR"/lambda-bundle-*.tar.zst 2>/dev/null | head -1 || true)
if [[ -z "$BUNDLE" ]]; then
echo "no bundle found in $OUT_DIR. run scripts/build-lambda-bundle.sh first." >&2
exit 1
fi
SHORT=$(basename "$BUNDLE" .tar.zst | sed 's/^lambda-bundle-//')
echo "=== bundle ==="
ls -lh "$BUNDLE" "$BUNDLE.sha256" 2>/dev/null
echo
echo "=== remote ==="
echo " $REMOTE (key=$SSH_KEY)"
echo
# Sanity: can we ssh?
if ! ssh "${SSH_OPTS[@]}" -o ConnectTimeout=10 "$REMOTE" 'echo connected' 2>&1; then
echo "ssh to $REMOTE failed. Check the IP, key permissions, and that" >&2
echo "the instance is fully booted." >&2
exit 1
fi
# Rsync the bundle. -P resumes partial transfers + shows progress.
echo "=== rsync bundle → lambda ==="
rsync -P --partial -e "ssh ${SSH_OPTS[*]}" "$BUNDLE" "$REMOTE:cis490-bundle.tar.zst"
rsync -P --partial -e "ssh ${SSH_OPTS[*]}" "$BUNDLE.sha256" "$REMOTE:cis490-bundle.tar.zst.sha256"
echo
# Run bootstrap remotely. We pipe stdout/stderr back so the operator
# sees training progress live.
echo "=== running bootstrap.sh on lambda ==="
ssh "${SSH_OPTS[@]}" "$REMOTE" 'bash -s' <<'REMOTE_SCRIPT'
set -euo pipefail
cd "$HOME"
# Verify the bundle if we have the sha256 alongside it
if [[ -f cis490-bundle.tar.zst.sha256 ]]; then
if ! sha256sum -c cis490-bundle.tar.zst.sha256 >/dev/null 2>&1; then
echo "bundle sha256 mismatch — corrupted rsync? aborting." >&2
exit 2
fi
fi
# Extract into ~/cis490 (delete prior extraction if present)
mkdir -p cis490
cd cis490
tar --use-compress-program='zstd -T0' -xf ../cis490-bundle.tar.zst
# Hand off to the bundle's bootstrap.sh
exec bash bootstrap.sh
REMOTE_SCRIPT
echo
echo "=== bootstrap returned ok; rsync artifacts back ==="
mkdir -p "$REPO_ROOT/artifacts-lambda" "$REPO_ROOT/reports/lambda"
rsync -av --partial -e "ssh ${SSH_OPTS[*]}" \
"$REMOTE:cis490/artifacts/" "$REPO_ROOT/artifacts-lambda/"
rsync -av --partial -e "ssh ${SSH_OPTS[*]}" \
"$REMOTE:cis490/reports/" "$REPO_ROOT/reports/lambda/"
echo
echo "✓ artifacts pulled back"
echo " $REPO_ROOT/artifacts-lambda/ ($(du -sh "$REPO_ROOT/artifacts-lambda" | awk '{print $1}'))"
echo " $REPO_ROOT/reports/lambda/ ($(du -sh "$REPO_ROOT/reports/lambda" | awk '{print $1}'))"
echo
echo "next: bash scripts/ingest-lambda-artifacts.sh # uploads each artifact"
echo " # to the local trainer-receiver"
echo " # so cis490-jobs status reflects them"