External-GPU path for the time-pressured first round, before the
Windows desktop joins the WG fleet. Lambda is treated as an "external
worker" whose output lands in the same /var/lib/cis490/models/ tree
the receiver-coordinated fleet uses, so cis490-jobs status reflects
Lambda runs identically to fleet runs.
Three scripts + one ingest tool:
scripts/build-lambda-bundle.sh
Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with:
- the repo (sans .git, sans data/, sans artifacts*)
- data/processed/{validation_v1,features_window_v1}.parquet
- data/processed/feature_schema_v1.json
- data/processed/tensor_window_v1/ (npz shards)
- bootstrap.sh (entrypoint)
- training_manifest.toml (the canonical job list)
- BUNDLE_MANIFEST.json (commit hash + counts + build stamp)
Verifies all four data inputs exist BEFORE compressing 5+ GB.
scripts/run-on-lambda.sh ubuntu@<ip>
rsync bundle up → ssh + run bootstrap → rsync artifacts +
reports/eval back to artifacts-lambda/ + reports/lambda/.
Resumable rsync; sha256-verified.
scripts/lambda-bootstrap.sh (runs ON the Lambda instance)
Creates .venv with cu121 torch + xgboost + the [training] deps,
iterates the manifest's job list in priority order (highest first),
runs trainer/run.py (or run_ssl.py for transformer_ssl) per job,
skips jobs whose .ckpt.json already exists (idempotent on re-run),
writes per-job logs/<model>_<mode>.log, runs eval suite at the end,
stamps artifacts/RUN_SUMMARY.json with counts + failed-job list.
tools/ingest_lambda_artifacts.py
Bundles each (ckpt.json + sidecar + train.json) trio into a
.tar.zst, sha256, PUTs to the local trainer-receiver's
/v1/model/{job_id}, marks the job complete. Maps (model, mode) →
job_id by re-reading the canonical manifest. Handles the queue
state churn (requeue if completed, claim if pending, fail-back
on race losses).
End-to-end smoke verified on the A100 instance just provisioned:
- SSH from Pi via ed25519 keypair (cis490-trainer-pi)
- GPU: A100-SXM4-40GB, driver 580.105.08
- venv warmed: torch 2.5.1+cu121, xgboost 3.2.0
- 464 GB ephemeral disk available
Pi-side feature build (build_features.py + build_tensors.py against
all 72,952 accepted+degraded episodes) is in progress; bundle build
gates on its completion. Estimated wall-clock for the full Lambda
training run on A100: ~2.5 hours for 12 supervised + 2 SSL models +
eval suite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
120 lines
4.4 KiB
Bash
Executable file
120 lines
4.4 KiB
Bash
Executable file
#!/usr/bin/env bash
|
|
# Build a self-contained tarball ready for rsync to a Lambda GPU instance.
|
|
#
|
|
# Inputs:
|
|
# - The repo at /home/max/.env/CIS490 (or $REPO_ROOT)
|
|
# - data/processed/validation_v1.parquet
|
|
# - data/processed/features_window_v1.parquet
|
|
# - data/processed/feature_schema_v1.json
|
|
# - data/processed/tensor_window_v1/ (npz shards, one per episode)
|
|
#
|
|
# Output:
|
|
# $OUT_DIR/lambda-bundle-<git-short>.tar.zst
|
|
#
|
|
# What's IN the bundle:
|
|
# - repo/ (sans .git, sans data/, sans artifacts*, sans .venv*)
|
|
# - data/processed/ (the four artifacts above)
|
|
# - bootstrap.sh (entrypoint that runs ON Lambda)
|
|
# - training_manifest.toml (the operator's canonical plan; bootstrap loops over jobs)
|
|
#
|
|
# What's NOT in the bundle:
|
|
# - raw .tar.zst episodes (not needed once tensors are pre-built)
|
|
# - .git directory (we ship a code snapshot, not history)
|
|
# - prior artifacts/ (Lambda generates fresh)
|
|
#
|
|
# Run on the Pi:
|
|
# bash scripts/build-lambda-bundle.sh
|
|
set -euo pipefail
|
|
|
|
REPO_ROOT="${REPO_ROOT:-/home/max/.env/CIS490}"
|
|
OUT_DIR="${OUT_DIR:-/tmp/cis490-lambda}"
|
|
SHORT=$(cd "$REPO_ROOT" && git rev-parse --short HEAD)
|
|
BUNDLE="$OUT_DIR/lambda-bundle-$SHORT.tar.zst"
|
|
|
|
mkdir -p "$OUT_DIR"
|
|
|
|
# Check the four required inputs exist BEFORE we start tarring 5 GB.
|
|
required=(
|
|
"$REPO_ROOT/data/processed/validation_v1.parquet"
|
|
"$REPO_ROOT/data/processed/features_window_v1.parquet"
|
|
"$REPO_ROOT/data/processed/feature_schema_v1.json"
|
|
"$REPO_ROOT/data/processed/tensor_window_v1"
|
|
)
|
|
for r in "${required[@]}"; do
|
|
if [[ ! -e "$r" ]]; then
|
|
echo "missing required input: $r" >&2
|
|
echo "did the Pi-side feature build finish? check data/logs/build_features_full.log" >&2
|
|
exit 1
|
|
fi
|
|
done
|
|
|
|
# Stage the manifest into the bundle's working dir so bootstrap can read it.
|
|
STAGE="$(mktemp -d)"
|
|
trap 'rm -rf "$STAGE"' EXIT
|
|
|
|
# Pre-built data the Lambda instance needs
|
|
mkdir -p "$STAGE/data/processed"
|
|
cp "$REPO_ROOT/data/processed/validation_v1.parquet" "$STAGE/data/processed/"
|
|
cp "$REPO_ROOT/data/processed/features_window_v1.parquet" "$STAGE/data/processed/"
|
|
cp "$REPO_ROOT/data/processed/feature_schema_v1.json" "$STAGE/data/processed/"
|
|
cp -r "$REPO_ROOT/data/processed/tensor_window_v1" "$STAGE/data/processed/"
|
|
|
|
# Code snapshot — exclude .git, runtime caches, and anything under data/
|
|
mkdir -p "$STAGE/repo"
|
|
rsync -a \
|
|
--exclude='.git/' \
|
|
--exclude='.venv*/' \
|
|
--exclude='__pycache__/' \
|
|
--exclude='*.pyc' \
|
|
--exclude='data/' \
|
|
--exclude='artifacts*/' \
|
|
--exclude='reports/eval/' \
|
|
--exclude='reports/pca/' \
|
|
--exclude='reports/xai/' \
|
|
--exclude='reports/fleet-*/' \
|
|
--exclude='/tmp/*' \
|
|
--exclude='vm/images/' \
|
|
--exclude='vm/snapshots/' \
|
|
"$REPO_ROOT/" "$STAGE/repo/"
|
|
|
|
# The bootstrap script Lambda runs after extracting the bundle.
|
|
cp "$REPO_ROOT/scripts/lambda-bootstrap.sh" "$STAGE/bootstrap.sh"
|
|
chmod +x "$STAGE/bootstrap.sh"
|
|
|
|
# Use the canonical training manifest as the job list. If the operator
|
|
# wants a different plan, they edit etc/training_manifest.toml.example
|
|
# and we ship the edited version.
|
|
cp "$REPO_ROOT/etc/training_manifest.toml.example" \
|
|
"$STAGE/training_manifest.toml"
|
|
|
|
# Manifest pinning — Lambda gets a stamp of what code commit produced
|
|
# this bundle, so rerunning against the same data with the same code
|
|
# is reproducible.
|
|
cat > "$STAGE/BUNDLE_MANIFEST.json" <<EOF
|
|
{
|
|
"code_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD)",
|
|
"code_commit_short": "$SHORT",
|
|
"code_branch": "$(cd "$REPO_ROOT" && git rev-parse --abbrev-ref HEAD)",
|
|
"code_dirty": "$(cd "$REPO_ROOT" && git status --porcelain | wc -l | xargs)",
|
|
"built_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
|
"built_on": "$(hostname)",
|
|
"n_episodes": "$(/home/max/.env/CIS490/.venv-training/bin/python -c "import pyarrow.parquet as pq; print(pq.read_table('$STAGE/data/processed/validation_v1.parquet').num_rows)" 2>/dev/null)",
|
|
"n_tensor_shards": "$(find "$STAGE/data/processed/tensor_window_v1" -name '*.npz' | wc -l | xargs)"
|
|
}
|
|
EOF
|
|
|
|
# tar.zst (zstd > gzip for both speed and ratio on this kind of payload)
|
|
echo "compressing bundle to $BUNDLE..."
|
|
tar -C "$STAGE" --use-compress-program='zstd -T0 -3' -cf "$BUNDLE" .
|
|
|
|
# Stamp the bundle's own sha256 so rsync resume + verify is stable.
|
|
sha256sum "$BUNDLE" > "$BUNDLE.sha256"
|
|
|
|
# Report
|
|
size=$(du -sh "$BUNDLE" | awk '{print $1}')
|
|
echo
|
|
echo "✓ bundle ready"
|
|
echo " $BUNDLE ($size)"
|
|
echo " $BUNDLE.sha256"
|
|
echo
|
|
echo "next: bash scripts/run-on-lambda.sh ubuntu@<lambda-ip>"
|