CIS490/scripts/sync-training-data.sh
Max 1fabd4a246 training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers
The model layer of the project, built honestly:

  - tools/dataset_validate.py — full-sweep validator over the receiver
    store (sha256, schema, monotonic labels, telemetry-row gate). On the
    current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
    7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
    is committed as the per-episode acceptance index.

  - training/_features.py — channel registry (46 channels across
    proc/guest/qmp/netflow), summary-stat windowing AND channel×time
    tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
    (Unix ns) — tested fix for a real netflow-vs-host clock-base
    inconsistency that was silently dropping every netflow channel.

  - training/_split.py — three held-out recipes (host / sample / time)
    with profile-stratification assertions. held_out_host carries
    untested_profiles for cases like scan-and-dial absent from the test
    host (5 of 6 profiles tested cross-device, never silently averaged).

  - training/models/ — 6 architectures behind a common BaseModel
    interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
    trained twice (realistic / oracle) per the deployment threat model.
    Schema-hashed checkpoints refuse to load if _features.py changed
    since training (silent-input-drift protection, tested).

  - training/trainer/ — unified training loop: class-weighted CE, LR
    warmup + cosine, gradient clipping, mixed precision when CUDA,
    early stopping on val macro F1, best-on-val checkpoint. Same loop
    runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
    early_stopping_rounds on val mlogloss.

  - training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
    per-profile and per-host breakdown, paired-bootstrap significance
    for model-vs-model gap. Confusion matrix uses union of seen labels.

  - training/dashboard/producers/ — replay/metrics/perf/profiles
    emitting the six event types the dashboard's awaiting scenes
    consume; on-demand tensor extraction so the Pi can run live
    inference without 65 GB of shards.

  - 17 unit tests (split coverage, features round-trip, schema mismatch,
    determinism, time-base alignment regression).

End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:19:00 -05:00

41 lines
1.3 KiB
Bash
Executable file

#!/usr/bin/env bash
# Pull training data from the receiver Pi to a local trainer box.
#
# Run this on the trainer (e.g. the Windows/2070-Super box via WSL or a
# Linux desktop). Requires WireGuard up to 10.100.0.1 with `cis490-trainer`
# enrollment so SSH key auth works.
#
# What gets pulled:
# /var/lib/cis490/episodes/ raw .tar.zst episode tarballs (~3GB)
# /var/lib/cis490/index.jsonl shipped-episode index
# data/processed/validation_v1.parquet validator output (committed in repo)
#
# Once those are local you can run:
# uv run --group training python training/build_features.py \
# --validation data/processed/validation_v1.parquet \
# --store ./episodes \
# --out-dir data/processed
#
# Then training/train_gbt.py and training/train_nn.py.
set -euo pipefail
PI_HOST="${PI_HOST:-10.100.0.1}"
PI_USER="${PI_USER:-max}"
LOCAL_DIR="${LOCAL_DIR:-./episodes}"
mkdir -p "${LOCAL_DIR}"
echo "→ rsyncing episodes from ${PI_USER}@${PI_HOST}:/var/lib/cis490/episodes/"
rsync -ah --info=progress2 \
--exclude='*.partial' \
"${PI_USER}@${PI_HOST}:/var/lib/cis490/episodes/" \
"${LOCAL_DIR}/"
echo "→ rsyncing index.jsonl"
rsync -a --info=progress2 \
"${PI_USER}@${PI_HOST}:/var/lib/cis490/index.jsonl" \
"${LOCAL_DIR}/index.jsonl"
echo "done. ${LOCAL_DIR} contains:"
du -sh "${LOCAL_DIR}"
ls "${LOCAL_DIR}/" | head