# training/ Train a behavioral malware detector from labeled episode tarballs. Six architectures × two threat-model modes = twelve trained models, evaluated head-to-head on a held-out-by-host split. ## What lives where ``` training/ _episode_io.py tarball decoder _features.py channel registry + windowing (summary + tensor) _split.py held-out recipes (host / sample / time) build_features.py summary-stat parquet builder build_tensors.py channel × time tensor shard builder models/ 6 architectures behind a common BaseModel interface gbt.py XGBoost on summary features mlp.py MLP on summary features (NN baseline parity to GBT) cnn.py 1D-CNN on tensor windows gru.py GRU on tensor windows lstm.py LSTM on tensor windows transformer.py small Transformer encoder on tensor windows _base.py, _torch_seq.py _checkpoint.py schema-hashed save/load — refuses mismatches trainer/ run.py end-to-end training driver (one model at a time) _loop.py shared training loop: class-weighted CE, LR warmup + cosine, early stop on val macro F1, best-on-val _data.py loaders for summary parquet + tensor shards eval_/ run.py load every checkpoint, score, write comparison_v2.md _metrics.py macro F1 + per-class F1 with bootstrap 95 % CIs; paired-bootstrap significance for model-vs-model breakdown.py per-profile, per-host metric tables producers/ live event sources for the dashboard (replay, metrics, perf, profiles) → see training/dashboard/PRODUCERS.md ``` ## The honesty rules this implements 1. **Held-out by host (primary):** train on `elliott-thinkpad`, test on `k-gamingcom`. Tests cross-device generalization, the claim a deployed model has to support. 5 of 6 profiles populated cross-device; `scan-and-dial` is *untested* (k-gamingcom never ran it) and explicitly reported as such, not silently averaged in. 2. **Profile-stratified, sample-stratified, or time:** all three split recipes are available via `--split-recipe {host,sample,time}`. `held_out_sample` excludes profiles with too few unique sample_names (would be mathematically unsound otherwise) — the dataset has 2 such profiles today (`cpu-saturate`, `low-and-slow`). 3. **In-distribution val carved from train host** for hyperparameter selection. Test set is never touched at training time. 4. **Class-weighted cross-entropy** computed from the train slice (inverse frequency, clipped). Class imbalance is real (`armed`/`infecting` rare, `infected_running` common) and unweighted loss under-trains on the operationally interesting phases. 5. **Best-on-val checkpoint** selected by macro F1 (not accuracy — accuracy hides imbalance). Early stopping with patience=8. LR warmup (5 % of steps) + cosine decay to 0. 6. **Schema-hashed checkpoints.** Every saved model carries a sha256 of its input schema. Loading a checkpoint against a changed `_features.py` registry raises `ValueError` instead of silently feeding mis-aligned columns to the model. 7. **Bootstrap CIs on every test metric.** Reporting `macro_f1 = 0.873 ± 0.012` is the bar; a single point estimate from one finite test is dishonest. 8. **Paired-bootstrap significance** for model-vs-model gap. CI excludes 0 → significant. 9. **NaN handling for the `degraded` set** (k-gamingcom shipped without netflow): NaN fed through standardization → 0 after, but a missingness mask is kept on tensor data for the sequence models to learn to discount sparse channels. (Indicator features for summary models is a v2 enhancement.) ## Pipeline ``` /var/lib/cis490/episodes/ ← raw .tar.zst /var/lib/cis490/index.jsonl │ ▼ tools/dataset_validate.py ← full-sweep validator │ ▼ data/processed/validation_v1.parquet ← committed │ ┌────────┴─────────┐ ▼ ▼ (rsync to GPU box) training/build_features.py training/build_tensors.py │ │ ▼ ▼ features_window_v1.parquet tensor_window_v1/host=*/.npz feature_schema_v1.json (channel × time, ~12 GB at full scale) │ │ └────────┬─────────────────────────┘ ▼ training/trainer/run.py (per model × mode) │ ▼ artifacts/_.ckpt.json + sidecar (.pt or .xgb.json) │ ▼ training/eval_/run.py │ ▼ reports/eval/comparison_v2.md reports/eval/__eval.json (full per-phase, per-profile, per-host metrics with CIs) ``` ## Quickstart on the GPU box ```sh git clone http://maxgit.wg/spectral/CIS490.git cd CIS490 uv sync --group training # 1. pull raw episodes from the Pi (needs WireGuard + cis490-trainer) PI_USER=max PI_HOST=10.100.0.1 LOCAL_DIR=./episodes \ bash scripts/sync-training-data.sh # 2. build features + tensors uv run --group training python training/build_features.py \ --validation data/processed/validation_v1.parquet \ --store ./episodes \ --out-dir data/processed uv run --group training python training/build_tensors.py \ --validation data/processed/validation_v1.parquet \ --store ./episodes \ --out-dir data/processed/tensor_window_v1 # 3. train all 12 (one process per model × mode) for model in gbt mlp cnn gru lstm transformer; do for mode in realistic oracle; do uv run --group training python -m training.trainer.run \ --model $model --mode $mode \ --validation data/processed/validation_v1.parquet \ --summary data/processed/features_window_v1.parquet \ --tensors data/processed/tensor_window_v1 \ --schema data/processed/feature_schema_v1.json \ --train-hosts elliott-thinkpad \ --epochs 60 done done # 4. evaluate, write comparison_v2.md uv run --group training python -m training.eval_.run \ --validation data/processed/validation_v1.parquet \ --summary data/processed/features_window_v1.parquet \ --tensors data/processed/tensor_window_v1 \ --reports-dir reports/eval ``` ## Live dashboard Producers under `training/producers/` push events to the `dashboard.wg` WebSocket via the canonical `training.dashboard.client.Publisher` (loopback HTTP, stdlib-only). See [`../dashboard/PRODUCERS.md`](../dashboard/PRODUCERS.md) for the event contract. ```sh # After training, push live model_metric + model_perf bars: uv run --group training python -m training.producers.metrics \ --validation data/processed/validation_v1.parquet \ --artifacts artifacts \ --summary data/processed/features_window_v1.parquet \ --tensors data/processed/tensor_window_v1 # Replay one episode at wall-clock speed (drives phase + prediction + # embedding events): uv run --group training python -m training.producers.replay \ --episode /var/lib/cis490/episodes/elliott-thinkpad/.tar.zst \ --host-id elliott-thinkpad \ --artifacts artifacts ``` ## Tests ```sh pytest tests/test_training_split.py tests/test_training_features.py \ tests/test_training_checkpoint.py ``` Guards: split coverage assertions, time-base alignment (the `t_wall_ns` vs `t_mono_ns` netflow regression), counter-to-rate correctness, schema-mismatch rejection, deterministic split. ## Open data-quality issues found while building this Surfaced for the writeup, not silently worked around: - **`receiver/store.py:130` torn write** — index.jsonl line 19500 has two records concatenated. The "atomic for sub-PIPE_BUF" comment isn't holding. Validator skips and warns; producer-side fix needed. - **k-gamingcom silent downgrade** — ~24 k episodes shipped without `netflow.jsonl`. Per AGENTS.md "Do not silently downgrade a host" this is a producer hard-rule violation. We accept them as `degraded` and train, but the realistic model loses bridge-pcap signal on those. - **`scan-and-dial` absent from k-gamingcom** — held-out-by-host can't evaluate that profile cross-device. Reported as `untested_profiles` in every metrics output rather than averaged in. - **Cross-source clock drift** — `_features.py` aligns on `t_wall_ns` because netflow's `t_mono_ns` is system-uptime, not episode-relative. Fix is in this repo; the producer should be patched to emit episode-relative `t_mono_ns` consistently. - **Sample diversity is low** (12 unique sample_names total across 6 profiles). `held_out_sample` only fits the `io-walk` profile. Held-out-by-host is the right primary eval until more samples are added.