CIS490/training/README.md
Max 1fabd4a246 training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers
The model layer of the project, built honestly:

  - tools/dataset_validate.py — full-sweep validator over the receiver
    store (sha256, schema, monotonic labels, telemetry-row gate). On the
    current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
    7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
    is committed as the per-episode acceptance index.

  - training/_features.py — channel registry (46 channels across
    proc/guest/qmp/netflow), summary-stat windowing AND channel×time
    tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
    (Unix ns) — tested fix for a real netflow-vs-host clock-base
    inconsistency that was silently dropping every netflow channel.

  - training/_split.py — three held-out recipes (host / sample / time)
    with profile-stratification assertions. held_out_host carries
    untested_profiles for cases like scan-and-dial absent from the test
    host (5 of 6 profiles tested cross-device, never silently averaged).

  - training/models/ — 6 architectures behind a common BaseModel
    interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
    trained twice (realistic / oracle) per the deployment threat model.
    Schema-hashed checkpoints refuse to load if _features.py changed
    since training (silent-input-drift protection, tested).

  - training/trainer/ — unified training loop: class-weighted CE, LR
    warmup + cosine, gradient clipping, mixed precision when CUDA,
    early stopping on val macro F1, best-on-val checkpoint. Same loop
    runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
    early_stopping_rounds on val mlogloss.

  - training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
    per-profile and per-host breakdown, paired-bootstrap significance
    for model-vs-model gap. Confusion matrix uses union of seen labels.

  - training/dashboard/producers/ — replay/metrics/perf/profiles
    emitting the six event types the dashboard's awaiting scenes
    consume; on-demand tensor extraction so the Pi can run live
    inference without 65 GB of shards.

  - 17 unit tests (split coverage, features round-trip, schema mismatch,
    determinism, time-base alignment regression).

End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:19:00 -05:00

219 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# training/
Train a behavioral malware detector from labeled episode tarballs.
Six architectures × two threat-model modes = twelve trained models,
evaluated head-to-head on a held-out-by-host split.
## What lives where
```
training/
_episode_io.py tarball decoder
_features.py channel registry + windowing (summary + tensor)
_split.py held-out recipes (host / sample / time)
build_features.py summary-stat parquet builder
build_tensors.py channel × time tensor shard builder
models/ 6 architectures behind a common BaseModel interface
gbt.py XGBoost on summary features
mlp.py MLP on summary features (NN baseline parity to GBT)
cnn.py 1D-CNN on tensor windows
gru.py GRU on tensor windows
lstm.py LSTM on tensor windows
transformer.py small Transformer encoder on tensor windows
_base.py, _torch_seq.py
_checkpoint.py schema-hashed save/load — refuses mismatches
trainer/
run.py end-to-end training driver (one model at a time)
_loop.py shared training loop: class-weighted CE, LR warmup +
cosine, early stop on val macro F1, best-on-val
_data.py loaders for summary parquet + tensor shards
eval_/
run.py load every checkpoint, score, write comparison_v2.md
_metrics.py macro F1 + per-class F1 with bootstrap 95 % CIs;
paired-bootstrap significance for model-vs-model
breakdown.py per-profile, per-host metric tables
dashboard/producers/ live event emitters — see ../dashboard/PRODUCERS.md
```
## The honesty rules this implements
1. **Held-out by host (primary):** train on `elliott-thinkpad`, test on
`k-gamingcom`. Tests cross-device generalization, the claim a deployed
model has to support. 5 of 6 profiles populated cross-device;
`scan-and-dial` is *untested* (k-gamingcom never ran it) and explicitly
reported as such, not silently averaged in.
2. **Profile-stratified, sample-stratified, or time:** all three split
recipes are available via `--split-recipe {host,sample,time}`.
`held_out_sample` excludes profiles with too few unique sample_names
(would be mathematically unsound otherwise) — the dataset has 2
such profiles today (`cpu-saturate`, `low-and-slow`).
3. **In-distribution val carved from train host** for hyperparameter
selection. Test set is never touched at training time.
4. **Class-weighted cross-entropy** computed from the train slice
(inverse frequency, clipped). Class imbalance is real
(`armed`/`infecting` rare, `infected_running` common) and unweighted
loss under-trains on the operationally interesting phases.
5. **Best-on-val checkpoint** selected by macro F1 (not accuracy —
accuracy hides imbalance). Early stopping with patience=8.
LR warmup (5 % of steps) + cosine decay to 0.
6. **Schema-hashed checkpoints.** Every saved model carries a sha256
of its input schema. Loading a checkpoint against a changed
`_features.py` registry raises `ValueError` instead of silently
feeding mis-aligned columns to the model.
7. **Bootstrap CIs on every test metric.** Reporting
`macro_f1 = 0.873 ± 0.012` is the bar; a single point estimate
from one finite test is dishonest.
8. **Paired-bootstrap significance** for model-vs-model gap. CI excludes
0 → significant.
9. **NaN handling for the `degraded` set** (k-gamingcom shipped without
netflow): NaN fed through standardization → 0 after, but a
missingness mask is kept on tensor data for the sequence models to
learn to discount sparse channels. (Indicator features for summary
models is a v2 enhancement.)
## Pipeline
```
/var/lib/cis490/episodes/ ← raw .tar.zst
/var/lib/cis490/index.jsonl
tools/dataset_validate.py ← full-sweep validator
data/processed/validation_v1.parquet ← committed
┌────────┴─────────┐
▼ ▼ (rsync to GPU box)
training/build_features.py training/build_tensors.py
│ │
▼ ▼
features_window_v1.parquet tensor_window_v1/host=*/<id>.npz
feature_schema_v1.json (channel × time, ~12 GB at full scale)
│ │
└────────┬─────────────────────────┘
training/trainer/run.py
(per model × mode)
artifacts/<model>_<mode>.ckpt.json + sidecar (.pt or .xgb.json)
training/eval_/run.py
reports/eval/comparison_v2.md
reports/eval/<model>_<mode>_eval.json (full per-phase, per-profile,
per-host metrics with CIs)
```
## Quickstart on the GPU box
```sh
git clone http://maxgit.wg/spectral/CIS490.git
cd CIS490
uv sync --group training
# 1. pull raw episodes from the Pi (needs WireGuard + cis490-trainer)
PI_USER=max PI_HOST=10.100.0.1 LOCAL_DIR=./episodes \
bash scripts/sync-training-data.sh
# 2. build features + tensors
uv run --group training python training/build_features.py \
--validation data/processed/validation_v1.parquet \
--store ./episodes \
--out-dir data/processed
uv run --group training python training/build_tensors.py \
--validation data/processed/validation_v1.parquet \
--store ./episodes \
--out-dir data/processed/tensor_window_v1
# 3. train all 12 (one process per model × mode)
for model in gbt mlp cnn gru lstm transformer; do
for mode in realistic oracle; do
uv run --group training python -m training.trainer.run \
--model $model --mode $mode \
--validation data/processed/validation_v1.parquet \
--summary data/processed/features_window_v1.parquet \
--tensors data/processed/tensor_window_v1 \
--schema data/processed/feature_schema_v1.json \
--train-hosts elliott-thinkpad \
--epochs 60
done
done
# 4. evaluate, write comparison_v2.md
uv run --group training python -m training.eval_.run \
--validation data/processed/validation_v1.parquet \
--summary data/processed/features_window_v1.parquet \
--tensors data/processed/tensor_window_v1 \
--reports-dir reports/eval
```
## Live dashboard
Producers under `training/dashboard/producers/` push events to the
`dashboard.wg` WebSocket via the canonical
`training.dashboard.client.Publisher` (loopback HTTP, stdlib-only).
See [`../dashboard/PRODUCERS.md`](../dashboard/PRODUCERS.md) for the
event contract.
```sh
# After training, push live model_metric + model_perf bars:
uv run --group training python -m training.dashboard.producers.metrics \
--validation data/processed/validation_v1.parquet \
--artifacts artifacts \
--summary data/processed/features_window_v1.parquet \
--tensors data/processed/tensor_window_v1
# Replay one episode at wall-clock speed (drives phase + prediction +
# embedding events):
uv run --group training python -m training.dashboard.producers.replay \
--episode /var/lib/cis490/episodes/elliott-thinkpad/<id>.tar.zst \
--host-id elliott-thinkpad \
--artifacts artifacts
```
## Tests
```sh
pytest tests/test_training_split.py tests/test_training_features.py \
tests/test_training_checkpoint.py
```
Guards: split coverage assertions, time-base alignment (the
`t_wall_ns` vs `t_mono_ns` netflow regression), counter-to-rate
correctness, schema-mismatch rejection, deterministic split.
## Open data-quality issues found while building this
Surfaced for the writeup, not silently worked around:
- **`receiver/store.py:130` torn write** — index.jsonl line 19500 has
two records concatenated. The "atomic for sub-PIPE_BUF" comment isn't
holding. Validator skips and warns; producer-side fix needed.
- **k-gamingcom silent downgrade** — ~24 k episodes shipped without
`netflow.jsonl`. Per AGENTS.md "Do not silently downgrade a host"
this is a producer hard-rule violation. We accept them as `degraded`
and train, but the realistic model loses bridge-pcap signal on those.
- **`scan-and-dial` absent from k-gamingcom** — held-out-by-host can't
evaluate that profile cross-device. Reported as `untested_profiles`
in every metrics output rather than averaged in.
- **Cross-source clock drift** — `_features.py` aligns on `t_wall_ns`
because netflow's `t_mono_ns` is system-uptime, not episode-relative.
Fix is in this repo; the producer should be patched to emit
episode-relative `t_mono_ns` consistently.
- **Sample diversity is low** (12 unique sample_names total across 6
profiles). `held_out_sample` only fits the `io-walk` profile.
Held-out-by-host is the right primary eval until more samples are
added.