CIS490/training/README.md

# training/

Train a behavioral malware detector from labeled episode tarballs.
Six architectures × two threat-model modes = twelve trained models,
evaluated head-to-head on a held-out-by-host split.

## What lives where

```
training/
  _episode_io.py          tarball decoder
  _features.py            channel registry + windowing (summary + tensor)
  _split.py               held-out recipes (host / sample / time)
  build_features.py       summary-stat parquet builder
  build_tensors.py        channel × time tensor shard builder
  models/                 6 architectures behind a common BaseModel interface
    gbt.py                XGBoost on summary features
    mlp.py                MLP on summary features (NN baseline parity to GBT)
    cnn.py                1D-CNN on tensor windows
    gru.py                GRU on tensor windows
    lstm.py               LSTM on tensor windows
    transformer.py        small Transformer encoder on tensor windows
    _base.py, _torch_seq.py
    _checkpoint.py        schema-hashed save/load — refuses mismatches
  trainer/
    run.py                end-to-end training driver (one model at a time)
    _loop.py              shared training loop: class-weighted CE, LR warmup +
                          cosine, early stop on val macro F1, best-on-val
    _data.py              loaders for summary parquet + tensor shards
  eval_/
    run.py                load every checkpoint, score, write comparison_v2.md
    _metrics.py           macro F1 + per-class F1 with bootstrap 95 % CIs;
                          paired-bootstrap significance for model-vs-model
    breakdown.py          per-profile, per-host metric tables
  dashboard/producers/    live event emitters — see ../dashboard/PRODUCERS.md
```

## The honesty rules this implements

1. **Held-out by host (primary):** train on `elliott-thinkpad`, test on
   `k-gamingcom`. Tests cross-device generalization, the claim a deployed
   model has to support. 5 of 6 profiles populated cross-device;
   `scan-and-dial` is *untested* (k-gamingcom never ran it) and explicitly
   reported as such, not silently averaged in.

2. **Profile-stratified, sample-stratified, or time:** all three split
   recipes are available via `--split-recipe {host,sample,time}`.
   `held_out_sample` excludes profiles with too few unique sample_names
   (would be mathematically unsound otherwise) — the dataset has 2
   such profiles today (`cpu-saturate`, `low-and-slow`).

3. **In-distribution val carved from train host** for hyperparameter
   selection. Test set is never touched at training time.

4. **Class-weighted cross-entropy** computed from the train slice
   (inverse frequency, clipped). Class imbalance is real
   (`armed`/`infecting` rare, `infected_running` common) and unweighted
   loss under-trains on the operationally interesting phases.

5. **Best-on-val checkpoint** selected by macro F1 (not accuracy —
   accuracy hides imbalance). Early stopping with patience=8.
   LR warmup (5 % of steps) + cosine decay to 0.

6. **Schema-hashed checkpoints.** Every saved model carries a sha256
   of its input schema. Loading a checkpoint against a changed
   `_features.py` registry raises `ValueError` instead of silently
   feeding mis-aligned columns to the model.

7. **Bootstrap CIs on every test metric.** Reporting
   `macro_f1 = 0.873 ± 0.012` is the bar; a single point estimate
   from one finite test is dishonest.

8. **Paired-bootstrap significance** for model-vs-model gap. CI excludes
   0 → significant.

9. **NaN handling for the `degraded` set** (k-gamingcom shipped without
   netflow): NaN fed through standardization → 0 after, but a
   missingness mask is kept on tensor data for the sequence models to
   learn to discount sparse channels. (Indicator features for summary
   models is a v2 enhancement.)

## Pipeline

```
   /var/lib/cis490/episodes/                ← raw .tar.zst
   /var/lib/cis490/index.jsonl
            │
            ▼
   tools/dataset_validate.py                ← full-sweep validator
            │
            ▼
   data/processed/validation_v1.parquet     ← committed
            │
   ┌────────┴─────────┐
   ▼                  ▼                       (rsync to GPU box)
   training/build_features.py     training/build_tensors.py
            │                                  │
            ▼                                  ▼
   features_window_v1.parquet       tensor_window_v1/host=*/<id>.npz
   feature_schema_v1.json           (channel × time, ~12 GB at full scale)
            │                                  │
            └────────┬─────────────────────────┘
                     ▼
          training/trainer/run.py
            (per model × mode)
                     │
                     ▼
   artifacts/<model>_<mode>.ckpt.json   + sidecar (.pt or .xgb.json)
                     │
                     ▼
          training/eval_/run.py
                     │
                     ▼
   reports/eval/comparison_v2.md
   reports/eval/<model>_<mode>_eval.json   (full per-phase, per-profile,
                                            per-host metrics with CIs)
```

## Quickstart on the GPU box

```sh
git clone http://maxgit.wg/spectral/CIS490.git
cd CIS490
uv sync --group training

# 1. pull raw episodes from the Pi (needs WireGuard + cis490-trainer)
PI_USER=max PI_HOST=10.100.0.1 LOCAL_DIR=./episodes \
    bash scripts/sync-training-data.sh

# 2. build features + tensors
uv run --group training python training/build_features.py \
    --validation data/processed/validation_v1.parquet \
    --store      ./episodes \
    --out-dir    data/processed

uv run --group training python training/build_tensors.py \
    --validation data/processed/validation_v1.parquet \
    --store      ./episodes \
    --out-dir    data/processed/tensor_window_v1

# 3. train all 12 (one process per model × mode)
for model in gbt mlp cnn gru lstm transformer; do
  for mode in realistic oracle; do
    uv run --group training python -m training.trainer.run \
      --model $model --mode $mode \
      --validation data/processed/validation_v1.parquet \
      --summary    data/processed/features_window_v1.parquet \
      --tensors    data/processed/tensor_window_v1 \
      --schema     data/processed/feature_schema_v1.json \
      --train-hosts elliott-thinkpad \
      --epochs 60
  done
done

# 4. evaluate, write comparison_v2.md
uv run --group training python -m training.eval_.run \
    --validation data/processed/validation_v1.parquet \
    --summary    data/processed/features_window_v1.parquet \
    --tensors    data/processed/tensor_window_v1 \
    --reports-dir reports/eval
```

## Live dashboard

Producers under `training/dashboard/producers/` push events to the
`dashboard.wg` WebSocket via the canonical
`training.dashboard.client.Publisher` (loopback HTTP, stdlib-only).
See [`../dashboard/PRODUCERS.md`](../dashboard/PRODUCERS.md) for the
event contract.

```sh
# After training, push live model_metric + model_perf bars:
uv run --group training python -m training.dashboard.producers.metrics \
    --validation data/processed/validation_v1.parquet \
    --artifacts artifacts \
    --summary data/processed/features_window_v1.parquet \
    --tensors data/processed/tensor_window_v1

# Replay one episode at wall-clock speed (drives phase + prediction +
# embedding events):
uv run --group training python -m training.dashboard.producers.replay \
    --episode  /var/lib/cis490/episodes/elliott-thinkpad/<id>.tar.zst \
    --host-id  elliott-thinkpad \
    --artifacts artifacts
```

## Tests

```sh
pytest tests/test_training_split.py tests/test_training_features.py \
       tests/test_training_checkpoint.py
```

Guards: split coverage assertions, time-base alignment (the
`t_wall_ns` vs `t_mono_ns` netflow regression), counter-to-rate
correctness, schema-mismatch rejection, deterministic split.

## Open data-quality issues found while building this

Surfaced for the writeup, not silently worked around:

- **`receiver/store.py:130` torn write** — index.jsonl line 19500 has
  two records concatenated. The "atomic for sub-PIPE_BUF" comment isn't
  holding. Validator skips and warns; producer-side fix needed.
- **k-gamingcom silent downgrade** — ~24 k episodes shipped without
  `netflow.jsonl`. Per AGENTS.md "Do not silently downgrade a host"
  this is a producer hard-rule violation. We accept them as `degraded`
  and train, but the realistic model loses bridge-pcap signal on those.
- **`scan-and-dial` absent from k-gamingcom** — held-out-by-host can't
  evaluate that profile cross-device. Reported as `untested_profiles`
  in every metrics output rather than averaged in.
- **Cross-source clock drift** — `_features.py` aligns on `t_wall_ns`
  because netflow's `t_mono_ns` is system-uptime, not episode-relative.
  Fix is in this repo; the producer should be patched to emit
  episode-relative `t_mono_ns` consistently.
- **Sample diversity is low** (12 unique sample_names total across 6
  profiles). `held_out_sample` only fits the `io-walk` profile.
  Held-out-by-host is the right primary eval until more samples are
  added.