The model layer of the project, built honestly:
- tools/dataset_validate.py — full-sweep validator over the receiver
store (sha256, schema, monotonic labels, telemetry-row gate). On the
current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
is committed as the per-episode acceptance index.
- training/_features.py — channel registry (46 channels across
proc/guest/qmp/netflow), summary-stat windowing AND channel×time
tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
(Unix ns) — tested fix for a real netflow-vs-host clock-base
inconsistency that was silently dropping every netflow channel.
- training/_split.py — three held-out recipes (host / sample / time)
with profile-stratification assertions. held_out_host carries
untested_profiles for cases like scan-and-dial absent from the test
host (5 of 6 profiles tested cross-device, never silently averaged).
- training/models/ — 6 architectures behind a common BaseModel
interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
trained twice (realistic / oracle) per the deployment threat model.
Schema-hashed checkpoints refuse to load if _features.py changed
since training (silent-input-drift protection, tested).
- training/trainer/ — unified training loop: class-weighted CE, LR
warmup + cosine, gradient clipping, mixed precision when CUDA,
early stopping on val macro F1, best-on-val checkpoint. Same loop
runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
early_stopping_rounds on val mlogloss.
- training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
per-profile and per-host breakdown, paired-bootstrap significance
for model-vs-model gap. Confusion matrix uses union of seen labels.
- training/dashboard/producers/ — replay/metrics/perf/profiles
emitting the six event types the dashboard's awaiting scenes
consume; on-demand tensor extraction so the Pi can run live
inference without 65 GB of shards.
- 17 unit tests (split coverage, features round-trip, schema mismatch,
determinism, time-base alignment regression).
End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
219 lines
9.1 KiB
Markdown
219 lines
9.1 KiB
Markdown
# training/
|
||
|
||
Train a behavioral malware detector from labeled episode tarballs.
|
||
Six architectures × two threat-model modes = twelve trained models,
|
||
evaluated head-to-head on a held-out-by-host split.
|
||
|
||
## What lives where
|
||
|
||
```
|
||
training/
|
||
_episode_io.py tarball decoder
|
||
_features.py channel registry + windowing (summary + tensor)
|
||
_split.py held-out recipes (host / sample / time)
|
||
build_features.py summary-stat parquet builder
|
||
build_tensors.py channel × time tensor shard builder
|
||
models/ 6 architectures behind a common BaseModel interface
|
||
gbt.py XGBoost on summary features
|
||
mlp.py MLP on summary features (NN baseline parity to GBT)
|
||
cnn.py 1D-CNN on tensor windows
|
||
gru.py GRU on tensor windows
|
||
lstm.py LSTM on tensor windows
|
||
transformer.py small Transformer encoder on tensor windows
|
||
_base.py, _torch_seq.py
|
||
_checkpoint.py schema-hashed save/load — refuses mismatches
|
||
trainer/
|
||
run.py end-to-end training driver (one model at a time)
|
||
_loop.py shared training loop: class-weighted CE, LR warmup +
|
||
cosine, early stop on val macro F1, best-on-val
|
||
_data.py loaders for summary parquet + tensor shards
|
||
eval_/
|
||
run.py load every checkpoint, score, write comparison_v2.md
|
||
_metrics.py macro F1 + per-class F1 with bootstrap 95 % CIs;
|
||
paired-bootstrap significance for model-vs-model
|
||
breakdown.py per-profile, per-host metric tables
|
||
dashboard/producers/ live event emitters — see ../dashboard/PRODUCERS.md
|
||
```
|
||
|
||
## The honesty rules this implements
|
||
|
||
1. **Held-out by host (primary):** train on `elliott-thinkpad`, test on
|
||
`k-gamingcom`. Tests cross-device generalization, the claim a deployed
|
||
model has to support. 5 of 6 profiles populated cross-device;
|
||
`scan-and-dial` is *untested* (k-gamingcom never ran it) and explicitly
|
||
reported as such, not silently averaged in.
|
||
|
||
2. **Profile-stratified, sample-stratified, or time:** all three split
|
||
recipes are available via `--split-recipe {host,sample,time}`.
|
||
`held_out_sample` excludes profiles with too few unique sample_names
|
||
(would be mathematically unsound otherwise) — the dataset has 2
|
||
such profiles today (`cpu-saturate`, `low-and-slow`).
|
||
|
||
3. **In-distribution val carved from train host** for hyperparameter
|
||
selection. Test set is never touched at training time.
|
||
|
||
4. **Class-weighted cross-entropy** computed from the train slice
|
||
(inverse frequency, clipped). Class imbalance is real
|
||
(`armed`/`infecting` rare, `infected_running` common) and unweighted
|
||
loss under-trains on the operationally interesting phases.
|
||
|
||
5. **Best-on-val checkpoint** selected by macro F1 (not accuracy —
|
||
accuracy hides imbalance). Early stopping with patience=8.
|
||
LR warmup (5 % of steps) + cosine decay to 0.
|
||
|
||
6. **Schema-hashed checkpoints.** Every saved model carries a sha256
|
||
of its input schema. Loading a checkpoint against a changed
|
||
`_features.py` registry raises `ValueError` instead of silently
|
||
feeding mis-aligned columns to the model.
|
||
|
||
7. **Bootstrap CIs on every test metric.** Reporting
|
||
`macro_f1 = 0.873 ± 0.012` is the bar; a single point estimate
|
||
from one finite test is dishonest.
|
||
|
||
8. **Paired-bootstrap significance** for model-vs-model gap. CI excludes
|
||
0 → significant.
|
||
|
||
9. **NaN handling for the `degraded` set** (k-gamingcom shipped without
|
||
netflow): NaN fed through standardization → 0 after, but a
|
||
missingness mask is kept on tensor data for the sequence models to
|
||
learn to discount sparse channels. (Indicator features for summary
|
||
models is a v2 enhancement.)
|
||
|
||
## Pipeline
|
||
|
||
```
|
||
/var/lib/cis490/episodes/ ← raw .tar.zst
|
||
/var/lib/cis490/index.jsonl
|
||
│
|
||
▼
|
||
tools/dataset_validate.py ← full-sweep validator
|
||
│
|
||
▼
|
||
data/processed/validation_v1.parquet ← committed
|
||
│
|
||
┌────────┴─────────┐
|
||
▼ ▼ (rsync to GPU box)
|
||
training/build_features.py training/build_tensors.py
|
||
│ │
|
||
▼ ▼
|
||
features_window_v1.parquet tensor_window_v1/host=*/<id>.npz
|
||
feature_schema_v1.json (channel × time, ~12 GB at full scale)
|
||
│ │
|
||
└────────┬─────────────────────────┘
|
||
▼
|
||
training/trainer/run.py
|
||
(per model × mode)
|
||
│
|
||
▼
|
||
artifacts/<model>_<mode>.ckpt.json + sidecar (.pt or .xgb.json)
|
||
│
|
||
▼
|
||
training/eval_/run.py
|
||
│
|
||
▼
|
||
reports/eval/comparison_v2.md
|
||
reports/eval/<model>_<mode>_eval.json (full per-phase, per-profile,
|
||
per-host metrics with CIs)
|
||
```
|
||
|
||
## Quickstart on the GPU box
|
||
|
||
```sh
|
||
git clone http://maxgit.wg/spectral/CIS490.git
|
||
cd CIS490
|
||
uv sync --group training
|
||
|
||
# 1. pull raw episodes from the Pi (needs WireGuard + cis490-trainer)
|
||
PI_USER=max PI_HOST=10.100.0.1 LOCAL_DIR=./episodes \
|
||
bash scripts/sync-training-data.sh
|
||
|
||
# 2. build features + tensors
|
||
uv run --group training python training/build_features.py \
|
||
--validation data/processed/validation_v1.parquet \
|
||
--store ./episodes \
|
||
--out-dir data/processed
|
||
|
||
uv run --group training python training/build_tensors.py \
|
||
--validation data/processed/validation_v1.parquet \
|
||
--store ./episodes \
|
||
--out-dir data/processed/tensor_window_v1
|
||
|
||
# 3. train all 12 (one process per model × mode)
|
||
for model in gbt mlp cnn gru lstm transformer; do
|
||
for mode in realistic oracle; do
|
||
uv run --group training python -m training.trainer.run \
|
||
--model $model --mode $mode \
|
||
--validation data/processed/validation_v1.parquet \
|
||
--summary data/processed/features_window_v1.parquet \
|
||
--tensors data/processed/tensor_window_v1 \
|
||
--schema data/processed/feature_schema_v1.json \
|
||
--train-hosts elliott-thinkpad \
|
||
--epochs 60
|
||
done
|
||
done
|
||
|
||
# 4. evaluate, write comparison_v2.md
|
||
uv run --group training python -m training.eval_.run \
|
||
--validation data/processed/validation_v1.parquet \
|
||
--summary data/processed/features_window_v1.parquet \
|
||
--tensors data/processed/tensor_window_v1 \
|
||
--reports-dir reports/eval
|
||
```
|
||
|
||
## Live dashboard
|
||
|
||
Producers under `training/dashboard/producers/` push events to the
|
||
`dashboard.wg` WebSocket via the canonical
|
||
`training.dashboard.client.Publisher` (loopback HTTP, stdlib-only).
|
||
See [`../dashboard/PRODUCERS.md`](../dashboard/PRODUCERS.md) for the
|
||
event contract.
|
||
|
||
```sh
|
||
# After training, push live model_metric + model_perf bars:
|
||
uv run --group training python -m training.dashboard.producers.metrics \
|
||
--validation data/processed/validation_v1.parquet \
|
||
--artifacts artifacts \
|
||
--summary data/processed/features_window_v1.parquet \
|
||
--tensors data/processed/tensor_window_v1
|
||
|
||
# Replay one episode at wall-clock speed (drives phase + prediction +
|
||
# embedding events):
|
||
uv run --group training python -m training.dashboard.producers.replay \
|
||
--episode /var/lib/cis490/episodes/elliott-thinkpad/<id>.tar.zst \
|
||
--host-id elliott-thinkpad \
|
||
--artifacts artifacts
|
||
```
|
||
|
||
## Tests
|
||
|
||
```sh
|
||
pytest tests/test_training_split.py tests/test_training_features.py \
|
||
tests/test_training_checkpoint.py
|
||
```
|
||
|
||
Guards: split coverage assertions, time-base alignment (the
|
||
`t_wall_ns` vs `t_mono_ns` netflow regression), counter-to-rate
|
||
correctness, schema-mismatch rejection, deterministic split.
|
||
|
||
## Open data-quality issues found while building this
|
||
|
||
Surfaced for the writeup, not silently worked around:
|
||
|
||
- **`receiver/store.py:130` torn write** — index.jsonl line 19500 has
|
||
two records concatenated. The "atomic for sub-PIPE_BUF" comment isn't
|
||
holding. Validator skips and warns; producer-side fix needed.
|
||
- **k-gamingcom silent downgrade** — ~24 k episodes shipped without
|
||
`netflow.jsonl`. Per AGENTS.md "Do not silently downgrade a host"
|
||
this is a producer hard-rule violation. We accept them as `degraded`
|
||
and train, but the realistic model loses bridge-pcap signal on those.
|
||
- **`scan-and-dial` absent from k-gamingcom** — held-out-by-host can't
|
||
evaluate that profile cross-device. Reported as `untested_profiles`
|
||
in every metrics output rather than averaged in.
|
||
- **Cross-source clock drift** — `_features.py` aligns on `t_wall_ns`
|
||
because netflow's `t_mono_ns` is system-uptime, not episode-relative.
|
||
Fix is in this repo; the producer should be patched to emit
|
||
episode-relative `t_mono_ns` consistently.
|
||
- **Sample diversity is low** (12 unique sample_names total across 6
|
||
profiles). `held_out_sample` only fits the `io-walk` profile.
|
||
Held-out-by-host is the right primary eval until more samples are
|
||
added.
|