- multi_model_metrics: publish gbt / mlp / cnn / knn_semi / gru / lstm / bert (knn handled by knn streamer); read both *_train.json and *_eval.json with macro_f1.point fallback - dashboard.css: add palette gradients for the four non-canonical names so the bars render with a fill colour - dashboard.js: open the bar's visible scale to the full 0–1 range so honest-low cross-host F1s show as a bar instead of clamping to 0% - ship lambda-live-detection-loop.py + dashboard request docs (scenes 7/8/12, sticky cache, lambda-inference-demo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| dashboard | ||
| eval_ | ||
| fleet | ||
| models | ||
| producers | ||
| trainer | ||
| xai | ||
| __init__.py | ||
| _episode_io.py | ||
| _features.py | ||
| _split.py | ||
| build_features.py | ||
| build_tensors.py | ||
| README.md | ||
training/
Train a behavioral malware detector from labeled episode tarballs. Six architectures × two threat-model modes = twelve trained models, evaluated head-to-head on a held-out-by-host split.
What lives where
training/
_episode_io.py tarball decoder
_features.py channel registry + windowing (summary + tensor)
_split.py held-out recipes (host / sample / time)
build_features.py summary-stat parquet builder
build_tensors.py channel × time tensor shard builder
models/ 6 architectures behind a common BaseModel interface
gbt.py XGBoost on summary features
mlp.py MLP on summary features (NN baseline parity to GBT)
cnn.py 1D-CNN on tensor windows
gru.py GRU on tensor windows
lstm.py LSTM on tensor windows
transformer.py small Transformer encoder on tensor windows
_base.py, _torch_seq.py
_checkpoint.py schema-hashed save/load — refuses mismatches
trainer/
run.py end-to-end training driver (one model at a time)
_loop.py shared training loop: class-weighted CE, LR warmup +
cosine, early stop on val macro F1, best-on-val
_data.py loaders for summary parquet + tensor shards
eval_/
run.py load every checkpoint, score, write comparison_v2.md
_metrics.py macro F1 + per-class F1 with bootstrap 95 % CIs;
paired-bootstrap significance for model-vs-model
breakdown.py per-profile, per-host metric tables
producers/ live event sources for the dashboard
(replay, metrics, perf, profiles)
→ see training/dashboard/PRODUCERS.md
The honesty rules this implements
-
Held-out by host (primary): train on
elliott-thinkpad, test onk-gamingcom. Tests cross-device generalization, the claim a deployed model has to support. 5 of 6 profiles populated cross-device;scan-and-dialis untested (k-gamingcom never ran it) and explicitly reported as such, not silently averaged in. -
Profile-stratified, sample-stratified, or time: all three split recipes are available via
--split-recipe {host,sample,time}.held_out_sampleexcludes profiles with too few unique sample_names (would be mathematically unsound otherwise) — the dataset has 2 such profiles today (cpu-saturate,low-and-slow). -
In-distribution val carved from train host for hyperparameter selection. Test set is never touched at training time.
-
Class-weighted cross-entropy computed from the train slice (inverse frequency, clipped). Class imbalance is real (
armed/infectingrare,infected_runningcommon) and unweighted loss under-trains on the operationally interesting phases. -
Best-on-val checkpoint selected by macro F1 (not accuracy — accuracy hides imbalance). Early stopping with patience=8. LR warmup (5 % of steps) + cosine decay to 0.
-
Schema-hashed checkpoints. Every saved model carries a sha256 of its input schema. Loading a checkpoint against a changed
_features.pyregistry raisesValueErrorinstead of silently feeding mis-aligned columns to the model. -
Bootstrap CIs on every test metric. Reporting
macro_f1 = 0.873 ± 0.012is the bar; a single point estimate from one finite test is dishonest. -
Paired-bootstrap significance for model-vs-model gap. CI excludes 0 → significant.
-
NaN handling for the
degradedset (k-gamingcom shipped without netflow): NaN fed through standardization → 0 after, but a missingness mask is kept on tensor data for the sequence models to learn to discount sparse channels. (Indicator features for summary models is a v2 enhancement.)
Pipeline
/var/lib/cis490/episodes/ ← raw .tar.zst
/var/lib/cis490/index.jsonl
│
▼
tools/dataset_validate.py ← full-sweep validator
│
▼
data/processed/validation_v1.parquet ← committed
│
┌────────┴─────────┐
▼ ▼ (rsync to GPU box)
training/build_features.py training/build_tensors.py
│ │
▼ ▼
features_window_v1.parquet tensor_window_v1/host=*/<id>.npz
feature_schema_v1.json (channel × time, ~12 GB at full scale)
│ │
└────────┬─────────────────────────┘
▼
training/trainer/run.py
(per model × mode)
│
▼
artifacts/<model>_<mode>.ckpt.json + sidecar (.pt or .xgb.json)
│
▼
training/eval_/run.py
│
▼
reports/eval/comparison_v2.md
reports/eval/<model>_<mode>_eval.json (full per-phase, per-profile,
per-host metrics with CIs)
Quickstart on the GPU box
git clone http://maxgit.wg/spectral/CIS490.git
cd CIS490
uv sync --group training
# 1. pull raw episodes from the Pi (needs WireGuard + cis490-trainer)
PI_USER=max PI_HOST=10.100.0.1 LOCAL_DIR=./episodes \
bash scripts/sync-training-data.sh
# 2. build features + tensors
uv run --group training python training/build_features.py \
--validation data/processed/validation_v1.parquet \
--store ./episodes \
--out-dir data/processed
uv run --group training python training/build_tensors.py \
--validation data/processed/validation_v1.parquet \
--store ./episodes \
--out-dir data/processed/tensor_window_v1
# 3. train all 12 (one process per model × mode)
for model in gbt mlp cnn gru lstm transformer; do
for mode in realistic oracle; do
uv run --group training python -m training.trainer.run \
--model $model --mode $mode \
--validation data/processed/validation_v1.parquet \
--summary data/processed/features_window_v1.parquet \
--tensors data/processed/tensor_window_v1 \
--schema data/processed/feature_schema_v1.json \
--train-hosts elliott-thinkpad \
--epochs 60
done
done
# 4. evaluate, write comparison_v2.md
uv run --group training python -m training.eval_.run \
--validation data/processed/validation_v1.parquet \
--summary data/processed/features_window_v1.parquet \
--tensors data/processed/tensor_window_v1 \
--reports-dir reports/eval
Live dashboard
Producers under training/producers/ push events to the
dashboard.wg WebSocket via the canonical
training.dashboard.client.Publisher (loopback HTTP, stdlib-only).
See ../dashboard/PRODUCERS.md for the
event contract.
# After training, push live model_metric + model_perf bars:
uv run --group training python -m training.producers.metrics \
--validation data/processed/validation_v1.parquet \
--artifacts artifacts \
--summary data/processed/features_window_v1.parquet \
--tensors data/processed/tensor_window_v1
# Replay one episode at wall-clock speed (drives phase + prediction +
# embedding events):
uv run --group training python -m training.producers.replay \
--episode /var/lib/cis490/episodes/elliott-thinkpad/<id>.tar.zst \
--host-id elliott-thinkpad \
--artifacts artifacts
Tests
pytest tests/test_training_split.py tests/test_training_features.py \
tests/test_training_checkpoint.py
Guards: split coverage assertions, time-base alignment (the
t_wall_ns vs t_mono_ns netflow regression), counter-to-rate
correctness, schema-mismatch rejection, deterministic split.
Open data-quality issues found while building this
Surfaced for the writeup, not silently worked around:
receiver/store.py:130torn write — index.jsonl line 19500 has two records concatenated. The "atomic for sub-PIPE_BUF" comment isn't holding. Validator skips and warns; producer-side fix needed.- k-gamingcom silent downgrade — ~24 k episodes shipped without
netflow.jsonl. Per AGENTS.md "Do not silently downgrade a host" this is a producer hard-rule violation. We accept them asdegradedand train, but the realistic model loses bridge-pcap signal on those. scan-and-dialabsent from k-gamingcom — held-out-by-host can't evaluate that profile cross-device. Reported asuntested_profilesin every metrics output rather than averaged in.- Cross-source clock drift —
_features.pyaligns ont_wall_nsbecause netflow'st_mono_nsis system-uptime, not episode-relative. Fix is in this repo; the producer should be patched to emit episode-relativet_mono_nsconsistently. - Sample diversity is low (12 unique sample_names total across 6
profiles).
held_out_sampleonly fits theio-walkprofile. Held-out-by-host is the right primary eval until more samples are added.