History

Max 8643192a71 training/fleet: distributed multi-host trainer with capability gating Symmetric companion to the collection fleet (orchestrator/fleet.py) but for training. Collection is embarrassingly parallel; training is not (a model is trained at most once across the fleet), so the receiver coordinates which worker gets which job. Operator-control surface is etc/training_manifest.toml.example — single canonical file declaring (a) per-host capability + per-model allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper) with capability constraints (require_cuda, prefer_cuda, min_vram_gib, min_ram_gib, allowed_hosts). Components: capability.py — self-detection: hostname, cores, RAM, CUDA presence, VRAM, torch version, git commit. Used by workers to filter eligible jobs before claiming. manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable sha256 of (model, mode, hyper, split_recipe, train_hosts, seed) so manifest reload is idempotent: existing rows keep their status, new jobs become claimable, removed jobs stay until cancelled. queue.py — SQLite job queue (training_jobs.db) with statuses pending\|claimed\|running\|completed\|failed\|cancelled. Atomic claim_next via single UPDATE WHERE status='pending'. Heartbeat, complete, fail. Stale-claim sweep (stale_after_s=600s) with max_attempts cutoff to failed. store.py — model artifact store mirroring receiver/store.py. Artifact ID is the sha256 of the uploaded tarball; bit-identical re-runs deduplicate. receiver.py — Starlette app exposing 11 endpoints: POST /v1/job/claim (worker) POST /v1/job/{id}/heartbeat (worker) POST /v1/job/{id}/complete (worker) POST /v1/job/{id}/fail (worker) PUT /v1/model/{id} (worker — uploads tarball) GET /v1/jobs (anyone) GET /v1/workers (anyone) POST /v1/job/{id}/cancel (operator: X-Operator-Token) POST /v1/job/{id}/requeue (operator) POST /v1/manifest/reload (operator) GET /v1/health (anyone) Runs as cis490-trainer-receiver.service on the Pi alongside the existing receiver, on a separate port. client.py — stdlib HTTP client (urllib only, no new deps). worker.py — long-running daemon. Loop: detect capability → claim → spawn training/trainer/run.py subprocess → heartbeat every 30s → tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe. Operator CLI (tools/cis490_jobs.py): status / list / show / cancel / requeue / reload / workers. Cancel and requeue require $CIS490_OPERATOR_TOKEN matching the receiver's configured value. Bootstrap: scripts/install-training-worker.sh (Linux systemd) and scripts/install-training-worker-windows.ps1 (Windows Scheduled Task) let the operator enroll a new host with one command after cloning the repo and setting up the venv. Worker self-tests capability before registering. End-to-end smoke verified on the Pi: receiver up, manifest synced, 14 jobs queued, worker registered, claimed 4 CPU-eligible jobs (allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle, mlp-oracle), 1 failed with the actual error visible via cis490-jobs status, 3 artifacts uploaded to /var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with proper index.jsonl row. 21 unit tests (manifest validation: 8; queue lifecycle + eligibility: 13). All pass alongside the prior 17 training tests = 38 green. Open limitations surfaced inline: - Hyper-key drift between manifest and run.py fails at training time, not at manifest reload (worth tightening to argparse introspection later). - mTLS not yet wired through Caddy for the trainer-receiver port — listens loopback-only until that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-08 01:20:20 -05:00
..
dashboard	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
eval_	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
fleet	training/fleet: distributed multi-host trainer with capability gating	2026-05-08 01:20:20 -05:00
models	training: self-supervised pretrain + IG XAI + project brief / slide planner	2026-05-08 01:19:41 -05:00
trainer	training: self-supervised pretrain + IG XAI + project brief / slide planner	2026-05-08 01:19:41 -05:00
xai	training: self-supervised pretrain + IG XAI + project brief / slide planner	2026-05-08 01:19:41 -05:00
__init__.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
_episode_io.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
_features.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
_split.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
build_features.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
build_tensors.py	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00
README.md	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers	2026-05-08 01:19:00 -05:00

README.md

training/

Train a behavioral malware detector from labeled episode tarballs. Six architectures × two threat-model modes = twelve trained models, evaluated head-to-head on a held-out-by-host split.

What lives where

training/
  _episode_io.py          tarball decoder
  _features.py            channel registry + windowing (summary + tensor)
  _split.py               held-out recipes (host / sample / time)
  build_features.py       summary-stat parquet builder
  build_tensors.py        channel × time tensor shard builder
  models/                 6 architectures behind a common BaseModel interface
    gbt.py                XGBoost on summary features
    mlp.py                MLP on summary features (NN baseline parity to GBT)
    cnn.py                1D-CNN on tensor windows
    gru.py                GRU on tensor windows
    lstm.py               LSTM on tensor windows
    transformer.py        small Transformer encoder on tensor windows
    _base.py, _torch_seq.py
    _checkpoint.py        schema-hashed save/load — refuses mismatches
  trainer/
    run.py                end-to-end training driver (one model at a time)
    _loop.py              shared training loop: class-weighted CE, LR warmup +
                          cosine, early stop on val macro F1, best-on-val
    _data.py              loaders for summary parquet + tensor shards
  eval_/
    run.py                load every checkpoint, score, write comparison_v2.md
    _metrics.py           macro F1 + per-class F1 with bootstrap 95 % CIs;
                          paired-bootstrap significance for model-vs-model
    breakdown.py          per-profile, per-host metric tables
  dashboard/producers/    live event emitters — see ../dashboard/PRODUCERS.md

The honesty rules this implements

Held-out by host (primary): train on elliott-thinkpad, test on k-gamingcom. Tests cross-device generalization, the claim a deployed model has to support. 5 of 6 profiles populated cross-device; scan-and-dial is untested (k-gamingcom never ran it) and explicitly reported as such, not silently averaged in.
Profile-stratified, sample-stratified, or time: all three split recipes are available via --split-recipe {host,sample,time}. held_out_sample excludes profiles with too few unique sample_names (would be mathematically unsound otherwise) — the dataset has 2 such profiles today (cpu-saturate, low-and-slow).
In-distribution val carved from train host for hyperparameter selection. Test set is never touched at training time.
Class-weighted cross-entropy computed from the train slice (inverse frequency, clipped). Class imbalance is real (armed/infecting rare, infected_running common) and unweighted loss under-trains on the operationally interesting phases.
Best-on-val checkpoint selected by macro F1 (not accuracy — accuracy hides imbalance). Early stopping with patience=8. LR warmup (5 % of steps) + cosine decay to 0.
Schema-hashed checkpoints. Every saved model carries a sha256 of its input schema. Loading a checkpoint against a changed _features.py registry raises ValueError instead of silently feeding mis-aligned columns to the model.
Bootstrap CIs on every test metric. Reporting macro_f1 = 0.873 ± 0.012 is the bar; a single point estimate from one finite test is dishonest.
Paired-bootstrap significance for model-vs-model gap. CI excludes 0 → significant.
NaN handling for the degraded set (k-gamingcom shipped without netflow): NaN fed through standardization → 0 after, but a missingness mask is kept on tensor data for the sequence models to learn to discount sparse channels. (Indicator features for summary models is a v2 enhancement.)

Pipeline

   /var/lib/cis490/episodes/                ← raw .tar.zst
   /var/lib/cis490/index.jsonl
            │
            ▼
   tools/dataset_validate.py                ← full-sweep validator
            │
            ▼
   data/processed/validation_v1.parquet     ← committed
            │
   ┌────────┴─────────┐
   ▼                  ▼                       (rsync to GPU box)
   training/build_features.py     training/build_tensors.py
            │                                  │
            ▼                                  ▼
   features_window_v1.parquet       tensor_window_v1/host=*/<id>.npz
   feature_schema_v1.json           (channel × time, ~12 GB at full scale)
            │                                  │
            └────────┬─────────────────────────┘
                     ▼
          training/trainer/run.py
            (per model × mode)
                     │
                     ▼
   artifacts/<model>_<mode>.ckpt.json   + sidecar (.pt or .xgb.json)
                     │
                     ▼
          training/eval_/run.py
                     │
                     ▼
   reports/eval/comparison_v2.md
   reports/eval/<model>_<mode>_eval.json   (full per-phase, per-profile,
                                            per-host metrics with CIs)

Quickstart on the GPU box

git clone http://maxgit.wg/spectral/CIS490.git
cd CIS490
uv sync --group training

# 1. pull raw episodes from the Pi (needs WireGuard + cis490-trainer)
PI_USER=max PI_HOST=10.100.0.1 LOCAL_DIR=./episodes \
    bash scripts/sync-training-data.sh

# 2. build features + tensors
uv run --group training python training/build_features.py \
    --validation data/processed/validation_v1.parquet \
    --store      ./episodes \
    --out-dir    data/processed

uv run --group training python training/build_tensors.py \
    --validation data/processed/validation_v1.parquet \
    --store      ./episodes \
    --out-dir    data/processed/tensor_window_v1

# 3. train all 12 (one process per model × mode)
for model in gbt mlp cnn gru lstm transformer; do
  for mode in realistic oracle; do
    uv run --group training python -m training.trainer.run \
      --model $model --mode $mode \
      --validation data/processed/validation_v1.parquet \
      --summary    data/processed/features_window_v1.parquet \
      --tensors    data/processed/tensor_window_v1 \
      --schema     data/processed/feature_schema_v1.json \
      --train-hosts elliott-thinkpad \
      --epochs 60
  done
done

# 4. evaluate, write comparison_v2.md
uv run --group training python -m training.eval_.run \
    --validation data/processed/validation_v1.parquet \
    --summary    data/processed/features_window_v1.parquet \
    --tensors    data/processed/tensor_window_v1 \
    --reports-dir reports/eval

Live dashboard

Producers under training/dashboard/producers/ push events to the dashboard.wg WebSocket via the canonical training.dashboard.client.Publisher (loopback HTTP, stdlib-only). See ../dashboard/PRODUCERS.md for the event contract.

# After training, push live model_metric + model_perf bars:
uv run --group training python -m training.dashboard.producers.metrics \
    --validation data/processed/validation_v1.parquet \
    --artifacts artifacts \
    --summary data/processed/features_window_v1.parquet \
    --tensors data/processed/tensor_window_v1

# Replay one episode at wall-clock speed (drives phase + prediction +
# embedding events):
uv run --group training python -m training.dashboard.producers.replay \
    --episode  /var/lib/cis490/episodes/elliott-thinkpad/<id>.tar.zst \
    --host-id  elliott-thinkpad \
    --artifacts artifacts

Tests

pytest tests/test_training_split.py tests/test_training_features.py \
       tests/test_training_checkpoint.py

Guards: split coverage assertions, time-base alignment (the t_wall_ns vs t_mono_ns netflow regression), counter-to-rate correctness, schema-mismatch rejection, deterministic split.

Open data-quality issues found while building this

Surfaced for the writeup, not silently worked around:

receiver/store.py:130 torn write — index.jsonl line 19500 has two records concatenated. The "atomic for sub-PIPE_BUF" comment isn't holding. Validator skips and warns; producer-side fix needed.
k-gamingcom silent downgrade — ~24 k episodes shipped without netflow.jsonl. Per AGENTS.md "Do not silently downgrade a host" this is a producer hard-rule violation. We accept them as degraded and train, but the realistic model loses bridge-pcap signal on those.
scan-and-dial absent from k-gamingcom — held-out-by-host can't evaluate that profile cross-device. Reported as untested_profiles in every metrics output rather than averaged in.
Cross-source clock drift — _features.py aligns on t_wall_ns because netflow's t_mono_ns is system-uptime, not episode-relative. Fix is in this repo; the producer should be patched to emit episode-relative t_mono_ns consistently.
Sample diversity is low (12 unique sample_names total across 6 profiles). held_out_sample only fits the io-walk profile. Held-out-by-host is the right primary eval until more samples are added.

README.md Unescape Escape