The model layer of the project, built honestly:
- tools/dataset_validate.py — full-sweep validator over the receiver
store (sha256, schema, monotonic labels, telemetry-row gate). On the
current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
is committed as the per-episode acceptance index.
- training/_features.py — channel registry (46 channels across
proc/guest/qmp/netflow), summary-stat windowing AND channel×time
tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
(Unix ns) — tested fix for a real netflow-vs-host clock-base
inconsistency that was silently dropping every netflow channel.
- training/_split.py — three held-out recipes (host / sample / time)
with profile-stratification assertions. held_out_host carries
untested_profiles for cases like scan-and-dial absent from the test
host (5 of 6 profiles tested cross-device, never silently averaged).
- training/models/ — 6 architectures behind a common BaseModel
interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
trained twice (realistic / oracle) per the deployment threat model.
Schema-hashed checkpoints refuse to load if _features.py changed
since training (silent-input-drift protection, tested).
- training/trainer/ — unified training loop: class-weighted CE, LR
warmup + cosine, gradient clipping, mixed precision when CUDA,
early stopping on val macro F1, best-on-val checkpoint. Same loop
runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
early_stopping_rounds on val mlogloss.
- training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
per-profile and per-host breakdown, paired-bootstrap significance
for model-vs-model gap. Confusion matrix uses union of seen labels.
- training/dashboard/producers/ — replay/metrics/perf/profiles
emitting the six event types the dashboard's awaiting scenes
consume; on-demand tensor extraction so the Pi can run live
inference without 65 GB of shards.
- 17 unit tests (split coverage, features round-trip, schema mismatch,
determinism, time-base alignment regression).
End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
77 lines
1.3 KiB
Text
77 lines
1.3 KiB
Text
# Disk images and snapshots
|
|
*.iso
|
|
*.img
|
|
*.qcow2
|
|
*.qcow2.*
|
|
*.vmdk
|
|
*.vdi
|
|
*.raw
|
|
vm/images/
|
|
vm/snapshots/
|
|
|
|
# VERSION file is install-script-stamped (provenance for episodes
|
|
# generated from /opt/cis490 install copies). Tracking it would
|
|
# trigger spurious dirty-tree state on lab hosts and reject every
|
|
# episode at the §4.6 acceptance gate.
|
|
/VERSION
|
|
|
|
# Telemetry output
|
|
data/episodes/
|
|
data/campaign.json
|
|
data/campaign_done.marker
|
|
data/outbox/
|
|
data/shipped/
|
|
*.pcap
|
|
*.pcapng
|
|
|
|
# Training artifacts that are regenerated from raw episodes:
|
|
# features are large and deterministic from code+episodes, so we don't
|
|
# track them. validation_v1.parquet IS tracked — it's small and pins
|
|
# the accepted/degraded set.
|
|
data/processed/features_*.parquet
|
|
data/processed/feature_schema_*.json
|
|
data/processed/.validation_checkpoint.parquet
|
|
data/processed/validation_smoke.parquet
|
|
data/logs/
|
|
artifacts/
|
|
artifacts-*/
|
|
reports/eval/
|
|
reports/pca/
|
|
reports/xai/
|
|
reports/fleet-*/
|
|
|
|
# Per-developer training venv
|
|
.venv-training/
|
|
|
|
# Malware samples — NEVER commit binaries
|
|
samples/store/
|
|
*.bin
|
|
*.elf
|
|
*.exe
|
|
*.dll
|
|
*.so.malware
|
|
|
|
# Python
|
|
__pycache__/
|
|
*.py[cod]
|
|
.venv/
|
|
venv/
|
|
.pytest_cache/
|
|
.mypy_cache/
|
|
.ruff_cache/
|
|
*.egg-info/
|
|
dist/
|
|
build/
|
|
|
|
# Editor
|
|
.vscode/
|
|
.idea/
|
|
*.swp
|
|
.DS_Store
|
|
|
|
# Local secrets (never commit)
|
|
.env
|
|
.env.local
|
|
secrets.toml
|
|
*.pat
|
|
*.token
|