The model layer of the project, built honestly:
- tools/dataset_validate.py — full-sweep validator over the receiver
store (sha256, schema, monotonic labels, telemetry-row gate). On the
current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
is committed as the per-episode acceptance index.
- training/_features.py — channel registry (46 channels across
proc/guest/qmp/netflow), summary-stat windowing AND channel×time
tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
(Unix ns) — tested fix for a real netflow-vs-host clock-base
inconsistency that was silently dropping every netflow channel.
- training/_split.py — three held-out recipes (host / sample / time)
with profile-stratification assertions. held_out_host carries
untested_profiles for cases like scan-and-dial absent from the test
host (5 of 6 profiles tested cross-device, never silently averaged).
- training/models/ — 6 architectures behind a common BaseModel
interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
trained twice (realistic / oracle) per the deployment threat model.
Schema-hashed checkpoints refuse to load if _features.py changed
since training (silent-input-drift protection, tested).
- training/trainer/ — unified training loop: class-weighted CE, LR
warmup + cosine, gradient clipping, mixed precision when CUDA,
early stopping on val macro F1, best-on-val checkpoint. Same loop
runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
early_stopping_rounds on val mlogloss.
- training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
per-profile and per-host breakdown, paired-bootstrap significance
for model-vs-model gap. Confusion matrix uses union of seen labels.
- training/dashboard/producers/ — replay/metrics/perf/profiles
emitting the six event types the dashboard's awaiting scenes
consume; on-demand tensor extraction so the Pi can run live
inference without 65 GB of shards.
- 17 unit tests (split coverage, features round-trip, schema mismatch,
determinism, time-base alignment regression).
End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
53 lines
2.3 KiB
Python
53 lines
2.3 KiB
Python
"""Tiny Transformer encoder over channel × time windows.
|
||
|
||
Linear projection of channels → d_model, learned positional embedding,
|
||
two encoder layers, mean-pool over time, linear head. Deliberately
|
||
small (d_model=64, 4 heads, 2 layers) — the dataset is small enough
|
||
that anything bigger overfits within a few epochs."""
|
||
from __future__ import annotations
|
||
|
||
from training.models import register
|
||
from training.models._torch_seq import _SeqBase
|
||
|
||
|
||
@register("transformer")
|
||
class Transformer(_SeqBase):
|
||
def _build_module(self, *, n_channels_in: int, n_timesteps: int,
|
||
n_classes: int, d_model: int = 64, n_heads: int = 4,
|
||
n_layers: int = 2, ffn_hidden: int = 128,
|
||
dropout: float = 0.1):
|
||
return _TransformerClassifier(
|
||
n_channels_in=n_channels_in, n_timesteps=n_timesteps,
|
||
n_classes=n_classes, d_model=d_model, n_heads=n_heads,
|
||
n_layers=n_layers, ffn_hidden=ffn_hidden, dropout=dropout,
|
||
)
|
||
|
||
|
||
import torch # noqa: E402
|
||
from torch import nn # noqa: E402
|
||
|
||
|
||
class _TransformerClassifier(nn.Module):
|
||
def __init__(self, *, n_channels_in: int, n_timesteps: int, n_classes: int,
|
||
d_model: int, n_heads: int, n_layers: int, ffn_hidden: int,
|
||
dropout: float):
|
||
super().__init__()
|
||
self.proj = nn.Linear(n_channels_in, d_model)
|
||
self.pos = nn.Parameter(torch.zeros(1, n_timesteps, d_model))
|
||
nn.init.trunc_normal_(self.pos, std=0.02)
|
||
layer = nn.TransformerEncoderLayer(
|
||
d_model=d_model, nhead=n_heads, dim_feedforward=ffn_hidden,
|
||
dropout=dropout, batch_first=True, activation="gelu",
|
||
norm_first=True,
|
||
)
|
||
self.encoder = nn.TransformerEncoder(layer, num_layers=n_layers)
|
||
self.head = nn.Sequential(nn.LayerNorm(d_model),
|
||
nn.Dropout(dropout),
|
||
nn.Linear(d_model, n_classes))
|
||
|
||
def forward(self, x): # (B, C, T) → (B, T, C)
|
||
x = x.transpose(1, 2)
|
||
h = self.proj(x) + self.pos[:, : x.size(1), :]
|
||
h = self.encoder(h) # (B, T, d_model)
|
||
h = h.mean(dim=1) # mean-pool over time
|
||
return self.head(h)
|