CIS490/training/models/_torch_seq.py
Max 1fabd4a246 training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers
The model layer of the project, built honestly:

  - tools/dataset_validate.py — full-sweep validator over the receiver
    store (sha256, schema, monotonic labels, telemetry-row gate). On the
    current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
    7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
    is committed as the per-episode acceptance index.

  - training/_features.py — channel registry (46 channels across
    proc/guest/qmp/netflow), summary-stat windowing AND channel×time
    tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
    (Unix ns) — tested fix for a real netflow-vs-host clock-base
    inconsistency that was silently dropping every netflow channel.

  - training/_split.py — three held-out recipes (host / sample / time)
    with profile-stratification assertions. held_out_host carries
    untested_profiles for cases like scan-and-dial absent from the test
    host (5 of 6 profiles tested cross-device, never silently averaged).

  - training/models/ — 6 architectures behind a common BaseModel
    interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
    trained twice (realistic / oracle) per the deployment threat model.
    Schema-hashed checkpoints refuse to load if _features.py changed
    since training (silent-input-drift protection, tested).

  - training/trainer/ — unified training loop: class-weighted CE, LR
    warmup + cosine, gradient clipping, mixed precision when CUDA,
    early stopping on val macro F1, best-on-val checkpoint. Same loop
    runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
    early_stopping_rounds on val mlogloss.

  - training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
    per-profile and per-host breakdown, paired-bootstrap significance
    for model-vs-model gap. Confusion matrix uses union of seen labels.

  - training/dashboard/producers/ — replay/metrics/perf/profiles
    emitting the six event types the dashboard's awaiting scenes
    consume; on-demand tensor extraction so the Pi can run live
    inference without 65 GB of shards.

  - 17 unit tests (split coverage, features round-trip, schema mismatch,
    determinism, time-base alignment regression).

End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:19:00 -05:00

89 lines
2.7 KiB
Python

"""Shared scaffolding for sequence models.
All four sequence models (CNN, GRU, LSTM, Transformer) follow the same
input/output contract:
Input: (B, n_channels_keep, n_timesteps) float32
Output: (B, n_classes) float32 logits
This module factors out the common BaseModel boilerplate so each
architecture file only declares its torch.nn.Module.
"""
from __future__ import annotations
from typing import Any
import numpy as np
from training.models._base import BaseModel, StandardizeStats
class _SeqBase(BaseModel):
"""Composition wrapper: a torch.nn.Module under self._mod plus the
BaseModel interface (select, predict, predict_proba, save_sidecar).
Subclasses override _build_module(self, **cfg) -> nn.Module."""
input_kind = "tensor"
def __init__(
self,
*,
n_channels_in: int,
n_timesteps: int,
n_classes: int,
keep_mask: np.ndarray,
standardize: StandardizeStats,
device: str = "cpu",
**arch_config,
) -> None:
self.n_classes = n_classes
self.keep_mask = keep_mask.astype(bool)
self.standardize = standardize
self.config = {
"n_channels_in": n_channels_in,
"n_timesteps": n_timesteps,
**arch_config,
}
self._device = device
self._mod = self._build_module(
n_channels_in=n_channels_in,
n_timesteps=n_timesteps,
n_classes=n_classes,
**arch_config,
).to(device)
@property
def module(self):
return self._mod
def _build_module(self, **cfg):
raise NotImplementedError
def predict_proba(self, X: np.ndarray) -> np.ndarray:
import torch
Xk = self.select(X) # (N, C_keep, T) float32
self._mod.eval()
with torch.no_grad():
t = torch.from_numpy(Xk).to(self._device)
logits = self._mod(t)
return torch.softmax(logits, dim=-1).cpu().numpy()
def state_for_checkpoint(self) -> dict[str, Any]:
return {"state_dict": self._mod.state_dict(), "config": self.config}
@classmethod
def from_checkpoint(cls, header: dict, payload: dict, *,
device: str = "cpu") -> "_SeqBase":
cfg = dict(payload["config"])
n_ch = cfg.pop("n_channels_in")
n_t = cfg.pop("n_timesteps")
m = cls(
n_channels_in=n_ch, n_timesteps=n_t,
n_classes=int(header["n_classes"]),
keep_mask=np.asarray(header["keep_mask"], dtype=bool),
standardize=StandardizeStats.from_dict(header["standardize"]),
device=device,
**cfg,
)
m._mod.load_state_dict(payload["state_dict"])
return m