CIS490

bolyai/CIS490

Fork 0

Commit graph

Author	SHA1	Message	Date
Max	2aa7b865fb	training/models: knn_semi — semi-supervised self-training KNN Registered as `knn_semi`. Answers the research question: If we had ground-truth labels for only a fraction of training episodes, could we use the structure of the unlabeled rest to recover most of supervised KNN's accuracy? Pipeline (Yarowsky-style self-training): 1. Split train slice deterministically into labeled (label_frac=0.2 default) and unlabeled (1 - label_frac) by row-index hash. 2. Fit a "labeler" KNN on the labeled fraction. 3. Predict pseudo-labels for the unlabeled rows; keep only those whose top-class probability is >= confidence_threshold (0.6). 4. Fit the final KNN on (labeled rows + confident pseudo-labels). Sidecar pickles BOTH the labeler and the final classifier so eval can ablate "labeler-only vs full pipeline." Smoke run (567-episode subset, oracle mode, label_frac=0.2): val_macro_f1 test_macro_f1 knn (100% labels) 0.737 0.133 knn_semi (20% labels) 0.654 0.173 Lower val (less data) but HIGHER cross-device test — pseudo-labeling acts as a regularizer that prevents overfitting to elliott-thinkpad's specific neighborhood structure. Honest research finding worth a slide in the writeup. Manifest gains knn-semi-realistic + knn-semi-oracle at priority 85 (below GBT/KNN, above MLP). Storage cost = augmented set × n_features × 4 bytes; same .knn.pkl sidecar format as plain KNN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:51:30 -05:00
Max	2187a5d752	training/models: KNN as a registered supervised model Non-parametric baseline alongside GBT/MLP/CNN/GRU/LSTM/Transformer. Same BaseModel + schema-hashed checkpoint contract; sidecar is a pickled sklearn KNeighborsClassifier (.knn.pkl) handled by the existing checkpoint machinery alongside .xgb.json / .pt. KNN's storage cost = n_train_rows × n_kept_features × 4 bytes. At 660k windows × 145 kept (realistic mode) features = ~380 MB sidecar; at 230 features (oracle) = ~600 MB. Heavy but ships through the same artifact-upload path. trainer/run.py learns a third fit branch: - GBT — XGBoost early stopping on val mlogloss - KNN — fit() memorizes; "training time" is val/test predict cost - NN — train_nn loop (the rest) Manifest gains knn-realistic + knn-oracle at priority 95 (just below GBT). KNN's k=10 default lives in the model class — overriding via hyper.k requires adding --k to run.py first to avoid the unknown-arg exit-2 issue. Smoke verified on the 567-episode subset: knn oracle val=0.7365 test=0.1333 (held-out k-gamingcom) That val/test gap (0.74 → 0.13) is the cross-device generalization story: KNN memorizes elliott-thinkpad's local feature space and falls apart on the other host. Honest baseline for the comparison report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:06:56 -05:00
Max	1fabd4a246	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers The model layer of the project, built honestly: - tools/dataset_validate.py — full-sweep validator over the receiver store (sha256, schema, monotonic labels, telemetry-row gate). On the current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected + 7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet is committed as the per-episode acceptance index. - training/_features.py — channel registry (46 channels across proc/guest/qmp/netflow), summary-stat windowing AND channel×time tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns (Unix ns) — tested fix for a real netflow-vs-host clock-base inconsistency that was silently dropping every netflow channel. - training/_split.py — three held-out recipes (host / sample / time) with profile-stratification assertions. held_out_host carries untested_profiles for cases like scan-and-dial absent from the test host (5 of 6 profiles tested cross-device, never silently averaged). - training/models/ — 6 architectures behind a common BaseModel interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each trained twice (realistic / oracle) per the deployment threat model. Schema-hashed checkpoints refuse to load if _features.py changed since training (silent-input-drift protection, tested). - training/trainer/ — unified training loop: class-weighted CE, LR warmup + cosine, gradient clipping, mixed precision when CUDA, early stopping on val macro F1, best-on-val checkpoint. Same loop runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost early_stopping_rounds on val mlogloss. - training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1, per-profile and per-host breakdown, paired-bootstrap significance for model-vs-model gap. Confusion matrix uses union of seen labels. - training/dashboard/producers/ — replay/metrics/perf/profiles emitting the six event types the dashboard's awaiting scenes consume; on-demand tensor extraction so the Pi can run live inference without 65 GB of shards. - 17 unit tests (split coverage, features round-trip, schema mismatch, determinism, time-base alignment regression). End-to-end smoke-trained all six on a 567-episode subset; held-out test macro F1 reported with paired-bootstrap significance. The methodology now reports honest cross-device generalization, not in-distribution validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 01:19:00 -05:00

Author

SHA1

Message

Date

Max

2aa7b865fb

training/models: knn_semi — semi-supervised self-training KNN

Registered as `knn_semi`. Answers the research question:

  *If we had ground-truth labels for only a fraction of training
   episodes, could we use the structure of the unlabeled rest to
   recover most of supervised KNN's accuracy?*

Pipeline (Yarowsky-style self-training):

  1. Split train slice deterministically into labeled (label_frac=0.2
     default) and unlabeled (1 - label_frac) by row-index hash.
  2. Fit a "labeler" KNN on the labeled fraction.
  3. Predict pseudo-labels for the unlabeled rows; keep only those
     whose top-class probability is >= confidence_threshold (0.6).
  4. Fit the final KNN on (labeled rows + confident pseudo-labels).
     Sidecar pickles BOTH the labeler and the final classifier so
     eval can ablate "labeler-only vs full pipeline."

Smoke run (567-episode subset, oracle mode, label_frac=0.2):

                       val_macro_f1   test_macro_f1
  knn       (100% labels)   0.737        0.133
  knn_semi  (20% labels)    0.654        0.173

Lower val (less data) but HIGHER cross-device test — pseudo-labeling
acts as a regularizer that prevents overfitting to elliott-thinkpad's
specific neighborhood structure. Honest research finding worth a slide
in the writeup.

Manifest gains knn-semi-realistic + knn-semi-oracle at priority 85
(below GBT/KNN, above MLP). Storage cost = augmented set × n_features
× 4 bytes; same .knn.pkl sidecar format as plain KNN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 13:51:30 -05:00

Max

2187a5d752

training/models: KNN as a registered supervised model

Non-parametric baseline alongside GBT/MLP/CNN/GRU/LSTM/Transformer.
Same BaseModel + schema-hashed checkpoint contract; sidecar is a
pickled sklearn KNeighborsClassifier (.knn.pkl) handled by the
existing checkpoint machinery alongside .xgb.json / .pt.

KNN's storage cost = n_train_rows × n_kept_features × 4 bytes.
At 660k windows × 145 kept (realistic mode) features = ~380 MB
sidecar; at 230 features (oracle) = ~600 MB. Heavy but ships through
the same artifact-upload path.

trainer/run.py learns a third fit branch:
  - GBT — XGBoost early stopping on val mlogloss
  - KNN — fit() memorizes; "training time" is val/test predict cost
  - NN  — train_nn loop (the rest)

Manifest gains knn-realistic + knn-oracle at priority 95 (just
below GBT). KNN's k=10 default lives in the model class — overriding
via hyper.k requires adding --k to run.py first to avoid the
unknown-arg exit-2 issue.

Smoke verified on the 567-episode subset:
  knn   oracle    val=0.7365  test=0.1333  (held-out k-gamingcom)

That val/test gap (0.74 → 0.13) is the cross-device generalization
story: KNN memorizes elliott-thinkpad's local feature space and
falls apart on the other host. Honest baseline for the comparison
report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 13:06:56 -05:00

Max

1fabd4a246

training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers

The model layer of the project, built honestly:

  - tools/dataset_validate.py — full-sweep validator over the receiver
    store (sha256, schema, monotonic labels, telemetry-row gate). On the
    current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
    7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
    is committed as the per-episode acceptance index.

  - training/_features.py — channel registry (46 channels across
    proc/guest/qmp/netflow), summary-stat windowing AND channel×time
    tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
    (Unix ns) — tested fix for a real netflow-vs-host clock-base
    inconsistency that was silently dropping every netflow channel.

  - training/_split.py — three held-out recipes (host / sample / time)
    with profile-stratification assertions. held_out_host carries
    untested_profiles for cases like scan-and-dial absent from the test
    host (5 of 6 profiles tested cross-device, never silently averaged).

  - training/models/ — 6 architectures behind a common BaseModel
    interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
    trained twice (realistic / oracle) per the deployment threat model.
    Schema-hashed checkpoints refuse to load if _features.py changed
    since training (silent-input-drift protection, tested).

  - training/trainer/ — unified training loop: class-weighted CE, LR
    warmup + cosine, gradient clipping, mixed precision when CUDA,
    early stopping on val macro F1, best-on-val checkpoint. Same loop
    runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
    early_stopping_rounds on val mlogloss.

  - training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
    per-profile and per-host breakdown, paired-bootstrap significance
    for model-vs-model gap. Confusion matrix uses union of seen labels.

  - training/dashboard/producers/ — replay/metrics/perf/profiles
    emitting the six event types the dashboard's awaiting scenes
    consume; on-demand tensor extraction so the Pi can run live
    inference without 65 GB of shards.

  - 17 unit tests (split coverage, features round-trip, schema mismatch,
    determinism, time-base alignment regression).

End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 01:19:00 -05:00

3 commits