Commit graph

188 commits

Author SHA1 Message Date
Max Gorog
6230f18692 model bars: paint every architecture (+ neutral fallback)
The bar widget had gradients for lstm / gru / rnn / bert / knn
only — any other model name (cnn, mlp, transformer, gbt, knn_semi,
transformer_ssl) rendered a track but no fill. Now:

- Added explicit gradients for cnn, mlp, transformer,
  transformer_ssl, gbt, knn_semi (each visually distinct from the
  existing five).
- Added a neutral grey-grey fallback on .model-fill itself, so any
  unanticipated model name still produces a visible bar instead of
  silently disappearing. The specific class rules override it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:20:21 -05:00
Max
c2a71de4b2 scene 9 bars: paint full zoo + 0–1 visible scale
- multi_model_metrics: publish gbt / mlp / cnn / knn_semi /
  gru / lstm / bert (knn handled by knn streamer); read both
  *_train.json and *_eval.json with macro_f1.point fallback
- dashboard.css: add palette gradients for the four
  non-canonical names so the bars render with a fill colour
- dashboard.js: open the bar's visible scale to the full 0–1
  range so honest-low cross-host F1s show as a bar instead of
  clamping to 0%
- ship lambda-live-detection-loop.py + dashboard request docs
  (scenes 7/8/12, sticky cache, lambda-inference-demo)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:18:00 -05:00
Max Gorog
06bfcef3d6 demo button: include (d) in tooltip
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:07:47 -05:00
Max Gorog
cedf64c708 hotkeys: 'd' toggles demo mode
Saves a click during live demos. Topbar tooltip updated to mention
the binding. Hotkey is gated by the same input-focus check as 'c' /
arrow keys, so typing 'd' in a search box won't fire it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:07:23 -05:00
Max Gorog
0bc2b57ccb live demo: back to 2500 ms cadence
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:59:16 -05:00
Max Gorog
00d11740eb live demo: drop elliott-lab from inference host list
It contributed no training data, so the A100 wouldn't be running
inference on its windows. Only hosts that actually produced data
(elliott-thinkpad, k-gamingcom) should appear as the source of
synthetic predictions in the live scene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:59:04 -05:00
Max Gorog
ac630997c3 live demo: bump cadence back up to ~1 event/sec
2500ms read too slow. 1000ms is the sweet spot — under the real
ceiling of ~1.5/sec but still lively enough to feel like a working
inference loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:52:37 -05:00
Max Gorog
ab21217261 live demo: slow A100 inference cadence to ~0.4 events/sec
Was 280ms (~3.5 events/sec) — way too fast for real fleet
inference. The bottleneck is window arrival (one 10-second window
per host per 10 s), not A100 forward-pass speed. With ~3 hosts × 5
models that's ~1.5 events/sec real ceiling, so demo at 2500ms
(~0.4/sec) reads honest without claiming impossible throughput.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:52:04 -05:00
Max Gorog
3b96537b3e live scene HTML: stats line + prose match per-model framing
Stats line now reads "A100 inference · live · N models · X infer/sec
· last window: <host> · hit-rate: …" instead of "live detections ·
N hosts · model: …". Prose rewritten to describe lanes as side-by-
side model-agreement check rather than per-host activity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:44:53 -05:00
Max Gorog
5533043b02 live scene: per-model lanes (A100 inference), not per-host
The scene's framing was wrong. It's about the A100 doing live
model predictions, not about per-host telemetry collection. Lanes
now key on `model` instead of `host_id`; the callout leads with
model name + A100 latency, demoting host/profile to secondary
metadata. Stats line reads "N models · X infer/sec · last window
from <host>" instead of "N hosts · model: X".

Demo synthesis updated to match: 5 trained models cycle through
predictions on rotating fleet windows, each model with its own
accuracy + latency profile (KNN fast/loose, BERT slow/precise) so
the lanes visually differ. Article prose reframes the scene as
side-by-side model agreement, the natural read of per-model lanes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:44:23 -05:00
Max Gorog
804220d7f6 knn scatter: revert to real-data only (no demo handlers)
The KNN producer works; KNN does not need a demo-mode fallback.
Remove demo_start / demo_stop / cachedReal / demoActive scaffolding
that I'd added speculatively. Embedding events render directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:39:45 -05:00
Max Gorog
ef6bc71009 knn scatter: exclusive demo (not additive)
Same pattern as models / perf / live: cachedReal accumulates real
embedding events at all times; demoActive flag gates which source
renders.
- demo on  → only synthetic clusters
- demo off → only real embeddings (replayed from cachedReal)

Cache cap 5000 points to bound memory across long sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:36:31 -05:00
Max Gorog
b6e478c578 demo mode: exclusive (not additive) on models / perf / live
Was: demo seeded on demo_start, then real producer events rendered
on top of the synthetic bars/points/cells. Both sources visible
simultaneously — visually confusing.

Now: each widget tracks demoActive + a cachedReal store.
- demo_start: set demoActive=true, clear, repaint from synthetic
- demo_stop:  set demoActive=false, clear, repaint from cachedReal
- on real event: always cache; only render when demo is off

Toggling demo flips between two clean pictures with no overlap.
cachedReal grows as real producer events arrive even while demo is
on, so demo_stop restores immediately without waiting for the
producer to re-publish.

Applied to: models bars, perf scatter, live detections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:36:12 -05:00
Max Gorog
a04ea60aef demo mode: never overwrite real data on perf / models / live
Same hasReal* gating I already used for phase_mix, applied to:
- models bars (model_metric)
- perf scatter (model_perf)
- live detections (live_detection)

Each widget tracks whether a real producer event has arrived; demo
only seeds when nothing real has been seen yet, and demo_stop
preserves real state instead of wiping it.

demoTick is now a no-op — periodic model_metric jitter was
overwriting real values mid-stream. Per-widget one-shot seeding
on demo_start (gated by hasReal*) is enough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:32:55 -05:00
Max Gorog
f429bd4223 perf scatter: log-x + alternating-position labels (kill overlap)
Two issues with the accuracy-vs-latency scatter:
1. Linear x crammed RNN/GRU/LSTM into ~25 px of axis (380/520/700 μs)
   while BERT alone took the right 80 % (3200 μs).
2. Labels placed at fixed +12 right of each point overlapped both
   neighbouring points and other labels in the recurrent cluster.

Fixes:
- X-axis switched to log10 with bounds 10μs–10ms; tick labels and
  marks added at 10μs / 100μs / 1ms / 10ms so the audience can
  read the scale.
- Y-axis bounds tightened to [0.5, 1.0] (was [0.7, 1.0]) so KNN's
  ~0.43 cross-host F1 falls within the visible plot area instead
  of off-bottom; ticks added at 0.6 / 0.8 / 1.0.
- Anti-overlap label placement: sort points by x, alternate
  above (-12) / below (+18) the circle. Adjacent labels can no
  longer share both x and y bands. repaintLabels() re-runs on
  each model_perf event so late arrivals slot into the staircase.

Y-axis title also updated: "held-out accuracy" → "held-out macro-F1"
to match the actual metric the producer reports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:31:42 -05:00
Max Gorog
9e7d9999a3 demo mode: omit attack envelopes too
Same scope-narrowing as collect / hosts / db / knn — attack profiles
are real data from the orchestrator's catalog, so the deck should
display whatever the producer publishes via attack_profile events
and not overwrite that with synthetic curves on demo_start.

Removed both demo_start (synthesize) and demo_stop (clearAll)
handlers; the syntheticProfiles helper is left in place for
reference but is no longer wired to anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:29:38 -05:00
Max Gorog
7e7fb52d32 demo mode: stop synthesizing episode + phase events
Per the data-ownership scope: collect (episodes-ingested counter),
hosts (per-host bars), and db (database explorer) all work fine in
or out of demo mode — they read real values from the server's
snapshot. Demo mode shouldn't be injecting fake `episode` records
into them.

Removed both dispatches from demoTick:
- `episode` (was 70% per tick) — no longer clobbers collect/hosts/db
- `phase` (was 50% per tick) — dead code anyway; baseline now
  consumes the dataset-derived `phase_mix` event, not raw `phase`

demoTick is now just the model_metric jitter (5% per tick) so the
sequence-model bars don't sit frozen during a long demo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:29:13 -05:00
Max Gorog
af1f7fb56d demo mode: backfill phase mix + knn metric (no clobber on real data)
Two targeted fixes for the demo-toggle path; intentionally narrow so
we don't override widgets that already work in both modes (KNN
scatter, DB explorer).

Phase-mix bar
- Tracks `hasRealMix` and only injects a synthetic fallback on
  demo_start if no real snapshot/phase_mix event has been seen.
  If real data later arrives, applyMix overwrites the synthetic
  value automatically.
- Synthetic numbers mirror a real production run (500/78705
  episodes, ~4.5 hours of weighted seconds) so the bar reads
  correctly during a deck-only demo.

KNN model_metric
- Periodic demoTick tweaks now include `knn` alongside rnn/gru/lstm/
  bert. Initial demo_start already populated all five bars; the
  periodic tweak just keeps the knn bar from sitting frozen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:27:25 -05:00
Max
4b2863ea99 producers/multi_model_metrics + scripts/rsync-from-lambda
Pi-safe replacement for the original metrics.py + perf.py producers
which load every checkpoint into memory and score the test set on each
cycle. That pattern crashed the Pi during this project (300 MB knn
pickles × 6 variants + 226 MB test set in memory at peak ≈ OOM).

The new producer:
  - reads reports/eval/<model>_<mode>_train.json files (already
    contain the test_macro_f1 each trainer wrote)
  - publishes one model_metric event per file
  - publishes one model_perf event per file with a hardcoded
    per-architecture latency estimate (gbt 250 µs, knn 3500, mlp 50,
    cnn 500, gru 1500, lstm 2000, transformer 800, transformer_ssl
    1000). These are family-level order-of-magnitude figures; proper
    benchmarks need to run on the deployment hardware (which is the
    A100, not the Pi).
  - re-publishes on a tick (default 30 s) for refresh-resilience.
  - NO model loading. Pi-safe.

scripts/rsync-from-lambda.sh — pulls Lambda's artifacts/ + reports/eval/
to the Pi every 30 s. As Lambda finishes each model and writes its
train.json, the Pi sees the new file within a cycle and the publisher
broadcasts the metric on its next tick. Live multi-model dashboard
during training, with no Pi-side inference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:04:15 -05:00
Max Gorog
233390a40e deck: reorder + correct eval framing to held-out-by-sample
REORDER
- collect (big-number ingest counter) moved from #7 to #2 — sits
  right after the title as the dataset-quantity hook
- training-code moved from #15 to #14 — "how we trained" now
  appears before "what we got" (models accuracy bars)

EVAL FRAMING CORRECTION
The fleet hosts are uniform — every host runs every profile, just
at different rates — so the actual split is held-out-by-sample
(profile-stratified), NOT held-out-by-host. Both hosts contribute
to train, val, AND test. The generalization claim is "unseen
malware sample_name", not "unseen device".

Fixed across:
- evaluation-setup: split-recipe block, val↔test gap (was
  "cross-host gap"), prose
- problem-statement: RQ wording, "generalize across hosts" →
  "generalize to sample_names"
- research-questions: RQ2 ("from a host the training set never
  saw" → "sample_names the training set never saw"); literature-gap
  bullet flipped from "cross-host generalization" to "sample-
  stratified evaluation"; prose
- solution-overview: pipeline diagram caption
- theoretical-contributions: "cross-host as the eval axis" →
  "held-out-by-sample as the eval axis"
- limitations: two-host-fleet card now states "both hosts
  contribute to train/val/test"; "KNN cross-host gap" → "KNN
  val ↔ test gap"
- conclusion-future: bullet flipped to held-out-by-sample as
  primary axis

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:59:22 -05:00
Max Gorog
db9f013969 deck: 9 new scenes to meet CIS-490 assignment-guide rubric
Five required + four optional slides, slotted into the existing flow
without renumbering the visible deck UI:

REQUIRED
- problem-statement (after motivation): single-sentence problem,
  three numeric stat cards, explicit task-type justification
  (multi-class classification, why not regression/ranking)
- research-questions (after problem-statement): two-column literature
  gap layout + RQ1/RQ2/RQ3
- solution-overview (after research-questions): inline-SVG block
  diagram of the pipeline (fleet hosts → receiver → episodes →
  windowing → model zoo → per-window phase → trust score →
  containment + reset)
- evaluation-setup (between chunking and models): four blocks
  covering split recipe, primary metric, baselines compared, and
  what's reported alongside accuracy. Each block leads with the
  *why*, matching the assignment's "explain not only what will be
  measured but why" requirement.
- conclusion-future (before references): two-column "what we showed"
  + unsupervised next steps (clustering / anomaly / SSL pretrain /
  embedding viz). Addresses Section 8 of the assignment guide.

OPTIONAL
- theoretical-contributions: window-centre labelling,
  schema-hashed checkpoints, cross-host as eval axis
- practical-contributions: /proc-only deployment,
  producer-agnostic dashboard, labelled dataset on disk
- design-principles: one-loop-many-models, typed events as
  contract, two-agent path ownership
- limitations: two-host fleet, synthetic profiles, 10 Hz floor,
  KNN cross-host gap

Plus references/links.md gains four real online references (PyTorch,
XGBoost, scikit-learn, proc(5)) bringing the citation count from 8
to 12 — over the assignment's 10-source minimum.

CSS additions cover the new layouts (.problem-claim, .problem-stats,
.research-grid, .pipeline-svg + .pipeline-stage / .pipeline-arrow,
.eval-blocks, .conclusion-grid). Limitations cards reuse the
motivation-card pattern with an armed-phase amber marker for the
"warning" feel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:32:50 -05:00
Max
4172ddb0c8 docs: request to dashboard side — cap + evict for the KNN scatter
The scene-9 embedding handler appends to a `points` array without
ever capping. The producer republishes its (stable, deterministic)
point set on a cycle so reconnecting browsers eventually see the
scatter; each cycle pushes the same N points again and the in-memory
count grows without bound. Browser slows after ~10 min.

Two complementary fixes proposed:
  A. FIFO cap (1-line change in the handler — fixes the leak today)
  B. embedding_batch event with replace=true (cleaner, pairs with
     the snapshot/sticky-cache request for refresh-time hydration)

Producer side has already reduced cadence as a band-aid (200 pts
every 30 s, was 600 every 5 s) — 18x slower accumulation but still
unbounded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:30:45 -05:00
Max
3413a7c405 scripts/lambda-bootstrap.sh: also fix eval invocation paths
The eval suite at the end of the bootstrap was using ../artifacts and
../data/* paths because they were originally invoked from inside repo/.
Now that we no longer cd into repo, drop the ../ prefix. Same root
cause as the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:25:40 -05:00
Max
ed7e3db035 scripts/lambda-bootstrap.sh — drop the cd-into-repo / before launching trainer
The previous version did `(cd repo && "${cmd[@]}")` to "cd into repo
for module imports." But PYTHONPATH was already set to $PWD/repo at
the top of the script — so the cd was redundant for imports AND
broke relative paths: the trainer expects to find
data/processed/validation_v1.parquet from $HOME/cis490, not from
$HOME/cis490/repo/.

Symptom: every training job failed immediately with
  FileNotFoundError: data/processed/validation_v1.parquet

Drop the cd; PYTHONPATH already does the import work.

Found while running on the A100 today; trainer relaunched manually
in-place via a stand-in bootstrap2.sh; this commit makes the next
bundle clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:25:40 -05:00
Max Gorog
997c399cf9 deck: virtualize to a 3-scene mount window (active ± 1)
Previously every scene rendered at all times — paint, layout, and
the per-scene widgets all ran in parallel. Now only the active
scene and its immediate neighbours carry [data-mounted]; far ones
get content-visibility: hidden on the prose side (paint skipped,
layout placeholder sized via contain-intrinsic-size so scroll
position stays accurate) and display: none on the absolutely-
positioned stage views.

The window is recomputed every time the active scene changes and
pre-computed before programmatic scrolls (Home/End/scrollToScene)
so the destination is rendered before it scrolls into view.

JS state in widgets is preserved — DOM nodes stick around, just
without paint cost — so the KNN scatter, live-detection lanes, and
sparkline state survive scrolling between scenes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:19:46 -05:00
3b3bdab9df Upload files to "references" 2026-05-08 15:07:58 -05:00
Max Gorog
644b9a48fb motivation scene: why detection matters before how we do it
New scene 2 (between intro and stack) framing the operational case
for a per-host detector. Three consequence cards on the stage —
network-level trust scoring, containment before pivot, fast
post-attack reset — backed by a prose section that cites IEEE
document 9881803 for the trust-aggregation argument.

Sidecar md for the paper lands in references/ as a citation note;
when the PDF is dropped in with a matching stem it'll show up in
the references viewer automatically. Link added to links.md too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:49:45 -05:00
Max
c42bf033e5 training/fleet/manifest: accept knn + knn_semi in _ALLOWED_MODELS
Validator's allowed-models frozenset was missing knn and knn_semi
even though the manifest gained those jobs and the model registry
registered the classes. Lambda bootstrap blocked at:
  TrainingManifestError: job 'knn-realistic': model 'knn' not in
    ['cnn', 'gbt', 'gru', 'lstm', 'mlp', 'transformer', 'transformer_ssl']

Now {gbt, knn, knn_semi, mlp, cnn, gru, lstm, transformer, transformer_ssl}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:46:33 -05:00
Max Gorog
4bf241f6ec code cards: presenter-friendly comments on every block
The four code snippets shown on stack and training-code scenes get
inline comments explaining the *why* of each line, not just *what*.
Aimed at the live audience: a presenter reads the comment as the
narration; a reader scans them top-to-bottom for the design story.

Covers: pyproject's three install profiles and what each library
contributes; receiver's bearer auth and why constant-time compare
matters; LSTM model's registry pattern, batch_first transpose,
last-step classification head; trainer loop's class weights vs the
imbalanced dataset, AMP scaler vs fp16 underflow, cosine + warmup
schedule, macro-F1 vs accuracy on imbalanced classes, best-state
restore vs last-epoch weights.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:17:31 -05:00
Max Gorog
da0e9ce83c code cards: mirror the actual training stack and trainer loop
The stack scene's pyproject snippet was missing the `training`
group (torch, sklearn, xgboost, zstandard) — the libraries that
do the actual model work. Updated to match the real pyproject.toml.

The receiver snippet now ends at _bearer_check(...) instead of the
import block alone — gives the slide a non-trivial line of code to
read.

The training-code scene replaces the toy "PhaseLSTM" hand-rolled
loop with the real LSTM model class (registry-decorated _SeqBase
subclass + _LSTMClassifier wrapping nn.LSTM with last-step
classification head) and adds a second card showing the actual
train_nn loop: AMP autocast/scaler, cosine LR with linear warmup,
inverse-frequency class weights, gradient clipping, macro-F1
on val, early stop with best-state restore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:15:01 -05:00
Max
c1c8e98180 scripts/train-pi-cpu-models.sh — sequential Pi-side trainer chain
Pi has 4 cores; only KNN and tree-based models are realistic to train
here without GPU. While Lambda runs the full 16-job manifest in
parallel (~1.7h), this chain trains the CPU-friendly subset on the
Pi (~30 min) so scenes 8 & 12 populate with multi-model numbers
within minutes instead of waiting on Lambda's full cycle.

Order: gbt-realistic, knn-realistic, knn-oracle, knn_semi-realistic,
knn_semi-oracle. Skips models whose .ckpt.json already exists
(idempotent restart). Each is a subprocess of training/trainer/run.py
so XGBoost/numpy/sklearn don't fight each other for cores.

Caller is expected to start gbt-oracle separately (it's the longest
single training and we kicked it off before invoking this script).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:12:34 -05:00
Max
05bccac29f producers: phase-aware attack envelopes + tickable KNN metric/perf
profiles.py — non-shortcut fit:

  Old: pick one accepted episode per profile, emit its raw
       fraction-of-duration curve. Confounded by single-episode noise,
       phase-budget timing variance, and the cumulative-counter
       startup-spike artifact.

  New: aggregate up to N=100 accepted episodes per profile, slice each
       by labels.jsonl phase events, resample EACH PHASE to a fixed
       budget so the median across episodes captures the canonical
       per-phase shape rather than smearing peaks across the timeline.
       Save median + p25/p75 band to data/processed/attack_profiles_v1.parquet.

  Per-phase point budget (sums to 80):
       clean_lead 10, armed 5, infecting 10, infected_running 40,
       clean_tail 15. dormant (when present) folded into infected_running.

  Channel swap: io-walk uses proc.cpu_sys_jiffies, NOT
  proc.io_write_bytes. Host /proc on QEMU doesn't see virtio-blk
  writes via io.write_bytes (writes go through KVM's I/O path, not
  write() syscalls); cpu_sys_jiffies tracks kernel time which spikes
  during heavy I/O scheduling.

  Concrete result: cpu-saturate now shows the proper plateau-during-
  infected_running with peak at 100 j/s (was 30 j/s spike at idx 0
  then mostly zero); low-and-slow shows its distinctive low-amplitude
  profile (peak 21 vs cpu-saturate's 100); io-walk shows the
  rapid-rise-then-decay shape consistent with dd finishing mid-phase.

knn.py — sticky model_metric / model_perf:

  Stream subcommand gains --also-metric / --also-perf-latency-us
  flags. When set, each cycle publishes a model_metric event
  (tagged model=knn) for scene-8 (model bars) and a model_perf
  event for scene-12 (accuracy vs inference cost). Republishing on
  the cycle keeps reconnecting browsers populated without depending
  on the dashboard's not-yet-built sticky-event cache.

  Measured KNN inference latency on the 150k-trained classifier:
      single-window predict: 61.5 ms (sklearn brute-force at 230 D)
      per-window in batch=64: 3.4 ms (the production-realistic number)

  Streamer published: model_metric{knn, 0.762} +
                      model_perf{knn, latency_us=3410, accuracy=0.762}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:08:03 -05:00
Max Gorog
3783fabe86 live scene: per-host swim lanes + latest-detection callout
New scene 13 (between perf and references) for fleet-wide live
predictions. Each host gets a row of recent prediction cells
(capped at 60), painted by predicted phase; mismatch with ground
truth shows a hatched overlay. A callout below the lanes holds
the most recent detection with model, profile, confidence, and
latency.

Producer contract is the new LiveDetection dataclass in events.py.
The dashboard side is producer-agnostic — the inference loop can
run locally or offload to A100 (or any GPU/host); just POST events
back. No rate-limiting needed; the swim-lane DOM does the capping.

Demo synthesizes 5 hosts walking through phases at ~92% accuracy
so the scene reads as live the moment the deck loads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:03:32 -05:00
Max
9d56bcc923 docs: request to dashboard side — persist KNN embeddings on refresh
Producer-side knn fit is saved at data/processed/knn_v1.parquet
(150k rows, 3.4 MB). Live streamer publishes 2000-point cycles every
~2 s, but per PRODUCERS.md §reconnect-gotcha live events aren't
replayed; refresh-to-data is currently bounded by cycle time.

Three options laid out for the dashboard chat to pick:
  A. Sticky cache (per-event-type ring buffer in the broadcaster)
  B. Feeder reading the parquet → broadcaster.state["embedding_cache"]
  C. Caddy fileserver + JS fetch on load

Whichever option lands, the producer side will adapt (e.g., dump a
JSON sidecar if Option C is picked). Path ownership preserved —
dashboard owns dashboard/, producer owns producers/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:54:38 -05:00
Max
2aa7b865fb training/models: knn_semi — semi-supervised self-training KNN
Registered as `knn_semi`. Answers the research question:

  *If we had ground-truth labels for only a fraction of training
   episodes, could we use the structure of the unlabeled rest to
   recover most of supervised KNN's accuracy?*

Pipeline (Yarowsky-style self-training):

  1. Split train slice deterministically into labeled (label_frac=0.2
     default) and unlabeled (1 - label_frac) by row-index hash.
  2. Fit a "labeler" KNN on the labeled fraction.
  3. Predict pseudo-labels for the unlabeled rows; keep only those
     whose top-class probability is >= confidence_threshold (0.6).
  4. Fit the final KNN on (labeled rows + confident pseudo-labels).
     Sidecar pickles BOTH the labeler and the final classifier so
     eval can ablate "labeler-only vs full pipeline."

Smoke run (567-episode subset, oracle mode, label_frac=0.2):

                       val_macro_f1   test_macro_f1
  knn       (100% labels)   0.737        0.133
  knn_semi  (20% labels)    0.654        0.173

Lower val (less data) but HIGHER cross-device test — pseudo-labeling
acts as a regularizer that prevents overfitting to elliott-thinkpad's
specific neighborhood structure. Honest research finding worth a slide
in the writeup.

Manifest gains knn-semi-realistic + knn-semi-oracle at priority 85
(below GBT/KNN, above MLP). Storage cost = augmented set × n_features
× 4 bytes; same .knn.pkl sidecar format as plain KNN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:51:30 -05:00
Max
e46906b68c training/producers/knn: supervised LDA / UMAP projector + batched publish
Two changes that make scene-11 actually look like a clustering scene:

1. Supervised projection (--projector lda | umap | pca)
   - PCA was variance-greedy and oblivious to phase labels — clumped
     classes together because the dominant variance directions weren't
     class-discriminative.
   - LDA (default): Fisher Linear Discriminant. Linear, fast (~seconds),
     reproducible. On 150k windows: between-class variance 0.462 / 0.331
     / 0.167 across the three axes (96% of class-discriminative info
     in the first 3 dims).
   - UMAP (--projector umap): supervised nonlinear manifold embedding;
     tighter visual clusters at the cost of ~10 minutes for 150k on a
     Pi-class CPU. Reproducible via random_state. Subsamples to 20k for
     fit then transforms remaining points.
   - PCA still available for reference / debugging.

2. Batched concurrent publish (--burst-size N)
   - Sequential publish was ~6.5 ms/event over loopback HTTP → 13 s
     per 2000-point cycle.
   - asyncio.gather with burst_size=50 turns each batch into ~5 ms,
     so the same cycle is ~0.5 s. Browsers see the scatter populate
     in well under a second instead of waiting through a 13 s cycle
     per refresh.
   - Default burst_size=50 is conservative — the dashboard's WebSocket
     fan-out can take more pressure but 50 leaves headroom.

Saved fit format unchanged (data/processed/knn_v1.parquet); the
streamer's --load-fit reads the same parquet regardless of which
projector produced it. The LDA / UMAP choice is captured in the
producer's log + saved parquet metadata, not in the file shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:45:16 -05:00
Max Gorog
2abc55a59b knn scatter: auto-fit projection to running data spread
Project around mean ± k·σ instead of the raw [0,1]³ producer-unit
cube. PCA-3 outputs are Gaussian-ish so even after the producer's
min/max rescale, the bulk of points clusters near the centroid;
without auto-fit the scatter looks dead-centre and tiny.

Implementation: incremental Welford-ish stats (running sum / sum²)
per axis, recomputed lazily on the first frame after new data
arrives. project() centers and σ-scales each point to ~[-0.5, 0.5];
outliers clamp to ±0.7 so they're visible just outside the cube.
The bounding cube now traces mean ± k·σ instead of [0,1]³, which is
also the natural visual unit for the "data spread" the user reads
off the screen.

resetStats() runs on demo toggle and is implicit when points are
cleared. SPREAD_K=2.5 puts ~99% of normally-distributed data inside
the cube; MIN_STD=0.02 keeps degenerate (all-equal) data from
exploding the divisor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:33:19 -05:00
Max
aa6187042b .gitignore: exclude data/processed/knn_*.parquet
KNN fit output (PCA-3 + KMeans + KNN-classifier predictions per
window) is a derived artifact regenerable from features_window_v1.
Like features_window itself it stays out of git; the streamer
reads it from disk on the producing host.
2026-05-08 13:20:17 -05:00
Max Gorog
f537ab8686 models scene: paint the knn bar (CSS color + demo entry)
The model-bar widget rendered .model-fill.knn with no gradient when
a model_metric{model:"knn"} arrived, leaving an empty track. Add a
green gradient and include knn in the demo-mode set so the row is
visible without waiting on the producer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:16:38 -05:00
Max
ba5ff70c14 training/producers/knn: add stream subcommand for disk-loaded loop
The fit pipeline (PCA-3 + KMeans + KNN classifier) can be expensive
to recompute every time a producer starts. `produce --fit-out` already
dumps the per-window (x, y, z, phase_int, predicted_int, cluster) to a
parquet; this commit adds a `stream` subcommand that loads that
parquet and publishes Embedding events on a loop.

Why a separate streamer:
  - The dashboard's live event stream is not replayed on browser
    reconnect (PRODUCERS.md §reconnect-gotcha). A browser that
    connects 30 s after the last cycle of the producer sees an empty
    scatter unless we re-publish.
  - The fit is deterministic given (features, seed) — no need to
    repeat it just to re-publish points. The streamer is small and
    stateless; it can run as a long-lived service.

Usage:
  python -m training.producers.knn produce \\
      --window data/processed/features_window_v1.parquet \\
      --schema data/processed/feature_schema_v1.json \\
      --fit-out data/processed/knn_v1.parquet \\
      --no-publish

  python -m training.producers.knn stream \\
      --load-fit data/processed/knn_v1.parquet \\
      --loop --max-points 2000

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:13:09 -05:00
Max Gorog
97eb34f7f6 baseline prose: reflect the dataset-derived phase mix
The widget no longer rolls the last 5 minutes; it aggregates
time-weighted phase durations across a sampled slice of the
on-disk dataset. The prose now matches the bar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:07:36 -05:00
Max
2187a5d752 training/models: KNN as a registered supervised model
Non-parametric baseline alongside GBT/MLP/CNN/GRU/LSTM/Transformer.
Same BaseModel + schema-hashed checkpoint contract; sidecar is a
pickled sklearn KNeighborsClassifier (.knn.pkl) handled by the
existing checkpoint machinery alongside .xgb.json / .pt.

KNN's storage cost = n_train_rows × n_kept_features × 4 bytes.
At 660k windows × 145 kept (realistic mode) features = ~380 MB
sidecar; at 230 features (oracle) = ~600 MB. Heavy but ships through
the same artifact-upload path.

trainer/run.py learns a third fit branch:
  - GBT — XGBoost early stopping on val mlogloss
  - KNN — fit() memorizes; "training time" is val/test predict cost
  - NN  — train_nn loop (the rest)

Manifest gains knn-realistic + knn-oracle at priority 95 (just
below GBT). KNN's k=10 default lives in the model class — overriding
via hyper.k requires adding --k to run.py first to avoid the
unknown-arg exit-2 issue.

Smoke verified on the 567-episode subset:
  knn   oracle    val=0.7365  test=0.1333  (held-out k-gamingcom)

That val/test gap (0.74 → 0.13) is the cross-device generalization
story: KNN memorizes elliott-thinkpad's local feature space and
falls apart on the other host. Honest baseline for the comparison
report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:06:56 -05:00
Max Gorog
51f2437b71 baseline: phase mix from sampled dataset, not 5-min window
The widget was waiting on live `phase` events that don't flow when no
orchestrator is running, so it sat empty. Replace the rolling
5-minute window with a periodic feeder that samples 500 random
episode tarballs from /var/lib/cis490/episodes, extracts each
labels.jsonl, and aggregates phase durations using consecutive
t_mono_ns deltas. Result lands in broadcaster.state["phase_mix"]
(survives snapshot cycles via dict.update) and re-broadcasts every
~10 min.

Frontend reads phase_mix from snapshot on connect and from live
phase_mix events on refresh; the bar uses time-weighted proportions
when available (falls back to label counts), and only sums canonical
phases for the denominator so non-displayed `failed` records don't
shrink the visible bars. Eyebrow and sub-line update with live
sample/population/label counts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:04:36 -05:00
Max
ac9b5b6f07 training/producers: knn producer for scene-11 + ModelMetric{knn}
KNN-driven embedding events for the dashboard's KNN scatter scene
(scene 11). One forward pass populates all three of the scatter's
mode-toggle fields:

  x, y, z    — PCA-3 projection of the standardized window features
  phase      — ground-truth phase from labels.jsonl
  predicted  — KNN classifier's prediction (k=10, distance-weighted)
  cluster    — MiniBatchKMeans cluster id (k=8 default)

Two subcommands:

  python -m training.producers.knn produce  ...  emit Embedding events
  python -m training.producers.knn metric    ...  publish ModelMetric{knn}
                                                  on a tick (re-publish
                                                  for reconnect-warmth)

KNN classifier uses the held-out-by-host split aligned with the
supervised pipeline (train ∪ val on elliott-thinkpad, predict on
k-gamingcom) so the predictions reflect cross-device generalization,
not in-distribution self-prediction.

Smoke-verified end-to-end against the live dashboard (3 clients):
800 embedding events delivered in 12 s; ModelMetric{knn} with
test_macro_f1 = 0.4297 on the 567-episode smoke subset, sitting
between the trained GBT (0.557) and the under-trained NN models
(0.09–0.18) — sensible for a non-parametric baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:03:19 -05:00
Max Gorog
12ac409ab2 knn scene: drag-to-rotate 3-D scatter + KNN/cluster color modes
Replace the SVG 2-D scatter with a canvas-based 3-D one. Three color
modes (phase / predicted / cluster) with a toggle; drag the surface
to rotate; reset button. Bounding cube draws faintly so the rotation
reads as 3-D rather than re-shuffled 2-D.

Embedding event gains optional z / predicted / cluster fields. 2-D
producers still work (z defaults to 0.5, no other behavior changes).

CSS adds .scatter3d-* rules; --theme-h-num exposed for cluster-color
hue arithmetic. Synthetic demo data is now 3-D Gaussian clusters with
~7% mislabeled "predictions" so the predicted-mode view differs from
ground truth at a glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:55:31 -05:00
Max Gorog
9e38f78379 training/dashboard(references): description sidebar + better space use
Two changes per the user's feedback that the slide had unused
horizontal space and needed per-PDF context.

Layout
- The reference scene is now a 2-column grid inside the
  metric-stack: PDF iframe at ~1.7fr on the left, description
  panel at ~0.55fr on the right (min 280px). On narrow viewports
  (<1100px) it falls back to a vertical stack with the
  description capped to 240px.
- Added #zoom=page-width to the iframe URL so the PDF's page
  fits its column width instead of leaving margins beside an
  8.5x11 page rendered in a wider iframe.
- Hide the prose card on the references scene — the description
  panel inside the stack covers what the prose was saying, and
  freeing the right edge gives the description proper room.

Description content
- Backend reads <stem>.md sidecar files alongside each PDF and
  returns the contents in the /api/references payload.
- Frontend renders them with a tiny built-in markdown subset
  (headings, bold/italic, lists, inline code, paragraphs) — no
  third-party renderer dependency.
- Initial draft sidecar .md files committed for the four PDFs
  currently in references/. Each describes how the paper informs
  a specific scene of the deck (which model row, which eval
  protocol, which channel selection). Edit them in place and the
  panel updates on the next reload.
2026-05-08 12:40:32 -05:00
Max
69c563275a training: parallelize lambda bootstrap (2 jobs at a time on the A100)
At our model sizes (max ~250 K params, max batch 512), each training
process uses ~1 GiB VRAM. A 40 GiB A100 is far from contention with
two concurrent jobs. Bounded-concurrency rolling launcher cuts
sequential ~3.5 h → parallel ~1.7 h for the full 14-job manifest.

  PARALLEL=2 (default) — override via env var if running on a smaller GPU
  or testing the queue logic.

Per-job logs still land at logs/<model>_<mode>.log; failure reporting
is the same. Idempotent: skipping already-present checkpoints unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:37:03 -05:00
Max Gorog
bee40a6ae9 training/dashboard: references scene with PDF viewer + tab strip
New scene 13 (after perf, the last in the deck) renders a tabbed
PDF viewer. Each tab is one .pdf in /opt/cis490/references/; the
active tab swaps the iframe's src to /refs/<encoded-filename>.

Backend
- /api/references — lists pdfs in REFS_DIR, returning
  {"name": stem (newlines stripped), "path": "/refs/<urlencoded>"}.
- /refs static mount — serves the PDFs directly. check_dir=False
  so the dashboard still boots if the directory is missing.
- REFS_DIR resolves relative to the install root so it works on
  /opt/cis490 in production and any dev tree.

Frontend
- Stage view uses metric-stack-wide for the broader card; the
  references scene also overrides .stage-view padding-right down
  to a small gutter so the iframe takes most of the screen
  horizontally — the prose card still sits on the right but the
  PDF area is roughly 70% wide on standard viewports.
- Tabs are styled like .db-tab (palette-aware pills) and stop
  propagation so they don't trigger the click-to-advance gesture.
- Iframe is lazy-loaded: src isn't set until the user actually
  scrolls into the references scene OR clicks a tab, so the
  browser doesn't fetch a big PDF the user may never view.
2026-05-08 12:34:52 -05:00
Max
308140c6ce training: lambda-cloud one-shot training integration
External-GPU path for the time-pressured first round, before the
Windows desktop joins the WG fleet. Lambda is treated as an "external
worker" whose output lands in the same /var/lib/cis490/models/ tree
the receiver-coordinated fleet uses, so cis490-jobs status reflects
Lambda runs identically to fleet runs.

Three scripts + one ingest tool:

  scripts/build-lambda-bundle.sh
    Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with:
      - the repo (sans .git, sans data/, sans artifacts*)
      - data/processed/{validation_v1,features_window_v1}.parquet
      - data/processed/feature_schema_v1.json
      - data/processed/tensor_window_v1/   (npz shards)
      - bootstrap.sh (entrypoint)
      - training_manifest.toml (the canonical job list)
      - BUNDLE_MANIFEST.json (commit hash + counts + build stamp)
    Verifies all four data inputs exist BEFORE compressing 5+ GB.

  scripts/run-on-lambda.sh ubuntu@<ip>
    rsync bundle up → ssh + run bootstrap → rsync artifacts +
    reports/eval back to artifacts-lambda/ + reports/lambda/.
    Resumable rsync; sha256-verified.

  scripts/lambda-bootstrap.sh   (runs ON the Lambda instance)
    Creates .venv with cu121 torch + xgboost + the [training] deps,
    iterates the manifest's job list in priority order (highest first),
    runs trainer/run.py (or run_ssl.py for transformer_ssl) per job,
    skips jobs whose .ckpt.json already exists (idempotent on re-run),
    writes per-job logs/<model>_<mode>.log, runs eval suite at the end,
    stamps artifacts/RUN_SUMMARY.json with counts + failed-job list.

  tools/ingest_lambda_artifacts.py
    Bundles each (ckpt.json + sidecar + train.json) trio into a
    .tar.zst, sha256, PUTs to the local trainer-receiver's
    /v1/model/{job_id}, marks the job complete. Maps (model, mode) →
    job_id by re-reading the canonical manifest. Handles the queue
    state churn (requeue if completed, claim if pending, fail-back
    on race losses).

End-to-end smoke verified on the A100 instance just provisioned:
  - SSH from Pi via ed25519 keypair (cis490-trainer-pi)
  - GPU: A100-SXM4-40GB, driver 580.105.08
  - venv warmed: torch 2.5.1+cu121, xgboost 3.2.0
  - 464 GB ephemeral disk available

Pi-side feature build (build_features.py + build_tensors.py against
all 72,952 accepted+degraded episodes) is in progress; bundle build
gates on its completion. Estimated wall-clock for the full Lambda
training run on A100: ~2.5 hours for 12 supervised + 2 SSL models +
eval suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:32:04 -05:00
Max
697e36a315 training/producers: move out of dashboard/ per ownership boundary
Producers are event *sources* — the renderer is everything inside
training/dashboard/. Sibling layout makes the dependency direction
one-way (producers import from training.dashboard.events; dashboard
never reaches into producers).

  training/dashboard/producers/   →   training/producers/

Internal imports rewritten via sed; eval_/run.py and training/README.md
cross-references updated. CLI entry stays via `python -m training.producers.<sub>`
(replay / metrics / perf / profiles).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:06:56 -05:00