CIS490

Author	SHA1	Message	Date
Max Gorog	6230f18692	model bars: paint every architecture (+ neutral fallback) The bar widget had gradients for lstm / gru / rnn / bert / knn only — any other model name (cnn, mlp, transformer, gbt, knn_semi, transformer_ssl) rendered a track but no fill. Now: - Added explicit gradients for cnn, mlp, transformer, transformer_ssl, gbt, knn_semi (each visually distinct from the existing five). - Added a neutral grey-grey fallback on .model-fill itself, so any unanticipated model name still produces a visible bar instead of silently disappearing. The specific class rules override it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:20:21 -05:00
Max	c2a71de4b2	scene 9 bars: paint full zoo + 0–1 visible scale - multi_model_metrics: publish gbt / mlp / cnn / knn_semi / gru / lstm / bert (knn handled by knn streamer); read both _train.json and _eval.json with macro_f1.point fallback - dashboard.css: add palette gradients for the four non-canonical names so the bars render with a fill colour - dashboard.js: open the bar's visible scale to the full 0–1 range so honest-low cross-host F1s show as a bar instead of clamping to 0% - ship lambda-live-detection-loop.py + dashboard request docs (scenes 7/8/12, sticky cache, lambda-inference-demo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:18:00 -05:00
Max Gorog	06bfcef3d6	demo button: include (d) in tooltip Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:07:47 -05:00
Max Gorog	cedf64c708	hotkeys: 'd' toggles demo mode Saves a click during live demos. Topbar tooltip updated to mention the binding. Hotkey is gated by the same input-focus check as 'c' / arrow keys, so typing 'd' in a search box won't fire it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:07:23 -05:00
Max Gorog	0bc2b57ccb	live demo: back to 2500 ms cadence Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:59:16 -05:00
Max Gorog	00d11740eb	live demo: drop elliott-lab from inference host list It contributed no training data, so the A100 wouldn't be running inference on its windows. Only hosts that actually produced data (elliott-thinkpad, k-gamingcom) should appear as the source of synthetic predictions in the live scene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:59:04 -05:00
Max Gorog	ac630997c3	live demo: bump cadence back up to ~1 event/sec 2500ms read too slow. 1000ms is the sweet spot — under the real ceiling of ~1.5/sec but still lively enough to feel like a working inference loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:52:37 -05:00
Max Gorog	ab21217261	live demo: slow A100 inference cadence to ~0.4 events/sec Was 280ms (~3.5 events/sec) — way too fast for real fleet inference. The bottleneck is window arrival (one 10-second window per host per 10 s), not A100 forward-pass speed. With ~3 hosts × 5 models that's ~1.5 events/sec real ceiling, so demo at 2500ms (~0.4/sec) reads honest without claiming impossible throughput. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:52:04 -05:00
Max Gorog	3b96537b3e	live scene HTML: stats line + prose match per-model framing Stats line now reads "A100 inference · live · N models · X infer/sec · last window: <host> · hit-rate: …" instead of "live detections · N hosts · model: …". Prose rewritten to describe lanes as side-by- side model-agreement check rather than per-host activity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:44:53 -05:00
Max Gorog	5533043b02	live scene: per-model lanes (A100 inference), not per-host The scene's framing was wrong. It's about the A100 doing live model predictions, not about per-host telemetry collection. Lanes now key on `model` instead of `host_id`; the callout leads with model name + A100 latency, demoting host/profile to secondary metadata. Stats line reads "N models · X infer/sec · last window from <host>" instead of "N hosts · model: X". Demo synthesis updated to match: 5 trained models cycle through predictions on rotating fleet windows, each model with its own accuracy + latency profile (KNN fast/loose, BERT slow/precise) so the lanes visually differ. Article prose reframes the scene as side-by-side model agreement, the natural read of per-model lanes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:44:23 -05:00
Max Gorog	804220d7f6	knn scatter: revert to real-data only (no demo handlers) The KNN producer works; KNN does not need a demo-mode fallback. Remove demo_start / demo_stop / cachedReal / demoActive scaffolding that I'd added speculatively. Embedding events render directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:39:45 -05:00
Max Gorog	ef6bc71009	knn scatter: exclusive demo (not additive) Same pattern as models / perf / live: cachedReal accumulates real embedding events at all times; demoActive flag gates which source renders. - demo on → only synthetic clusters - demo off → only real embeddings (replayed from cachedReal) Cache cap 5000 points to bound memory across long sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:36:31 -05:00
Max Gorog	b6e478c578	demo mode: exclusive (not additive) on models / perf / live Was: demo seeded on demo_start, then real producer events rendered on top of the synthetic bars/points/cells. Both sources visible simultaneously — visually confusing. Now: each widget tracks demoActive + a cachedReal store. - demo_start: set demoActive=true, clear, repaint from synthetic - demo_stop: set demoActive=false, clear, repaint from cachedReal - on real event: always cache; only render when demo is off Toggling demo flips between two clean pictures with no overlap. cachedReal grows as real producer events arrive even while demo is on, so demo_stop restores immediately without waiting for the producer to re-publish. Applied to: models bars, perf scatter, live detections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:36:12 -05:00
Max Gorog	a04ea60aef	demo mode: never overwrite real data on perf / models / live Same hasReal* gating I already used for phase_mix, applied to: - models bars (model_metric) - perf scatter (model_perf) - live detections (live_detection) Each widget tracks whether a real producer event has arrived; demo only seeds when nothing real has been seen yet, and demo_stop preserves real state instead of wiping it. demoTick is now a no-op — periodic model_metric jitter was overwriting real values mid-stream. Per-widget one-shot seeding on demo_start (gated by hasReal*) is enough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:32:55 -05:00
Max Gorog	f429bd4223	perf scatter: log-x + alternating-position labels (kill overlap) Two issues with the accuracy-vs-latency scatter: 1. Linear x crammed RNN/GRU/LSTM into ~25 px of axis (380/520/700 μs) while BERT alone took the right 80 % (3200 μs). 2. Labels placed at fixed +12 right of each point overlapped both neighbouring points and other labels in the recurrent cluster. Fixes: - X-axis switched to log10 with bounds 10μs–10ms; tick labels and marks added at 10μs / 100μs / 1ms / 10ms so the audience can read the scale. - Y-axis bounds tightened to [0.5, 1.0] (was [0.7, 1.0]) so KNN's ~0.43 cross-host F1 falls within the visible plot area instead of off-bottom; ticks added at 0.6 / 0.8 / 1.0. - Anti-overlap label placement: sort points by x, alternate above (-12) / below (+18) the circle. Adjacent labels can no longer share both x and y bands. repaintLabels() re-runs on each model_perf event so late arrivals slot into the staircase. Y-axis title also updated: "held-out accuracy" → "held-out macro-F1" to match the actual metric the producer reports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:31:42 -05:00
Max Gorog	9e7d9999a3	demo mode: omit attack envelopes too Same scope-narrowing as collect / hosts / db / knn — attack profiles are real data from the orchestrator's catalog, so the deck should display whatever the producer publishes via attack_profile events and not overwrite that with synthetic curves on demo_start. Removed both demo_start (synthesize) and demo_stop (clearAll) handlers; the syntheticProfiles helper is left in place for reference but is no longer wired to anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:29:38 -05:00
Max Gorog	7e7fb52d32	demo mode: stop synthesizing episode + phase events Per the data-ownership scope: collect (episodes-ingested counter), hosts (per-host bars), and db (database explorer) all work fine in or out of demo mode — they read real values from the server's snapshot. Demo mode shouldn't be injecting fake `episode` records into them. Removed both dispatches from demoTick: - `episode` (was 70% per tick) — no longer clobbers collect/hosts/db - `phase` (was 50% per tick) — dead code anyway; baseline now consumes the dataset-derived `phase_mix` event, not raw `phase` demoTick is now just the model_metric jitter (5% per tick) so the sequence-model bars don't sit frozen during a long demo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:29:13 -05:00
Max Gorog	af1f7fb56d	demo mode: backfill phase mix + knn metric (no clobber on real data) Two targeted fixes for the demo-toggle path; intentionally narrow so we don't override widgets that already work in both modes (KNN scatter, DB explorer). Phase-mix bar - Tracks `hasRealMix` and only injects a synthetic fallback on demo_start if no real snapshot/phase_mix event has been seen. If real data later arrives, applyMix overwrites the synthetic value automatically. - Synthetic numbers mirror a real production run (500/78705 episodes, ~4.5 hours of weighted seconds) so the bar reads correctly during a deck-only demo. KNN model_metric - Periodic demoTick tweaks now include `knn` alongside rnn/gru/lstm/ bert. Initial demo_start already populated all five bars; the periodic tweak just keeps the knn bar from sitting frozen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:27:25 -05:00
Max	4b2863ea99	producers/multi_model_metrics + scripts/rsync-from-lambda Pi-safe replacement for the original metrics.py + perf.py producers which load every checkpoint into memory and score the test set on each cycle. That pattern crashed the Pi during this project (300 MB knn pickles × 6 variants + 226 MB test set in memory at peak ≈ OOM). The new producer: - reads reports/eval/<model>_<mode>_train.json files (already contain the test_macro_f1 each trainer wrote) - publishes one model_metric event per file - publishes one model_perf event per file with a hardcoded per-architecture latency estimate (gbt 250 µs, knn 3500, mlp 50, cnn 500, gru 1500, lstm 2000, transformer 800, transformer_ssl 1000). These are family-level order-of-magnitude figures; proper benchmarks need to run on the deployment hardware (which is the A100, not the Pi). - re-publishes on a tick (default 30 s) for refresh-resilience. - NO model loading. Pi-safe. scripts/rsync-from-lambda.sh — pulls Lambda's artifacts/ + reports/eval/ to the Pi every 30 s. As Lambda finishes each model and writes its train.json, the Pi sees the new file within a cycle and the publisher broadcasts the metric on its next tick. Live multi-model dashboard during training, with no Pi-side inference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:04:15 -05:00
Max Gorog	233390a40e	deck: reorder + correct eval framing to held-out-by-sample REORDER - collect (big-number ingest counter) moved from #7 to #2 — sits right after the title as the dataset-quantity hook - training-code moved from #15 to #14 — "how we trained" now appears before "what we got" (models accuracy bars) EVAL FRAMING CORRECTION The fleet hosts are uniform — every host runs every profile, just at different rates — so the actual split is held-out-by-sample (profile-stratified), NOT held-out-by-host. Both hosts contribute to train, val, AND test. The generalization claim is "unseen malware sample_name", not "unseen device". Fixed across: - evaluation-setup: split-recipe block, val↔test gap (was "cross-host gap"), prose - problem-statement: RQ wording, "generalize across hosts" → "generalize to sample_names" - research-questions: RQ2 ("from a host the training set never saw" → "sample_names the training set never saw"); literature-gap bullet flipped from "cross-host generalization" to "sample- stratified evaluation"; prose - solution-overview: pipeline diagram caption - theoretical-contributions: "cross-host as the eval axis" → "held-out-by-sample as the eval axis" - limitations: two-host-fleet card now states "both hosts contribute to train/val/test"; "KNN cross-host gap" → "KNN val ↔ test gap" - conclusion-future: bullet flipped to held-out-by-sample as primary axis Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:59:22 -05:00
Max Gorog	db9f013969	deck: 9 new scenes to meet CIS-490 assignment-guide rubric Five required + four optional slides, slotted into the existing flow without renumbering the visible deck UI: REQUIRED - problem-statement (after motivation): single-sentence problem, three numeric stat cards, explicit task-type justification (multi-class classification, why not regression/ranking) - research-questions (after problem-statement): two-column literature gap layout + RQ1/RQ2/RQ3 - solution-overview (after research-questions): inline-SVG block diagram of the pipeline (fleet hosts → receiver → episodes → windowing → model zoo → per-window phase → trust score → containment + reset) - evaluation-setup (between chunking and models): four blocks covering split recipe, primary metric, baselines compared, and what's reported alongside accuracy. Each block leads with the why, matching the assignment's "explain not only what will be measured but why" requirement. - conclusion-future (before references): two-column "what we showed" + unsupervised next steps (clustering / anomaly / SSL pretrain / embedding viz). Addresses Section 8 of the assignment guide. OPTIONAL - theoretical-contributions: window-centre labelling, schema-hashed checkpoints, cross-host as eval axis - practical-contributions: /proc-only deployment, producer-agnostic dashboard, labelled dataset on disk - design-principles: one-loop-many-models, typed events as contract, two-agent path ownership - limitations: two-host fleet, synthetic profiles, 10 Hz floor, KNN cross-host gap Plus references/links.md gains four real online references (PyTorch, XGBoost, scikit-learn, proc(5)) bringing the citation count from 8 to 12 — over the assignment's 10-source minimum. CSS additions cover the new layouts (.problem-claim, .problem-stats, .research-grid, .pipeline-svg + .pipeline-stage / .pipeline-arrow, .eval-blocks, .conclusion-grid). Limitations cards reuse the motivation-card pattern with an armed-phase amber marker for the "warning" feel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:32:50 -05:00
Max	4172ddb0c8	docs: request to dashboard side — cap + evict for the KNN scatter The scene-9 embedding handler appends to a `points` array without ever capping. The producer republishes its (stable, deterministic) point set on a cycle so reconnecting browsers eventually see the scatter; each cycle pushes the same N points again and the in-memory count grows without bound. Browser slows after ~10 min. Two complementary fixes proposed: A. FIFO cap (1-line change in the handler — fixes the leak today) B. embedding_batch event with replace=true (cleaner, pairs with the snapshot/sticky-cache request for refresh-time hydration) Producer side has already reduced cadence as a band-aid (200 pts every 30 s, was 600 every 5 s) — 18x slower accumulation but still unbounded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:30:45 -05:00
Max	3413a7c405	scripts/lambda-bootstrap.sh: also fix eval invocation paths The eval suite at the end of the bootstrap was using ../artifacts and ../data/* paths because they were originally invoked from inside repo/. Now that we no longer cd into repo, drop the ../ prefix. Same root cause as the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:25:40 -05:00
Max	ed7e3db035	scripts/lambda-bootstrap.sh — drop the cd-into-repo / before launching trainer The previous version did `(cd repo && "${cmd[@]}")` to "cd into repo for module imports." But PYTHONPATH was already set to $PWD/repo at the top of the script — so the cd was redundant for imports AND broke relative paths: the trainer expects to find data/processed/validation_v1.parquet from $HOME/cis490, not from $HOME/cis490/repo/. Symptom: every training job failed immediately with FileNotFoundError: data/processed/validation_v1.parquet Drop the cd; PYTHONPATH already does the import work. Found while running on the A100 today; trainer relaunched manually in-place via a stand-in bootstrap2.sh; this commit makes the next bundle clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:25:40 -05:00
Max Gorog	997c399cf9	deck: virtualize to a 3-scene mount window (active ± 1) Previously every scene rendered at all times — paint, layout, and the per-scene widgets all ran in parallel. Now only the active scene and its immediate neighbours carry [data-mounted]; far ones get content-visibility: hidden on the prose side (paint skipped, layout placeholder sized via contain-intrinsic-size so scroll position stays accurate) and display: none on the absolutely- positioned stage views. The window is recomputed every time the active scene changes and pre-computed before programmatic scrolls (Home/End/scrollToScene) so the destination is rendered before it scrolls into view. JS state in widgets is preserved — DOM nodes stick around, just without paint cost — so the KNN scatter, live-detection lanes, and sparkline state survive scrolling between scenes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:19:46 -05:00
elliott	3b3bdab9df	Upload files to "references"	2026-05-08 15:07:58 -05:00
Max Gorog	644b9a48fb	motivation scene: why detection matters before how we do it New scene 2 (between intro and stack) framing the operational case for a per-host detector. Three consequence cards on the stage — network-level trust scoring, containment before pivot, fast post-attack reset — backed by a prose section that cites IEEE document 9881803 for the trust-aggregation argument. Sidecar md for the paper lands in references/ as a citation note; when the PDF is dropped in with a matching stem it'll show up in the references viewer automatically. Link added to links.md too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:49:45 -05:00
Max	c42bf033e5	training/fleet/manifest: accept knn + knn_semi in _ALLOWED_MODELS Validator's allowed-models frozenset was missing knn and knn_semi even though the manifest gained those jobs and the model registry registered the classes. Lambda bootstrap blocked at: TrainingManifestError: job 'knn-realistic': model 'knn' not in ['cnn', 'gbt', 'gru', 'lstm', 'mlp', 'transformer', 'transformer_ssl'] Now {gbt, knn, knn_semi, mlp, cnn, gru, lstm, transformer, transformer_ssl}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:46:33 -05:00
Max Gorog	4bf241f6ec	code cards: presenter-friendly comments on every block The four code snippets shown on stack and training-code scenes get inline comments explaining the why of each line, not just what. Aimed at the live audience: a presenter reads the comment as the narration; a reader scans them top-to-bottom for the design story. Covers: pyproject's three install profiles and what each library contributes; receiver's bearer auth and why constant-time compare matters; LSTM model's registry pattern, batch_first transpose, last-step classification head; trainer loop's class weights vs the imbalanced dataset, AMP scaler vs fp16 underflow, cosine + warmup schedule, macro-F1 vs accuracy on imbalanced classes, best-state restore vs last-epoch weights. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:17:31 -05:00
Max Gorog	da0e9ce83c	code cards: mirror the actual training stack and trainer loop The stack scene's pyproject snippet was missing the `training` group (torch, sklearn, xgboost, zstandard) — the libraries that do the actual model work. Updated to match the real pyproject.toml. The receiver snippet now ends at _bearer_check(...) instead of the import block alone — gives the slide a non-trivial line of code to read. The training-code scene replaces the toy "PhaseLSTM" hand-rolled loop with the real LSTM model class (registry-decorated _SeqBase subclass + _LSTMClassifier wrapping nn.LSTM with last-step classification head) and adds a second card showing the actual train_nn loop: AMP autocast/scaler, cosine LR with linear warmup, inverse-frequency class weights, gradient clipping, macro-F1 on val, early stop with best-state restore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:15:01 -05:00
Max	c1c8e98180	scripts/train-pi-cpu-models.sh — sequential Pi-side trainer chain Pi has 4 cores; only KNN and tree-based models are realistic to train here without GPU. While Lambda runs the full 16-job manifest in parallel (~1.7h), this chain trains the CPU-friendly subset on the Pi (~30 min) so scenes 8 & 12 populate with multi-model numbers within minutes instead of waiting on Lambda's full cycle. Order: gbt-realistic, knn-realistic, knn-oracle, knn_semi-realistic, knn_semi-oracle. Skips models whose .ckpt.json already exists (idempotent restart). Each is a subprocess of training/trainer/run.py so XGBoost/numpy/sklearn don't fight each other for cores. Caller is expected to start gbt-oracle separately (it's the longest single training and we kicked it off before invoking this script). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:12:34 -05:00
Max	05bccac29f	producers: phase-aware attack envelopes + tickable KNN metric/perf profiles.py — non-shortcut fit: Old: pick one accepted episode per profile, emit its raw fraction-of-duration curve. Confounded by single-episode noise, phase-budget timing variance, and the cumulative-counter startup-spike artifact. New: aggregate up to N=100 accepted episodes per profile, slice each by labels.jsonl phase events, resample EACH PHASE to a fixed budget so the median across episodes captures the canonical per-phase shape rather than smearing peaks across the timeline. Save median + p25/p75 band to data/processed/attack_profiles_v1.parquet. Per-phase point budget (sums to 80): clean_lead 10, armed 5, infecting 10, infected_running 40, clean_tail 15. dormant (when present) folded into infected_running. Channel swap: io-walk uses proc.cpu_sys_jiffies, NOT proc.io_write_bytes. Host /proc on QEMU doesn't see virtio-blk writes via io.write_bytes (writes go through KVM's I/O path, not write() syscalls); cpu_sys_jiffies tracks kernel time which spikes during heavy I/O scheduling. Concrete result: cpu-saturate now shows the proper plateau-during- infected_running with peak at 100 j/s (was 30 j/s spike at idx 0 then mostly zero); low-and-slow shows its distinctive low-amplitude profile (peak 21 vs cpu-saturate's 100); io-walk shows the rapid-rise-then-decay shape consistent with dd finishing mid-phase. knn.py — sticky model_metric / model_perf: Stream subcommand gains --also-metric / --also-perf-latency-us flags. When set, each cycle publishes a model_metric event (tagged model=knn) for scene-8 (model bars) and a model_perf event for scene-12 (accuracy vs inference cost). Republishing on the cycle keeps reconnecting browsers populated without depending on the dashboard's not-yet-built sticky-event cache. Measured KNN inference latency on the 150k-trained classifier: single-window predict: 61.5 ms (sklearn brute-force at 230 D) per-window in batch=64: 3.4 ms (the production-realistic number) Streamer published: model_metric{knn, 0.762} + model_perf{knn, latency_us=3410, accuracy=0.762}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:08:03 -05:00
Max Gorog	3783fabe86	live scene: per-host swim lanes + latest-detection callout New scene 13 (between perf and references) for fleet-wide live predictions. Each host gets a row of recent prediction cells (capped at 60), painted by predicted phase; mismatch with ground truth shows a hatched overlay. A callout below the lanes holds the most recent detection with model, profile, confidence, and latency. Producer contract is the new LiveDetection dataclass in events.py. The dashboard side is producer-agnostic — the inference loop can run locally or offload to A100 (or any GPU/host); just POST events back. No rate-limiting needed; the swim-lane DOM does the capping. Demo synthesizes 5 hosts walking through phases at ~92% accuracy so the scene reads as live the moment the deck loads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:03:32 -05:00
Max	9d56bcc923	docs: request to dashboard side — persist KNN embeddings on refresh Producer-side knn fit is saved at data/processed/knn_v1.parquet (150k rows, 3.4 MB). Live streamer publishes 2000-point cycles every ~2 s, but per PRODUCERS.md §reconnect-gotcha live events aren't replayed; refresh-to-data is currently bounded by cycle time. Three options laid out for the dashboard chat to pick: A. Sticky cache (per-event-type ring buffer in the broadcaster) B. Feeder reading the parquet → broadcaster.state["embedding_cache"] C. Caddy fileserver + JS fetch on load Whichever option lands, the producer side will adapt (e.g., dump a JSON sidecar if Option C is picked). Path ownership preserved — dashboard owns dashboard/, producer owns producers/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:54:38 -05:00
Max	2aa7b865fb	training/models: knn_semi — semi-supervised self-training KNN Registered as `knn_semi`. Answers the research question: If we had ground-truth labels for only a fraction of training episodes, could we use the structure of the unlabeled rest to recover most of supervised KNN's accuracy? Pipeline (Yarowsky-style self-training): 1. Split train slice deterministically into labeled (label_frac=0.2 default) and unlabeled (1 - label_frac) by row-index hash. 2. Fit a "labeler" KNN on the labeled fraction. 3. Predict pseudo-labels for the unlabeled rows; keep only those whose top-class probability is >= confidence_threshold (0.6). 4. Fit the final KNN on (labeled rows + confident pseudo-labels). Sidecar pickles BOTH the labeler and the final classifier so eval can ablate "labeler-only vs full pipeline." Smoke run (567-episode subset, oracle mode, label_frac=0.2): val_macro_f1 test_macro_f1 knn (100% labels) 0.737 0.133 knn_semi (20% labels) 0.654 0.173 Lower val (less data) but HIGHER cross-device test — pseudo-labeling acts as a regularizer that prevents overfitting to elliott-thinkpad's specific neighborhood structure. Honest research finding worth a slide in the writeup. Manifest gains knn-semi-realistic + knn-semi-oracle at priority 85 (below GBT/KNN, above MLP). Storage cost = augmented set × n_features × 4 bytes; same .knn.pkl sidecar format as plain KNN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:51:30 -05:00
Max	e46906b68c	training/producers/knn: supervised LDA / UMAP projector + batched publish Two changes that make scene-11 actually look like a clustering scene: 1. Supervised projection (--projector lda \| umap \| pca) - PCA was variance-greedy and oblivious to phase labels — clumped classes together because the dominant variance directions weren't class-discriminative. - LDA (default): Fisher Linear Discriminant. Linear, fast (~seconds), reproducible. On 150k windows: between-class variance 0.462 / 0.331 / 0.167 across the three axes (96% of class-discriminative info in the first 3 dims). - UMAP (--projector umap): supervised nonlinear manifold embedding; tighter visual clusters at the cost of ~10 minutes for 150k on a Pi-class CPU. Reproducible via random_state. Subsamples to 20k for fit then transforms remaining points. - PCA still available for reference / debugging. 2. Batched concurrent publish (--burst-size N) - Sequential publish was ~6.5 ms/event over loopback HTTP → 13 s per 2000-point cycle. - asyncio.gather with burst_size=50 turns each batch into ~5 ms, so the same cycle is ~0.5 s. Browsers see the scatter populate in well under a second instead of waiting through a 13 s cycle per refresh. - Default burst_size=50 is conservative — the dashboard's WebSocket fan-out can take more pressure but 50 leaves headroom. Saved fit format unchanged (data/processed/knn_v1.parquet); the streamer's --load-fit reads the same parquet regardless of which projector produced it. The LDA / UMAP choice is captured in the producer's log + saved parquet metadata, not in the file shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:45:16 -05:00
Max Gorog	2abc55a59b	knn scatter: auto-fit projection to running data spread Project around mean ± k·σ instead of the raw [0,1]³ producer-unit cube. PCA-3 outputs are Gaussian-ish so even after the producer's min/max rescale, the bulk of points clusters near the centroid; without auto-fit the scatter looks dead-centre and tiny. Implementation: incremental Welford-ish stats (running sum / sum²) per axis, recomputed lazily on the first frame after new data arrives. project() centers and σ-scales each point to ~[-0.5, 0.5]; outliers clamp to ±0.7 so they're visible just outside the cube. The bounding cube now traces mean ± k·σ instead of [0,1]³, which is also the natural visual unit for the "data spread" the user reads off the screen. resetStats() runs on demo toggle and is implicit when points are cleared. SPREAD_K=2.5 puts ~99% of normally-distributed data inside the cube; MIN_STD=0.02 keeps degenerate (all-equal) data from exploding the divisor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:33:19 -05:00
Max	aa6187042b	.gitignore: exclude data/processed/knn_*.parquet KNN fit output (PCA-3 + KMeans + KNN-classifier predictions per window) is a derived artifact regenerable from features_window_v1. Like features_window itself it stays out of git; the streamer reads it from disk on the producing host.	2026-05-08 13:20:17 -05:00
Max Gorog	f537ab8686	models scene: paint the knn bar (CSS color + demo entry) The model-bar widget rendered .model-fill.knn with no gradient when a model_metric{model:"knn"} arrived, leaving an empty track. Add a green gradient and include knn in the demo-mode set so the row is visible without waiting on the producer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:16:38 -05:00
Max	ba5ff70c14	training/producers/knn: add `stream` subcommand for disk-loaded loop The fit pipeline (PCA-3 + KMeans + KNN classifier) can be expensive to recompute every time a producer starts. `produce --fit-out` already dumps the per-window (x, y, z, phase_int, predicted_int, cluster) to a parquet; this commit adds a `stream` subcommand that loads that parquet and publishes Embedding events on a loop. Why a separate streamer: - The dashboard's live event stream is not replayed on browser reconnect (PRODUCERS.md §reconnect-gotcha). A browser that connects 30 s after the last cycle of the producer sees an empty scatter unless we re-publish. - The fit is deterministic given (features, seed) — no need to repeat it just to re-publish points. The streamer is small and stateless; it can run as a long-lived service. Usage: python -m training.producers.knn produce \\ --window data/processed/features_window_v1.parquet \\ --schema data/processed/feature_schema_v1.json \\ --fit-out data/processed/knn_v1.parquet \\ --no-publish python -m training.producers.knn stream \\ --load-fit data/processed/knn_v1.parquet \\ --loop --max-points 2000 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:13:09 -05:00
Max Gorog	97eb34f7f6	baseline prose: reflect the dataset-derived phase mix The widget no longer rolls the last 5 minutes; it aggregates time-weighted phase durations across a sampled slice of the on-disk dataset. The prose now matches the bar. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:07:36 -05:00
Max	2187a5d752	training/models: KNN as a registered supervised model Non-parametric baseline alongside GBT/MLP/CNN/GRU/LSTM/Transformer. Same BaseModel + schema-hashed checkpoint contract; sidecar is a pickled sklearn KNeighborsClassifier (.knn.pkl) handled by the existing checkpoint machinery alongside .xgb.json / .pt. KNN's storage cost = n_train_rows × n_kept_features × 4 bytes. At 660k windows × 145 kept (realistic mode) features = ~380 MB sidecar; at 230 features (oracle) = ~600 MB. Heavy but ships through the same artifact-upload path. trainer/run.py learns a third fit branch: - GBT — XGBoost early stopping on val mlogloss - KNN — fit() memorizes; "training time" is val/test predict cost - NN — train_nn loop (the rest) Manifest gains knn-realistic + knn-oracle at priority 95 (just below GBT). KNN's k=10 default lives in the model class — overriding via hyper.k requires adding --k to run.py first to avoid the unknown-arg exit-2 issue. Smoke verified on the 567-episode subset: knn oracle val=0.7365 test=0.1333 (held-out k-gamingcom) That val/test gap (0.74 → 0.13) is the cross-device generalization story: KNN memorizes elliott-thinkpad's local feature space and falls apart on the other host. Honest baseline for the comparison report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:06:56 -05:00
Max Gorog	51f2437b71	baseline: phase mix from sampled dataset, not 5-min window The widget was waiting on live `phase` events that don't flow when no orchestrator is running, so it sat empty. Replace the rolling 5-minute window with a periodic feeder that samples 500 random episode tarballs from /var/lib/cis490/episodes, extracts each labels.jsonl, and aggregates phase durations using consecutive t_mono_ns deltas. Result lands in broadcaster.state["phase_mix"] (survives snapshot cycles via dict.update) and re-broadcasts every ~10 min. Frontend reads phase_mix from snapshot on connect and from live phase_mix events on refresh; the bar uses time-weighted proportions when available (falls back to label counts), and only sums canonical phases for the denominator so non-displayed `failed` records don't shrink the visible bars. Eyebrow and sub-line update with live sample/population/label counts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:04:36 -05:00
Max	ac9b5b6f07	training/producers: knn producer for scene-11 + ModelMetric{knn} KNN-driven embedding events for the dashboard's KNN scatter scene (scene 11). One forward pass populates all three of the scatter's mode-toggle fields: x, y, z — PCA-3 projection of the standardized window features phase — ground-truth phase from labels.jsonl predicted — KNN classifier's prediction (k=10, distance-weighted) cluster — MiniBatchKMeans cluster id (k=8 default) Two subcommands: python -m training.producers.knn produce ... emit Embedding events python -m training.producers.knn metric ... publish ModelMetric{knn} on a tick (re-publish for reconnect-warmth) KNN classifier uses the held-out-by-host split aligned with the supervised pipeline (train ∪ val on elliott-thinkpad, predict on k-gamingcom) so the predictions reflect cross-device generalization, not in-distribution self-prediction. Smoke-verified end-to-end against the live dashboard (3 clients): 800 embedding events delivered in 12 s; ModelMetric{knn} with test_macro_f1 = 0.4297 on the 567-episode smoke subset, sitting between the trained GBT (0.557) and the under-trained NN models (0.09–0.18) — sensible for a non-parametric baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:03:19 -05:00
Max Gorog	12ac409ab2	knn scene: drag-to-rotate 3-D scatter + KNN/cluster color modes Replace the SVG 2-D scatter with a canvas-based 3-D one. Three color modes (phase / predicted / cluster) with a toggle; drag the surface to rotate; reset button. Bounding cube draws faintly so the rotation reads as 3-D rather than re-shuffled 2-D. Embedding event gains optional z / predicted / cluster fields. 2-D producers still work (z defaults to 0.5, no other behavior changes). CSS adds .scatter3d-* rules; --theme-h-num exposed for cluster-color hue arithmetic. Synthetic demo data is now 3-D Gaussian clusters with ~7% mislabeled "predictions" so the predicted-mode view differs from ground truth at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:55:31 -05:00
Max Gorog	9e38f78379	training/dashboard(references): description sidebar + better space use Two changes per the user's feedback that the slide had unused horizontal space and needed per-PDF context. Layout - The reference scene is now a 2-column grid inside the metric-stack: PDF iframe at ~1.7fr on the left, description panel at ~0.55fr on the right (min 280px). On narrow viewports (<1100px) it falls back to a vertical stack with the description capped to 240px. - Added #zoom=page-width to the iframe URL so the PDF's page fits its column width instead of leaving margins beside an 8.5x11 page rendered in a wider iframe. - Hide the prose card on the references scene — the description panel inside the stack covers what the prose was saying, and freeing the right edge gives the description proper room. Description content - Backend reads <stem>.md sidecar files alongside each PDF and returns the contents in the /api/references payload. - Frontend renders them with a tiny built-in markdown subset (headings, bold/italic, lists, inline code, paragraphs) — no third-party renderer dependency. - Initial draft sidecar .md files committed for the four PDFs currently in references/. Each describes how the paper informs a specific scene of the deck (which model row, which eval protocol, which channel selection). Edit them in place and the panel updates on the next reload.	2026-05-08 12:40:32 -05:00
Max	69c563275a	training: parallelize lambda bootstrap (2 jobs at a time on the A100) At our model sizes (max ~250 K params, max batch 512), each training process uses ~1 GiB VRAM. A 40 GiB A100 is far from contention with two concurrent jobs. Bounded-concurrency rolling launcher cuts sequential ~3.5 h → parallel ~1.7 h for the full 14-job manifest. PARALLEL=2 (default) — override via env var if running on a smaller GPU or testing the queue logic. Per-job logs still land at logs/<model>_<mode>.log; failure reporting is the same. Idempotent: skipping already-present checkpoints unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:37:03 -05:00
Max Gorog	bee40a6ae9	training/dashboard: references scene with PDF viewer + tab strip New scene 13 (after perf, the last in the deck) renders a tabbed PDF viewer. Each tab is one .pdf in /opt/cis490/references/; the active tab swaps the iframe's src to /refs/<encoded-filename>. Backend - /api/references — lists pdfs in REFS_DIR, returning {"name": stem (newlines stripped), "path": "/refs/<urlencoded>"}. - /refs static mount — serves the PDFs directly. check_dir=False so the dashboard still boots if the directory is missing. - REFS_DIR resolves relative to the install root so it works on /opt/cis490 in production and any dev tree. Frontend - Stage view uses metric-stack-wide for the broader card; the references scene also overrides .stage-view padding-right down to a small gutter so the iframe takes most of the screen horizontally — the prose card still sits on the right but the PDF area is roughly 70% wide on standard viewports. - Tabs are styled like .db-tab (palette-aware pills) and stop propagation so they don't trigger the click-to-advance gesture. - Iframe is lazy-loaded: src isn't set until the user actually scrolls into the references scene OR clicks a tab, so the browser doesn't fetch a big PDF the user may never view.	2026-05-08 12:34:52 -05:00
Max	308140c6ce	training: lambda-cloud one-shot training integration External-GPU path for the time-pressured first round, before the Windows desktop joins the WG fleet. Lambda is treated as an "external worker" whose output lands in the same /var/lib/cis490/models/ tree the receiver-coordinated fleet uses, so cis490-jobs status reflects Lambda runs identically to fleet runs. Three scripts + one ingest tool: scripts/build-lambda-bundle.sh Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with: - the repo (sans .git, sans data/, sans artifacts*) - data/processed/{validation_v1,features_window_v1}.parquet - data/processed/feature_schema_v1.json - data/processed/tensor_window_v1/ (npz shards) - bootstrap.sh (entrypoint) - training_manifest.toml (the canonical job list) - BUNDLE_MANIFEST.json (commit hash + counts + build stamp) Verifies all four data inputs exist BEFORE compressing 5+ GB. scripts/run-on-lambda.sh ubuntu@<ip> rsync bundle up → ssh + run bootstrap → rsync artifacts + reports/eval back to artifacts-lambda/ + reports/lambda/. Resumable rsync; sha256-verified. scripts/lambda-bootstrap.sh (runs ON the Lambda instance) Creates .venv with cu121 torch + xgboost + the [training] deps, iterates the manifest's job list in priority order (highest first), runs trainer/run.py (or run_ssl.py for transformer_ssl) per job, skips jobs whose .ckpt.json already exists (idempotent on re-run), writes per-job logs/<model>_<mode>.log, runs eval suite at the end, stamps artifacts/RUN_SUMMARY.json with counts + failed-job list. tools/ingest_lambda_artifacts.py Bundles each (ckpt.json + sidecar + train.json) trio into a .tar.zst, sha256, PUTs to the local trainer-receiver's /v1/model/{job_id}, marks the job complete. Maps (model, mode) → job_id by re-reading the canonical manifest. Handles the queue state churn (requeue if completed, claim if pending, fail-back on race losses). End-to-end smoke verified on the A100 instance just provisioned: - SSH from Pi via ed25519 keypair (cis490-trainer-pi) - GPU: A100-SXM4-40GB, driver 580.105.08 - venv warmed: torch 2.5.1+cu121, xgboost 3.2.0 - 464 GB ephemeral disk available Pi-side feature build (build_features.py + build_tensors.py against all 72,952 accepted+degraded episodes) is in progress; bundle build gates on its completion. Estimated wall-clock for the full Lambda training run on A100: ~2.5 hours for 12 supervised + 2 SSL models + eval suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:32:04 -05:00
Max	697e36a315	training/producers: move out of dashboard/ per ownership boundary Producers are event sources — the renderer is everything inside training/dashboard/. Sibling layout makes the dependency direction one-way (producers import from training.dashboard.events; dashboard never reaches into producers). training/dashboard/producers/ → training/producers/ Internal imports rewritten via sed; eval_/run.py and training/README.md cross-references updated. CLI entry stays via `python -m training.producers.<sub>` (replay / metrics / perf / profiles). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:06:56 -05:00

1 2 3 4

188 commits