The CSS-rule-per-canonical-name approach was wrong: any name the
producer publishes that wasn't in the hardcoded list (mlp_realistic,
cnn_oracle, knn_semi, anything new tomorrow) rendered grey because
no .model-fill.<name> rule matched.
Replace with a deterministic FNV-1a hash of the model string → hue,
applied inline as an OKLCH gradient when the row is created. Every
model string gets a stable, distinct color regardless of suffix or
case. Inline style beats any CSS rule, so this works whatever's in
dashboard.css.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the user's request — the rubric-derived scenes I added in one
sweep weren't tied closely enough to their actual project narrative
and ate up presentation time. Reverting to the pre-insertion deck:
removed
problem-statement / research-questions / solution-overview /
evaluation-setup / theoretical / practical / design-principles /
limitations / conclusion-future
kept (user-requested earlier in the session)
motivation (with the IEEE 9881803 citation)
live (A100 inference scene)
CSS rules and references/* sidecar files for the removed scenes
are left in place as harmless dead code; they can be cleaned up
later.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mlp / cnn / knn_semi were rendering grey for at least one client even
though their .model-fill.<name> rules were identical specificity to
the working ones (lstm/gru/bert/knn/gbt). Probable cause: stale
browser cache or a theme-pass rule clobbering background:.
!important on every model gradient is heavy-handed but guarantees
the deck reads right during the live talk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bar widget had gradients for lstm / gru / rnn / bert / knn
only — any other model name (cnn, mlp, transformer, gbt, knn_semi,
transformer_ssl) rendered a track but no fill. Now:
- Added explicit gradients for cnn, mlp, transformer,
transformer_ssl, gbt, knn_semi (each visually distinct from the
existing five).
- Added a neutral grey-grey fallback on .model-fill itself, so any
unanticipated model name still produces a visible bar instead of
silently disappearing. The specific class rules override it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- multi_model_metrics: publish gbt / mlp / cnn / knn_semi /
gru / lstm / bert (knn handled by knn streamer); read both
*_train.json and *_eval.json with macro_f1.point fallback
- dashboard.css: add palette gradients for the four
non-canonical names so the bars render with a fill colour
- dashboard.js: open the bar's visible scale to the full 0–1
range so honest-low cross-host F1s show as a bar instead of
clamping to 0%
- ship lambda-live-detection-loop.py + dashboard request docs
(scenes 7/8/12, sticky cache, lambda-inference-demo)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Saves a click during live demos. Topbar tooltip updated to mention
the binding. Hotkey is gated by the same input-focus check as 'c' /
arrow keys, so typing 'd' in a search box won't fire it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
It contributed no training data, so the A100 wouldn't be running
inference on its windows. Only hosts that actually produced data
(elliott-thinkpad, k-gamingcom) should appear as the source of
synthetic predictions in the live scene.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2500ms read too slow. 1000ms is the sweet spot — under the real
ceiling of ~1.5/sec but still lively enough to feel like a working
inference loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was 280ms (~3.5 events/sec) — way too fast for real fleet
inference. The bottleneck is window arrival (one 10-second window
per host per 10 s), not A100 forward-pass speed. With ~3 hosts × 5
models that's ~1.5 events/sec real ceiling, so demo at 2500ms
(~0.4/sec) reads honest without claiming impossible throughput.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stats line now reads "A100 inference · live · N models · X infer/sec
· last window: <host> · hit-rate: …" instead of "live detections ·
N hosts · model: …". Prose rewritten to describe lanes as side-by-
side model-agreement check rather than per-host activity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The scene's framing was wrong. It's about the A100 doing live
model predictions, not about per-host telemetry collection. Lanes
now key on `model` instead of `host_id`; the callout leads with
model name + A100 latency, demoting host/profile to secondary
metadata. Stats line reads "N models · X infer/sec · last window
from <host>" instead of "N hosts · model: X".
Demo synthesis updated to match: 5 trained models cycle through
predictions on rotating fleet windows, each model with its own
accuracy + latency profile (KNN fast/loose, BERT slow/precise) so
the lanes visually differ. Article prose reframes the scene as
side-by-side model agreement, the natural read of per-model lanes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The KNN producer works; KNN does not need a demo-mode fallback.
Remove demo_start / demo_stop / cachedReal / demoActive scaffolding
that I'd added speculatively. Embedding events render directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same pattern as models / perf / live: cachedReal accumulates real
embedding events at all times; demoActive flag gates which source
renders.
- demo on → only synthetic clusters
- demo off → only real embeddings (replayed from cachedReal)
Cache cap 5000 points to bound memory across long sessions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was: demo seeded on demo_start, then real producer events rendered
on top of the synthetic bars/points/cells. Both sources visible
simultaneously — visually confusing.
Now: each widget tracks demoActive + a cachedReal store.
- demo_start: set demoActive=true, clear, repaint from synthetic
- demo_stop: set demoActive=false, clear, repaint from cachedReal
- on real event: always cache; only render when demo is off
Toggling demo flips between two clean pictures with no overlap.
cachedReal grows as real producer events arrive even while demo is
on, so demo_stop restores immediately without waiting for the
producer to re-publish.
Applied to: models bars, perf scatter, live detections.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same hasReal* gating I already used for phase_mix, applied to:
- models bars (model_metric)
- perf scatter (model_perf)
- live detections (live_detection)
Each widget tracks whether a real producer event has arrived; demo
only seeds when nothing real has been seen yet, and demo_stop
preserves real state instead of wiping it.
demoTick is now a no-op — periodic model_metric jitter was
overwriting real values mid-stream. Per-widget one-shot seeding
on demo_start (gated by hasReal*) is enough.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues with the accuracy-vs-latency scatter:
1. Linear x crammed RNN/GRU/LSTM into ~25 px of axis (380/520/700 μs)
while BERT alone took the right 80 % (3200 μs).
2. Labels placed at fixed +12 right of each point overlapped both
neighbouring points and other labels in the recurrent cluster.
Fixes:
- X-axis switched to log10 with bounds 10μs–10ms; tick labels and
marks added at 10μs / 100μs / 1ms / 10ms so the audience can
read the scale.
- Y-axis bounds tightened to [0.5, 1.0] (was [0.7, 1.0]) so KNN's
~0.43 cross-host F1 falls within the visible plot area instead
of off-bottom; ticks added at 0.6 / 0.8 / 1.0.
- Anti-overlap label placement: sort points by x, alternate
above (-12) / below (+18) the circle. Adjacent labels can no
longer share both x and y bands. repaintLabels() re-runs on
each model_perf event so late arrivals slot into the staircase.
Y-axis title also updated: "held-out accuracy" → "held-out macro-F1"
to match the actual metric the producer reports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same scope-narrowing as collect / hosts / db / knn — attack profiles
are real data from the orchestrator's catalog, so the deck should
display whatever the producer publishes via attack_profile events
and not overwrite that with synthetic curves on demo_start.
Removed both demo_start (synthesize) and demo_stop (clearAll)
handlers; the syntheticProfiles helper is left in place for
reference but is no longer wired to anything.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the data-ownership scope: collect (episodes-ingested counter),
hosts (per-host bars), and db (database explorer) all work fine in
or out of demo mode — they read real values from the server's
snapshot. Demo mode shouldn't be injecting fake `episode` records
into them.
Removed both dispatches from demoTick:
- `episode` (was 70% per tick) — no longer clobbers collect/hosts/db
- `phase` (was 50% per tick) — dead code anyway; baseline now
consumes the dataset-derived `phase_mix` event, not raw `phase`
demoTick is now just the model_metric jitter (5% per tick) so the
sequence-model bars don't sit frozen during a long demo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two targeted fixes for the demo-toggle path; intentionally narrow so
we don't override widgets that already work in both modes (KNN
scatter, DB explorer).
Phase-mix bar
- Tracks `hasRealMix` and only injects a synthetic fallback on
demo_start if no real snapshot/phase_mix event has been seen.
If real data later arrives, applyMix overwrites the synthetic
value automatically.
- Synthetic numbers mirror a real production run (500/78705
episodes, ~4.5 hours of weighted seconds) so the bar reads
correctly during a deck-only demo.
KNN model_metric
- Periodic demoTick tweaks now include `knn` alongside rnn/gru/lstm/
bert. Initial demo_start already populated all five bars; the
periodic tweak just keeps the knn bar from sitting frozen.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pi-safe replacement for the original metrics.py + perf.py producers
which load every checkpoint into memory and score the test set on each
cycle. That pattern crashed the Pi during this project (300 MB knn
pickles × 6 variants + 226 MB test set in memory at peak ≈ OOM).
The new producer:
- reads reports/eval/<model>_<mode>_train.json files (already
contain the test_macro_f1 each trainer wrote)
- publishes one model_metric event per file
- publishes one model_perf event per file with a hardcoded
per-architecture latency estimate (gbt 250 µs, knn 3500, mlp 50,
cnn 500, gru 1500, lstm 2000, transformer 800, transformer_ssl
1000). These are family-level order-of-magnitude figures; proper
benchmarks need to run on the deployment hardware (which is the
A100, not the Pi).
- re-publishes on a tick (default 30 s) for refresh-resilience.
- NO model loading. Pi-safe.
scripts/rsync-from-lambda.sh — pulls Lambda's artifacts/ + reports/eval/
to the Pi every 30 s. As Lambda finishes each model and writes its
train.json, the Pi sees the new file within a cycle and the publisher
broadcasts the metric on its next tick. Live multi-model dashboard
during training, with no Pi-side inference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
REORDER
- collect (big-number ingest counter) moved from #7 to #2 — sits
right after the title as the dataset-quantity hook
- training-code moved from #15 to #14 — "how we trained" now
appears before "what we got" (models accuracy bars)
EVAL FRAMING CORRECTION
The fleet hosts are uniform — every host runs every profile, just
at different rates — so the actual split is held-out-by-sample
(profile-stratified), NOT held-out-by-host. Both hosts contribute
to train, val, AND test. The generalization claim is "unseen
malware sample_name", not "unseen device".
Fixed across:
- evaluation-setup: split-recipe block, val↔test gap (was
"cross-host gap"), prose
- problem-statement: RQ wording, "generalize across hosts" →
"generalize to sample_names"
- research-questions: RQ2 ("from a host the training set never
saw" → "sample_names the training set never saw"); literature-gap
bullet flipped from "cross-host generalization" to "sample-
stratified evaluation"; prose
- solution-overview: pipeline diagram caption
- theoretical-contributions: "cross-host as the eval axis" →
"held-out-by-sample as the eval axis"
- limitations: two-host-fleet card now states "both hosts
contribute to train/val/test"; "KNN cross-host gap" → "KNN
val ↔ test gap"
- conclusion-future: bullet flipped to held-out-by-sample as
primary axis
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five required + four optional slides, slotted into the existing flow
without renumbering the visible deck UI:
REQUIRED
- problem-statement (after motivation): single-sentence problem,
three numeric stat cards, explicit task-type justification
(multi-class classification, why not regression/ranking)
- research-questions (after problem-statement): two-column literature
gap layout + RQ1/RQ2/RQ3
- solution-overview (after research-questions): inline-SVG block
diagram of the pipeline (fleet hosts → receiver → episodes →
windowing → model zoo → per-window phase → trust score →
containment + reset)
- evaluation-setup (between chunking and models): four blocks
covering split recipe, primary metric, baselines compared, and
what's reported alongside accuracy. Each block leads with the
*why*, matching the assignment's "explain not only what will be
measured but why" requirement.
- conclusion-future (before references): two-column "what we showed"
+ unsupervised next steps (clustering / anomaly / SSL pretrain /
embedding viz). Addresses Section 8 of the assignment guide.
OPTIONAL
- theoretical-contributions: window-centre labelling,
schema-hashed checkpoints, cross-host as eval axis
- practical-contributions: /proc-only deployment,
producer-agnostic dashboard, labelled dataset on disk
- design-principles: one-loop-many-models, typed events as
contract, two-agent path ownership
- limitations: two-host fleet, synthetic profiles, 10 Hz floor,
KNN cross-host gap
Plus references/links.md gains four real online references (PyTorch,
XGBoost, scikit-learn, proc(5)) bringing the citation count from 8
to 12 — over the assignment's 10-source minimum.
CSS additions cover the new layouts (.problem-claim, .problem-stats,
.research-grid, .pipeline-svg + .pipeline-stage / .pipeline-arrow,
.eval-blocks, .conclusion-grid). Limitations cards reuse the
motivation-card pattern with an armed-phase amber marker for the
"warning" feel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The scene-9 embedding handler appends to a `points` array without
ever capping. The producer republishes its (stable, deterministic)
point set on a cycle so reconnecting browsers eventually see the
scatter; each cycle pushes the same N points again and the in-memory
count grows without bound. Browser slows after ~10 min.
Two complementary fixes proposed:
A. FIFO cap (1-line change in the handler — fixes the leak today)
B. embedding_batch event with replace=true (cleaner, pairs with
the snapshot/sticky-cache request for refresh-time hydration)
Producer side has already reduced cadence as a band-aid (200 pts
every 30 s, was 600 every 5 s) — 18x slower accumulation but still
unbounded.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval suite at the end of the bootstrap was using ../artifacts and
../data/* paths because they were originally invoked from inside repo/.
Now that we no longer cd into repo, drop the ../ prefix. Same root
cause as the previous commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version did `(cd repo && "${cmd[@]}")` to "cd into repo
for module imports." But PYTHONPATH was already set to $PWD/repo at
the top of the script — so the cd was redundant for imports AND
broke relative paths: the trainer expects to find
data/processed/validation_v1.parquet from $HOME/cis490, not from
$HOME/cis490/repo/.
Symptom: every training job failed immediately with
FileNotFoundError: data/processed/validation_v1.parquet
Drop the cd; PYTHONPATH already does the import work.
Found while running on the A100 today; trainer relaunched manually
in-place via a stand-in bootstrap2.sh; this commit makes the next
bundle clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously every scene rendered at all times — paint, layout, and
the per-scene widgets all ran in parallel. Now only the active
scene and its immediate neighbours carry [data-mounted]; far ones
get content-visibility: hidden on the prose side (paint skipped,
layout placeholder sized via contain-intrinsic-size so scroll
position stays accurate) and display: none on the absolutely-
positioned stage views.
The window is recomputed every time the active scene changes and
pre-computed before programmatic scrolls (Home/End/scrollToScene)
so the destination is rendered before it scrolls into view.
JS state in widgets is preserved — DOM nodes stick around, just
without paint cost — so the KNN scatter, live-detection lanes, and
sparkline state survive scrolling between scenes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New scene 2 (between intro and stack) framing the operational case
for a per-host detector. Three consequence cards on the stage —
network-level trust scoring, containment before pivot, fast
post-attack reset — backed by a prose section that cites IEEE
document 9881803 for the trust-aggregation argument.
Sidecar md for the paper lands in references/ as a citation note;
when the PDF is dropped in with a matching stem it'll show up in
the references viewer automatically. Link added to links.md too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validator's allowed-models frozenset was missing knn and knn_semi
even though the manifest gained those jobs and the model registry
registered the classes. Lambda bootstrap blocked at:
TrainingManifestError: job 'knn-realistic': model 'knn' not in
['cnn', 'gbt', 'gru', 'lstm', 'mlp', 'transformer', 'transformer_ssl']
Now {gbt, knn, knn_semi, mlp, cnn, gru, lstm, transformer, transformer_ssl}.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The four code snippets shown on stack and training-code scenes get
inline comments explaining the *why* of each line, not just *what*.
Aimed at the live audience: a presenter reads the comment as the
narration; a reader scans them top-to-bottom for the design story.
Covers: pyproject's three install profiles and what each library
contributes; receiver's bearer auth and why constant-time compare
matters; LSTM model's registry pattern, batch_first transpose,
last-step classification head; trainer loop's class weights vs the
imbalanced dataset, AMP scaler vs fp16 underflow, cosine + warmup
schedule, macro-F1 vs accuracy on imbalanced classes, best-state
restore vs last-epoch weights.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stack scene's pyproject snippet was missing the `training`
group (torch, sklearn, xgboost, zstandard) — the libraries that
do the actual model work. Updated to match the real pyproject.toml.
The receiver snippet now ends at _bearer_check(...) instead of the
import block alone — gives the slide a non-trivial line of code to
read.
The training-code scene replaces the toy "PhaseLSTM" hand-rolled
loop with the real LSTM model class (registry-decorated _SeqBase
subclass + _LSTMClassifier wrapping nn.LSTM with last-step
classification head) and adds a second card showing the actual
train_nn loop: AMP autocast/scaler, cosine LR with linear warmup,
inverse-frequency class weights, gradient clipping, macro-F1
on val, early stop with best-state restore.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pi has 4 cores; only KNN and tree-based models are realistic to train
here without GPU. While Lambda runs the full 16-job manifest in
parallel (~1.7h), this chain trains the CPU-friendly subset on the
Pi (~30 min) so scenes 8 & 12 populate with multi-model numbers
within minutes instead of waiting on Lambda's full cycle.
Order: gbt-realistic, knn-realistic, knn-oracle, knn_semi-realistic,
knn_semi-oracle. Skips models whose .ckpt.json already exists
(idempotent restart). Each is a subprocess of training/trainer/run.py
so XGBoost/numpy/sklearn don't fight each other for cores.
Caller is expected to start gbt-oracle separately (it's the longest
single training and we kicked it off before invoking this script).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
profiles.py — non-shortcut fit:
Old: pick one accepted episode per profile, emit its raw
fraction-of-duration curve. Confounded by single-episode noise,
phase-budget timing variance, and the cumulative-counter
startup-spike artifact.
New: aggregate up to N=100 accepted episodes per profile, slice each
by labels.jsonl phase events, resample EACH PHASE to a fixed
budget so the median across episodes captures the canonical
per-phase shape rather than smearing peaks across the timeline.
Save median + p25/p75 band to data/processed/attack_profiles_v1.parquet.
Per-phase point budget (sums to 80):
clean_lead 10, armed 5, infecting 10, infected_running 40,
clean_tail 15. dormant (when present) folded into infected_running.
Channel swap: io-walk uses proc.cpu_sys_jiffies, NOT
proc.io_write_bytes. Host /proc on QEMU doesn't see virtio-blk
writes via io.write_bytes (writes go through KVM's I/O path, not
write() syscalls); cpu_sys_jiffies tracks kernel time which spikes
during heavy I/O scheduling.
Concrete result: cpu-saturate now shows the proper plateau-during-
infected_running with peak at 100 j/s (was 30 j/s spike at idx 0
then mostly zero); low-and-slow shows its distinctive low-amplitude
profile (peak 21 vs cpu-saturate's 100); io-walk shows the
rapid-rise-then-decay shape consistent with dd finishing mid-phase.
knn.py — sticky model_metric / model_perf:
Stream subcommand gains --also-metric / --also-perf-latency-us
flags. When set, each cycle publishes a model_metric event
(tagged model=knn) for scene-8 (model bars) and a model_perf
event for scene-12 (accuracy vs inference cost). Republishing on
the cycle keeps reconnecting browsers populated without depending
on the dashboard's not-yet-built sticky-event cache.
Measured KNN inference latency on the 150k-trained classifier:
single-window predict: 61.5 ms (sklearn brute-force at 230 D)
per-window in batch=64: 3.4 ms (the production-realistic number)
Streamer published: model_metric{knn, 0.762} +
model_perf{knn, latency_us=3410, accuracy=0.762}.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New scene 13 (between perf and references) for fleet-wide live
predictions. Each host gets a row of recent prediction cells
(capped at 60), painted by predicted phase; mismatch with ground
truth shows a hatched overlay. A callout below the lanes holds
the most recent detection with model, profile, confidence, and
latency.
Producer contract is the new LiveDetection dataclass in events.py.
The dashboard side is producer-agnostic — the inference loop can
run locally or offload to A100 (or any GPU/host); just POST events
back. No rate-limiting needed; the swim-lane DOM does the capping.
Demo synthesizes 5 hosts walking through phases at ~92% accuracy
so the scene reads as live the moment the deck loads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Producer-side knn fit is saved at data/processed/knn_v1.parquet
(150k rows, 3.4 MB). Live streamer publishes 2000-point cycles every
~2 s, but per PRODUCERS.md §reconnect-gotcha live events aren't
replayed; refresh-to-data is currently bounded by cycle time.
Three options laid out for the dashboard chat to pick:
A. Sticky cache (per-event-type ring buffer in the broadcaster)
B. Feeder reading the parquet → broadcaster.state["embedding_cache"]
C. Caddy fileserver + JS fetch on load
Whichever option lands, the producer side will adapt (e.g., dump a
JSON sidecar if Option C is picked). Path ownership preserved —
dashboard owns dashboard/, producer owns producers/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Registered as `knn_semi`. Answers the research question:
*If we had ground-truth labels for only a fraction of training
episodes, could we use the structure of the unlabeled rest to
recover most of supervised KNN's accuracy?*
Pipeline (Yarowsky-style self-training):
1. Split train slice deterministically into labeled (label_frac=0.2
default) and unlabeled (1 - label_frac) by row-index hash.
2. Fit a "labeler" KNN on the labeled fraction.
3. Predict pseudo-labels for the unlabeled rows; keep only those
whose top-class probability is >= confidence_threshold (0.6).
4. Fit the final KNN on (labeled rows + confident pseudo-labels).
Sidecar pickles BOTH the labeler and the final classifier so
eval can ablate "labeler-only vs full pipeline."
Smoke run (567-episode subset, oracle mode, label_frac=0.2):
val_macro_f1 test_macro_f1
knn (100% labels) 0.737 0.133
knn_semi (20% labels) 0.654 0.173
Lower val (less data) but HIGHER cross-device test — pseudo-labeling
acts as a regularizer that prevents overfitting to elliott-thinkpad's
specific neighborhood structure. Honest research finding worth a slide
in the writeup.
Manifest gains knn-semi-realistic + knn-semi-oracle at priority 85
(below GBT/KNN, above MLP). Storage cost = augmented set × n_features
× 4 bytes; same .knn.pkl sidecar format as plain KNN.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that make scene-11 actually look like a clustering scene:
1. Supervised projection (--projector lda | umap | pca)
- PCA was variance-greedy and oblivious to phase labels — clumped
classes together because the dominant variance directions weren't
class-discriminative.
- LDA (default): Fisher Linear Discriminant. Linear, fast (~seconds),
reproducible. On 150k windows: between-class variance 0.462 / 0.331
/ 0.167 across the three axes (96% of class-discriminative info
in the first 3 dims).
- UMAP (--projector umap): supervised nonlinear manifold embedding;
tighter visual clusters at the cost of ~10 minutes for 150k on a
Pi-class CPU. Reproducible via random_state. Subsamples to 20k for
fit then transforms remaining points.
- PCA still available for reference / debugging.
2. Batched concurrent publish (--burst-size N)
- Sequential publish was ~6.5 ms/event over loopback HTTP → 13 s
per 2000-point cycle.
- asyncio.gather with burst_size=50 turns each batch into ~5 ms,
so the same cycle is ~0.5 s. Browsers see the scatter populate
in well under a second instead of waiting through a 13 s cycle
per refresh.
- Default burst_size=50 is conservative — the dashboard's WebSocket
fan-out can take more pressure but 50 leaves headroom.
Saved fit format unchanged (data/processed/knn_v1.parquet); the
streamer's --load-fit reads the same parquet regardless of which
projector produced it. The LDA / UMAP choice is captured in the
producer's log + saved parquet metadata, not in the file shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Project around mean ± k·σ instead of the raw [0,1]³ producer-unit
cube. PCA-3 outputs are Gaussian-ish so even after the producer's
min/max rescale, the bulk of points clusters near the centroid;
without auto-fit the scatter looks dead-centre and tiny.
Implementation: incremental Welford-ish stats (running sum / sum²)
per axis, recomputed lazily on the first frame after new data
arrives. project() centers and σ-scales each point to ~[-0.5, 0.5];
outliers clamp to ±0.7 so they're visible just outside the cube.
The bounding cube now traces mean ± k·σ instead of [0,1]³, which is
also the natural visual unit for the "data spread" the user reads
off the screen.
resetStats() runs on demo toggle and is implicit when points are
cleared. SPREAD_K=2.5 puts ~99% of normally-distributed data inside
the cube; MIN_STD=0.02 keeps degenerate (all-equal) data from
exploding the divisor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KNN fit output (PCA-3 + KMeans + KNN-classifier predictions per
window) is a derived artifact regenerable from features_window_v1.
Like features_window itself it stays out of git; the streamer
reads it from disk on the producing host.
The model-bar widget rendered .model-fill.knn with no gradient when
a model_metric{model:"knn"} arrived, leaving an empty track. Add a
green gradient and include knn in the demo-mode set so the row is
visible without waiting on the producer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fit pipeline (PCA-3 + KMeans + KNN classifier) can be expensive
to recompute every time a producer starts. `produce --fit-out` already
dumps the per-window (x, y, z, phase_int, predicted_int, cluster) to a
parquet; this commit adds a `stream` subcommand that loads that
parquet and publishes Embedding events on a loop.
Why a separate streamer:
- The dashboard's live event stream is not replayed on browser
reconnect (PRODUCERS.md §reconnect-gotcha). A browser that
connects 30 s after the last cycle of the producer sees an empty
scatter unless we re-publish.
- The fit is deterministic given (features, seed) — no need to
repeat it just to re-publish points. The streamer is small and
stateless; it can run as a long-lived service.
Usage:
python -m training.producers.knn produce \\
--window data/processed/features_window_v1.parquet \\
--schema data/processed/feature_schema_v1.json \\
--fit-out data/processed/knn_v1.parquet \\
--no-publish
python -m training.producers.knn stream \\
--load-fit data/processed/knn_v1.parquet \\
--loop --max-points 2000
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The widget no longer rolls the last 5 minutes; it aggregates
time-weighted phase durations across a sampled slice of the
on-disk dataset. The prose now matches the bar.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Non-parametric baseline alongside GBT/MLP/CNN/GRU/LSTM/Transformer.
Same BaseModel + schema-hashed checkpoint contract; sidecar is a
pickled sklearn KNeighborsClassifier (.knn.pkl) handled by the
existing checkpoint machinery alongside .xgb.json / .pt.
KNN's storage cost = n_train_rows × n_kept_features × 4 bytes.
At 660k windows × 145 kept (realistic mode) features = ~380 MB
sidecar; at 230 features (oracle) = ~600 MB. Heavy but ships through
the same artifact-upload path.
trainer/run.py learns a third fit branch:
- GBT — XGBoost early stopping on val mlogloss
- KNN — fit() memorizes; "training time" is val/test predict cost
- NN — train_nn loop (the rest)
Manifest gains knn-realistic + knn-oracle at priority 95 (just
below GBT). KNN's k=10 default lives in the model class — overriding
via hyper.k requires adding --k to run.py first to avoid the
unknown-arg exit-2 issue.
Smoke verified on the 567-episode subset:
knn oracle val=0.7365 test=0.1333 (held-out k-gamingcom)
That val/test gap (0.74 → 0.13) is the cross-device generalization
story: KNN memorizes elliott-thinkpad's local feature space and
falls apart on the other host. Honest baseline for the comparison
report.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The widget was waiting on live `phase` events that don't flow when no
orchestrator is running, so it sat empty. Replace the rolling
5-minute window with a periodic feeder that samples 500 random
episode tarballs from /var/lib/cis490/episodes, extracts each
labels.jsonl, and aggregates phase durations using consecutive
t_mono_ns deltas. Result lands in broadcaster.state["phase_mix"]
(survives snapshot cycles via dict.update) and re-broadcasts every
~10 min.
Frontend reads phase_mix from snapshot on connect and from live
phase_mix events on refresh; the bar uses time-weighted proportions
when available (falls back to label counts), and only sums canonical
phases for the denominator so non-displayed `failed` records don't
shrink the visible bars. Eyebrow and sub-line update with live
sample/population/label counts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KNN-driven embedding events for the dashboard's KNN scatter scene
(scene 11). One forward pass populates all three of the scatter's
mode-toggle fields:
x, y, z — PCA-3 projection of the standardized window features
phase — ground-truth phase from labels.jsonl
predicted — KNN classifier's prediction (k=10, distance-weighted)
cluster — MiniBatchKMeans cluster id (k=8 default)
Two subcommands:
python -m training.producers.knn produce ... emit Embedding events
python -m training.producers.knn metric ... publish ModelMetric{knn}
on a tick (re-publish
for reconnect-warmth)
KNN classifier uses the held-out-by-host split aligned with the
supervised pipeline (train ∪ val on elliott-thinkpad, predict on
k-gamingcom) so the predictions reflect cross-device generalization,
not in-distribution self-prediction.
Smoke-verified end-to-end against the live dashboard (3 clients):
800 embedding events delivered in 12 s; ModelMetric{knn} with
test_macro_f1 = 0.4297 on the 567-episode smoke subset, sitting
between the trained GBT (0.557) and the under-trained NN models
(0.09–0.18) — sensible for a non-parametric baseline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the SVG 2-D scatter with a canvas-based 3-D one. Three color
modes (phase / predicted / cluster) with a toggle; drag the surface
to rotate; reset button. Bounding cube draws faintly so the rotation
reads as 3-D rather than re-shuffled 2-D.
Embedding event gains optional z / predicted / cluster fields. 2-D
producers still work (z defaults to 0.5, no other behavior changes).
CSS adds .scatter3d-* rules; --theme-h-num exposed for cluster-color
hue arithmetic. Synthetic demo data is now 3-D Gaussian clusters with
~7% mislabeled "predictions" so the predicted-mode view differs from
ground truth at a glance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes per the user's feedback that the slide had unused
horizontal space and needed per-PDF context.
Layout
- The reference scene is now a 2-column grid inside the
metric-stack: PDF iframe at ~1.7fr on the left, description
panel at ~0.55fr on the right (min 280px). On narrow viewports
(<1100px) it falls back to a vertical stack with the
description capped to 240px.
- Added #zoom=page-width to the iframe URL so the PDF's page
fits its column width instead of leaving margins beside an
8.5x11 page rendered in a wider iframe.
- Hide the prose card on the references scene — the description
panel inside the stack covers what the prose was saying, and
freeing the right edge gives the description proper room.
Description content
- Backend reads <stem>.md sidecar files alongside each PDF and
returns the contents in the /api/references payload.
- Frontend renders them with a tiny built-in markdown subset
(headings, bold/italic, lists, inline code, paragraphs) — no
third-party renderer dependency.
- Initial draft sidecar .md files committed for the four PDFs
currently in references/. Each describes how the paper informs
a specific scene of the deck (which model row, which eval
protocol, which channel selection). Edit them in place and the
panel updates on the next reload.