- multi_model_metrics: publish gbt / mlp / cnn / knn_semi / gru / lstm / bert (knn handled by knn streamer); read both *_train.json and *_eval.json with macro_f1.point fallback - dashboard.css: add palette gradients for the four non-canonical names so the bars render with a fill colour - dashboard.js: open the bar's visible scale to the full 0–1 range so honest-low cross-host F1s show as a bar instead of clamping to 0% - ship lambda-live-detection-loop.py + dashboard request docs (scenes 7/8/12, sticky cache, lambda-inference-demo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.9 KiB
Dashboard request — sticky cache for slowly-changing event types
Audience: dashboard session (owns training/dashboard/).
Producer side: training/producers/multi_model_metrics.py
(scenes 9 + 12), training/producers/knn.py stream (scene 11),
Lambda-side live_detection_loop_v2.py (scene 13).
Problem
The broadcaster fans events out to currently-connected browsers only. Reconnects (page refresh, second tab opening, mid-talk page reload) see empty widgets until the next producer tick rebroadcasts. The user has explicitly flagged this as a bug:
"Your functions need to be more stateful, when we call your data it needs to be available right away. For the streaming data, when we call a new page it needs to connect correctly."
The broadcaster already does sticky caching for some keys — its
/healthz reports cached state under host_counts, phase_mix,
recent_episodes, total_alerts, total_bytes, total_episodes.
What's missing is sticky caching for the model + scatter + embedding
event types.
Producer-side band-aid (already in place)
We've shortened the multi_model_metrics tick from 20 s → 5 s so worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for the talk but not the right architecture — at 5 s × 4 events × 2 event types we're spending bandwidth and CPU on retransmits the broadcaster could just remember.
Asks
Please add sticky caching to the broadcaster for these event types:
| event type | scene | key | TTL | replay-on-connect? |
|---|---|---|---|---|
model_metric |
9 | one entry per model (last value wins) |
none | yes |
model_perf |
12 | one entry per model (last value wins) |
none | yes |
live_detection |
13 | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes |
embedding |
11 | one snapshot — see companion request dashboard-request-knn-cap-evict.md for the snapshot-replace pattern |
none | yes |
attack_profile |
7 | one entry per name (last curve wins) |
none | yes |
prediction |
8 | one entry per (episode_id, window_idx) last value wins |
none | yes |
Implementation suggestion: extend the broadcaster's existing state-keys cache with a per-event-type "sticky map." On new client connect, replay the cache before any live event reaches the new client.
For live_detection the right structure is a ring-buffer (60 cells
per lane match the widget's DOM cap; replaying 60 newest events lets
a new browser paint the lanes immediately).
Verification
After this lands, our producers can drop their republish cadence
back to a sane 30 s + on-change-only, and a cold page-load on
dashboard.wg paints scenes 9, 11, 12, 13 within one frame.
We'll also drop the 5 s tick on multi_model_metrics once we
verify replay works.