- multi_model_metrics: publish gbt / mlp / cnn / knn_semi / gru / lstm / bert (knn handled by knn streamer); read both *_train.json and *_eval.json with macro_f1.point fallback - dashboard.css: add palette gradients for the four non-canonical names so the bars render with a fill colour - dashboard.js: open the bar's visible scale to the full 0–1 range so honest-low cross-host F1s show as a bar instead of clamping to 0% - ship lambda-live-detection-loop.py + dashboard request docs (scenes 7/8/12, sticky cache, lambda-inference-demo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
62 lines
2.9 KiB
Markdown
62 lines
2.9 KiB
Markdown
# Dashboard request — sticky cache for slowly-changing event types
|
||
|
||
**Audience:** dashboard session (owns `training/dashboard/`).
|
||
**Producer side:** `training/producers/multi_model_metrics.py`
|
||
(scenes 9 + 12), `training/producers/knn.py stream` (scene 11),
|
||
Lambda-side `live_detection_loop_v2.py` (scene 13).
|
||
|
||
## Problem
|
||
|
||
The broadcaster fans events out to **currently-connected** browsers
|
||
only. Reconnects (page refresh, second tab opening, mid-talk page
|
||
reload) see empty widgets until the next producer tick rebroadcasts.
|
||
The user has explicitly flagged this as a bug:
|
||
|
||
> "Your functions need to be more stateful, when we call your data it
|
||
> needs to be available right away. For the streaming data, when we
|
||
> call a new page it needs to connect correctly."
|
||
|
||
The broadcaster already does sticky caching for some keys — its
|
||
`/healthz` reports cached state under `host_counts`, `phase_mix`,
|
||
`recent_episodes`, `total_alerts`, `total_bytes`, `total_episodes`.
|
||
What's missing is sticky caching for the model + scatter + embedding
|
||
event types.
|
||
|
||
## Producer-side band-aid (already in place)
|
||
|
||
We've shortened the multi_model_metrics tick from 20 s → **5 s** so
|
||
worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for
|
||
the talk but not the right architecture — at 5 s × 4 events × 2
|
||
event types we're spending bandwidth and CPU on retransmits the
|
||
broadcaster could just remember.
|
||
|
||
## Asks
|
||
|
||
Please add sticky caching to the broadcaster for these event types:
|
||
|
||
| event type | scene | key | TTL | replay-on-connect? |
|
||
|-------------------|-------|--------------------|-------|---------------------|
|
||
| `model_metric` | 9 | one entry per `model` (last value wins) | none | yes |
|
||
| `model_perf` | 12 | one entry per `model` (last value wins) | none | yes |
|
||
| `live_detection` | 13 | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes |
|
||
| `embedding` | 11 | one snapshot — see companion request `dashboard-request-knn-cap-evict.md` for the snapshot-replace pattern | none | yes |
|
||
| `attack_profile` | 7 | one entry per `name` (last curve wins) | none | yes |
|
||
| `prediction` | 8 | one entry per `(episode_id, window_idx)` last value wins | none | yes |
|
||
|
||
Implementation suggestion: extend the broadcaster's existing
|
||
state-keys cache with a per-event-type "sticky map." On new client
|
||
connect, replay the cache before any live event reaches the new
|
||
client.
|
||
|
||
For `live_detection` the right structure is a ring-buffer (60 cells
|
||
per lane match the widget's DOM cap; replaying 60 newest events lets
|
||
a new browser paint the lanes immediately).
|
||
|
||
## Verification
|
||
|
||
After this lands, our producers can drop their republish cadence
|
||
back to a sane 30 s + on-change-only, and a cold page-load on
|
||
`dashboard.wg` paints scenes 9, 11, 12, 13 within one frame.
|
||
|
||
We'll also drop the 5 s tick on `multi_model_metrics` once we
|
||
verify replay works.
|