CIS490/docs/dashboard-request-sticky-cache.md
Max c2a71de4b2 scene 9 bars: paint full zoo + 0–1 visible scale
- multi_model_metrics: publish gbt / mlp / cnn / knn_semi /
  gru / lstm / bert (knn handled by knn streamer); read both
  *_train.json and *_eval.json with macro_f1.point fallback
- dashboard.css: add palette gradients for the four
  non-canonical names so the bars render with a fill colour
- dashboard.js: open the bar's visible scale to the full 0–1
  range so honest-low cross-host F1s show as a bar instead of
  clamping to 0%
- ship lambda-live-detection-loop.py + dashboard request docs
  (scenes 7/8/12, sticky cache, lambda-inference-demo)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:18:00 -05:00

62 lines
2.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Dashboard request — sticky cache for slowly-changing event types
**Audience:** dashboard session (owns `training/dashboard/`).
**Producer side:** `training/producers/multi_model_metrics.py`
(scenes 9 + 12), `training/producers/knn.py stream` (scene 11),
Lambda-side `live_detection_loop_v2.py` (scene 13).
## Problem
The broadcaster fans events out to **currently-connected** browsers
only. Reconnects (page refresh, second tab opening, mid-talk page
reload) see empty widgets until the next producer tick rebroadcasts.
The user has explicitly flagged this as a bug:
> "Your functions need to be more stateful, when we call your data it
> needs to be available right away. For the streaming data, when we
> call a new page it needs to connect correctly."
The broadcaster already does sticky caching for some keys — its
`/healthz` reports cached state under `host_counts`, `phase_mix`,
`recent_episodes`, `total_alerts`, `total_bytes`, `total_episodes`.
What's missing is sticky caching for the model + scatter + embedding
event types.
## Producer-side band-aid (already in place)
We've shortened the multi_model_metrics tick from 20 s → **5 s** so
worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for
the talk but not the right architecture — at 5 s × 4 events × 2
event types we're spending bandwidth and CPU on retransmits the
broadcaster could just remember.
## Asks
Please add sticky caching to the broadcaster for these event types:
| event type | scene | key | TTL | replay-on-connect? |
|-------------------|-------|--------------------|-------|---------------------|
| `model_metric` | 9 | one entry per `model` (last value wins) | none | yes |
| `model_perf` | 12 | one entry per `model` (last value wins) | none | yes |
| `live_detection` | 13 | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes |
| `embedding` | 11 | one snapshot — see companion request `dashboard-request-knn-cap-evict.md` for the snapshot-replace pattern | none | yes |
| `attack_profile` | 7 | one entry per `name` (last curve wins) | none | yes |
| `prediction` | 8 | one entry per `(episode_id, window_idx)` last value wins | none | yes |
Implementation suggestion: extend the broadcaster's existing
state-keys cache with a per-event-type "sticky map." On new client
connect, replay the cache before any live event reaches the new
client.
For `live_detection` the right structure is a ring-buffer (60 cells
per lane match the widget's DOM cap; replaying 60 newest events lets
a new browser paint the lanes immediately).
## Verification
After this lands, our producers can drop their republish cadence
back to a sane 30 s + on-change-only, and a cold page-load on
`dashboard.wg` paints scenes 9, 11, 12, 13 within one frame.
We'll also drop the 5 s tick on `multi_model_metrics` once we
verify replay works.