# Dashboard request — persist KNN embeddings across reconnects **Audience:** dashboard session (owns `training/dashboard/`). **Producer side:** producer session (`training/producers/knn.py`, `data/processed/knn_v1.parquet`). **Status:** request, not implementation. Producer side has no opinion on which option is picked; whichever you prefer is fine. ## Problem Scene-11 (KNN scatter) populates from live `embedding` events. Per `PRODUCERS.md §reconnect-gotcha` live events aren't replayed on reconnect, so a browser that refreshes mid-cycle sees a blank scatter until the next cycle finishes. The current producer (`training/producers/knn.py stream --loop`) cycles every ~2 s with `--cycle-pause-s 0`, which keeps worst-case refresh-to-data around 2 s. Operator wants ~0 s. ## Producer side already has - A saved fit at `data/processed/knn_v1.parquet` (3.4 MB, 150k rows, schema: `episode_id, host_id, profile, x, y, z, phase_int, predicted_int, cluster, is_train`). - The fit is regenerated by the producer's `knn produce --fit-out`. Whoever rewrites that producer for fresh data updates the same file. - The streamer (`knn stream`) is one option for delivery, but isn't required — the file is the source of truth. ## Three options the dashboard could implement ### Option A — sticky cache for embedding events Inside the broadcaster, add a per-event-type ring buffer (e.g., last N=2000 `embedding` events). On each new WebSocket connection, replay the buffer contents. The producer keeps publishing live; the dashboard remembers what it saw. **Pros:** producer doesn't change; works for any event type if the ring is generic. PRODUCERS.md already mentions this as a future "file a request" feature. **Cons:** memory grows with the number of buffered event types; needs a per-type cap. ### Option B — feeder that reads the parquet at startup Add a feeder under `training/dashboard/feeder.py` that opens `data/processed/knn_v1.parquet` (or a configurable path) and populates `broadcaster.state["embedding_cache"]` with a list of point dicts. The existing snapshot-on-connect flow then carries it. Frontend gains a `snapshot` handler that consumes `m.embedding_cache` and renders the points immediately. **Pros:** survives dashboard restart; doesn't depend on the streamer running; cheapest "read from disk" path. **Cons:** the file gets out-of-date if the producer regenerates it without the dashboard noticing. Could mtime-watch + reload. ### Option C — Caddy fileserver + JS fetch on load Caddy serves `data/processed/knn_v1.parquet` (or a JSON dump) at a known path; the dashboard JS fetches it during `connect()` and renders the points before any WebSocket events arrive. **Pros:** zero runtime cost on the dashboard process; the file is just static content. **Cons:** parquet isn't browser-native; we'd want a small JSON dump beside it. Operator would need to remember to regenerate the JSON when the parquet changes (or we add a tiny "dump" step to the producer). ## Operator preference (if you ask) Option B feels most in-spirit with the existing feeder pattern (which already reads `index.jsonl` and `host_counts/`). If you want, the producer side will: - Keep the parquet at the same path and shape - Add a JSON sidecar `data/processed/knn_v1.json` next to it (cheap, ~3 MB) so the feeder can `json.loads(...)` without parquet deps in the dashboard process Just tell me which option you want; I'll add whatever the producer side needs to land it cleanly.