From 9d56bcc923d80800ab9897bd252619b2a9a5362d Mon Sep 17 00:00:00 2001 From: Max Date: Fri, 8 May 2026 13:54:38 -0500 Subject: [PATCH] =?UTF-8?q?docs:=20request=20to=20dashboard=20side=20?= =?UTF-8?q?=E2=80=94=20persist=20KNN=20embeddings=20on=20refresh?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Producer-side knn fit is saved at data/processed/knn_v1.parquet (150k rows, 3.4 MB). Live streamer publishes 2000-point cycles every ~2 s, but per PRODUCERS.md §reconnect-gotcha live events aren't replayed; refresh-to-data is currently bounded by cycle time. Three options laid out for the dashboard chat to pick: A. Sticky cache (per-event-type ring buffer in the broadcaster) B. Feeder reading the parquet → broadcaster.state["embedding_cache"] C. Caddy fileserver + JS fetch on load Whichever option lands, the producer side will adapt (e.g., dump a JSON sidecar if Option C is picked). Path ownership preserved — dashboard owns dashboard/, producer owns producers/. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...dashboard-request-embedding-persistence.md | 82 +++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 docs/dashboard-request-embedding-persistence.md diff --git a/docs/dashboard-request-embedding-persistence.md b/docs/dashboard-request-embedding-persistence.md new file mode 100644 index 0000000..27a67b4 --- /dev/null +++ b/docs/dashboard-request-embedding-persistence.md @@ -0,0 +1,82 @@ +# Dashboard request — persist KNN embeddings across reconnects + +**Audience:** dashboard session (owns `training/dashboard/`). +**Producer side:** producer session (`training/producers/knn.py`, +`data/processed/knn_v1.parquet`). +**Status:** request, not implementation. Producer side has no opinion +on which option is picked; whichever you prefer is fine. + +## Problem + +Scene-11 (KNN scatter) populates from live `embedding` events. Per +`PRODUCERS.md §reconnect-gotcha` live events aren't replayed on +reconnect, so a browser that refreshes mid-cycle sees a blank scatter +until the next cycle finishes. The current producer (`training/producers/knn.py +stream --loop`) cycles every ~2 s with `--cycle-pause-s 0`, which keeps +worst-case refresh-to-data around 2 s. Operator wants ~0 s. + +## Producer side already has + +- A saved fit at `data/processed/knn_v1.parquet` (3.4 MB, 150k rows, + schema: `episode_id, host_id, profile, x, y, z, phase_int, + predicted_int, cluster, is_train`). +- The fit is regenerated by the producer's `knn produce --fit-out`. + Whoever rewrites that producer for fresh data updates the same file. +- The streamer (`knn stream`) is one option for delivery, but isn't + required — the file is the source of truth. + +## Three options the dashboard could implement + +### Option A — sticky cache for embedding events + +Inside the broadcaster, add a per-event-type ring buffer (e.g., last +N=2000 `embedding` events). On each new WebSocket connection, replay +the buffer contents. The producer keeps publishing live; the dashboard +remembers what it saw. + +**Pros:** producer doesn't change; works for any event type if the +ring is generic. PRODUCERS.md already mentions this as a future +"file a request" feature. +**Cons:** memory grows with the number of buffered event types; needs +a per-type cap. + +### Option B — feeder that reads the parquet at startup + +Add a feeder under `training/dashboard/feeder.py` that opens +`data/processed/knn_v1.parquet` (or a configurable path) and +populates `broadcaster.state["embedding_cache"]` with a list of point +dicts. The existing snapshot-on-connect flow then carries it. +Frontend gains a `snapshot` handler that consumes +`m.embedding_cache` and renders the points immediately. + +**Pros:** survives dashboard restart; doesn't depend on the streamer +running; cheapest "read from disk" path. +**Cons:** the file gets out-of-date if the producer regenerates it +without the dashboard noticing. Could mtime-watch + reload. + +### Option C — Caddy fileserver + JS fetch on load + +Caddy serves `data/processed/knn_v1.parquet` (or a JSON dump) at a +known path; the dashboard JS fetches it during `connect()` and +renders the points before any WebSocket events arrive. + +**Pros:** zero runtime cost on the dashboard process; the file is +just static content. +**Cons:** parquet isn't browser-native; we'd want a small JSON dump +beside it. Operator would need to remember to regenerate the JSON +when the parquet changes (or we add a tiny "dump" step to the +producer). + +## Operator preference (if you ask) + +Option B feels most in-spirit with the existing feeder pattern (which +already reads `index.jsonl` and `host_counts/`). If you want, the +producer side will: + +- Keep the parquet at the same path and shape +- Add a JSON sidecar `data/processed/knn_v1.json` next to it (cheap, + ~3 MB) so the feeder can `json.loads(...)` without parquet deps in + the dashboard process + +Just tell me which option you want; I'll add whatever the producer +side needs to land it cleanly.