CIS490/docs/dashboard-request-embedding-persistence.md
Max 9d56bcc923 docs: request to dashboard side — persist KNN embeddings on refresh
Producer-side knn fit is saved at data/processed/knn_v1.parquet
(150k rows, 3.4 MB). Live streamer publishes 2000-point cycles every
~2 s, but per PRODUCERS.md §reconnect-gotcha live events aren't
replayed; refresh-to-data is currently bounded by cycle time.

Three options laid out for the dashboard chat to pick:
  A. Sticky cache (per-event-type ring buffer in the broadcaster)
  B. Feeder reading the parquet → broadcaster.state["embedding_cache"]
  C. Caddy fileserver + JS fetch on load

Whichever option lands, the producer side will adapt (e.g., dump a
JSON sidecar if Option C is picked). Path ownership preserved —
dashboard owns dashboard/, producer owns producers/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:54:38 -05:00

3.4 KiB

Dashboard request — persist KNN embeddings across reconnects

Audience: dashboard session (owns training/dashboard/). Producer side: producer session (training/producers/knn.py, data/processed/knn_v1.parquet). Status: request, not implementation. Producer side has no opinion on which option is picked; whichever you prefer is fine.

Problem

Scene-11 (KNN scatter) populates from live embedding events. Per PRODUCERS.md §reconnect-gotcha live events aren't replayed on reconnect, so a browser that refreshes mid-cycle sees a blank scatter until the next cycle finishes. The current producer (training/producers/knn.py stream --loop) cycles every ~2 s with --cycle-pause-s 0, which keeps worst-case refresh-to-data around 2 s. Operator wants ~0 s.

Producer side already has

  • A saved fit at data/processed/knn_v1.parquet (3.4 MB, 150k rows, schema: episode_id, host_id, profile, x, y, z, phase_int, predicted_int, cluster, is_train).
  • The fit is regenerated by the producer's knn produce --fit-out. Whoever rewrites that producer for fresh data updates the same file.
  • The streamer (knn stream) is one option for delivery, but isn't required — the file is the source of truth.

Three options the dashboard could implement

Option A — sticky cache for embedding events

Inside the broadcaster, add a per-event-type ring buffer (e.g., last N=2000 embedding events). On each new WebSocket connection, replay the buffer contents. The producer keeps publishing live; the dashboard remembers what it saw.

Pros: producer doesn't change; works for any event type if the ring is generic. PRODUCERS.md already mentions this as a future "file a request" feature. Cons: memory grows with the number of buffered event types; needs a per-type cap.

Option B — feeder that reads the parquet at startup

Add a feeder under training/dashboard/feeder.py that opens data/processed/knn_v1.parquet (or a configurable path) and populates broadcaster.state["embedding_cache"] with a list of point dicts. The existing snapshot-on-connect flow then carries it. Frontend gains a snapshot handler that consumes m.embedding_cache and renders the points immediately.

Pros: survives dashboard restart; doesn't depend on the streamer running; cheapest "read from disk" path. Cons: the file gets out-of-date if the producer regenerates it without the dashboard noticing. Could mtime-watch + reload.

Option C — Caddy fileserver + JS fetch on load

Caddy serves data/processed/knn_v1.parquet (or a JSON dump) at a known path; the dashboard JS fetches it during connect() and renders the points before any WebSocket events arrive.

Pros: zero runtime cost on the dashboard process; the file is just static content. Cons: parquet isn't browser-native; we'd want a small JSON dump beside it. Operator would need to remember to regenerate the JSON when the parquet changes (or we add a tiny "dump" step to the producer).

Operator preference (if you ask)

Option B feels most in-spirit with the existing feeder pattern (which already reads index.jsonl and host_counts/). If you want, the producer side will:

  • Keep the parquet at the same path and shape
  • Add a JSON sidecar data/processed/knn_v1.json next to it (cheap, ~3 MB) so the feeder can json.loads(...) without parquet deps in the dashboard process

Just tell me which option you want; I'll add whatever the producer side needs to land it cleanly.