docs: request to dashboard side — persist KNN embeddings on refresh
Producer-side knn fit is saved at data/processed/knn_v1.parquet (150k rows, 3.4 MB). Live streamer publishes 2000-point cycles every ~2 s, but per PRODUCERS.md §reconnect-gotcha live events aren't replayed; refresh-to-data is currently bounded by cycle time. Three options laid out for the dashboard chat to pick: A. Sticky cache (per-event-type ring buffer in the broadcaster) B. Feeder reading the parquet → broadcaster.state["embedding_cache"] C. Caddy fileserver + JS fetch on load Whichever option lands, the producer side will adapt (e.g., dump a JSON sidecar if Option C is picked). Path ownership preserved — dashboard owns dashboard/, producer owns producers/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
2aa7b865fb
commit
9d56bcc923
1 changed files with 82 additions and 0 deletions
82
docs/dashboard-request-embedding-persistence.md
Normal file
82
docs/dashboard-request-embedding-persistence.md
Normal file
|
|
@ -0,0 +1,82 @@
|
|||
# Dashboard request — persist KNN embeddings across reconnects
|
||||
|
||||
**Audience:** dashboard session (owns `training/dashboard/`).
|
||||
**Producer side:** producer session (`training/producers/knn.py`,
|
||||
`data/processed/knn_v1.parquet`).
|
||||
**Status:** request, not implementation. Producer side has no opinion
|
||||
on which option is picked; whichever you prefer is fine.
|
||||
|
||||
## Problem
|
||||
|
||||
Scene-11 (KNN scatter) populates from live `embedding` events. Per
|
||||
`PRODUCERS.md §reconnect-gotcha` live events aren't replayed on
|
||||
reconnect, so a browser that refreshes mid-cycle sees a blank scatter
|
||||
until the next cycle finishes. The current producer (`training/producers/knn.py
|
||||
stream --loop`) cycles every ~2 s with `--cycle-pause-s 0`, which keeps
|
||||
worst-case refresh-to-data around 2 s. Operator wants ~0 s.
|
||||
|
||||
## Producer side already has
|
||||
|
||||
- A saved fit at `data/processed/knn_v1.parquet` (3.4 MB, 150k rows,
|
||||
schema: `episode_id, host_id, profile, x, y, z, phase_int,
|
||||
predicted_int, cluster, is_train`).
|
||||
- The fit is regenerated by the producer's `knn produce --fit-out`.
|
||||
Whoever rewrites that producer for fresh data updates the same file.
|
||||
- The streamer (`knn stream`) is one option for delivery, but isn't
|
||||
required — the file is the source of truth.
|
||||
|
||||
## Three options the dashboard could implement
|
||||
|
||||
### Option A — sticky cache for embedding events
|
||||
|
||||
Inside the broadcaster, add a per-event-type ring buffer (e.g., last
|
||||
N=2000 `embedding` events). On each new WebSocket connection, replay
|
||||
the buffer contents. The producer keeps publishing live; the dashboard
|
||||
remembers what it saw.
|
||||
|
||||
**Pros:** producer doesn't change; works for any event type if the
|
||||
ring is generic. PRODUCERS.md already mentions this as a future
|
||||
"file a request" feature.
|
||||
**Cons:** memory grows with the number of buffered event types; needs
|
||||
a per-type cap.
|
||||
|
||||
### Option B — feeder that reads the parquet at startup
|
||||
|
||||
Add a feeder under `training/dashboard/feeder.py` that opens
|
||||
`data/processed/knn_v1.parquet` (or a configurable path) and
|
||||
populates `broadcaster.state["embedding_cache"]` with a list of point
|
||||
dicts. The existing snapshot-on-connect flow then carries it.
|
||||
Frontend gains a `snapshot` handler that consumes
|
||||
`m.embedding_cache` and renders the points immediately.
|
||||
|
||||
**Pros:** survives dashboard restart; doesn't depend on the streamer
|
||||
running; cheapest "read from disk" path.
|
||||
**Cons:** the file gets out-of-date if the producer regenerates it
|
||||
without the dashboard noticing. Could mtime-watch + reload.
|
||||
|
||||
### Option C — Caddy fileserver + JS fetch on load
|
||||
|
||||
Caddy serves `data/processed/knn_v1.parquet` (or a JSON dump) at a
|
||||
known path; the dashboard JS fetches it during `connect()` and
|
||||
renders the points before any WebSocket events arrive.
|
||||
|
||||
**Pros:** zero runtime cost on the dashboard process; the file is
|
||||
just static content.
|
||||
**Cons:** parquet isn't browser-native; we'd want a small JSON dump
|
||||
beside it. Operator would need to remember to regenerate the JSON
|
||||
when the parquet changes (or we add a tiny "dump" step to the
|
||||
producer).
|
||||
|
||||
## Operator preference (if you ask)
|
||||
|
||||
Option B feels most in-spirit with the existing feeder pattern (which
|
||||
already reads `index.jsonl` and `host_counts/`). If you want, the
|
||||
producer side will:
|
||||
|
||||
- Keep the parquet at the same path and shape
|
||||
- Add a JSON sidecar `data/processed/knn_v1.json` next to it (cheap,
|
||||
~3 MB) so the feeder can `json.loads(...)` without parquet deps in
|
||||
the dashboard process
|
||||
|
||||
Just tell me which option you want; I'll add whatever the producer
|
||||
side needs to land it cleanly.
|
||||
Loading…
Add table
Reference in a new issue