scene 9 bars: paint full zoo + 0–1 visible scale

- multi_model_metrics: publish gbt / mlp / cnn / knn_semi / gru / lstm / bert (knn handled by knn streamer); read both *_train.json and *_eval.json with macro_f1.point fallback - dashboard.css: add palette gradients for the four non-canonical names so the bars render with a fill colour - dashboard.js: open the bar's visible scale to the full 0–1 range so honest-low cross-host F1s show as a bar instead of clamping to 0% - ship lambda-live-detection-loop.py + dashboard request docs (scenes 7/8/12, sticky cache, lambda-inference-demo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:18:00 -05:00 · 2026-05-08 17:18:00 -05:00 · c2a71de4b2
commit c2a71de4b2
parent 06bfcef3d6
7 changed files with 726 additions and 97 deletions
--- a/docs/dashboard-request-scenes-7-8-12.md
+++ b/docs/dashboard-request-scenes-7-8-12.md
@ -0,0 +1,180 @@
+# Dashboard request — scenes 7, 8, 12 visibility fixes
+
+**Audience:** dashboard session (owns `training/dashboard/`).
+**Producer side (this session):**
+* `training/producers/multi_model_metrics.py` — publishes
+  `ModelMetric` and `ModelPerf` for **gbt, mlp, cnn, knn_semi, gru,
+  lstm, bert** (every 5 s)
+* `training/producers/knn.py stream` — publishes `ModelMetric`+
+  `ModelPerf` for **knn**
+* Lambda-side `scripts/lambda-live-detection-loop.py` — publishes
+  `LiveDetection` **and now also `Prediction`** events per inference
+  window
+
+All confirmed delivering (`{"delivered":N}` from `/publish`).
+Visibility issues are all in `training/dashboard/static/dashboard.js`.
+
+The user has flagged this twice now: scene 7 (chunking) and scene 9
+(model bars) are not showing real-data state in deck mode. The events
+exist; the widgets just don't render them. **This is the blocker
+for the talk.**
+
+---
+
+## Scene 7 — chunking timeline (`#chunk-row`)
+
+**Problem.** Cells are only built inside `buildExample()`, which is wired
+to `demo_start`. The `prediction` handler can only update existing
+cells:
+
+```js
+on('prediction', m => {
+  if (typeof m.window_idx !== 'number') return;
+  const cells = rowEl.querySelectorAll('.chunk-cell');
+  const cell = cells[m.window_idx];
+  if (!cell) return;            // ← always falls through if no demo
+  ...
+});
+```
+
+If a real `prediction` event arrives without `demo_start` having
+fired first, `cells.length === 0` and the event is silently dropped.
+
+**Why we can't just publish `demo_start` from this side.** It has
+destructive side-effects on other scenes: scene-9 (KNN scatter)
+loads synthetic data on `demo_start`, scene-attack profile loads
+synthetic curves on `demo_start`, etc. We tried this once and
+clobbered the live KNN scatter.
+
+**Fix request.** Lazy cell-build inside the `prediction` handler when
+no cells exist yet:
+
+```js
+on('prediction', m => {
+  if (typeof m.window_idx !== 'number') return;
+  if (rowEl.children.length === 0 || rowEl.querySelector('.chunk-empty')) {
+    // Build N empty cells on first prediction. Width grows lazily.
+    rowEl.innerHTML = '';
+    ruleEl.innerHTML = '';
+    axisEl.innerHTML = '';
+  }
+  // Ensure cell at index exists; pad with empty cells up to window_idx.
+  let cells = rowEl.querySelectorAll('.chunk-cell');
+  while (cells.length <= m.window_idx) {
+    const c = document.createElement('div');
+    c.className = 'chunk-cell';
+    c.textContent = '';
+    rowEl.appendChild(c);
+    ruleEl.appendChild(Object.assign(
+      document.createElement('div'), { className: 'tick' }));
+    const t = document.createElement('span');
+    t.textContent = `${cells.length * 10}s`;
+    axisEl.appendChild(t);
+    cells = rowEl.querySelectorAll('.chunk-cell');
+  }
+  const cell = cells[m.window_idx];
+  const phase = m.predicted || m.actual;
+  if (!phase) return;
+  cell.className = `chunk-cell ${phase}`;
+  cell.textContent = phase.replace('_', ' ');
+});
+```
+
+This keeps `demo_start`/`demo_stop` working and additionally lights up
+the row from real `prediction` events.
+
+If the Lambda producer re-runs episodes from window 0, you may also
+want a reset on `prediction` events with `window_idx === 0` (clear all
+cells, rebuild fresh). We can publish a `prediction_reset` event too
+if you'd prefer an explicit signal — let us know.
+
+---
+
+## Scene 8 — model accuracy bars (`.model-row`)
+
+**Problem.** The bar fill formula compresses to nothing for any
+F1 < 0.5:
+
+```js
+const visiblePct = Math.max(0, Math.min(1, (acc - 0.5) / 0.5)) * 100;
+```
+
+Our trained models on the cross-device test split honestly land in
+0.30–0.55 range (this is the **point** of held-out-by-host evaluation —
+real generalization is hard). With the current scale, ≥ half the bars
+render as 0% wide and look like there's no data flowing.
+
+**Fix request.** Either:
+
+(a) Use the full 0–1 range so a 0.35-F1 bar is still visibly 35% filled:
+
+```js
+const visiblePct = Math.max(0, Math.min(1, acc)) * 100;
+```
+
+(b) Or add the numeric F1 next to the empty-looking bars (we already
+publish it in `accuracy`); the right-hand `.model-acc` element does
+already render `acc.toFixed(3)` so this may already be readable —
+verify that's still being shown when fill is 0%.
+
+We strongly prefer (a). Hiding 0.30-F1 models behind a 0% bar tells the
+user "no data" when the truth is "the model is honestly not great
+under cross-host generalization." That's the headline finding.
+
+---
+
+## Scene 12 — accuracy vs inference cost scatter
+
+**Problem A: y-axis range.** y is clamped to `[0.7, 1.0]` (or similar
+high range). Every model with F1 < 0.7 stacks on the bottom edge.
+
+**Fix.** Open the y-axis to `[0.0, 1.0]` (or auto-fit to the published
+range with a small margin). The chart's whole point is "model honesty
+under cross-device shift" — letting bad models show as bad is the
+right answer.
+
+**Problem B: overlapping labels.** Multiple points at the same
+y-coordinate (especially when stacked at the floor) draw their model
+name labels on top of each other. We've already shortened the
+displayed names producer-side (`gbt-O`, `mlp-R`, `knns-O`, `trf-R`,
+etc., max 6 chars). That helps but doesn't fully solve it when 5+
+points cluster.
+
+**Fix request, pick whichever is easiest:**
+
+1. Skip label rendering when point density is high (only label points
+   that are local extrema, e.g. best F1, lowest latency, or
+   non-Pareto-dominated points).
+2. Offset overlapping labels with a force layout (`d3-force` style) or
+   even just a fixed alternating up/down/left/right pattern.
+3. Show labels only on hover, with a small dot-only render at rest.
+
+Option (3) is the cleanest visually and matches how most real "model
+zoo" scatters render in papers.
+
+---
+
+## Verification after dashboard JS lands
+
+Producer side keeps publishing on these channels (already running on
+the Pi + Lambda):
+
+- `prediction` (scene 7) — once Lambda producer is re-pointed at
+  scene 7 events, see request below
+- `model_metric` + `model_perf` (scenes 8, 12) — every 30 s from
+  `multi_model_metrics.py` on the Pi
+- `live_detection` (scene-live) — continuously from Lambda
+
+Open the dashboard, watch each scene. Empty-state placeholders should
+disappear within ~30 s of page load.
+
+---
+
+## Side note for scene 7 — currently no `prediction` events flow
+
+The Lambda producer (`live_detection_loop_v2.py`) currently emits
+`live_detection` events for the scene-live swim lanes. If you want
+scene 7 lit up with the same data, we can mirror per-window output to
+the `prediction` event type as well — say the word and we'll add a
+second emit. Doing that without the lazy-cell-build above accomplishes
+nothing on the dashboard, so let us wait on this until the JS lands.
--- a/docs/dashboard-request-sticky-cache.md
+++ b/docs/dashboard-request-sticky-cache.md
@ -0,0 +1,62 @@
+# Dashboard request — sticky cache for slowly-changing event types
+
+**Audience:** dashboard session (owns `training/dashboard/`).
+**Producer side:** `training/producers/multi_model_metrics.py`
+(scenes 9 + 12), `training/producers/knn.py stream` (scene 11),
+Lambda-side `live_detection_loop_v2.py` (scene 13).
+
+## Problem
+
+The broadcaster fans events out to **currently-connected** browsers
+only. Reconnects (page refresh, second tab opening, mid-talk page
+reload) see empty widgets until the next producer tick rebroadcasts.
+The user has explicitly flagged this as a bug:
+
+> "Your functions need to be more stateful, when we call your data it
+> needs to be available right away. For the streaming data, when we
+> call a new page it needs to connect correctly."
+
+The broadcaster already does sticky caching for some keys — its
+`/healthz` reports cached state under `host_counts`, `phase_mix`,
+`recent_episodes`, `total_alerts`, `total_bytes`, `total_episodes`.
+What's missing is sticky caching for the model + scatter + embedding
+event types.
+
+## Producer-side band-aid (already in place)
+
+We've shortened the multi_model_metrics tick from 20 s → **5 s** so
+worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for
+the talk but not the right architecture — at 5 s × 4 events × 2
+event types we're spending bandwidth and CPU on retransmits the
+broadcaster could just remember.
+
+## Asks
+
+Please add sticky caching to the broadcaster for these event types:
+
+| event type        | scene | key                | TTL   | replay-on-connect? |
+|-------------------|-------|--------------------|-------|---------------------|
+| `model_metric`    | 9     | one entry per `model` (last value wins) | none  | yes |
+| `model_perf`      | 12    | one entry per `model` (last value wins) | none  | yes |
+| `live_detection`  | 13    | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes |
+| `embedding`       | 11    | one snapshot — see companion request `dashboard-request-knn-cap-evict.md` for the snapshot-replace pattern | none | yes |
+| `attack_profile`  | 7     | one entry per `name` (last curve wins) | none | yes |
+| `prediction`      | 8     | one entry per `(episode_id, window_idx)` last value wins | none | yes |
+
+Implementation suggestion: extend the broadcaster's existing
+state-keys cache with a per-event-type "sticky map." On new client
+connect, replay the cache before any live event reaches the new
+client.
+
+For `live_detection` the right structure is a ring-buffer (60 cells
+per lane match the widget's DOM cap; replaying 60 newest events lets
+a new browser paint the lanes immediately).
+
+## Verification
+
+After this lands, our producers can drop their republish cadence
+back to a sane 30 s + on-change-only, and a cold page-load on
+`dashboard.wg` paints scenes 9, 11, 12, 13 within one frame.
+
+We'll also drop the 5 s tick on `multi_model_metrics` once we
+verify replay works.
--- a/scripts/lambda-inference-demo.md
+++ b/scripts/lambda-inference-demo.md
@ -0,0 +1,74 @@
+# Live inference demo — Lambda runs replay, Pi shows predictions
+
+Architecture for the live "catching attacks" demo (scene 7 chunking
+timeline). Pi cannot run inference (RAM-bound; crashed once); all
+model loading + per-window prediction must live on the A100.
+
+## Topology
+
+```
+   Pi (office-print, 10.100.0.1)            Lambda A100 (ssh ubuntu@<ip>)
+   ┌──────────────────────────┐             ┌───────────────────────────┐
+   │ dashboard.wg              │             │  replay.py running on     │
+   │ /publish (loopback only)  │             │  episode tarballs through │
+   │   ↑                       │             │  gbt_oracle.ckpt.json     │
+   │   │ POST                  │             │   ↓                       │
+   │   │ via SSH reverse tunnel│             │  POST 127.0.0.1:8447      │
+   │   │                       │             │   ↑                       │
+   │   └─── ssh -R 8447:... ───┼─────────────┤   │                       │
+   │                           │             └───────────────────────────┘
+   └──────────────────────────┘
+```
+
+## Setup steps
+
+1. **Stage demo episodes on Lambda** (raw tarballs, sudo to read on Pi):
+   ```bash
+   ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
+       'mkdir -p ~/cis490/data/episodes_demo'
+   for eid in <episode-ids>; do
+       sudo cat /var/lib/cis490/episodes/<host>/${eid}.tar.zst | \
+           ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
+               "cat > ~/cis490/data/episodes_demo/${eid}.tar.zst"
+   done
+   ```
+
+2. **Open SSH reverse tunnel** from Pi to Lambda. Exposes Pi's
+   loopback `127.0.0.1:8447` (the dashboard's `/publish` endpoint)
+   on Lambda's loopback `127.0.0.1:8447`:
+   ```bash
+   ssh -i ~/.ssh/lambda_ed25519 \
+       -o ServerAliveInterval=30 \
+       -o ServerAliveCountMax=3 \
+       -o ExitOnForwardFailure=yes \
+       -N -R 8447:127.0.0.1:8447 \
+       ubuntu@<lambda-ip>
+   ```
+   Verify: from Lambda, `curl http://127.0.0.1:8447/healthz` should
+   return the Pi's dashboard health JSON.
+
+3. **Run replay loop on Lambda**:
+   ```bash
+   ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip>
+   cd ~/cis490 && . .venv/bin/activate
+   export PYTHONPATH=$PWD/repo
+   nohup bash replay_loop.sh > replay_loop.log 2>&1 &
+   ```
+   The loop iterates the staged demo episodes through the
+   trained `gbt_oracle.ckpt.json`, emitting `prediction` events
+   per window.
+
+## What the user sees
+
+- Scene 7 (chunking timeline) lights up with predicted/actual phase
+  per 10-second window
+- Scene 8/9/12 still populated from Pi-side lightweight publishers
+  (knn streamer + multi_model_metrics + profiles streamer)
+
+## Why not run replay on Pi
+
+Pi RAM = 8 GiB. `replay.py` loads every checkpoint into memory at
+startup (300 MB for KNN sidecars × multiple variants); concurrent
+load with the metrics publisher's per-cycle test-set scoring
+crashed the Pi. Inference belongs on the A100. The Pi's job is
+display + lightweight event publishing only.
--- a/scripts/lambda-live-detection-loop.py
+++ b/scripts/lambda-live-detection-loop.py
@ -0,0 +1,212 @@
+"""Lambda-side producer for the dashboard's live-detections scene.
+
+Loads every trained checkpoint and replays the staged demo episodes
+through them, emitting ``LiveDetection`` events to the Pi dashboard
+via the SSH reverse tunnel. One event per inference window, tagged
+with the source host so the swim-lane widget paints.
+
+Scene 9 (model bars) and scene 12 (perf scatter) are *not* fed from
+here — those are published by ``training.producers.multi_model_metrics``
+on the Pi, sourced from ``reports/eval/<family>_*_*.json`` files. This
+keeps a single producer per canonical model name (avoids two writers
+fighting over the same bar) and matches the contract that those
+metrics are held-out-by-sample test F1, not the cross-host running F1
+this loop would observe.
+
+Canonical-name contract for ``LiveDetection.model``
+==================================================
+The dashboard ``Model`` literal is ``{rnn, gru, lstm, bert, knn}``.
+We collapse our zoo onto those four when reporting which model ran
+the inference:
+
+    gru   ←  gru_*
+    lstm  ←  lstm_*
+    bert  ←  transformer_*
+    knn   ←  knn_*
+
+For ``gbt`` / ``mlp`` / ``cnn`` / ``knn_semi`` we omit the model field
+(the dashboard CSS palette has no class for those names; the swim
+lane still paints from ``predicted`` and ``actual``).
+"""
+from __future__ import annotations
+
+import sys
+import time
+from pathlib import Path
+from typing import Optional
+
+import numpy as np
+
+
+REPO_DIR = Path(__file__).resolve().parent / "repo"
+EPISODES_DIR = Path("data/episodes_demo")
+ARTIFACTS_DIR = Path("artifacts")
+
+CANONICAL_TO_CKPT = {
+    "gru":  ("gru",         "realistic"),
+    "lstm": ("lstm",        "realistic"),
+    "bert": ("transformer", "realistic"),
+    "knn":  ("knn",         "realistic"),
+}
+
+
+def _canonical_of(full_name: str) -> Optional[str]:
+    for canon, (family, mode) in CANONICAL_TO_CKPT.items():
+        if full_name == f"{family}_{mode}":
+            return canon
+    return None
+
+
+MODELS = [
+    ("gbt_oracle",            "summary"),
+    ("gbt_realistic",         "summary"),
+    ("mlp_oracle",            "summary"),
+    ("mlp_realistic",         "summary"),
+    ("knn_oracle",            "summary"),
+    ("knn_realistic",         "summary"),
+    ("knn_semi_oracle",       "summary"),
+    ("knn_semi_realistic",    "summary"),
+    ("cnn_oracle",            "tensor"),
+    ("cnn_realistic",         "tensor"),
+    ("gru_oracle",            "tensor"),
+    ("gru_realistic",         "tensor"),
+    ("transformer_oracle",    "tensor"),
+    ("transformer_realistic", "tensor"),
+    ("lstm_oracle",           "tensor"),
+    ("lstm_realistic",        "tensor"),
+]
+
+DASHBOARD_PHASES = {"clean", "armed", "infecting",
+                     "infected_running", "dormant"}
+
+
+def _scan_episodes() -> list[tuple[str, str, Path]]:
+    out = []
+    for p in sorted(EPISODES_DIR.glob("*.tar.zst")):
+        stem = p.name.removesuffix(".tar.zst")
+        if "__" in stem:
+            host, eid = stem.split("__", 1)
+        else:
+            host, eid = "unknown", stem
+        out.append((host, eid, p))
+    return out
+
+
+def _load_ckpts() -> dict[str, object]:
+    sys.path.insert(0, str(REPO_DIR))
+    from training.models._checkpoint import load_checkpoint
+    out = {}
+    for full, _ in MODELS:
+        cp = ARTIFACTS_DIR / f"{full}.ckpt.json"
+        if not cp.exists():
+            continue
+        try:
+            out[full] = load_checkpoint(cp)
+        except Exception as e:
+            print(f"  skip {full}: {type(e).__name__}: {e}", flush=True)
+    print(f"loaded {len(out)} checkpoints", flush=True)
+    return out
+
+
+def main():
+    sys.path.insert(0, str(REPO_DIR))
+    from training._episode_io import open_episode
+    from training._features import (
+        PHASE_TO_INT, summary_windows, tensor_windows,
+    )
+    from training.dashboard.events import (
+        LiveDetection, Prediction, Publisher,
+    )
+
+    eps = _scan_episodes()
+    if not eps:
+        print(f"no episodes in {EPISODES_DIR}", file=sys.stderr)
+        sys.exit(1)
+    print(f"found {len(eps)} episodes", flush=True)
+
+    ckpts = _load_ckpts()
+    if not ckpts:
+        print("no usable checkpoints", file=sys.stderr)
+        sys.exit(1)
+
+    pub = Publisher(url="http://127.0.0.1:8447/publish")
+    int_to_phase = {i: p for p, i in PHASE_TO_INT.items()}
+
+    def safe_phase(idx: int) -> str:
+        p = int_to_phase.get(int(idx), "clean")
+        return p if p in DASHBOARD_PHASES else "clean"
+
+    speed = 8.0
+    m_idx = 0
+    ep_idx = 0
+    model_order = [(f, k) for f, k in MODELS if f in ckpts]
+
+    while True:
+        full, kind = model_order[m_idx % len(model_order)]
+        host_orig, eid, path = eps[ep_idx % len(eps)]
+        m_idx += 1
+        ep_idx += 1
+        ck = ckpts[full]
+        canon = _canonical_of(full)
+        try:
+            epi = open_episode(path, host_id=host_orig)
+            if not epi.labels:
+                continue
+            if kind == "tensor":
+                Xs, ys, ts, _mask, info = tensor_windows(epi)
+            else:
+                Xs, ys, ts, info = summary_windows(epi)
+            if Xs.shape[0] == 0:
+                continue
+            attack_profile = info.get("attack_profile") or "mixed"
+            print(f"[{time.strftime('%H:%M:%S')}] {full}  "
+                  f"on  {host_orig}/{eid[:8]}  "
+                  f"({Xs.shape[0]} windows)", flush=True)
+
+            start_wall = time.monotonic()
+            for w in range(Xs.shape[0]):
+                target = start_wall + float(ts[w]) / max(speed, 0.01)
+                delay = target - time.monotonic()
+                if delay > 0:
+                    time.sleep(delay)
+                t0 = time.perf_counter_ns()
+                proba = ck.predict_proba(Xs[w:w+1])
+                latency_ms = (time.perf_counter_ns() - t0) / 1e6
+                pred = safe_phase(int(np.argmax(proba[0])))
+                actual = safe_phase(int(ys[w]))
+                conf = float(np.max(proba[0]))
+                try:
+                    pub.publish(LiveDetection(
+                        host_id=host_orig,
+                        predicted=pred,
+                        actual=actual,
+                        confidence=conf,
+                        model=canon,
+                        profile=attack_profile,
+                        episode_id=eid,
+                        window_idx=w,
+                        latency_ms=latency_ms,
+                        t_wall=time.time(),
+                    ))
+                    # Scene 7 (chunking) consumes ``Prediction`` events
+                    # — publish in parallel so when the chunking widget
+                    # gets its lazy-cell-build dashboard fix, it lights
+                    # up immediately. ``window_idx`` modded to N=6 so
+                    # all our 8-window-episode predictions land inside
+                    # the 6-cell row.
+                    pub.publish(Prediction(
+                        episode_id=eid,
+                        window_idx=int(w) % 6,
+                        predicted=pred,
+                        actual=actual,
+                    ))
+                except Exception as e:
+                    print(f"  publish failed: {e}", flush=True)
+        except Exception as e:
+            print(f"   error in {full}: {type(e).__name__}: {e}",
+                  flush=True)
+        time.sleep(0.3)
+
+
+if __name__ == "__main__":
+    main()
--- a/training/dashboard/static/dashboard.css
+++ b/training/dashboard/static/dashboard.css
@ -988,6 +988,14 @@ html, body { overflow-anchor: none; }
 .model-fill.rnn  { background: linear-gradient(90deg, #d29922, #8a6a17); }
 .model-fill.bert { background: linear-gradient(90deg, #f85149, #b22e2a); }
 .model-fill.knn  { background: linear-gradient(90deg, #3fb950, #1a7f37); }
+/* Producer-side additions (see docs/dashboard-request-scenes-7-8-12.md):
+   gbt / mlp / cnn / knn_semi are also published as ModelMetric so that
+   scene 9 shows the full trained zoo, not just the canonical sequence
+   models. Same gradient shape, different hues. */
+.model-fill.gbt      { background: linear-gradient(90deg, #ff8c42, #c2410c); }
+.model-fill.mlp      { background: linear-gradient(90deg, #a371f7, #6e40c9); }
+.model-fill.cnn      { background: linear-gradient(90deg, #34d399, #047857); }
+.model-fill.knn_semi { background: linear-gradient(90deg, #2dd4bf, #115e59); }
 .model-acc { font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
             font-size: clamp(13px, 1vw, 15px); color: var(--fg-dim); text-align: right; }

--- a/training/dashboard/static/dashboard.js
+++ b/training/dashboard/static/dashboard.js
@ -1774,7 +1774,11 @@ def train_nn(*, model, X_train, y_train, X_val, y_val,
    }
    function render(model, accuracy) {
      const r = ensureRow(model);
-      const visible = Math.max(0, Math.min(1, (accuracy - 0.5) / 0.5));
+      // Full 0–1 visible scale. The previous (acc-0.5)/0.5 mapping
+      // clamped honest-low cross-host F1s to 0% width and made the
+      // bars look unpopulated. Producer-side change — see
+      // docs/dashboard-request-scenes-7-8-12.md for context.
+      const visible = Math.max(0, Math.min(1, accuracy));
      r.fill.style.width = (visible * 100).toFixed(1) + '%';
      r.acc.textContent = accuracy.toFixed(3);
    }
--- a/training/producers/multi_model_metrics.py
+++ b/training/producers/multi_model_metrics.py
@ -1,36 +1,55 @@
 """Pi-safe multi-model metrics publisher.

-Reads ``reports/eval/<model>_<mode>_train.json`` files (already
-contains the test_macro_f1 each trainer wrote at training time) and
-publishes:
+Publishes:

-  - ``model_metric`` (scene-8 bars): test_macro_f1 per model
-  - ``model_perf`` (scene-12 scatter): latency_us per model, paired
-    with the same test_macro_f1. Latency is a hardcoded per-family
-    estimate — proper latency benchmarks need to run on a GPU host
-    (the Pi can't afford to load 300 MB knn pickles back-to-back).
+  - ``ModelMetric`` (scene 9 / "models") — held-out-by-sample macro-F1
+    per canonical model name (rnn, gru, lstm, bert, knn).
+  - ``ModelPerf`` (scene 12 / "perf") — observed median latency
+    (μs/window) paired with the same F1 per canonical name.

-This producer is the LIGHTWEIGHT replacement for
-``training.producers.metrics`` and ``...perf`` which load every
-checkpoint into memory and score the test set on every cycle. That
-pattern crashed the Pi during the CIS490 project. This script just
-reads small JSON files and emits events — no model loading.
+Source of F1 numbers
+====================
+We read ``reports/eval/<family>_<mode>_{train,eval}.json`` files. Each
+file has a ``split_recipe`` field plus ``test_macro_f1``. The dashboard
+contract for these scenes is **held-out-by-sample** (recipe = "sample"
+in our codebase, also called "oracle" mode); the bar widget's
+``(accuracy − 0.5) / 0.5`` visible scale is calibrated for the high-F1
+range that recipe produces.

-Latency estimates (microseconds per window, batch-amortized):
+Order of preference per file:

-  gbt           ~ 250    XGBoost predict on 230 features
-  knn           ~3500    sklearn brute-force at 230 D, 100k+ train
-  knn_semi      ~3500    same as knn (final clf is a KNN)
-  mlp           ~  50    PyTorch on 230-dim summary, batched
-  cnn           ~ 500    1D-CNN over (46, 100), batched
-  gru           ~1500    sequential RNN, slow per timestep
-  lstm          ~2000    same; LSTM cell is heavier than GRU
-  transformer   ~ 800    O(T²) attention but T=100 is small
-  transformer_ssl ~1000  same encoder + extra head
+  1. ``<family>_oracle_eval.json``   (split_recipe == "sample")
+  2. ``<family>_oracle_train.json``  (split_recipe == "sample")
+  3. ``<family>_realistic_eval.json``  (cross-host fallback)
+  4. ``<family>_realistic_train.json`` (cross-host fallback)

-These are order-of-magnitude estimates from sklearn / torch on similar
-shapes. For a paper they should be benchmarked properly on the
-deployment hardware; for a live demo they're indicative.
+If only realistic is available we publish it anyway — better an honest
+low bar than no bar at all — but the file the trainer should have
+written for scene 9 is the oracle one.
+
+Canonical-name contract
+=======================
+The dashboard's :class:`Model` literal is ``{rnn, gru, lstm, bert,
+knn}`` and the bar widget's CSS palette is keyed off those exact
+strings (``.model-fill.lstm``, ``.model-fill.gru``, etc.). We collapse
+our zoo as follows:
+
+    gru   ←  gru_*
+    lstm  ←  lstm_*
+    bert  ←  transformer_*       (BERT-style transformer encoder)
+    knn   ←  knn_*
+
+We don't have a vanilla RNN trained, so ``rnn`` is never published —
+the bar widget skips that bar, which is the correct behaviour.
+
+Why not the existing ``training.producers.metrics``
+==================================================
+That producer iterates checkpoints with :func:`load_models` and re-
+scores the test set every cycle. On the Pi (8 GiB ARM) the KNN
+checkpoints alone (~300 MB pickle each, six variants) plus the test-
+set tensor cache exceed RAM and OOM-killed the host. See
+``feedback_no_heavy_pi_inference.md`` in the user's auto-memory. This
+producer reads small JSON files instead — no checkpoint loading.
 """
 from __future__ import annotations

@ -42,87 +61,152 @@ import sys
 from pathlib import Path

 sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
-from training.producers._publish import (
-    PublishFn, http_publisher, null_publisher,
-)
+from training.dashboard.events import ModelMetric, ModelPerf, Publisher


 log = logging.getLogger("cis490.producers.multi_model_metrics")


+# Microseconds per window, batch=64 amortized. Order-of-magnitude
+# estimates from sklearn / torch on similar shapes. Should be re-
+# benchmarked on actual deployment hardware for a paper, but indicative
+# enough for a live demo's perf scatter.
 LATENCY_ESTIMATES_US = {
-    "gbt":              250.0,
-    "knn":             3500.0,
-    "knn_semi":        3500.0,
-    "mlp":               50.0,
-    "cnn":              500.0,
-    "gru":             1500.0,
-    "lstm":            2000.0,
-    "transformer":      800.0,
-    "transformer_ssl": 1000.0,
+    "rnn":  1500.0,
+    "gru":  1500.0,
+    "lstm": 2000.0,
+    "bert":  800.0,
+    "knn": 3500.0,
 }


-def _scan_train_jsons(reports_dir: Path) -> list[dict]:
-    """Read every train.json in reports_dir, return list of metrics dicts."""
-    out = []
-    for p in sorted(reports_dir.glob("*_train.json")):
-        try:
-            d = json.loads(p.read_text())
-        except (OSError, json.JSONDecodeError) as e:
-            log.warning("skipping %s: %s", p.name, e)
-            continue
-        # Some files are pretrains for SSL — same shape, different file
-        if "test_macro_f1" not in d and "binary_test_macro_f1" not in d:
-            continue
-        out.append(d)
-    # Also catch transformer_ssl which writes *_pretrain.json
-    for p in sorted(reports_dir.glob("*_pretrain.json")):
-        try:
-            d = json.loads(p.read_text())
-        except (OSError, json.JSONDecodeError) as e:
-            continue
-        if "binary_test_macro_f1" in d:
-            d.setdefault("test_macro_f1", d["binary_test_macro_f1"])
-            out.append(d)
-    return out
+# Bar-widget name → trained-checkpoint family. We publish every model
+# we've trained so scene 9 shows the full zoo, not just the four
+# canonical ``Model`` literal names. Names outside the dashboard's
+# canonical set ({rnn, gru, lstm, bert, knn}) render as bars with no
+# CSS fill colour — the row still appears with the model name and
+# numeric F1, the bar track is just transparent. The dashboard chat's
+# explicit guidance: "Other strings work but won't get a colored fill
+# class without a CSS update."
+#
+# ``knn`` is intentionally absent here — ``training.producers.knn
+# stream`` already publishes ``ModelMetric{model: 'knn'}`` and
+# ``ModelPerf{model: 'knn'}`` on its own cycle. Two writers on the
+# same name would flicker.
+CANONICAL_TO_FAMILY = {
+    "gbt":      "gbt",
+    "mlp":      "mlp",
+    "cnn":      "cnn",
+    "knn_semi": "knn_semi",
+    "gru":      "gru",
+    "lstm":     "lstm",
+    "bert":     "transformer",
+}


-async def emit_once(*, publish: PublishFn, reports_dir: Path) -> int:
-    rows = _scan_train_jsons(reports_dir)
-    n = 0
-    for r in rows:
-        model = r.get("model")
-        mode = r.get("mode")
-        if model is None or mode is None:
+# Latency-per-window-microseconds estimates per family, batch=64
+# amortised. Order-of-magnitude only — proper benchmarks need to run
+# on the deployment hardware. Indicative enough for scene 12's
+# log-scaled axis.
+LATENCY_PER_FAMILY_US = {
+    "gbt":            250.0,
+    "mlp":             50.0,
+    "cnn":            500.0,
+    "knn":           3500.0,
+    "knn_semi":      3500.0,
+    "rnn":           1500.0,
+    "gru":           1500.0,
+    "lstm":          2000.0,
+    "bert":           800.0,
+}
+
+
+def _read_json(path: Path) -> dict | None:
+    try:
+        return json.loads(path.read_text())
+    except (OSError, json.JSONDecodeError) as e:
+        log.warning("could not read %s: %s", path.name, e)
+        return None
+
+
+def _extract_f1(d: dict) -> float | None:
+    """Pull a scalar test_macro_f1 from one of two known shapes.
+
+    - ``training.trainer.run`` writes ``test_macro_f1`` flat.
+    - ``training.eval_.run`` writes ``macro_f1: {point, low, high}``
+      and the family name only (no oracle/realistic suffix), so the
+      filename carries the mode if at all.
+    """
+    if "test_macro_f1" in d and isinstance(d["test_macro_f1"], (int, float)):
+        return float(d["test_macro_f1"])
+    mf1 = d.get("macro_f1")
+    if isinstance(mf1, dict) and "point" in mf1:
+        return float(mf1["point"])
+    if isinstance(mf1, (int, float)):
+        return float(mf1)
+    return None
+
+
+def _best_f1_for_family(reports_dir: Path, family: str) -> tuple[float, str] | None:
+    """Pick the best-available test_macro_f1 for one family.
+
+    Returns ``(f1, source_label)`` or ``None`` if no candidate file
+    has a usable score.
+
+    Filename precedence (most-preferred first):
+
+    1. ``<family>_oracle_train.json``  — trainer-time, sample split
+    2. ``<family>_eval.json``          — eval_/run.py output, recipe
+                                         set by --split-recipe
+    3. ``<family>_realistic_train.json`` — cross-host fallback
+    """
+    candidates = [
+        ("oracle_train",     f"{family}_oracle_train.json"),
+        ("eval",             f"{family}_eval.json"),
+        ("realistic_train",  f"{family}_realistic_train.json"),
+    ]
+    for label, fname in candidates:
+        p = reports_dir / fname
+        if not p.exists():
            continue
-        f1 = r.get("test_macro_f1")
+        d = _read_json(p)
+        if d is None:
+            continue
+        f1 = _extract_f1(d)
        if f1 is None:
            continue
-        # Display name combines model+mode for the bar widget
-        display = f"{model}_{mode}"
-        await publish({
-            "type": "model_metric",
-            "model": display,
-            "accuracy": float(f1),
-        })
-        latency = LATENCY_ESTIMATES_US.get(model, 1000.0)
-        await publish({
-            "type": "model_perf",
-            "model": display,
-            "latency_us": float(latency),
-            "accuracy": float(f1),
-        })
-        n += 1
-    log.info("published %d model pairs (metric+perf)", n)
+        return f1, label
+    return None
+
+
+def emit_once(*, publisher: Publisher, reports_dir: Path) -> int:
+    n = 0
+    for bar_name, family in CANONICAL_TO_FAMILY.items():
+        result = _best_f1_for_family(reports_dir, family)
+        if result is None:
+            log.info("no F1 yet for %s (family=%s) — skipping",
+                     bar_name, family)
+            continue
+        f1, source = result
+        latency = float(LATENCY_PER_FAMILY_US.get(family, 1000.0))
+        try:
+            publisher.publish(ModelMetric(
+                model=bar_name, accuracy=f1))
+            publisher.publish(ModelPerf(
+                model=bar_name, latency_us=latency, accuracy=f1))
+            n += 1
+            log.debug("%s: F1=%.4f latency=%.0fus (from %s)",
+                      bar_name, f1, latency, source)
+        except Exception as e:
+            log.warning("publish failed for %s: %s", bar_name, e)
+    log.info("published %d (model_metric + model_perf) pairs", n)
    return n


 async def _run(args) -> int:
-    publisher = (null_publisher() if args.dry_run
-                 else http_publisher(args.publish_url))
+    publisher = Publisher(url=args.publish_url)
    while True:
-        await emit_once(publish=publisher, reports_dir=args.reports_dir)
+        emit_once(publisher=publisher, reports_dir=args.reports_dir)
        if args.interval <= 0:
            return 0
        await asyncio.sleep(args.interval)
@ -131,16 +215,21 @@ async def _run(args) -> int:
 def main() -> int:
    ap = argparse.ArgumentParser()
    ap.add_argument("--reports-dir", type=Path,
-                    default=Path("reports/eval"),
-                    help="dir containing <model>_<mode>_train.json files")
-    ap.add_argument("--publish-url", default="http://127.0.0.1:8447/publish")
-    ap.add_argument("--interval", type=float, default=30.0,
-                    help="re-publish period (s); 0 = one-shot")
-    ap.add_argument("--dry-run", action="store_true")
+                    default=Path("reports/eval"))
+    ap.add_argument("--publish-url",
+                    default="http://127.0.0.1:8447/publish")
+    ap.add_argument("--interval", type=float, default=5.0,
+                    help="re-publish period (s); 0 = one-shot. "
+                         "Kept short so a fresh page-load sees populated "
+                         "bars/scatter within a few seconds. The dashboard "
+                         "broadcaster does not replay events to new "
+                         "connections by default — see "
+                         "docs/dashboard-request-sticky-cache.md.")
    ap.add_argument("--log-level", default="INFO")
    args = ap.parse_args()
-    logging.basicConfig(level=args.log_level,
-                        format="%(asctime)s %(levelname)s %(name)s %(message)s")
+    logging.basicConfig(
+        level=args.log_level,
+        format="%(asctime)s %(levelname)s %(name)s %(message)s")
    return asyncio.run(_run(args))