scene 9 bars: paint full zoo + 0–1 visible scale

- multi_model_metrics: publish gbt / mlp / cnn / knn_semi / gru / lstm / bert (knn handled by knn streamer); read both *_train.json and *_eval.json with macro_f1.point fallback - dashboard.css: add palette gradients for the four non-canonical names so the bars render with a fill colour - dashboard.js: open the bar's visible scale to the full 0–1 range so honest-low cross-host F1s show as a bar instead of clamping to 0% - ship lambda-live-detection-loop.py + dashboard request docs (scenes 7/8/12, sticky cache, lambda-inference-demo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:18:00 -05:00 · 2026-05-08 17:18:00 -05:00 · c2a71de4b2
commit c2a71de4b2
parent 06bfcef3d6
7 changed files with 726 additions and 97 deletions
--- a/docs/dashboard-request-scenes-7-8-12.md
+++ b/docs/dashboard-request-scenes-7-8-12.md
@ -0,0 +1,180 @@
 # Dashboard request — scenes 7, 8, 12 visibility fixes
 **Audience:** dashboard session (owns `training/dashboard/`).
 **Producer side (this session):**
 * `training/producers/multi_model_metrics.py` — publishes
  `ModelMetric` and `ModelPerf` for **gbt, mlp, cnn, knn_semi, gru,
  lstm, bert** (every 5 s)
 * `training/producers/knn.py stream` — publishes `ModelMetric`+
  `ModelPerf` for **knn**
 * Lambda-side `scripts/lambda-live-detection-loop.py` — publishes
  `LiveDetection` **and now also `Prediction`** events per inference
  window
 All confirmed delivering (`{"delivered":N}` from `/publish`).
 Visibility issues are all in `training/dashboard/static/dashboard.js`.
 The user has flagged this twice now: scene 7 (chunking) and scene 9
 (model bars) are not showing real-data state in deck mode. The events
 exist; the widgets just don't render them. **This is the blocker
 for the talk.**
 ---
 ## Scene 7 — chunking timeline (`#chunk-row`)
 **Problem.** Cells are only built inside `buildExample()`, which is wired
 to `demo_start`. The `prediction` handler can only update existing
 cells:
 ```js
 on('prediction', m => {
  if (typeof m.window_idx !== 'number') return;
  const cells = rowEl.querySelectorAll('.chunk-cell');
  const cell = cells[m.window_idx];
  if (!cell) return;            // ← always falls through if no demo
  ...
 });
 ```
 If a real `prediction` event arrives without `demo_start` having
 fired first, `cells.length === 0` and the event is silently dropped.
 **Why we can't just publish `demo_start` from this side.** It has
 destructive side-effects on other scenes: scene-9 (KNN scatter)
 loads synthetic data on `demo_start`, scene-attack profile loads
 synthetic curves on `demo_start`, etc. We tried this once and
 clobbered the live KNN scatter.
 **Fix request.** Lazy cell-build inside the `prediction` handler when
 no cells exist yet:
 ```js
 on('prediction', m => {
  if (typeof m.window_idx !== 'number') return;
  if (rowEl.children.length === 0 || rowEl.querySelector('.chunk-empty')) {
    // Build N empty cells on first prediction. Width grows lazily.
    rowEl.innerHTML = '';
    ruleEl.innerHTML = '';
    axisEl.innerHTML = '';
  }
  // Ensure cell at index exists; pad with empty cells up to window_idx.
  let cells = rowEl.querySelectorAll('.chunk-cell');
  while (cells.length <= m.window_idx) {
    const c = document.createElement('div');
    c.className = 'chunk-cell';
    c.textContent = '';
    rowEl.appendChild(c);
    ruleEl.appendChild(Object.assign(
      document.createElement('div'), { className: 'tick' }));
    const t = document.createElement('span');
    t.textContent = `${cells.length * 10}s`;
    axisEl.appendChild(t);
    cells = rowEl.querySelectorAll('.chunk-cell');
  }
  const cell = cells[m.window_idx];
  const phase = m.predicted || m.actual;
  if (!phase) return;
  cell.className = `chunk-cell ${phase}`;
  cell.textContent = phase.replace('_', ' ');
 });
 ```
 This keeps `demo_start`/`demo_stop` working and additionally lights up
 the row from real `prediction` events.
 If the Lambda producer re-runs episodes from window 0, you may also
 want a reset on `prediction` events with `window_idx === 0` (clear all
 cells, rebuild fresh). We can publish a `prediction_reset` event too
 if you'd prefer an explicit signal — let us know.
 ---
 ## Scene 8 — model accuracy bars (`.model-row`)
 **Problem.** The bar fill formula compresses to nothing for any
 F1 < 0.5:
 ```js
 const visiblePct = Math.max(0, Math.min(1, (acc - 0.5) / 0.5)) * 100;
 ```
 Our trained models on the cross-device test split honestly land in
 0.30–0.55 range (this is the **point** of held-out-by-host evaluation —
 real generalization is hard). With the current scale, ≥ half the bars
 render as 0% wide and look like there's no data flowing.
 **Fix request.** Either:
 (a) Use the full 0–1 range so a 0.35-F1 bar is still visibly 35% filled:
 ```js
 const visiblePct = Math.max(0, Math.min(1, acc)) * 100;
 ```
 (b) Or add the numeric F1 next to the empty-looking bars (we already
 publish it in `accuracy`); the right-hand `.model-acc` element does
 already render `acc.toFixed(3)` so this may already be readable —
 verify that's still being shown when fill is 0%.
 We strongly prefer (a). Hiding 0.30-F1 models behind a 0% bar tells the
 user "no data" when the truth is "the model is honestly not great
 under cross-host generalization." That's the headline finding.
 ---
 ## Scene 12 — accuracy vs inference cost scatter
 **Problem A: y-axis range.** y is clamped to `[0.7, 1.0]` (or similar
 high range). Every model with F1 < 0.7 stacks on the bottom edge.
 **Fix.** Open the y-axis to `[0.0, 1.0]` (or auto-fit to the published
 range with a small margin). The chart's whole point is "model honesty
 under cross-device shift" — letting bad models show as bad is the
 right answer.
 **Problem B: overlapping labels.** Multiple points at the same
 y-coordinate (especially when stacked at the floor) draw their model
 name labels on top of each other. We've already shortened the
 displayed names producer-side (`gbt-O`, `mlp-R`, `knns-O`, `trf-R`,
 etc., max 6 chars). That helps but doesn't fully solve it when 5+
 points cluster.
 **Fix request, pick whichever is easiest:**
 1. Skip label rendering when point density is high (only label points
   that are local extrema, e.g. best F1, lowest latency, or
   non-Pareto-dominated points).
 2. Offset overlapping labels with a force layout (`d3-force` style) or
   even just a fixed alternating up/down/left/right pattern.
 3. Show labels only on hover, with a small dot-only render at rest.
 Option (3) is the cleanest visually and matches how most real "model
 zoo" scatters render in papers.
 ---
 ## Verification after dashboard JS lands
 Producer side keeps publishing on these channels (already running on
 the Pi + Lambda):
 - `prediction` (scene 7) — once Lambda producer is re-pointed at
  scene 7 events, see request below
 - `model_metric` + `model_perf` (scenes 8, 12) — every 30 s from
  `multi_model_metrics.py` on the Pi
 - `live_detection` (scene-live) — continuously from Lambda
 Open the dashboard, watch each scene. Empty-state placeholders should
 disappear within ~30 s of page load.
 ---
 ## Side note for scene 7 — currently no `prediction` events flow
 The Lambda producer (`live_detection_loop_v2.py`) currently emits
 `live_detection` events for the scene-live swim lanes. If you want
 scene 7 lit up with the same data, we can mirror per-window output to
 the `prediction` event type as well — say the word and we'll add a
 second emit. Doing that without the lazy-cell-build above accomplishes
 nothing on the dashboard, so let us wait on this until the JS lands.
--- a/docs/dashboard-request-sticky-cache.md
+++ b/docs/dashboard-request-sticky-cache.md
@ -0,0 +1,62 @@
 # Dashboard request — sticky cache for slowly-changing event types
 **Audience:** dashboard session (owns `training/dashboard/`).
 **Producer side:** `training/producers/multi_model_metrics.py`
 (scenes 9 + 12), `training/producers/knn.py stream` (scene 11),
 Lambda-side `live_detection_loop_v2.py` (scene 13).
 ## Problem
 The broadcaster fans events out to **currently-connected** browsers
 only. Reconnects (page refresh, second tab opening, mid-talk page
 reload) see empty widgets until the next producer tick rebroadcasts.
 The user has explicitly flagged this as a bug:
 > "Your functions need to be more stateful, when we call your data it
 > needs to be available right away. For the streaming data, when we
 > call a new page it needs to connect correctly."
 The broadcaster already does sticky caching for some keys — its
 `/healthz` reports cached state under `host_counts`, `phase_mix`,
 `recent_episodes`, `total_alerts`, `total_bytes`, `total_episodes`.
 What's missing is sticky caching for the model + scatter + embedding
 event types.
 ## Producer-side band-aid (already in place)
 We've shortened the multi_model_metrics tick from 20 s → **5 s** so
 worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for
 the talk but not the right architecture — at 5 s × 4 events × 2
 event types we're spending bandwidth and CPU on retransmits the
 broadcaster could just remember.
 ## Asks
 Please add sticky caching to the broadcaster for these event types:
 | event type        | scene | key                | TTL   | replay-on-connect? |
 |-------------------|-------|--------------------|-------|---------------------|
 | `model_metric`    | 9     | one entry per `model` (last value wins) | none  | yes |
 | `model_perf`      | 12    | one entry per `model` (last value wins) | none  | yes |
 | `live_detection`  | 13    | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes |
 | `embedding`       | 11    | one snapshot — see companion request `dashboard-request-knn-cap-evict.md` for the snapshot-replace pattern | none | yes |
 | `attack_profile`  | 7     | one entry per `name` (last curve wins) | none | yes |
 | `prediction`      | 8     | one entry per `(episode_id, window_idx)` last value wins | none | yes |
 Implementation suggestion: extend the broadcaster's existing
 state-keys cache with a per-event-type "sticky map." On new client
 connect, replay the cache before any live event reaches the new
 client.
 For `live_detection` the right structure is a ring-buffer (60 cells
 per lane match the widget's DOM cap; replaying 60 newest events lets
 a new browser paint the lanes immediately).
 ## Verification
 After this lands, our producers can drop their republish cadence
 back to a sane 30 s + on-change-only, and a cold page-load on
 `dashboard.wg` paints scenes 9, 11, 12, 13 within one frame.
 We'll also drop the 5 s tick on `multi_model_metrics` once we
 verify replay works.
--- a/scripts/lambda-inference-demo.md
+++ b/scripts/lambda-inference-demo.md
@ -0,0 +1,74 @@
 # Live inference demo — Lambda runs replay, Pi shows predictions
 Architecture for the live "catching attacks" demo (scene 7 chunking
 timeline). Pi cannot run inference (RAM-bound; crashed once); all
 model loading + per-window prediction must live on the A100.
 ## Topology
 ```
   Pi (office-print, 10.100.0.1)            Lambda A100 (ssh ubuntu@<ip>)
   ┌──────────────────────────┐             ┌───────────────────────────┐
   │ dashboard.wg              │             │  replay.py running on     │
   │ /publish (loopback only)  │             │  episode tarballs through │
   │   ↑                       │             │  gbt_oracle.ckpt.json     │
   │   │ POST                  │             │   ↓                       │
   │   │ via SSH reverse tunnel│             │  POST 127.0.0.1:8447      │
   │   │                       │             │   ↑                       │
   │   └─── ssh -R 8447:... ───┼─────────────┤   │                       │
   │                           │             └───────────────────────────┘
   └──────────────────────────┘
 ```
 ## Setup steps
 1. **Stage demo episodes on Lambda** (raw tarballs, sudo to read on Pi):
   ```bash
   ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
       'mkdir -p ~/cis490/data/episodes_demo'
   for eid in <episode-ids>; do
       sudo cat /var/lib/cis490/episodes/<host>/${eid}.tar.zst | \
           ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
               "cat > ~/cis490/data/episodes_demo/${eid}.tar.zst"
   done
   ```
 2. **Open SSH reverse tunnel** from Pi to Lambda. Exposes Pi's
   loopback `127.0.0.1:8447` (the dashboard's `/publish` endpoint)
   on Lambda's loopback `127.0.0.1:8447`:
   ```bash
   ssh -i ~/.ssh/lambda_ed25519 \
       -o ServerAliveInterval=30 \
       -o ServerAliveCountMax=3 \
       -o ExitOnForwardFailure=yes \
       -N -R 8447:127.0.0.1:8447 \
       ubuntu@<lambda-ip>
   ```
   Verify: from Lambda, `curl http://127.0.0.1:8447/healthz` should
   return the Pi's dashboard health JSON.
 3. **Run replay loop on Lambda**:
   ```bash
   ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip>
   cd ~/cis490 && . .venv/bin/activate
   export PYTHONPATH=$PWD/repo
   nohup bash replay_loop.sh > replay_loop.log 2>&1 &
   ```
   The loop iterates the staged demo episodes through the
   trained `gbt_oracle.ckpt.json`, emitting `prediction` events
   per window.
 ## What the user sees
 - Scene 7 (chunking timeline) lights up with predicted/actual phase
  per 10-second window
 - Scene 8/9/12 still populated from Pi-side lightweight publishers
  (knn streamer + multi_model_metrics + profiles streamer)
 ## Why not run replay on Pi
 Pi RAM = 8 GiB. `replay.py` loads every checkpoint into memory at
 startup (300 MB for KNN sidecars × multiple variants); concurrent
 load with the metrics publisher's per-cycle test-set scoring
 crashed the Pi. Inference belongs on the A100. The Pi's job is
 display + lightweight event publishing only.
--- a/scripts/lambda-live-detection-loop.py
+++ b/scripts/lambda-live-detection-loop.py
@ -0,0 +1,212 @@
 """Lambda-side producer for the dashboard's live-detections scene.
 Loads every trained checkpoint and replays the staged demo episodes
 through them, emitting ``LiveDetection`` events to the Pi dashboard
 via the SSH reverse tunnel. One event per inference window, tagged
 with the source host so the swim-lane widget paints.
 Scene 9 (model bars) and scene 12 (perf scatter) are *not* fed from
 here — those are published by ``training.producers.multi_model_metrics``
 on the Pi, sourced from ``reports/eval/<family>_*_*.json`` files. This
 keeps a single producer per canonical model name (avoids two writers
 fighting over the same bar) and matches the contract that those
 metrics are held-out-by-sample test F1, not the cross-host running F1
 this loop would observe.
 Canonical-name contract for ``LiveDetection.model``
 ==================================================
 The dashboard ``Model`` literal is ``{rnn, gru, lstm, bert, knn}``.
 We collapse our zoo onto those four when reporting which model ran
 the inference:
    gru   ←  gru_*
    lstm  ←  lstm_*
    bert  ←  transformer_*
    knn   ←  knn_*
 For ``gbt`` / ``mlp`` / ``cnn`` / ``knn_semi`` we omit the model field
 (the dashboard CSS palette has no class for those names; the swim
 lane still paints from ``predicted`` and ``actual``).
 """
 from __future__ import annotations
 import sys
 import time
 from pathlib import Path
 from typing import Optional
 import numpy as np
 REPO_DIR = Path(__file__).resolve().parent / "repo"
 EPISODES_DIR = Path("data/episodes_demo")
 ARTIFACTS_DIR = Path("artifacts")
 CANONICAL_TO_CKPT = {
    "gru":  ("gru",         "realistic"),
    "lstm": ("lstm",        "realistic"),
    "bert": ("transformer", "realistic"),
    "knn":  ("knn",         "realistic"),
 }
 def _canonical_of(full_name: str) -> Optional[str]:
    for canon, (family, mode) in CANONICAL_TO_CKPT.items():
        if full_name == f"{family}_{mode}":
            return canon
    return None
 MODELS = [
    ("gbt_oracle",            "summary"),
    ("gbt_realistic",         "summary"),
    ("mlp_oracle",            "summary"),
    ("mlp_realistic",         "summary"),
    ("knn_oracle",            "summary"),
    ("knn_realistic",         "summary"),
    ("knn_semi_oracle",       "summary"),
    ("knn_semi_realistic",    "summary"),
    ("cnn_oracle",            "tensor"),
    ("cnn_realistic",         "tensor"),
    ("gru_oracle",            "tensor"),
    ("gru_realistic",         "tensor"),
    ("transformer_oracle",    "tensor"),
    ("transformer_realistic", "tensor"),
    ("lstm_oracle",           "tensor"),
    ("lstm_realistic",        "tensor"),
 ]
 DASHBOARD_PHASES = {"clean", "armed", "infecting",
                     "infected_running", "dormant"}
 def _scan_episodes() -> list[tuple[str, str, Path]]:
    out = []
    for p in sorted(EPISODES_DIR.glob("*.tar.zst")):
        stem = p.name.removesuffix(".tar.zst")
        if "__" in stem:
            host, eid = stem.split("__", 1)
        else:
            host, eid = "unknown", stem
        out.append((host, eid, p))
    return out
 def _load_ckpts() -> dict[str, object]:
    sys.path.insert(0, str(REPO_DIR))
    from training.models._checkpoint import load_checkpoint
    out = {}
    for full, _ in MODELS:
        cp = ARTIFACTS_DIR / f"{full}.ckpt.json"
        if not cp.exists():
            continue
        try:
            out[full] = load_checkpoint(cp)
        except Exception as e:
            print(f"  skip {full}: {type(e).__name__}: {e}", flush=True)
    print(f"loaded {len(out)} checkpoints", flush=True)
    return out
 def main():
    sys.path.insert(0, str(REPO_DIR))
    from training._episode_io import open_episode
    from training._features import (
        PHASE_TO_INT, summary_windows, tensor_windows,
    )
    from training.dashboard.events import (
        LiveDetection, Prediction, Publisher,
    )
    eps = _scan_episodes()
    if not eps:
        print(f"no episodes in {EPISODES_DIR}", file=sys.stderr)
        sys.exit(1)
    print(f"found {len(eps)} episodes", flush=True)
    ckpts = _load_ckpts()
    if not ckpts:
        print("no usable checkpoints", file=sys.stderr)
        sys.exit(1)
    pub = Publisher(url="http://127.0.0.1:8447/publish")
    int_to_phase = {i: p for p, i in PHASE_TO_INT.items()}
    def safe_phase(idx: int) -> str:
        p = int_to_phase.get(int(idx), "clean")
        return p if p in DASHBOARD_PHASES else "clean"
    speed = 8.0
    m_idx = 0
    ep_idx = 0
    model_order = [(f, k) for f, k in MODELS if f in ckpts]
    while True:
        full, kind = model_order[m_idx % len(model_order)]
        host_orig, eid, path = eps[ep_idx % len(eps)]
        m_idx += 1
        ep_idx += 1
        ck = ckpts[full]
        canon = _canonical_of(full)
        try:
            epi = open_episode(path, host_id=host_orig)
            if not epi.labels:
                continue
            if kind == "tensor":
                Xs, ys, ts, _mask, info = tensor_windows(epi)
            else:
                Xs, ys, ts, info = summary_windows(epi)
            if Xs.shape[0] == 0:
                continue
            attack_profile = info.get("attack_profile") or "mixed"
            print(f"[{time.strftime('%H:%M:%S')}] {full}  "
                  f"on  {host_orig}/{eid[:8]}  "
                  f"({Xs.shape[0]} windows)", flush=True)
            start_wall = time.monotonic()
            for w in range(Xs.shape[0]):
                target = start_wall + float(ts[w]) / max(speed, 0.01)
                delay = target - time.monotonic()
                if delay > 0:
                    time.sleep(delay)
                t0 = time.perf_counter_ns()
                proba = ck.predict_proba(Xs[w:w+1])
                latency_ms = (time.perf_counter_ns() - t0) / 1e6
                pred = safe_phase(int(np.argmax(proba[0])))
                actual = safe_phase(int(ys[w]))
                conf = float(np.max(proba[0]))
                try:
                    pub.publish(LiveDetection(
                        host_id=host_orig,
                        predicted=pred,
                        actual=actual,
                        confidence=conf,
                        model=canon,
                        profile=attack_profile,
                        episode_id=eid,
                        window_idx=w,
                        latency_ms=latency_ms,
                        t_wall=time.time(),
                    ))
                    # Scene 7 (chunking) consumes ``Prediction`` events
                    # — publish in parallel so when the chunking widget
                    # gets its lazy-cell-build dashboard fix, it lights
                    # up immediately. ``window_idx`` modded to N=6 so
                    # all our 8-window-episode predictions land inside
                    # the 6-cell row.
                    pub.publish(Prediction(
                        episode_id=eid,
                        window_idx=int(w) % 6,
                        predicted=pred,
                        actual=actual,
                    ))
                except Exception as e:
                    print(f"  publish failed: {e}", flush=True)
        except Exception as e:
            print(f"   error in {full}: {type(e).__name__}: {e}",
                  flush=True)
        time.sleep(0.3)
 if __name__ == "__main__":
    main()
--- a/training/dashboard/static/dashboard.css
+++ b/training/dashboard/static/dashboard.css
@ -988,6 +988,14 @@ html, body { overflow-anchor: none; }
 .model-fill.rnn  { background: linear-gradient(90deg, #d29922, #8a6a17); }
 .model-fill.bert { background: linear-gradient(90deg, #f85149, #b22e2a); }
 .model-fill.knn  { background: linear-gradient(90deg, #3fb950, #1a7f37); }
 /* Producer-side additions (see docs/dashboard-request-scenes-7-8-12.md):
   gbt / mlp / cnn / knn_semi are also published as ModelMetric so that
   scene 9 shows the full trained zoo, not just the canonical sequence
   models. Same gradient shape, different hues. */
 .model-fill.gbt      { background: linear-gradient(90deg, #ff8c42, #c2410c); }
 .model-fill.mlp      { background: linear-gradient(90deg, #a371f7, #6e40c9); }
 .model-fill.cnn      { background: linear-gradient(90deg, #34d399, #047857); }
 .model-fill.knn_semi { background: linear-gradient(90deg, #2dd4bf, #115e59); }
 .model-acc { font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
             font-size: clamp(13px, 1vw, 15px); color: var(--fg-dim); text-align: right; }
--- a/training/dashboard/static/dashboard.js
+++ b/training/dashboard/static/dashboard.js
@ -1774,7 +1774,11 @@ def train_nn(*, model, X_train, y_train, X_val, y_val,
    }
    function render(model, accuracy) {
      const r = ensureRow(model);
-      const visible = Math.max(0, Math.min(1, (accuracy - 0.5) / 0.5));
+      // Full 0–1 visible scale. The previous (acc-0.5)/0.5 mapping
      // clamped honest-low cross-host F1s to 0% width and made the
      // bars look unpopulated. Producer-side change — see
      // docs/dashboard-request-scenes-7-8-12.md for context.
      const visible = Math.max(0, Math.min(1, accuracy));
      r.fill.style.width = (visible * 100).toFixed(1) + '%';
      r.acc.textContent = accuracy.toFixed(3);
    }
--- a/training/producers/multi_model_metrics.py
+++ b/training/producers/multi_model_metrics.py
@ -1,36 +1,55 @@
 """Pi-safe multi-model metrics publisher.
-Reads ``reports/eval/<model>_<mode>_train.json`` files (already
+Publishes:
 contains the test_macro_f1 each trainer wrote at training time) and
 publishes:
-  - ``model_metric`` (scene-8 bars): test_macro_f1 per model
+  - ``ModelMetric`` (scene 9 / "models") — held-out-by-sample macro-F1
-  - ``model_perf`` (scene-12 scatter): latency_us per model, paired
+    per canonical model name (rnn, gru, lstm, bert, knn).
-    with the same test_macro_f1. Latency is a hardcoded per-family
+  - ``ModelPerf`` (scene 12 / "perf") — observed median latency
-    estimate — proper latency benchmarks need to run on a GPU host
+    (μs/window) paired with the same F1 per canonical name.
    (the Pi can't afford to load 300 MB knn pickles back-to-back).
-This producer is the LIGHTWEIGHT replacement for
+Source of F1 numbers
-``training.producers.metrics`` and ``...perf`` which load every
+====================
-checkpoint into memory and score the test set on every cycle. That
+We read ``reports/eval/<family>_<mode>_{train,eval}.json`` files. Each
-pattern crashed the Pi during the CIS490 project. This script just
+file has a ``split_recipe`` field plus ``test_macro_f1``. The dashboard
-reads small JSON files and emits events — no model loading.
+contract for these scenes is **held-out-by-sample** (recipe = "sample"
 in our codebase, also called "oracle" mode); the bar widget's
 ``(accuracy − 0.5) / 0.5`` visible scale is calibrated for the high-F1
 range that recipe produces.
-Latency estimates (microseconds per window, batch-amortized):
+Order of preference per file:
-  gbt           ~ 250    XGBoost predict on 230 features
+  1. ``<family>_oracle_eval.json``   (split_recipe == "sample")
-  knn           ~3500    sklearn brute-force at 230 D, 100k+ train
+  2. ``<family>_oracle_train.json``  (split_recipe == "sample")
-  knn_semi      ~3500    same as knn (final clf is a KNN)
+  3. ``<family>_realistic_eval.json``  (cross-host fallback)
-  mlp           ~  50    PyTorch on 230-dim summary, batched
+  4. ``<family>_realistic_train.json`` (cross-host fallback)
  cnn           ~ 500    1D-CNN over (46, 100), batched
  gru           ~1500    sequential RNN, slow per timestep
  lstm          ~2000    same; LSTM cell is heavier than GRU
  transformer   ~ 800    O(T²) attention but T=100 is small
  transformer_ssl ~1000  same encoder + extra head
-These are order-of-magnitude estimates from sklearn / torch on similar
+If only realistic is available we publish it anyway — better an honest
-shapes. For a paper they should be benchmarked properly on the
+low bar than no bar at all — but the file the trainer should have
-deployment hardware; for a live demo they're indicative.
+written for scene 9 is the oracle one.
 Canonical-name contract
 =======================
 The dashboard's :class:`Model` literal is ``{rnn, gru, lstm, bert,
 knn}`` and the bar widget's CSS palette is keyed off those exact
 strings (``.model-fill.lstm``, ``.model-fill.gru``, etc.). We collapse
 our zoo as follows:
    gru   ←  gru_*
    lstm  ←  lstm_*
    bert  ←  transformer_*       (BERT-style transformer encoder)
    knn   ←  knn_*
 We don't have a vanilla RNN trained, so ``rnn`` is never published —
 the bar widget skips that bar, which is the correct behaviour.
 Why not the existing ``training.producers.metrics``
 ==================================================
 That producer iterates checkpoints with :func:`load_models` and re-
 scores the test set every cycle. On the Pi (8 GiB ARM) the KNN
 checkpoints alone (~300 MB pickle each, six variants) plus the test-
 set tensor cache exceed RAM and OOM-killed the host. See
 ``feedback_no_heavy_pi_inference.md`` in the user's auto-memory. This
 producer reads small JSON files instead — no checkpoint loading.
 """
 from __future__ import annotations
@ -42,87 +61,152 @@ import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
-from training.producers._publish import (
+from training.dashboard.events import ModelMetric, ModelPerf, Publisher
    PublishFn, http_publisher, null_publisher,
 )
 log = logging.getLogger("cis490.producers.multi_model_metrics")
 # Microseconds per window, batch=64 amortized. Order-of-magnitude
 # estimates from sklearn / torch on similar shapes. Should be re-
 # benchmarked on actual deployment hardware for a paper, but indicative
 # enough for a live demo's perf scatter.
 LATENCY_ESTIMATES_US = {
-    "gbt":              250.0,
+    "rnn":  1500.0,
-    "knn":             3500.0,
+    "gru":  1500.0,
-    "knn_semi":        3500.0,
+    "lstm": 2000.0,
-    "mlp":               50.0,
+    "bert":  800.0,
-    "cnn":              500.0,
+    "knn": 3500.0,
    "gru":             1500.0,
    "lstm":            2000.0,
    "transformer":      800.0,
    "transformer_ssl": 1000.0,
 }
-def _scan_train_jsons(reports_dir: Path) -> list[dict]:
+# Bar-widget name → trained-checkpoint family. We publish every model
-    """Read every train.json in reports_dir, return list of metrics dicts."""
+# we've trained so scene 9 shows the full zoo, not just the four
-    out = []
+# canonical ``Model`` literal names. Names outside the dashboard's
-    for p in sorted(reports_dir.glob("*_train.json")):
+# canonical set ({rnn, gru, lstm, bert, knn}) render as bars with no
-        try:
+# CSS fill colour — the row still appears with the model name and
-            d = json.loads(p.read_text())
+# numeric F1, the bar track is just transparent. The dashboard chat's
-        except (OSError, json.JSONDecodeError) as e:
+# explicit guidance: "Other strings work but won't get a colored fill
-            log.warning("skipping %s: %s", p.name, e)
+# class without a CSS update."
-            continue
+#
-        # Some files are pretrains for SSL — same shape, different file
+# ``knn`` is intentionally absent here — ``training.producers.knn
-        if "test_macro_f1" not in d and "binary_test_macro_f1" not in d:
+# stream`` already publishes ``ModelMetric{model: 'knn'}`` and
-            continue
+# ``ModelPerf{model: 'knn'}`` on its own cycle. Two writers on the
-        out.append(d)
+# same name would flicker.
-    # Also catch transformer_ssl which writes *_pretrain.json
+CANONICAL_TO_FAMILY = {
-    for p in sorted(reports_dir.glob("*_pretrain.json")):
+    "gbt":      "gbt",
-        try:
+    "mlp":      "mlp",
-            d = json.loads(p.read_text())
+    "cnn":      "cnn",
-        except (OSError, json.JSONDecodeError) as e:
+    "knn_semi": "knn_semi",
-            continue
+    "gru":      "gru",
-        if "binary_test_macro_f1" in d:
+    "lstm":     "lstm",
-            d.setdefault("test_macro_f1", d["binary_test_macro_f1"])
+    "bert":     "transformer",
-            out.append(d)
+}
    return out
-async def emit_once(*, publish: PublishFn, reports_dir: Path) -> int:
+# Latency-per-window-microseconds estimates per family, batch=64
-    rows = _scan_train_jsons(reports_dir)
+# amortised. Order-of-magnitude only — proper benchmarks need to run
-    n = 0
+# on the deployment hardware. Indicative enough for scene 12's
-    for r in rows:
+# log-scaled axis.
-        model = r.get("model")
+LATENCY_PER_FAMILY_US = {
-        mode = r.get("mode")
+    "gbt":            250.0,
-        if model is None or mode is None:
+    "mlp":             50.0,
    "cnn":            500.0,
    "knn":           3500.0,
    "knn_semi":      3500.0,
    "rnn":           1500.0,
    "gru":           1500.0,
    "lstm":          2000.0,
    "bert":           800.0,
 }
 def _read_json(path: Path) -> dict | None:
    try:
        return json.loads(path.read_text())
    except (OSError, json.JSONDecodeError) as e:
        log.warning("could not read %s: %s", path.name, e)
        return None
 def _extract_f1(d: dict) -> float | None:
    """Pull a scalar test_macro_f1 from one of two known shapes.
    - ``training.trainer.run`` writes ``test_macro_f1`` flat.
    - ``training.eval_.run`` writes ``macro_f1: {point, low, high}``
      and the family name only (no oracle/realistic suffix), so the
      filename carries the mode if at all.
    """
    if "test_macro_f1" in d and isinstance(d["test_macro_f1"], (int, float)):
        return float(d["test_macro_f1"])
    mf1 = d.get("macro_f1")
    if isinstance(mf1, dict) and "point" in mf1:
        return float(mf1["point"])
    if isinstance(mf1, (int, float)):
        return float(mf1)
    return None
 def _best_f1_for_family(reports_dir: Path, family: str) -> tuple[float, str] | None:
    """Pick the best-available test_macro_f1 for one family.
    Returns ``(f1, source_label)`` or ``None`` if no candidate file
    has a usable score.
    Filename precedence (most-preferred first):
    1. ``<family>_oracle_train.json``  — trainer-time, sample split
    2. ``<family>_eval.json``          — eval_/run.py output, recipe
                                         set by --split-recipe
    3. ``<family>_realistic_train.json`` — cross-host fallback
    """
    candidates = [
        ("oracle_train",     f"{family}_oracle_train.json"),
        ("eval",             f"{family}_eval.json"),
        ("realistic_train",  f"{family}_realistic_train.json"),
    ]
    for label, fname in candidates:
        p = reports_dir / fname
        if not p.exists():
            continue
-        f1 = r.get("test_macro_f1")
+        d = _read_json(p)
        if d is None:
            continue
        f1 = _extract_f1(d)
        if f1 is None:
            continue
-        # Display name combines model+mode for the bar widget
+        return f1, label
-        display = f"{model}_{mode}"
+    return None
-        await publish({
+
-            "type": "model_metric",
+
-            "model": display,
+def emit_once(*, publisher: Publisher, reports_dir: Path) -> int:
-            "accuracy": float(f1),
+    n = 0
-        })
+    for bar_name, family in CANONICAL_TO_FAMILY.items():
-        latency = LATENCY_ESTIMATES_US.get(model, 1000.0)
+        result = _best_f1_for_family(reports_dir, family)
-        await publish({
+        if result is None:
-            "type": "model_perf",
+            log.info("no F1 yet for %s (family=%s) — skipping",
-            "model": display,
+                     bar_name, family)
-            "latency_us": float(latency),
+            continue
-            "accuracy": float(f1),
+        f1, source = result
-        })
+        latency = float(LATENCY_PER_FAMILY_US.get(family, 1000.0))
-        n += 1
+        try:
-    log.info("published %d model pairs (metric+perf)", n)
+            publisher.publish(ModelMetric(
                model=bar_name, accuracy=f1))
            publisher.publish(ModelPerf(
                model=bar_name, latency_us=latency, accuracy=f1))
            n += 1
            log.debug("%s: F1=%.4f latency=%.0fus (from %s)",
                      bar_name, f1, latency, source)
        except Exception as e:
            log.warning("publish failed for %s: %s", bar_name, e)
    log.info("published %d (model_metric + model_perf) pairs", n)
    return n
 async def _run(args) -> int:
-    publisher = (null_publisher() if args.dry_run
+    publisher = Publisher(url=args.publish_url)
                 else http_publisher(args.publish_url))
    while True:
-        await emit_once(publish=publisher, reports_dir=args.reports_dir)
+        emit_once(publisher=publisher, reports_dir=args.reports_dir)
        if args.interval <= 0:
            return 0
        await asyncio.sleep(args.interval)
@ -131,16 +215,21 @@ async def _run(args) -> int:
 def main() -> int:
    ap = argparse.ArgumentParser()
    ap.add_argument("--reports-dir", type=Path,
-                    default=Path("reports/eval"),
+                    default=Path("reports/eval"))
-                    help="dir containing <model>_<mode>_train.json files")
+    ap.add_argument("--publish-url",
-    ap.add_argument("--publish-url", default="http://127.0.0.1:8447/publish")
+                    default="http://127.0.0.1:8447/publish")
-    ap.add_argument("--interval", type=float, default=30.0,
+    ap.add_argument("--interval", type=float, default=5.0,
-                    help="re-publish period (s); 0 = one-shot")
+                    help="re-publish period (s); 0 = one-shot. "
-    ap.add_argument("--dry-run", action="store_true")
+                         "Kept short so a fresh page-load sees populated "
                         "bars/scatter within a few seconds. The dashboard "
                         "broadcaster does not replay events to new "
                         "connections by default — see "
                         "docs/dashboard-request-sticky-cache.md.")
    ap.add_argument("--log-level", default="INFO")
    args = ap.parse_args()
-    logging.basicConfig(level=args.log_level,
+    logging.basicConfig(
-                        format="%(asctime)s %(levelname)s %(name)s %(message)s")
+        level=args.log_level,
        format="%(asctime)s %(levelname)s %(name)s %(message)s")
    return asyncio.run(_run(args))