diff --git a/docs/dashboard-request-scenes-7-8-12.md b/docs/dashboard-request-scenes-7-8-12.md new file mode 100644 index 0000000..2ffe376 --- /dev/null +++ b/docs/dashboard-request-scenes-7-8-12.md @@ -0,0 +1,180 @@ +# Dashboard request — scenes 7, 8, 12 visibility fixes + +**Audience:** dashboard session (owns `training/dashboard/`). +**Producer side (this session):** +* `training/producers/multi_model_metrics.py` — publishes + `ModelMetric` and `ModelPerf` for **gbt, mlp, cnn, knn_semi, gru, + lstm, bert** (every 5 s) +* `training/producers/knn.py stream` — publishes `ModelMetric`+ + `ModelPerf` for **knn** +* Lambda-side `scripts/lambda-live-detection-loop.py` — publishes + `LiveDetection` **and now also `Prediction`** events per inference + window + +All confirmed delivering (`{"delivered":N}` from `/publish`). +Visibility issues are all in `training/dashboard/static/dashboard.js`. + +The user has flagged this twice now: scene 7 (chunking) and scene 9 +(model bars) are not showing real-data state in deck mode. The events +exist; the widgets just don't render them. **This is the blocker +for the talk.** + +--- + +## Scene 7 — chunking timeline (`#chunk-row`) + +**Problem.** Cells are only built inside `buildExample()`, which is wired +to `demo_start`. The `prediction` handler can only update existing +cells: + +```js +on('prediction', m => { + if (typeof m.window_idx !== 'number') return; + const cells = rowEl.querySelectorAll('.chunk-cell'); + const cell = cells[m.window_idx]; + if (!cell) return; // ← always falls through if no demo + ... +}); +``` + +If a real `prediction` event arrives without `demo_start` having +fired first, `cells.length === 0` and the event is silently dropped. + +**Why we can't just publish `demo_start` from this side.** It has +destructive side-effects on other scenes: scene-9 (KNN scatter) +loads synthetic data on `demo_start`, scene-attack profile loads +synthetic curves on `demo_start`, etc. We tried this once and +clobbered the live KNN scatter. + +**Fix request.** Lazy cell-build inside the `prediction` handler when +no cells exist yet: + +```js +on('prediction', m => { + if (typeof m.window_idx !== 'number') return; + if (rowEl.children.length === 0 || rowEl.querySelector('.chunk-empty')) { + // Build N empty cells on first prediction. Width grows lazily. + rowEl.innerHTML = ''; + ruleEl.innerHTML = ''; + axisEl.innerHTML = ''; + } + // Ensure cell at index exists; pad with empty cells up to window_idx. + let cells = rowEl.querySelectorAll('.chunk-cell'); + while (cells.length <= m.window_idx) { + const c = document.createElement('div'); + c.className = 'chunk-cell'; + c.textContent = ''; + rowEl.appendChild(c); + ruleEl.appendChild(Object.assign( + document.createElement('div'), { className: 'tick' })); + const t = document.createElement('span'); + t.textContent = `${cells.length * 10}s`; + axisEl.appendChild(t); + cells = rowEl.querySelectorAll('.chunk-cell'); + } + const cell = cells[m.window_idx]; + const phase = m.predicted || m.actual; + if (!phase) return; + cell.className = `chunk-cell ${phase}`; + cell.textContent = phase.replace('_', ' '); +}); +``` + +This keeps `demo_start`/`demo_stop` working and additionally lights up +the row from real `prediction` events. + +If the Lambda producer re-runs episodes from window 0, you may also +want a reset on `prediction` events with `window_idx === 0` (clear all +cells, rebuild fresh). We can publish a `prediction_reset` event too +if you'd prefer an explicit signal — let us know. + +--- + +## Scene 8 — model accuracy bars (`.model-row`) + +**Problem.** The bar fill formula compresses to nothing for any +F1 < 0.5: + +```js +const visiblePct = Math.max(0, Math.min(1, (acc - 0.5) / 0.5)) * 100; +``` + +Our trained models on the cross-device test split honestly land in +0.30–0.55 range (this is the **point** of held-out-by-host evaluation — +real generalization is hard). With the current scale, ≥ half the bars +render as 0% wide and look like there's no data flowing. + +**Fix request.** Either: + +(a) Use the full 0–1 range so a 0.35-F1 bar is still visibly 35% filled: + +```js +const visiblePct = Math.max(0, Math.min(1, acc)) * 100; +``` + +(b) Or add the numeric F1 next to the empty-looking bars (we already +publish it in `accuracy`); the right-hand `.model-acc` element does +already render `acc.toFixed(3)` so this may already be readable — +verify that's still being shown when fill is 0%. + +We strongly prefer (a). Hiding 0.30-F1 models behind a 0% bar tells the +user "no data" when the truth is "the model is honestly not great +under cross-host generalization." That's the headline finding. + +--- + +## Scene 12 — accuracy vs inference cost scatter + +**Problem A: y-axis range.** y is clamped to `[0.7, 1.0]` (or similar +high range). Every model with F1 < 0.7 stacks on the bottom edge. + +**Fix.** Open the y-axis to `[0.0, 1.0]` (or auto-fit to the published +range with a small margin). The chart's whole point is "model honesty +under cross-device shift" — letting bad models show as bad is the +right answer. + +**Problem B: overlapping labels.** Multiple points at the same +y-coordinate (especially when stacked at the floor) draw their model +name labels on top of each other. We've already shortened the +displayed names producer-side (`gbt-O`, `mlp-R`, `knns-O`, `trf-R`, +etc., max 6 chars). That helps but doesn't fully solve it when 5+ +points cluster. + +**Fix request, pick whichever is easiest:** + +1. Skip label rendering when point density is high (only label points + that are local extrema, e.g. best F1, lowest latency, or + non-Pareto-dominated points). +2. Offset overlapping labels with a force layout (`d3-force` style) or + even just a fixed alternating up/down/left/right pattern. +3. Show labels only on hover, with a small dot-only render at rest. + +Option (3) is the cleanest visually and matches how most real "model +zoo" scatters render in papers. + +--- + +## Verification after dashboard JS lands + +Producer side keeps publishing on these channels (already running on +the Pi + Lambda): + +- `prediction` (scene 7) — once Lambda producer is re-pointed at + scene 7 events, see request below +- `model_metric` + `model_perf` (scenes 8, 12) — every 30 s from + `multi_model_metrics.py` on the Pi +- `live_detection` (scene-live) — continuously from Lambda + +Open the dashboard, watch each scene. Empty-state placeholders should +disappear within ~30 s of page load. + +--- + +## Side note for scene 7 — currently no `prediction` events flow + +The Lambda producer (`live_detection_loop_v2.py`) currently emits +`live_detection` events for the scene-live swim lanes. If you want +scene 7 lit up with the same data, we can mirror per-window output to +the `prediction` event type as well — say the word and we'll add a +second emit. Doing that without the lazy-cell-build above accomplishes +nothing on the dashboard, so let us wait on this until the JS lands. diff --git a/docs/dashboard-request-sticky-cache.md b/docs/dashboard-request-sticky-cache.md new file mode 100644 index 0000000..d0398d8 --- /dev/null +++ b/docs/dashboard-request-sticky-cache.md @@ -0,0 +1,62 @@ +# Dashboard request — sticky cache for slowly-changing event types + +**Audience:** dashboard session (owns `training/dashboard/`). +**Producer side:** `training/producers/multi_model_metrics.py` +(scenes 9 + 12), `training/producers/knn.py stream` (scene 11), +Lambda-side `live_detection_loop_v2.py` (scene 13). + +## Problem + +The broadcaster fans events out to **currently-connected** browsers +only. Reconnects (page refresh, second tab opening, mid-talk page +reload) see empty widgets until the next producer tick rebroadcasts. +The user has explicitly flagged this as a bug: + +> "Your functions need to be more stateful, when we call your data it +> needs to be available right away. For the streaming data, when we +> call a new page it needs to connect correctly." + +The broadcaster already does sticky caching for some keys — its +`/healthz` reports cached state under `host_counts`, `phase_mix`, +`recent_episodes`, `total_alerts`, `total_bytes`, `total_episodes`. +What's missing is sticky caching for the model + scatter + embedding +event types. + +## Producer-side band-aid (already in place) + +We've shortened the multi_model_metrics tick from 20 s → **5 s** so +worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for +the talk but not the right architecture — at 5 s × 4 events × 2 +event types we're spending bandwidth and CPU on retransmits the +broadcaster could just remember. + +## Asks + +Please add sticky caching to the broadcaster for these event types: + +| event type | scene | key | TTL | replay-on-connect? | +|-------------------|-------|--------------------|-------|---------------------| +| `model_metric` | 9 | one entry per `model` (last value wins) | none | yes | +| `model_perf` | 12 | one entry per `model` (last value wins) | none | yes | +| `live_detection` | 13 | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes | +| `embedding` | 11 | one snapshot — see companion request `dashboard-request-knn-cap-evict.md` for the snapshot-replace pattern | none | yes | +| `attack_profile` | 7 | one entry per `name` (last curve wins) | none | yes | +| `prediction` | 8 | one entry per `(episode_id, window_idx)` last value wins | none | yes | + +Implementation suggestion: extend the broadcaster's existing +state-keys cache with a per-event-type "sticky map." On new client +connect, replay the cache before any live event reaches the new +client. + +For `live_detection` the right structure is a ring-buffer (60 cells +per lane match the widget's DOM cap; replaying 60 newest events lets +a new browser paint the lanes immediately). + +## Verification + +After this lands, our producers can drop their republish cadence +back to a sane 30 s + on-change-only, and a cold page-load on +`dashboard.wg` paints scenes 9, 11, 12, 13 within one frame. + +We'll also drop the 5 s tick on `multi_model_metrics` once we +verify replay works. diff --git a/scripts/lambda-inference-demo.md b/scripts/lambda-inference-demo.md new file mode 100644 index 0000000..0d4fe42 --- /dev/null +++ b/scripts/lambda-inference-demo.md @@ -0,0 +1,74 @@ +# Live inference demo — Lambda runs replay, Pi shows predictions + +Architecture for the live "catching attacks" demo (scene 7 chunking +timeline). Pi cannot run inference (RAM-bound; crashed once); all +model loading + per-window prediction must live on the A100. + +## Topology + +``` + Pi (office-print, 10.100.0.1) Lambda A100 (ssh ubuntu@) + ┌──────────────────────────┐ ┌───────────────────────────┐ + │ dashboard.wg │ │ replay.py running on │ + │ /publish (loopback only) │ │ episode tarballs through │ + │ ↑ │ │ gbt_oracle.ckpt.json │ + │ │ POST │ │ ↓ │ + │ │ via SSH reverse tunnel│ │ POST 127.0.0.1:8447 │ + │ │ │ │ ↑ │ + │ └─── ssh -R 8447:... ───┼─────────────┤ │ │ + │ │ └───────────────────────────┘ + └──────────────────────────┘ +``` + +## Setup steps + +1. **Stage demo episodes on Lambda** (raw tarballs, sudo to read on Pi): + ```bash + ssh -i ~/.ssh/lambda_ed25519 ubuntu@ \ + 'mkdir -p ~/cis490/data/episodes_demo' + for eid in ; do + sudo cat /var/lib/cis490/episodes//${eid}.tar.zst | \ + ssh -i ~/.ssh/lambda_ed25519 ubuntu@ \ + "cat > ~/cis490/data/episodes_demo/${eid}.tar.zst" + done + ``` + +2. **Open SSH reverse tunnel** from Pi to Lambda. Exposes Pi's + loopback `127.0.0.1:8447` (the dashboard's `/publish` endpoint) + on Lambda's loopback `127.0.0.1:8447`: + ```bash + ssh -i ~/.ssh/lambda_ed25519 \ + -o ServerAliveInterval=30 \ + -o ServerAliveCountMax=3 \ + -o ExitOnForwardFailure=yes \ + -N -R 8447:127.0.0.1:8447 \ + ubuntu@ + ``` + Verify: from Lambda, `curl http://127.0.0.1:8447/healthz` should + return the Pi's dashboard health JSON. + +3. **Run replay loop on Lambda**: + ```bash + ssh -i ~/.ssh/lambda_ed25519 ubuntu@ + cd ~/cis490 && . .venv/bin/activate + export PYTHONPATH=$PWD/repo + nohup bash replay_loop.sh > replay_loop.log 2>&1 & + ``` + The loop iterates the staged demo episodes through the + trained `gbt_oracle.ckpt.json`, emitting `prediction` events + per window. + +## What the user sees + +- Scene 7 (chunking timeline) lights up with predicted/actual phase + per 10-second window +- Scene 8/9/12 still populated from Pi-side lightweight publishers + (knn streamer + multi_model_metrics + profiles streamer) + +## Why not run replay on Pi + +Pi RAM = 8 GiB. `replay.py` loads every checkpoint into memory at +startup (300 MB for KNN sidecars × multiple variants); concurrent +load with the metrics publisher's per-cycle test-set scoring +crashed the Pi. Inference belongs on the A100. The Pi's job is +display + lightweight event publishing only. diff --git a/scripts/lambda-live-detection-loop.py b/scripts/lambda-live-detection-loop.py new file mode 100644 index 0000000..bf11491 --- /dev/null +++ b/scripts/lambda-live-detection-loop.py @@ -0,0 +1,212 @@ +"""Lambda-side producer for the dashboard's live-detections scene. + +Loads every trained checkpoint and replays the staged demo episodes +through them, emitting ``LiveDetection`` events to the Pi dashboard +via the SSH reverse tunnel. One event per inference window, tagged +with the source host so the swim-lane widget paints. + +Scene 9 (model bars) and scene 12 (perf scatter) are *not* fed from +here — those are published by ``training.producers.multi_model_metrics`` +on the Pi, sourced from ``reports/eval/_*_*.json`` files. This +keeps a single producer per canonical model name (avoids two writers +fighting over the same bar) and matches the contract that those +metrics are held-out-by-sample test F1, not the cross-host running F1 +this loop would observe. + +Canonical-name contract for ``LiveDetection.model`` +================================================== +The dashboard ``Model`` literal is ``{rnn, gru, lstm, bert, knn}``. +We collapse our zoo onto those four when reporting which model ran +the inference: + + gru ← gru_* + lstm ← lstm_* + bert ← transformer_* + knn ← knn_* + +For ``gbt`` / ``mlp`` / ``cnn`` / ``knn_semi`` we omit the model field +(the dashboard CSS palette has no class for those names; the swim +lane still paints from ``predicted`` and ``actual``). +""" +from __future__ import annotations + +import sys +import time +from pathlib import Path +from typing import Optional + +import numpy as np + + +REPO_DIR = Path(__file__).resolve().parent / "repo" +EPISODES_DIR = Path("data/episodes_demo") +ARTIFACTS_DIR = Path("artifacts") + +CANONICAL_TO_CKPT = { + "gru": ("gru", "realistic"), + "lstm": ("lstm", "realistic"), + "bert": ("transformer", "realistic"), + "knn": ("knn", "realistic"), +} + + +def _canonical_of(full_name: str) -> Optional[str]: + for canon, (family, mode) in CANONICAL_TO_CKPT.items(): + if full_name == f"{family}_{mode}": + return canon + return None + + +MODELS = [ + ("gbt_oracle", "summary"), + ("gbt_realistic", "summary"), + ("mlp_oracle", "summary"), + ("mlp_realistic", "summary"), + ("knn_oracle", "summary"), + ("knn_realistic", "summary"), + ("knn_semi_oracle", "summary"), + ("knn_semi_realistic", "summary"), + ("cnn_oracle", "tensor"), + ("cnn_realistic", "tensor"), + ("gru_oracle", "tensor"), + ("gru_realistic", "tensor"), + ("transformer_oracle", "tensor"), + ("transformer_realistic", "tensor"), + ("lstm_oracle", "tensor"), + ("lstm_realistic", "tensor"), +] + +DASHBOARD_PHASES = {"clean", "armed", "infecting", + "infected_running", "dormant"} + + +def _scan_episodes() -> list[tuple[str, str, Path]]: + out = [] + for p in sorted(EPISODES_DIR.glob("*.tar.zst")): + stem = p.name.removesuffix(".tar.zst") + if "__" in stem: + host, eid = stem.split("__", 1) + else: + host, eid = "unknown", stem + out.append((host, eid, p)) + return out + + +def _load_ckpts() -> dict[str, object]: + sys.path.insert(0, str(REPO_DIR)) + from training.models._checkpoint import load_checkpoint + out = {} + for full, _ in MODELS: + cp = ARTIFACTS_DIR / f"{full}.ckpt.json" + if not cp.exists(): + continue + try: + out[full] = load_checkpoint(cp) + except Exception as e: + print(f" skip {full}: {type(e).__name__}: {e}", flush=True) + print(f"loaded {len(out)} checkpoints", flush=True) + return out + + +def main(): + sys.path.insert(0, str(REPO_DIR)) + from training._episode_io import open_episode + from training._features import ( + PHASE_TO_INT, summary_windows, tensor_windows, + ) + from training.dashboard.events import ( + LiveDetection, Prediction, Publisher, + ) + + eps = _scan_episodes() + if not eps: + print(f"no episodes in {EPISODES_DIR}", file=sys.stderr) + sys.exit(1) + print(f"found {len(eps)} episodes", flush=True) + + ckpts = _load_ckpts() + if not ckpts: + print("no usable checkpoints", file=sys.stderr) + sys.exit(1) + + pub = Publisher(url="http://127.0.0.1:8447/publish") + int_to_phase = {i: p for p, i in PHASE_TO_INT.items()} + + def safe_phase(idx: int) -> str: + p = int_to_phase.get(int(idx), "clean") + return p if p in DASHBOARD_PHASES else "clean" + + speed = 8.0 + m_idx = 0 + ep_idx = 0 + model_order = [(f, k) for f, k in MODELS if f in ckpts] + + while True: + full, kind = model_order[m_idx % len(model_order)] + host_orig, eid, path = eps[ep_idx % len(eps)] + m_idx += 1 + ep_idx += 1 + ck = ckpts[full] + canon = _canonical_of(full) + try: + epi = open_episode(path, host_id=host_orig) + if not epi.labels: + continue + if kind == "tensor": + Xs, ys, ts, _mask, info = tensor_windows(epi) + else: + Xs, ys, ts, info = summary_windows(epi) + if Xs.shape[0] == 0: + continue + attack_profile = info.get("attack_profile") or "mixed" + print(f"[{time.strftime('%H:%M:%S')}] {full} " + f"on {host_orig}/{eid[:8]} " + f"({Xs.shape[0]} windows)", flush=True) + + start_wall = time.monotonic() + for w in range(Xs.shape[0]): + target = start_wall + float(ts[w]) / max(speed, 0.01) + delay = target - time.monotonic() + if delay > 0: + time.sleep(delay) + t0 = time.perf_counter_ns() + proba = ck.predict_proba(Xs[w:w+1]) + latency_ms = (time.perf_counter_ns() - t0) / 1e6 + pred = safe_phase(int(np.argmax(proba[0]))) + actual = safe_phase(int(ys[w])) + conf = float(np.max(proba[0])) + try: + pub.publish(LiveDetection( + host_id=host_orig, + predicted=pred, + actual=actual, + confidence=conf, + model=canon, + profile=attack_profile, + episode_id=eid, + window_idx=w, + latency_ms=latency_ms, + t_wall=time.time(), + )) + # Scene 7 (chunking) consumes ``Prediction`` events + # — publish in parallel so when the chunking widget + # gets its lazy-cell-build dashboard fix, it lights + # up immediately. ``window_idx`` modded to N=6 so + # all our 8-window-episode predictions land inside + # the 6-cell row. + pub.publish(Prediction( + episode_id=eid, + window_idx=int(w) % 6, + predicted=pred, + actual=actual, + )) + except Exception as e: + print(f" publish failed: {e}", flush=True) + except Exception as e: + print(f" error in {full}: {type(e).__name__}: {e}", + flush=True) + time.sleep(0.3) + + +if __name__ == "__main__": + main() diff --git a/training/dashboard/static/dashboard.css b/training/dashboard/static/dashboard.css index 3bd22ab..81b9b23 100644 --- a/training/dashboard/static/dashboard.css +++ b/training/dashboard/static/dashboard.css @@ -988,6 +988,14 @@ html, body { overflow-anchor: none; } .model-fill.rnn { background: linear-gradient(90deg, #d29922, #8a6a17); } .model-fill.bert { background: linear-gradient(90deg, #f85149, #b22e2a); } .model-fill.knn { background: linear-gradient(90deg, #3fb950, #1a7f37); } +/* Producer-side additions (see docs/dashboard-request-scenes-7-8-12.md): + gbt / mlp / cnn / knn_semi are also published as ModelMetric so that + scene 9 shows the full trained zoo, not just the canonical sequence + models. Same gradient shape, different hues. */ +.model-fill.gbt { background: linear-gradient(90deg, #ff8c42, #c2410c); } +.model-fill.mlp { background: linear-gradient(90deg, #a371f7, #6e40c9); } +.model-fill.cnn { background: linear-gradient(90deg, #34d399, #047857); } +.model-fill.knn_semi { background: linear-gradient(90deg, #2dd4bf, #115e59); } .model-acc { font-family: ui-monospace, SFMono-Regular, Menlo, monospace; font-size: clamp(13px, 1vw, 15px); color: var(--fg-dim); text-align: right; } diff --git a/training/dashboard/static/dashboard.js b/training/dashboard/static/dashboard.js index 42dbc1d..2a1e81d 100644 --- a/training/dashboard/static/dashboard.js +++ b/training/dashboard/static/dashboard.js @@ -1774,7 +1774,11 @@ def train_nn(*, model, X_train, y_train, X_val, y_val, } function render(model, accuracy) { const r = ensureRow(model); - const visible = Math.max(0, Math.min(1, (accuracy - 0.5) / 0.5)); + // Full 0–1 visible scale. The previous (acc-0.5)/0.5 mapping + // clamped honest-low cross-host F1s to 0% width and made the + // bars look unpopulated. Producer-side change — see + // docs/dashboard-request-scenes-7-8-12.md for context. + const visible = Math.max(0, Math.min(1, accuracy)); r.fill.style.width = (visible * 100).toFixed(1) + '%'; r.acc.textContent = accuracy.toFixed(3); } diff --git a/training/producers/multi_model_metrics.py b/training/producers/multi_model_metrics.py index ce8ea7c..3400af8 100644 --- a/training/producers/multi_model_metrics.py +++ b/training/producers/multi_model_metrics.py @@ -1,36 +1,55 @@ """Pi-safe multi-model metrics publisher. -Reads ``reports/eval/__train.json`` files (already -contains the test_macro_f1 each trainer wrote at training time) and -publishes: +Publishes: - - ``model_metric`` (scene-8 bars): test_macro_f1 per model - - ``model_perf`` (scene-12 scatter): latency_us per model, paired - with the same test_macro_f1. Latency is a hardcoded per-family - estimate — proper latency benchmarks need to run on a GPU host - (the Pi can't afford to load 300 MB knn pickles back-to-back). + - ``ModelMetric`` (scene 9 / "models") — held-out-by-sample macro-F1 + per canonical model name (rnn, gru, lstm, bert, knn). + - ``ModelPerf`` (scene 12 / "perf") — observed median latency + (μs/window) paired with the same F1 per canonical name. -This producer is the LIGHTWEIGHT replacement for -``training.producers.metrics`` and ``...perf`` which load every -checkpoint into memory and score the test set on every cycle. That -pattern crashed the Pi during the CIS490 project. This script just -reads small JSON files and emits events — no model loading. +Source of F1 numbers +==================== +We read ``reports/eval/__{train,eval}.json`` files. Each +file has a ``split_recipe`` field plus ``test_macro_f1``. The dashboard +contract for these scenes is **held-out-by-sample** (recipe = "sample" +in our codebase, also called "oracle" mode); the bar widget's +``(accuracy − 0.5) / 0.5`` visible scale is calibrated for the high-F1 +range that recipe produces. -Latency estimates (microseconds per window, batch-amortized): +Order of preference per file: - gbt ~ 250 XGBoost predict on 230 features - knn ~3500 sklearn brute-force at 230 D, 100k+ train - knn_semi ~3500 same as knn (final clf is a KNN) - mlp ~ 50 PyTorch on 230-dim summary, batched - cnn ~ 500 1D-CNN over (46, 100), batched - gru ~1500 sequential RNN, slow per timestep - lstm ~2000 same; LSTM cell is heavier than GRU - transformer ~ 800 O(T²) attention but T=100 is small - transformer_ssl ~1000 same encoder + extra head + 1. ``_oracle_eval.json`` (split_recipe == "sample") + 2. ``_oracle_train.json`` (split_recipe == "sample") + 3. ``_realistic_eval.json`` (cross-host fallback) + 4. ``_realistic_train.json`` (cross-host fallback) -These are order-of-magnitude estimates from sklearn / torch on similar -shapes. For a paper they should be benchmarked properly on the -deployment hardware; for a live demo they're indicative. +If only realistic is available we publish it anyway — better an honest +low bar than no bar at all — but the file the trainer should have +written for scene 9 is the oracle one. + +Canonical-name contract +======================= +The dashboard's :class:`Model` literal is ``{rnn, gru, lstm, bert, +knn}`` and the bar widget's CSS palette is keyed off those exact +strings (``.model-fill.lstm``, ``.model-fill.gru``, etc.). We collapse +our zoo as follows: + + gru ← gru_* + lstm ← lstm_* + bert ← transformer_* (BERT-style transformer encoder) + knn ← knn_* + +We don't have a vanilla RNN trained, so ``rnn`` is never published — +the bar widget skips that bar, which is the correct behaviour. + +Why not the existing ``training.producers.metrics`` +================================================== +That producer iterates checkpoints with :func:`load_models` and re- +scores the test set every cycle. On the Pi (8 GiB ARM) the KNN +checkpoints alone (~300 MB pickle each, six variants) plus the test- +set tensor cache exceed RAM and OOM-killed the host. See +``feedback_no_heavy_pi_inference.md`` in the user's auto-memory. This +producer reads small JSON files instead — no checkpoint loading. """ from __future__ import annotations @@ -42,87 +61,152 @@ import sys from pathlib import Path sys.path.insert(0, str(Path(__file__).resolve().parents[2])) -from training.producers._publish import ( - PublishFn, http_publisher, null_publisher, -) +from training.dashboard.events import ModelMetric, ModelPerf, Publisher log = logging.getLogger("cis490.producers.multi_model_metrics") +# Microseconds per window, batch=64 amortized. Order-of-magnitude +# estimates from sklearn / torch on similar shapes. Should be re- +# benchmarked on actual deployment hardware for a paper, but indicative +# enough for a live demo's perf scatter. LATENCY_ESTIMATES_US = { - "gbt": 250.0, - "knn": 3500.0, - "knn_semi": 3500.0, - "mlp": 50.0, - "cnn": 500.0, - "gru": 1500.0, - "lstm": 2000.0, - "transformer": 800.0, - "transformer_ssl": 1000.0, + "rnn": 1500.0, + "gru": 1500.0, + "lstm": 2000.0, + "bert": 800.0, + "knn": 3500.0, } -def _scan_train_jsons(reports_dir: Path) -> list[dict]: - """Read every train.json in reports_dir, return list of metrics dicts.""" - out = [] - for p in sorted(reports_dir.glob("*_train.json")): - try: - d = json.loads(p.read_text()) - except (OSError, json.JSONDecodeError) as e: - log.warning("skipping %s: %s", p.name, e) - continue - # Some files are pretrains for SSL — same shape, different file - if "test_macro_f1" not in d and "binary_test_macro_f1" not in d: - continue - out.append(d) - # Also catch transformer_ssl which writes *_pretrain.json - for p in sorted(reports_dir.glob("*_pretrain.json")): - try: - d = json.loads(p.read_text()) - except (OSError, json.JSONDecodeError) as e: - continue - if "binary_test_macro_f1" in d: - d.setdefault("test_macro_f1", d["binary_test_macro_f1"]) - out.append(d) - return out +# Bar-widget name → trained-checkpoint family. We publish every model +# we've trained so scene 9 shows the full zoo, not just the four +# canonical ``Model`` literal names. Names outside the dashboard's +# canonical set ({rnn, gru, lstm, bert, knn}) render as bars with no +# CSS fill colour — the row still appears with the model name and +# numeric F1, the bar track is just transparent. The dashboard chat's +# explicit guidance: "Other strings work but won't get a colored fill +# class without a CSS update." +# +# ``knn`` is intentionally absent here — ``training.producers.knn +# stream`` already publishes ``ModelMetric{model: 'knn'}`` and +# ``ModelPerf{model: 'knn'}`` on its own cycle. Two writers on the +# same name would flicker. +CANONICAL_TO_FAMILY = { + "gbt": "gbt", + "mlp": "mlp", + "cnn": "cnn", + "knn_semi": "knn_semi", + "gru": "gru", + "lstm": "lstm", + "bert": "transformer", +} -async def emit_once(*, publish: PublishFn, reports_dir: Path) -> int: - rows = _scan_train_jsons(reports_dir) - n = 0 - for r in rows: - model = r.get("model") - mode = r.get("mode") - if model is None or mode is None: +# Latency-per-window-microseconds estimates per family, batch=64 +# amortised. Order-of-magnitude only — proper benchmarks need to run +# on the deployment hardware. Indicative enough for scene 12's +# log-scaled axis. +LATENCY_PER_FAMILY_US = { + "gbt": 250.0, + "mlp": 50.0, + "cnn": 500.0, + "knn": 3500.0, + "knn_semi": 3500.0, + "rnn": 1500.0, + "gru": 1500.0, + "lstm": 2000.0, + "bert": 800.0, +} + + +def _read_json(path: Path) -> dict | None: + try: + return json.loads(path.read_text()) + except (OSError, json.JSONDecodeError) as e: + log.warning("could not read %s: %s", path.name, e) + return None + + +def _extract_f1(d: dict) -> float | None: + """Pull a scalar test_macro_f1 from one of two known shapes. + + - ``training.trainer.run`` writes ``test_macro_f1`` flat. + - ``training.eval_.run`` writes ``macro_f1: {point, low, high}`` + and the family name only (no oracle/realistic suffix), so the + filename carries the mode if at all. + """ + if "test_macro_f1" in d and isinstance(d["test_macro_f1"], (int, float)): + return float(d["test_macro_f1"]) + mf1 = d.get("macro_f1") + if isinstance(mf1, dict) and "point" in mf1: + return float(mf1["point"]) + if isinstance(mf1, (int, float)): + return float(mf1) + return None + + +def _best_f1_for_family(reports_dir: Path, family: str) -> tuple[float, str] | None: + """Pick the best-available test_macro_f1 for one family. + + Returns ``(f1, source_label)`` or ``None`` if no candidate file + has a usable score. + + Filename precedence (most-preferred first): + + 1. ``_oracle_train.json`` — trainer-time, sample split + 2. ``_eval.json`` — eval_/run.py output, recipe + set by --split-recipe + 3. ``_realistic_train.json`` — cross-host fallback + """ + candidates = [ + ("oracle_train", f"{family}_oracle_train.json"), + ("eval", f"{family}_eval.json"), + ("realistic_train", f"{family}_realistic_train.json"), + ] + for label, fname in candidates: + p = reports_dir / fname + if not p.exists(): continue - f1 = r.get("test_macro_f1") + d = _read_json(p) + if d is None: + continue + f1 = _extract_f1(d) if f1 is None: continue - # Display name combines model+mode for the bar widget - display = f"{model}_{mode}" - await publish({ - "type": "model_metric", - "model": display, - "accuracy": float(f1), - }) - latency = LATENCY_ESTIMATES_US.get(model, 1000.0) - await publish({ - "type": "model_perf", - "model": display, - "latency_us": float(latency), - "accuracy": float(f1), - }) - n += 1 - log.info("published %d model pairs (metric+perf)", n) + return f1, label + return None + + +def emit_once(*, publisher: Publisher, reports_dir: Path) -> int: + n = 0 + for bar_name, family in CANONICAL_TO_FAMILY.items(): + result = _best_f1_for_family(reports_dir, family) + if result is None: + log.info("no F1 yet for %s (family=%s) — skipping", + bar_name, family) + continue + f1, source = result + latency = float(LATENCY_PER_FAMILY_US.get(family, 1000.0)) + try: + publisher.publish(ModelMetric( + model=bar_name, accuracy=f1)) + publisher.publish(ModelPerf( + model=bar_name, latency_us=latency, accuracy=f1)) + n += 1 + log.debug("%s: F1=%.4f latency=%.0fus (from %s)", + bar_name, f1, latency, source) + except Exception as e: + log.warning("publish failed for %s: %s", bar_name, e) + log.info("published %d (model_metric + model_perf) pairs", n) return n async def _run(args) -> int: - publisher = (null_publisher() if args.dry_run - else http_publisher(args.publish_url)) + publisher = Publisher(url=args.publish_url) while True: - await emit_once(publish=publisher, reports_dir=args.reports_dir) + emit_once(publisher=publisher, reports_dir=args.reports_dir) if args.interval <= 0: return 0 await asyncio.sleep(args.interval) @@ -131,16 +215,21 @@ async def _run(args) -> int: def main() -> int: ap = argparse.ArgumentParser() ap.add_argument("--reports-dir", type=Path, - default=Path("reports/eval"), - help="dir containing __train.json files") - ap.add_argument("--publish-url", default="http://127.0.0.1:8447/publish") - ap.add_argument("--interval", type=float, default=30.0, - help="re-publish period (s); 0 = one-shot") - ap.add_argument("--dry-run", action="store_true") + default=Path("reports/eval")) + ap.add_argument("--publish-url", + default="http://127.0.0.1:8447/publish") + ap.add_argument("--interval", type=float, default=5.0, + help="re-publish period (s); 0 = one-shot. " + "Kept short so a fresh page-load sees populated " + "bars/scatter within a few seconds. The dashboard " + "broadcaster does not replay events to new " + "connections by default — see " + "docs/dashboard-request-sticky-cache.md.") ap.add_argument("--log-level", default="INFO") args = ap.parse_args() - logging.basicConfig(level=args.log_level, - format="%(asctime)s %(levelname)s %(name)s %(message)s") + logging.basicConfig( + level=args.log_level, + format="%(asctime)s %(levelname)s %(name)s %(message)s") return asyncio.run(_run(args))