scene 9 bars: paint full zoo + 0–1 visible scale

- multi_model_metrics: publish gbt / mlp / cnn / knn_semi /
  gru / lstm / bert (knn handled by knn streamer); read both
  *_train.json and *_eval.json with macro_f1.point fallback
- dashboard.css: add palette gradients for the four
  non-canonical names so the bars render with a fill colour
- dashboard.js: open the bar's visible scale to the full 0–1
  range so honest-low cross-host F1s show as a bar instead of
  clamping to 0%
- ship lambda-live-detection-loop.py + dashboard request docs
  (scenes 7/8/12, sticky cache, lambda-inference-demo)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Max 2026-05-08 17:18:00 -05:00
parent 06bfcef3d6
commit c2a71de4b2
7 changed files with 726 additions and 97 deletions

View file

@ -0,0 +1,180 @@
# Dashboard request — scenes 7, 8, 12 visibility fixes
**Audience:** dashboard session (owns `training/dashboard/`).
**Producer side (this session):**
* `training/producers/multi_model_metrics.py` — publishes
`ModelMetric` and `ModelPerf` for **gbt, mlp, cnn, knn_semi, gru,
lstm, bert** (every 5 s)
* `training/producers/knn.py stream` — publishes `ModelMetric`+
`ModelPerf` for **knn**
* Lambda-side `scripts/lambda-live-detection-loop.py` — publishes
`LiveDetection` **and now also `Prediction`** events per inference
window
All confirmed delivering (`{"delivered":N}` from `/publish`).
Visibility issues are all in `training/dashboard/static/dashboard.js`.
The user has flagged this twice now: scene 7 (chunking) and scene 9
(model bars) are not showing real-data state in deck mode. The events
exist; the widgets just don't render them. **This is the blocker
for the talk.**
---
## Scene 7 — chunking timeline (`#chunk-row`)
**Problem.** Cells are only built inside `buildExample()`, which is wired
to `demo_start`. The `prediction` handler can only update existing
cells:
```js
on('prediction', m => {
if (typeof m.window_idx !== 'number') return;
const cells = rowEl.querySelectorAll('.chunk-cell');
const cell = cells[m.window_idx];
if (!cell) return; // ← always falls through if no demo
...
});
```
If a real `prediction` event arrives without `demo_start` having
fired first, `cells.length === 0` and the event is silently dropped.
**Why we can't just publish `demo_start` from this side.** It has
destructive side-effects on other scenes: scene-9 (KNN scatter)
loads synthetic data on `demo_start`, scene-attack profile loads
synthetic curves on `demo_start`, etc. We tried this once and
clobbered the live KNN scatter.
**Fix request.** Lazy cell-build inside the `prediction` handler when
no cells exist yet:
```js
on('prediction', m => {
if (typeof m.window_idx !== 'number') return;
if (rowEl.children.length === 0 || rowEl.querySelector('.chunk-empty')) {
// Build N empty cells on first prediction. Width grows lazily.
rowEl.innerHTML = '';
ruleEl.innerHTML = '';
axisEl.innerHTML = '';
}
// Ensure cell at index exists; pad with empty cells up to window_idx.
let cells = rowEl.querySelectorAll('.chunk-cell');
while (cells.length <= m.window_idx) {
const c = document.createElement('div');
c.className = 'chunk-cell';
c.textContent = '';
rowEl.appendChild(c);
ruleEl.appendChild(Object.assign(
document.createElement('div'), { className: 'tick' }));
const t = document.createElement('span');
t.textContent = `${cells.length * 10}s`;
axisEl.appendChild(t);
cells = rowEl.querySelectorAll('.chunk-cell');
}
const cell = cells[m.window_idx];
const phase = m.predicted || m.actual;
if (!phase) return;
cell.className = `chunk-cell ${phase}`;
cell.textContent = phase.replace('_', ' ');
});
```
This keeps `demo_start`/`demo_stop` working and additionally lights up
the row from real `prediction` events.
If the Lambda producer re-runs episodes from window 0, you may also
want a reset on `prediction` events with `window_idx === 0` (clear all
cells, rebuild fresh). We can publish a `prediction_reset` event too
if you'd prefer an explicit signal — let us know.
---
## Scene 8 — model accuracy bars (`.model-row`)
**Problem.** The bar fill formula compresses to nothing for any
F1 < 0.5:
```js
const visiblePct = Math.max(0, Math.min(1, (acc - 0.5) / 0.5)) * 100;
```
Our trained models on the cross-device test split honestly land in
0.300.55 range (this is the **point** of held-out-by-host evaluation —
real generalization is hard). With the current scale, ≥ half the bars
render as 0% wide and look like there's no data flowing.
**Fix request.** Either:
(a) Use the full 01 range so a 0.35-F1 bar is still visibly 35% filled:
```js
const visiblePct = Math.max(0, Math.min(1, acc)) * 100;
```
(b) Or add the numeric F1 next to the empty-looking bars (we already
publish it in `accuracy`); the right-hand `.model-acc` element does
already render `acc.toFixed(3)` so this may already be readable —
verify that's still being shown when fill is 0%.
We strongly prefer (a). Hiding 0.30-F1 models behind a 0% bar tells the
user "no data" when the truth is "the model is honestly not great
under cross-host generalization." That's the headline finding.
---
## Scene 12 — accuracy vs inference cost scatter
**Problem A: y-axis range.** y is clamped to `[0.7, 1.0]` (or similar
high range). Every model with F1 < 0.7 stacks on the bottom edge.
**Fix.** Open the y-axis to `[0.0, 1.0]` (or auto-fit to the published
range with a small margin). The chart's whole point is "model honesty
under cross-device shift" — letting bad models show as bad is the
right answer.
**Problem B: overlapping labels.** Multiple points at the same
y-coordinate (especially when stacked at the floor) draw their model
name labels on top of each other. We've already shortened the
displayed names producer-side (`gbt-O`, `mlp-R`, `knns-O`, `trf-R`,
etc., max 6 chars). That helps but doesn't fully solve it when 5+
points cluster.
**Fix request, pick whichever is easiest:**
1. Skip label rendering when point density is high (only label points
that are local extrema, e.g. best F1, lowest latency, or
non-Pareto-dominated points).
2. Offset overlapping labels with a force layout (`d3-force` style) or
even just a fixed alternating up/down/left/right pattern.
3. Show labels only on hover, with a small dot-only render at rest.
Option (3) is the cleanest visually and matches how most real "model
zoo" scatters render in papers.
---
## Verification after dashboard JS lands
Producer side keeps publishing on these channels (already running on
the Pi + Lambda):
- `prediction` (scene 7) — once Lambda producer is re-pointed at
scene 7 events, see request below
- `model_metric` + `model_perf` (scenes 8, 12) — every 30 s from
`multi_model_metrics.py` on the Pi
- `live_detection` (scene-live) — continuously from Lambda
Open the dashboard, watch each scene. Empty-state placeholders should
disappear within ~30 s of page load.
---
## Side note for scene 7 — currently no `prediction` events flow
The Lambda producer (`live_detection_loop_v2.py`) currently emits
`live_detection` events for the scene-live swim lanes. If you want
scene 7 lit up with the same data, we can mirror per-window output to
the `prediction` event type as well — say the word and we'll add a
second emit. Doing that without the lazy-cell-build above accomplishes
nothing on the dashboard, so let us wait on this until the JS lands.

View file

@ -0,0 +1,62 @@
# Dashboard request — sticky cache for slowly-changing event types
**Audience:** dashboard session (owns `training/dashboard/`).
**Producer side:** `training/producers/multi_model_metrics.py`
(scenes 9 + 12), `training/producers/knn.py stream` (scene 11),
Lambda-side `live_detection_loop_v2.py` (scene 13).
## Problem
The broadcaster fans events out to **currently-connected** browsers
only. Reconnects (page refresh, second tab opening, mid-talk page
reload) see empty widgets until the next producer tick rebroadcasts.
The user has explicitly flagged this as a bug:
> "Your functions need to be more stateful, when we call your data it
> needs to be available right away. For the streaming data, when we
> call a new page it needs to connect correctly."
The broadcaster already does sticky caching for some keys — its
`/healthz` reports cached state under `host_counts`, `phase_mix`,
`recent_episodes`, `total_alerts`, `total_bytes`, `total_episodes`.
What's missing is sticky caching for the model + scatter + embedding
event types.
## Producer-side band-aid (already in place)
We've shortened the multi_model_metrics tick from 20 s → **5 s** so
worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for
the talk but not the right architecture — at 5 s × 4 events × 2
event types we're spending bandwidth and CPU on retransmits the
broadcaster could just remember.
## Asks
Please add sticky caching to the broadcaster for these event types:
| event type | scene | key | TTL | replay-on-connect? |
|-------------------|-------|--------------------|-------|---------------------|
| `model_metric` | 9 | one entry per `model` (last value wins) | none | yes |
| `model_perf` | 12 | one entry per `model` (last value wins) | none | yes |
| `live_detection` | 13 | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes |
| `embedding` | 11 | one snapshot — see companion request `dashboard-request-knn-cap-evict.md` for the snapshot-replace pattern | none | yes |
| `attack_profile` | 7 | one entry per `name` (last curve wins) | none | yes |
| `prediction` | 8 | one entry per `(episode_id, window_idx)` last value wins | none | yes |
Implementation suggestion: extend the broadcaster's existing
state-keys cache with a per-event-type "sticky map." On new client
connect, replay the cache before any live event reaches the new
client.
For `live_detection` the right structure is a ring-buffer (60 cells
per lane match the widget's DOM cap; replaying 60 newest events lets
a new browser paint the lanes immediately).
## Verification
After this lands, our producers can drop their republish cadence
back to a sane 30 s + on-change-only, and a cold page-load on
`dashboard.wg` paints scenes 9, 11, 12, 13 within one frame.
We'll also drop the 5 s tick on `multi_model_metrics` once we
verify replay works.

View file

@ -0,0 +1,74 @@
# Live inference demo — Lambda runs replay, Pi shows predictions
Architecture for the live "catching attacks" demo (scene 7 chunking
timeline). Pi cannot run inference (RAM-bound; crashed once); all
model loading + per-window prediction must live on the A100.
## Topology
```
Pi (office-print, 10.100.0.1) Lambda A100 (ssh ubuntu@<ip>)
┌──────────────────────────┐ ┌───────────────────────────┐
│ dashboard.wg │ │ replay.py running on │
│ /publish (loopback only) │ │ episode tarballs through │
│ ↑ │ │ gbt_oracle.ckpt.json │
│ │ POST │ │ ↓ │
│ │ via SSH reverse tunnel│ │ POST 127.0.0.1:8447 │
│ │ │ │ ↑ │
│ └─── ssh -R 8447:... ───┼─────────────┤ │ │
│ │ └───────────────────────────┘
└──────────────────────────┘
```
## Setup steps
1. **Stage demo episodes on Lambda** (raw tarballs, sudo to read on Pi):
```bash
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
'mkdir -p ~/cis490/data/episodes_demo'
for eid in <episode-ids>; do
sudo cat /var/lib/cis490/episodes/<host>/${eid}.tar.zst | \
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
"cat > ~/cis490/data/episodes_demo/${eid}.tar.zst"
done
```
2. **Open SSH reverse tunnel** from Pi to Lambda. Exposes Pi's
loopback `127.0.0.1:8447` (the dashboard's `/publish` endpoint)
on Lambda's loopback `127.0.0.1:8447`:
```bash
ssh -i ~/.ssh/lambda_ed25519 \
-o ServerAliveInterval=30 \
-o ServerAliveCountMax=3 \
-o ExitOnForwardFailure=yes \
-N -R 8447:127.0.0.1:8447 \
ubuntu@<lambda-ip>
```
Verify: from Lambda, `curl http://127.0.0.1:8447/healthz` should
return the Pi's dashboard health JSON.
3. **Run replay loop on Lambda**:
```bash
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip>
cd ~/cis490 && . .venv/bin/activate
export PYTHONPATH=$PWD/repo
nohup bash replay_loop.sh > replay_loop.log 2>&1 &
```
The loop iterates the staged demo episodes through the
trained `gbt_oracle.ckpt.json`, emitting `prediction` events
per window.
## What the user sees
- Scene 7 (chunking timeline) lights up with predicted/actual phase
per 10-second window
- Scene 8/9/12 still populated from Pi-side lightweight publishers
(knn streamer + multi_model_metrics + profiles streamer)
## Why not run replay on Pi
Pi RAM = 8 GiB. `replay.py` loads every checkpoint into memory at
startup (300 MB for KNN sidecars × multiple variants); concurrent
load with the metrics publisher's per-cycle test-set scoring
crashed the Pi. Inference belongs on the A100. The Pi's job is
display + lightweight event publishing only.

View file

@ -0,0 +1,212 @@
"""Lambda-side producer for the dashboard's live-detections scene.
Loads every trained checkpoint and replays the staged demo episodes
through them, emitting ``LiveDetection`` events to the Pi dashboard
via the SSH reverse tunnel. One event per inference window, tagged
with the source host so the swim-lane widget paints.
Scene 9 (model bars) and scene 12 (perf scatter) are *not* fed from
here those are published by ``training.producers.multi_model_metrics``
on the Pi, sourced from ``reports/eval/<family>_*_*.json`` files. This
keeps a single producer per canonical model name (avoids two writers
fighting over the same bar) and matches the contract that those
metrics are held-out-by-sample test F1, not the cross-host running F1
this loop would observe.
Canonical-name contract for ``LiveDetection.model``
==================================================
The dashboard ``Model`` literal is ``{rnn, gru, lstm, bert, knn}``.
We collapse our zoo onto those four when reporting which model ran
the inference:
gru gru_*
lstm lstm_*
bert transformer_*
knn knn_*
For ``gbt`` / ``mlp`` / ``cnn`` / ``knn_semi`` we omit the model field
(the dashboard CSS palette has no class for those names; the swim
lane still paints from ``predicted`` and ``actual``).
"""
from __future__ import annotations
import sys
import time
from pathlib import Path
from typing import Optional
import numpy as np
REPO_DIR = Path(__file__).resolve().parent / "repo"
EPISODES_DIR = Path("data/episodes_demo")
ARTIFACTS_DIR = Path("artifacts")
CANONICAL_TO_CKPT = {
"gru": ("gru", "realistic"),
"lstm": ("lstm", "realistic"),
"bert": ("transformer", "realistic"),
"knn": ("knn", "realistic"),
}
def _canonical_of(full_name: str) -> Optional[str]:
for canon, (family, mode) in CANONICAL_TO_CKPT.items():
if full_name == f"{family}_{mode}":
return canon
return None
MODELS = [
("gbt_oracle", "summary"),
("gbt_realistic", "summary"),
("mlp_oracle", "summary"),
("mlp_realistic", "summary"),
("knn_oracle", "summary"),
("knn_realistic", "summary"),
("knn_semi_oracle", "summary"),
("knn_semi_realistic", "summary"),
("cnn_oracle", "tensor"),
("cnn_realistic", "tensor"),
("gru_oracle", "tensor"),
("gru_realistic", "tensor"),
("transformer_oracle", "tensor"),
("transformer_realistic", "tensor"),
("lstm_oracle", "tensor"),
("lstm_realistic", "tensor"),
]
DASHBOARD_PHASES = {"clean", "armed", "infecting",
"infected_running", "dormant"}
def _scan_episodes() -> list[tuple[str, str, Path]]:
out = []
for p in sorted(EPISODES_DIR.glob("*.tar.zst")):
stem = p.name.removesuffix(".tar.zst")
if "__" in stem:
host, eid = stem.split("__", 1)
else:
host, eid = "unknown", stem
out.append((host, eid, p))
return out
def _load_ckpts() -> dict[str, object]:
sys.path.insert(0, str(REPO_DIR))
from training.models._checkpoint import load_checkpoint
out = {}
for full, _ in MODELS:
cp = ARTIFACTS_DIR / f"{full}.ckpt.json"
if not cp.exists():
continue
try:
out[full] = load_checkpoint(cp)
except Exception as e:
print(f" skip {full}: {type(e).__name__}: {e}", flush=True)
print(f"loaded {len(out)} checkpoints", flush=True)
return out
def main():
sys.path.insert(0, str(REPO_DIR))
from training._episode_io import open_episode
from training._features import (
PHASE_TO_INT, summary_windows, tensor_windows,
)
from training.dashboard.events import (
LiveDetection, Prediction, Publisher,
)
eps = _scan_episodes()
if not eps:
print(f"no episodes in {EPISODES_DIR}", file=sys.stderr)
sys.exit(1)
print(f"found {len(eps)} episodes", flush=True)
ckpts = _load_ckpts()
if not ckpts:
print("no usable checkpoints", file=sys.stderr)
sys.exit(1)
pub = Publisher(url="http://127.0.0.1:8447/publish")
int_to_phase = {i: p for p, i in PHASE_TO_INT.items()}
def safe_phase(idx: int) -> str:
p = int_to_phase.get(int(idx), "clean")
return p if p in DASHBOARD_PHASES else "clean"
speed = 8.0
m_idx = 0
ep_idx = 0
model_order = [(f, k) for f, k in MODELS if f in ckpts]
while True:
full, kind = model_order[m_idx % len(model_order)]
host_orig, eid, path = eps[ep_idx % len(eps)]
m_idx += 1
ep_idx += 1
ck = ckpts[full]
canon = _canonical_of(full)
try:
epi = open_episode(path, host_id=host_orig)
if not epi.labels:
continue
if kind == "tensor":
Xs, ys, ts, _mask, info = tensor_windows(epi)
else:
Xs, ys, ts, info = summary_windows(epi)
if Xs.shape[0] == 0:
continue
attack_profile = info.get("attack_profile") or "mixed"
print(f"[{time.strftime('%H:%M:%S')}] {full} "
f"on {host_orig}/{eid[:8]} "
f"({Xs.shape[0]} windows)", flush=True)
start_wall = time.monotonic()
for w in range(Xs.shape[0]):
target = start_wall + float(ts[w]) / max(speed, 0.01)
delay = target - time.monotonic()
if delay > 0:
time.sleep(delay)
t0 = time.perf_counter_ns()
proba = ck.predict_proba(Xs[w:w+1])
latency_ms = (time.perf_counter_ns() - t0) / 1e6
pred = safe_phase(int(np.argmax(proba[0])))
actual = safe_phase(int(ys[w]))
conf = float(np.max(proba[0]))
try:
pub.publish(LiveDetection(
host_id=host_orig,
predicted=pred,
actual=actual,
confidence=conf,
model=canon,
profile=attack_profile,
episode_id=eid,
window_idx=w,
latency_ms=latency_ms,
t_wall=time.time(),
))
# Scene 7 (chunking) consumes ``Prediction`` events
# — publish in parallel so when the chunking widget
# gets its lazy-cell-build dashboard fix, it lights
# up immediately. ``window_idx`` modded to N=6 so
# all our 8-window-episode predictions land inside
# the 6-cell row.
pub.publish(Prediction(
episode_id=eid,
window_idx=int(w) % 6,
predicted=pred,
actual=actual,
))
except Exception as e:
print(f" publish failed: {e}", flush=True)
except Exception as e:
print(f" error in {full}: {type(e).__name__}: {e}",
flush=True)
time.sleep(0.3)
if __name__ == "__main__":
main()

View file

@ -988,6 +988,14 @@ html, body { overflow-anchor: none; }
.model-fill.rnn { background: linear-gradient(90deg, #d29922, #8a6a17); } .model-fill.rnn { background: linear-gradient(90deg, #d29922, #8a6a17); }
.model-fill.bert { background: linear-gradient(90deg, #f85149, #b22e2a); } .model-fill.bert { background: linear-gradient(90deg, #f85149, #b22e2a); }
.model-fill.knn { background: linear-gradient(90deg, #3fb950, #1a7f37); } .model-fill.knn { background: linear-gradient(90deg, #3fb950, #1a7f37); }
/* Producer-side additions (see docs/dashboard-request-scenes-7-8-12.md):
gbt / mlp / cnn / knn_semi are also published as ModelMetric so that
scene 9 shows the full trained zoo, not just the canonical sequence
models. Same gradient shape, different hues. */
.model-fill.gbt { background: linear-gradient(90deg, #ff8c42, #c2410c); }
.model-fill.mlp { background: linear-gradient(90deg, #a371f7, #6e40c9); }
.model-fill.cnn { background: linear-gradient(90deg, #34d399, #047857); }
.model-fill.knn_semi { background: linear-gradient(90deg, #2dd4bf, #115e59); }
.model-acc { font-family: ui-monospace, SFMono-Regular, Menlo, monospace; .model-acc { font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
font-size: clamp(13px, 1vw, 15px); color: var(--fg-dim); text-align: right; } font-size: clamp(13px, 1vw, 15px); color: var(--fg-dim); text-align: right; }

View file

@ -1774,7 +1774,11 @@ def train_nn(*, model, X_train, y_train, X_val, y_val,
} }
function render(model, accuracy) { function render(model, accuracy) {
const r = ensureRow(model); const r = ensureRow(model);
const visible = Math.max(0, Math.min(1, (accuracy - 0.5) / 0.5)); // Full 01 visible scale. The previous (acc-0.5)/0.5 mapping
// clamped honest-low cross-host F1s to 0% width and made the
// bars look unpopulated. Producer-side change — see
// docs/dashboard-request-scenes-7-8-12.md for context.
const visible = Math.max(0, Math.min(1, accuracy));
r.fill.style.width = (visible * 100).toFixed(1) + '%'; r.fill.style.width = (visible * 100).toFixed(1) + '%';
r.acc.textContent = accuracy.toFixed(3); r.acc.textContent = accuracy.toFixed(3);
} }

View file

@ -1,36 +1,55 @@
"""Pi-safe multi-model metrics publisher. """Pi-safe multi-model metrics publisher.
Reads ``reports/eval/<model>_<mode>_train.json`` files (already Publishes:
contains the test_macro_f1 each trainer wrote at training time) and
publishes:
- ``model_metric`` (scene-8 bars): test_macro_f1 per model - ``ModelMetric`` (scene 9 / "models") held-out-by-sample macro-F1
- ``model_perf`` (scene-12 scatter): latency_us per model, paired per canonical model name (rnn, gru, lstm, bert, knn).
with the same test_macro_f1. Latency is a hardcoded per-family - ``ModelPerf`` (scene 12 / "perf") observed median latency
estimate proper latency benchmarks need to run on a GPU host (μs/window) paired with the same F1 per canonical name.
(the Pi can't afford to load 300 MB knn pickles back-to-back).
This producer is the LIGHTWEIGHT replacement for Source of F1 numbers
``training.producers.metrics`` and ``...perf`` which load every ====================
checkpoint into memory and score the test set on every cycle. That We read ``reports/eval/<family>_<mode>_{train,eval}.json`` files. Each
pattern crashed the Pi during the CIS490 project. This script just file has a ``split_recipe`` field plus ``test_macro_f1``. The dashboard
reads small JSON files and emits events no model loading. contract for these scenes is **held-out-by-sample** (recipe = "sample"
in our codebase, also called "oracle" mode); the bar widget's
``(accuracy 0.5) / 0.5`` visible scale is calibrated for the high-F1
range that recipe produces.
Latency estimates (microseconds per window, batch-amortized): Order of preference per file:
gbt ~ 250 XGBoost predict on 230 features 1. ``<family>_oracle_eval.json`` (split_recipe == "sample")
knn ~3500 sklearn brute-force at 230 D, 100k+ train 2. ``<family>_oracle_train.json`` (split_recipe == "sample")
knn_semi ~3500 same as knn (final clf is a KNN) 3. ``<family>_realistic_eval.json`` (cross-host fallback)
mlp ~ 50 PyTorch on 230-dim summary, batched 4. ``<family>_realistic_train.json`` (cross-host fallback)
cnn ~ 500 1D-CNN over (46, 100), batched
gru ~1500 sequential RNN, slow per timestep
lstm ~2000 same; LSTM cell is heavier than GRU
transformer ~ 800 O() attention but T=100 is small
transformer_ssl ~1000 same encoder + extra head
These are order-of-magnitude estimates from sklearn / torch on similar If only realistic is available we publish it anyway better an honest
shapes. For a paper they should be benchmarked properly on the low bar than no bar at all but the file the trainer should have
deployment hardware; for a live demo they're indicative. written for scene 9 is the oracle one.
Canonical-name contract
=======================
The dashboard's :class:`Model` literal is ``{rnn, gru, lstm, bert,
knn}`` and the bar widget's CSS palette is keyed off those exact
strings (``.model-fill.lstm``, ``.model-fill.gru``, etc.). We collapse
our zoo as follows:
gru gru_*
lstm lstm_*
bert transformer_* (BERT-style transformer encoder)
knn knn_*
We don't have a vanilla RNN trained, so ``rnn`` is never published —
the bar widget skips that bar, which is the correct behaviour.
Why not the existing ``training.producers.metrics``
==================================================
That producer iterates checkpoints with :func:`load_models` and re-
scores the test set every cycle. On the Pi (8 GiB ARM) the KNN
checkpoints alone (~300 MB pickle each, six variants) plus the test-
set tensor cache exceed RAM and OOM-killed the host. See
``feedback_no_heavy_pi_inference.md`` in the user's auto-memory. This
producer reads small JSON files instead no checkpoint loading.
""" """
from __future__ import annotations from __future__ import annotations
@ -42,87 +61,152 @@ import sys
from pathlib import Path from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[2])) sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from training.producers._publish import ( from training.dashboard.events import ModelMetric, ModelPerf, Publisher
PublishFn, http_publisher, null_publisher,
)
log = logging.getLogger("cis490.producers.multi_model_metrics") log = logging.getLogger("cis490.producers.multi_model_metrics")
# Microseconds per window, batch=64 amortized. Order-of-magnitude
# estimates from sklearn / torch on similar shapes. Should be re-
# benchmarked on actual deployment hardware for a paper, but indicative
# enough for a live demo's perf scatter.
LATENCY_ESTIMATES_US = { LATENCY_ESTIMATES_US = {
"gbt": 250.0, "rnn": 1500.0,
"knn": 3500.0, "gru": 1500.0,
"knn_semi": 3500.0, "lstm": 2000.0,
"mlp": 50.0, "bert": 800.0,
"cnn": 500.0, "knn": 3500.0,
"gru": 1500.0,
"lstm": 2000.0,
"transformer": 800.0,
"transformer_ssl": 1000.0,
} }
def _scan_train_jsons(reports_dir: Path) -> list[dict]: # Bar-widget name → trained-checkpoint family. We publish every model
"""Read every train.json in reports_dir, return list of metrics dicts.""" # we've trained so scene 9 shows the full zoo, not just the four
out = [] # canonical ``Model`` literal names. Names outside the dashboard's
for p in sorted(reports_dir.glob("*_train.json")): # canonical set ({rnn, gru, lstm, bert, knn}) render as bars with no
try: # CSS fill colour — the row still appears with the model name and
d = json.loads(p.read_text()) # numeric F1, the bar track is just transparent. The dashboard chat's
except (OSError, json.JSONDecodeError) as e: # explicit guidance: "Other strings work but won't get a colored fill
log.warning("skipping %s: %s", p.name, e) # class without a CSS update."
continue #
# Some files are pretrains for SSL — same shape, different file # ``knn`` is intentionally absent here — ``training.producers.knn
if "test_macro_f1" not in d and "binary_test_macro_f1" not in d: # stream`` already publishes ``ModelMetric{model: 'knn'}`` and
continue # ``ModelPerf{model: 'knn'}`` on its own cycle. Two writers on the
out.append(d) # same name would flicker.
# Also catch transformer_ssl which writes *_pretrain.json CANONICAL_TO_FAMILY = {
for p in sorted(reports_dir.glob("*_pretrain.json")): "gbt": "gbt",
try: "mlp": "mlp",
d = json.loads(p.read_text()) "cnn": "cnn",
except (OSError, json.JSONDecodeError) as e: "knn_semi": "knn_semi",
continue "gru": "gru",
if "binary_test_macro_f1" in d: "lstm": "lstm",
d.setdefault("test_macro_f1", d["binary_test_macro_f1"]) "bert": "transformer",
out.append(d) }
return out
async def emit_once(*, publish: PublishFn, reports_dir: Path) -> int: # Latency-per-window-microseconds estimates per family, batch=64
rows = _scan_train_jsons(reports_dir) # amortised. Order-of-magnitude only — proper benchmarks need to run
n = 0 # on the deployment hardware. Indicative enough for scene 12's
for r in rows: # log-scaled axis.
model = r.get("model") LATENCY_PER_FAMILY_US = {
mode = r.get("mode") "gbt": 250.0,
if model is None or mode is None: "mlp": 50.0,
"cnn": 500.0,
"knn": 3500.0,
"knn_semi": 3500.0,
"rnn": 1500.0,
"gru": 1500.0,
"lstm": 2000.0,
"bert": 800.0,
}
def _read_json(path: Path) -> dict | None:
try:
return json.loads(path.read_text())
except (OSError, json.JSONDecodeError) as e:
log.warning("could not read %s: %s", path.name, e)
return None
def _extract_f1(d: dict) -> float | None:
"""Pull a scalar test_macro_f1 from one of two known shapes.
- ``training.trainer.run`` writes ``test_macro_f1`` flat.
- ``training.eval_.run`` writes ``macro_f1: {point, low, high}``
and the family name only (no oracle/realistic suffix), so the
filename carries the mode if at all.
"""
if "test_macro_f1" in d and isinstance(d["test_macro_f1"], (int, float)):
return float(d["test_macro_f1"])
mf1 = d.get("macro_f1")
if isinstance(mf1, dict) and "point" in mf1:
return float(mf1["point"])
if isinstance(mf1, (int, float)):
return float(mf1)
return None
def _best_f1_for_family(reports_dir: Path, family: str) -> tuple[float, str] | None:
"""Pick the best-available test_macro_f1 for one family.
Returns ``(f1, source_label)`` or ``None`` if no candidate file
has a usable score.
Filename precedence (most-preferred first):
1. ``<family>_oracle_train.json`` trainer-time, sample split
2. ``<family>_eval.json`` eval_/run.py output, recipe
set by --split-recipe
3. ``<family>_realistic_train.json`` cross-host fallback
"""
candidates = [
("oracle_train", f"{family}_oracle_train.json"),
("eval", f"{family}_eval.json"),
("realistic_train", f"{family}_realistic_train.json"),
]
for label, fname in candidates:
p = reports_dir / fname
if not p.exists():
continue continue
f1 = r.get("test_macro_f1") d = _read_json(p)
if d is None:
continue
f1 = _extract_f1(d)
if f1 is None: if f1 is None:
continue continue
# Display name combines model+mode for the bar widget return f1, label
display = f"{model}_{mode}" return None
await publish({
"type": "model_metric",
"model": display, def emit_once(*, publisher: Publisher, reports_dir: Path) -> int:
"accuracy": float(f1), n = 0
}) for bar_name, family in CANONICAL_TO_FAMILY.items():
latency = LATENCY_ESTIMATES_US.get(model, 1000.0) result = _best_f1_for_family(reports_dir, family)
await publish({ if result is None:
"type": "model_perf", log.info("no F1 yet for %s (family=%s) — skipping",
"model": display, bar_name, family)
"latency_us": float(latency), continue
"accuracy": float(f1), f1, source = result
}) latency = float(LATENCY_PER_FAMILY_US.get(family, 1000.0))
n += 1 try:
log.info("published %d model pairs (metric+perf)", n) publisher.publish(ModelMetric(
model=bar_name, accuracy=f1))
publisher.publish(ModelPerf(
model=bar_name, latency_us=latency, accuracy=f1))
n += 1
log.debug("%s: F1=%.4f latency=%.0fus (from %s)",
bar_name, f1, latency, source)
except Exception as e:
log.warning("publish failed for %s: %s", bar_name, e)
log.info("published %d (model_metric + model_perf) pairs", n)
return n return n
async def _run(args) -> int: async def _run(args) -> int:
publisher = (null_publisher() if args.dry_run publisher = Publisher(url=args.publish_url)
else http_publisher(args.publish_url))
while True: while True:
await emit_once(publish=publisher, reports_dir=args.reports_dir) emit_once(publisher=publisher, reports_dir=args.reports_dir)
if args.interval <= 0: if args.interval <= 0:
return 0 return 0
await asyncio.sleep(args.interval) await asyncio.sleep(args.interval)
@ -131,16 +215,21 @@ async def _run(args) -> int:
def main() -> int: def main() -> int:
ap = argparse.ArgumentParser() ap = argparse.ArgumentParser()
ap.add_argument("--reports-dir", type=Path, ap.add_argument("--reports-dir", type=Path,
default=Path("reports/eval"), default=Path("reports/eval"))
help="dir containing <model>_<mode>_train.json files") ap.add_argument("--publish-url",
ap.add_argument("--publish-url", default="http://127.0.0.1:8447/publish") default="http://127.0.0.1:8447/publish")
ap.add_argument("--interval", type=float, default=30.0, ap.add_argument("--interval", type=float, default=5.0,
help="re-publish period (s); 0 = one-shot") help="re-publish period (s); 0 = one-shot. "
ap.add_argument("--dry-run", action="store_true") "Kept short so a fresh page-load sees populated "
"bars/scatter within a few seconds. The dashboard "
"broadcaster does not replay events to new "
"connections by default — see "
"docs/dashboard-request-sticky-cache.md.")
ap.add_argument("--log-level", default="INFO") ap.add_argument("--log-level", default="INFO")
args = ap.parse_args() args = ap.parse_args()
logging.basicConfig(level=args.log_level, logging.basicConfig(
format="%(asctime)s %(levelname)s %(name)s %(message)s") level=args.log_level,
format="%(asctime)s %(levelname)s %(name)s %(message)s")
return asyncio.run(_run(args)) return asyncio.run(_run(args))