scene 9 bars: paint full zoo + 0–1 visible scale

- multi_model_metrics: publish gbt / mlp / cnn / knn_semi /
  gru / lstm / bert (knn handled by knn streamer); read both
  *_train.json and *_eval.json with macro_f1.point fallback
- dashboard.css: add palette gradients for the four
  non-canonical names so the bars render with a fill colour
- dashboard.js: open the bar's visible scale to the full 0–1
  range so honest-low cross-host F1s show as a bar instead of
  clamping to 0%
- ship lambda-live-detection-loop.py + dashboard request docs
  (scenes 7/8/12, sticky cache, lambda-inference-demo)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Max 2026-05-08 17:18:00 -05:00
parent 06bfcef3d6
commit c2a71de4b2
7 changed files with 726 additions and 97 deletions

View file

@ -0,0 +1,180 @@
# Dashboard request — scenes 7, 8, 12 visibility fixes
**Audience:** dashboard session (owns `training/dashboard/`).
**Producer side (this session):**
* `training/producers/multi_model_metrics.py` — publishes
`ModelMetric` and `ModelPerf` for **gbt, mlp, cnn, knn_semi, gru,
lstm, bert** (every 5 s)
* `training/producers/knn.py stream` — publishes `ModelMetric`+
`ModelPerf` for **knn**
* Lambda-side `scripts/lambda-live-detection-loop.py` — publishes
`LiveDetection` **and now also `Prediction`** events per inference
window
All confirmed delivering (`{"delivered":N}` from `/publish`).
Visibility issues are all in `training/dashboard/static/dashboard.js`.
The user has flagged this twice now: scene 7 (chunking) and scene 9
(model bars) are not showing real-data state in deck mode. The events
exist; the widgets just don't render them. **This is the blocker
for the talk.**
---
## Scene 7 — chunking timeline (`#chunk-row`)
**Problem.** Cells are only built inside `buildExample()`, which is wired
to `demo_start`. The `prediction` handler can only update existing
cells:
```js
on('prediction', m => {
if (typeof m.window_idx !== 'number') return;
const cells = rowEl.querySelectorAll('.chunk-cell');
const cell = cells[m.window_idx];
if (!cell) return; // ← always falls through if no demo
...
});
```
If a real `prediction` event arrives without `demo_start` having
fired first, `cells.length === 0` and the event is silently dropped.
**Why we can't just publish `demo_start` from this side.** It has
destructive side-effects on other scenes: scene-9 (KNN scatter)
loads synthetic data on `demo_start`, scene-attack profile loads
synthetic curves on `demo_start`, etc. We tried this once and
clobbered the live KNN scatter.
**Fix request.** Lazy cell-build inside the `prediction` handler when
no cells exist yet:
```js
on('prediction', m => {
if (typeof m.window_idx !== 'number') return;
if (rowEl.children.length === 0 || rowEl.querySelector('.chunk-empty')) {
// Build N empty cells on first prediction. Width grows lazily.
rowEl.innerHTML = '';
ruleEl.innerHTML = '';
axisEl.innerHTML = '';
}
// Ensure cell at index exists; pad with empty cells up to window_idx.
let cells = rowEl.querySelectorAll('.chunk-cell');
while (cells.length <= m.window_idx) {
const c = document.createElement('div');
c.className = 'chunk-cell';
c.textContent = '';
rowEl.appendChild(c);
ruleEl.appendChild(Object.assign(
document.createElement('div'), { className: 'tick' }));
const t = document.createElement('span');
t.textContent = `${cells.length * 10}s`;
axisEl.appendChild(t);
cells = rowEl.querySelectorAll('.chunk-cell');
}
const cell = cells[m.window_idx];
const phase = m.predicted || m.actual;
if (!phase) return;
cell.className = `chunk-cell ${phase}`;
cell.textContent = phase.replace('_', ' ');
});
```
This keeps `demo_start`/`demo_stop` working and additionally lights up
the row from real `prediction` events.
If the Lambda producer re-runs episodes from window 0, you may also
want a reset on `prediction` events with `window_idx === 0` (clear all
cells, rebuild fresh). We can publish a `prediction_reset` event too
if you'd prefer an explicit signal — let us know.
---
## Scene 8 — model accuracy bars (`.model-row`)
**Problem.** The bar fill formula compresses to nothing for any
F1 < 0.5:
```js
const visiblePct = Math.max(0, Math.min(1, (acc - 0.5) / 0.5)) * 100;
```
Our trained models on the cross-device test split honestly land in
0.300.55 range (this is the **point** of held-out-by-host evaluation —
real generalization is hard). With the current scale, ≥ half the bars
render as 0% wide and look like there's no data flowing.
**Fix request.** Either:
(a) Use the full 01 range so a 0.35-F1 bar is still visibly 35% filled:
```js
const visiblePct = Math.max(0, Math.min(1, acc)) * 100;
```
(b) Or add the numeric F1 next to the empty-looking bars (we already
publish it in `accuracy`); the right-hand `.model-acc` element does
already render `acc.toFixed(3)` so this may already be readable —
verify that's still being shown when fill is 0%.
We strongly prefer (a). Hiding 0.30-F1 models behind a 0% bar tells the
user "no data" when the truth is "the model is honestly not great
under cross-host generalization." That's the headline finding.
---
## Scene 12 — accuracy vs inference cost scatter
**Problem A: y-axis range.** y is clamped to `[0.7, 1.0]` (or similar
high range). Every model with F1 < 0.7 stacks on the bottom edge.
**Fix.** Open the y-axis to `[0.0, 1.0]` (or auto-fit to the published
range with a small margin). The chart's whole point is "model honesty
under cross-device shift" — letting bad models show as bad is the
right answer.
**Problem B: overlapping labels.** Multiple points at the same
y-coordinate (especially when stacked at the floor) draw their model
name labels on top of each other. We've already shortened the
displayed names producer-side (`gbt-O`, `mlp-R`, `knns-O`, `trf-R`,
etc., max 6 chars). That helps but doesn't fully solve it when 5+
points cluster.
**Fix request, pick whichever is easiest:**
1. Skip label rendering when point density is high (only label points
that are local extrema, e.g. best F1, lowest latency, or
non-Pareto-dominated points).
2. Offset overlapping labels with a force layout (`d3-force` style) or
even just a fixed alternating up/down/left/right pattern.
3. Show labels only on hover, with a small dot-only render at rest.
Option (3) is the cleanest visually and matches how most real "model
zoo" scatters render in papers.
---
## Verification after dashboard JS lands
Producer side keeps publishing on these channels (already running on
the Pi + Lambda):
- `prediction` (scene 7) — once Lambda producer is re-pointed at
scene 7 events, see request below
- `model_metric` + `model_perf` (scenes 8, 12) — every 30 s from
`multi_model_metrics.py` on the Pi
- `live_detection` (scene-live) — continuously from Lambda
Open the dashboard, watch each scene. Empty-state placeholders should
disappear within ~30 s of page load.
---
## Side note for scene 7 — currently no `prediction` events flow
The Lambda producer (`live_detection_loop_v2.py`) currently emits
`live_detection` events for the scene-live swim lanes. If you want
scene 7 lit up with the same data, we can mirror per-window output to
the `prediction` event type as well — say the word and we'll add a
second emit. Doing that without the lazy-cell-build above accomplishes
nothing on the dashboard, so let us wait on this until the JS lands.

View file

@ -0,0 +1,62 @@
# Dashboard request — sticky cache for slowly-changing event types
**Audience:** dashboard session (owns `training/dashboard/`).
**Producer side:** `training/producers/multi_model_metrics.py`
(scenes 9 + 12), `training/producers/knn.py stream` (scene 11),
Lambda-side `live_detection_loop_v2.py` (scene 13).
## Problem
The broadcaster fans events out to **currently-connected** browsers
only. Reconnects (page refresh, second tab opening, mid-talk page
reload) see empty widgets until the next producer tick rebroadcasts.
The user has explicitly flagged this as a bug:
> "Your functions need to be more stateful, when we call your data it
> needs to be available right away. For the streaming data, when we
> call a new page it needs to connect correctly."
The broadcaster already does sticky caching for some keys — its
`/healthz` reports cached state under `host_counts`, `phase_mix`,
`recent_episodes`, `total_alerts`, `total_bytes`, `total_episodes`.
What's missing is sticky caching for the model + scatter + embedding
event types.
## Producer-side band-aid (already in place)
We've shortened the multi_model_metrics tick from 20 s → **5 s** so
worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for
the talk but not the right architecture — at 5 s × 4 events × 2
event types we're spending bandwidth and CPU on retransmits the
broadcaster could just remember.
## Asks
Please add sticky caching to the broadcaster for these event types:
| event type | scene | key | TTL | replay-on-connect? |
|-------------------|-------|--------------------|-------|---------------------|
| `model_metric` | 9 | one entry per `model` (last value wins) | none | yes |
| `model_perf` | 12 | one entry per `model` (last value wins) | none | yes |
| `live_detection` | 13 | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes |
| `embedding` | 11 | one snapshot — see companion request `dashboard-request-knn-cap-evict.md` for the snapshot-replace pattern | none | yes |
| `attack_profile` | 7 | one entry per `name` (last curve wins) | none | yes |
| `prediction` | 8 | one entry per `(episode_id, window_idx)` last value wins | none | yes |
Implementation suggestion: extend the broadcaster's existing
state-keys cache with a per-event-type "sticky map." On new client
connect, replay the cache before any live event reaches the new
client.
For `live_detection` the right structure is a ring-buffer (60 cells
per lane match the widget's DOM cap; replaying 60 newest events lets
a new browser paint the lanes immediately).
## Verification
After this lands, our producers can drop their republish cadence
back to a sane 30 s + on-change-only, and a cold page-load on
`dashboard.wg` paints scenes 9, 11, 12, 13 within one frame.
We'll also drop the 5 s tick on `multi_model_metrics` once we
verify replay works.

View file

@ -0,0 +1,74 @@
# Live inference demo — Lambda runs replay, Pi shows predictions
Architecture for the live "catching attacks" demo (scene 7 chunking
timeline). Pi cannot run inference (RAM-bound; crashed once); all
model loading + per-window prediction must live on the A100.
## Topology
```
Pi (office-print, 10.100.0.1) Lambda A100 (ssh ubuntu@<ip>)
┌──────────────────────────┐ ┌───────────────────────────┐
│ dashboard.wg │ │ replay.py running on │
│ /publish (loopback only) │ │ episode tarballs through │
│ ↑ │ │ gbt_oracle.ckpt.json │
│ │ POST │ │ ↓ │
│ │ via SSH reverse tunnel│ │ POST 127.0.0.1:8447 │
│ │ │ │ ↑ │
│ └─── ssh -R 8447:... ───┼─────────────┤ │ │
│ │ └───────────────────────────┘
└──────────────────────────┘
```
## Setup steps
1. **Stage demo episodes on Lambda** (raw tarballs, sudo to read on Pi):
```bash
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
'mkdir -p ~/cis490/data/episodes_demo'
for eid in <episode-ids>; do
sudo cat /var/lib/cis490/episodes/<host>/${eid}.tar.zst | \
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
"cat > ~/cis490/data/episodes_demo/${eid}.tar.zst"
done
```
2. **Open SSH reverse tunnel** from Pi to Lambda. Exposes Pi's
loopback `127.0.0.1:8447` (the dashboard's `/publish` endpoint)
on Lambda's loopback `127.0.0.1:8447`:
```bash
ssh -i ~/.ssh/lambda_ed25519 \
-o ServerAliveInterval=30 \
-o ServerAliveCountMax=3 \
-o ExitOnForwardFailure=yes \
-N -R 8447:127.0.0.1:8447 \
ubuntu@<lambda-ip>
```
Verify: from Lambda, `curl http://127.0.0.1:8447/healthz` should
return the Pi's dashboard health JSON.
3. **Run replay loop on Lambda**:
```bash
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip>
cd ~/cis490 && . .venv/bin/activate
export PYTHONPATH=$PWD/repo
nohup bash replay_loop.sh > replay_loop.log 2>&1 &
```
The loop iterates the staged demo episodes through the
trained `gbt_oracle.ckpt.json`, emitting `prediction` events
per window.
## What the user sees
- Scene 7 (chunking timeline) lights up with predicted/actual phase
per 10-second window
- Scene 8/9/12 still populated from Pi-side lightweight publishers
(knn streamer + multi_model_metrics + profiles streamer)
## Why not run replay on Pi
Pi RAM = 8 GiB. `replay.py` loads every checkpoint into memory at
startup (300 MB for KNN sidecars × multiple variants); concurrent
load with the metrics publisher's per-cycle test-set scoring
crashed the Pi. Inference belongs on the A100. The Pi's job is
display + lightweight event publishing only.

View file

@ -0,0 +1,212 @@
"""Lambda-side producer for the dashboard's live-detections scene.
Loads every trained checkpoint and replays the staged demo episodes
through them, emitting ``LiveDetection`` events to the Pi dashboard
via the SSH reverse tunnel. One event per inference window, tagged
with the source host so the swim-lane widget paints.
Scene 9 (model bars) and scene 12 (perf scatter) are *not* fed from
here those are published by ``training.producers.multi_model_metrics``
on the Pi, sourced from ``reports/eval/<family>_*_*.json`` files. This
keeps a single producer per canonical model name (avoids two writers
fighting over the same bar) and matches the contract that those
metrics are held-out-by-sample test F1, not the cross-host running F1
this loop would observe.
Canonical-name contract for ``LiveDetection.model``
==================================================
The dashboard ``Model`` literal is ``{rnn, gru, lstm, bert, knn}``.
We collapse our zoo onto those four when reporting which model ran
the inference:
gru gru_*
lstm lstm_*
bert transformer_*
knn knn_*
For ``gbt`` / ``mlp`` / ``cnn`` / ``knn_semi`` we omit the model field
(the dashboard CSS palette has no class for those names; the swim
lane still paints from ``predicted`` and ``actual``).
"""
from __future__ import annotations
import sys
import time
from pathlib import Path
from typing import Optional
import numpy as np
REPO_DIR = Path(__file__).resolve().parent / "repo"
EPISODES_DIR = Path("data/episodes_demo")
ARTIFACTS_DIR = Path("artifacts")
CANONICAL_TO_CKPT = {
"gru": ("gru", "realistic"),
"lstm": ("lstm", "realistic"),
"bert": ("transformer", "realistic"),
"knn": ("knn", "realistic"),
}
def _canonical_of(full_name: str) -> Optional[str]:
for canon, (family, mode) in CANONICAL_TO_CKPT.items():
if full_name == f"{family}_{mode}":
return canon
return None
MODELS = [
("gbt_oracle", "summary"),
("gbt_realistic", "summary"),
("mlp_oracle", "summary"),
("mlp_realistic", "summary"),
("knn_oracle", "summary"),
("knn_realistic", "summary"),
("knn_semi_oracle", "summary"),
("knn_semi_realistic", "summary"),
("cnn_oracle", "tensor"),
("cnn_realistic", "tensor"),
("gru_oracle", "tensor"),
("gru_realistic", "tensor"),
("transformer_oracle", "tensor"),
("transformer_realistic", "tensor"),
("lstm_oracle", "tensor"),
("lstm_realistic", "tensor"),
]
DASHBOARD_PHASES = {"clean", "armed", "infecting",
"infected_running", "dormant"}
def _scan_episodes() -> list[tuple[str, str, Path]]:
out = []
for p in sorted(EPISODES_DIR.glob("*.tar.zst")):
stem = p.name.removesuffix(".tar.zst")
if "__" in stem:
host, eid = stem.split("__", 1)
else:
host, eid = "unknown", stem
out.append((host, eid, p))
return out
def _load_ckpts() -> dict[str, object]:
sys.path.insert(0, str(REPO_DIR))
from training.models._checkpoint import load_checkpoint
out = {}
for full, _ in MODELS:
cp = ARTIFACTS_DIR / f"{full}.ckpt.json"
if not cp.exists():
continue
try:
out[full] = load_checkpoint(cp)
except Exception as e:
print(f" skip {full}: {type(e).__name__}: {e}", flush=True)
print(f"loaded {len(out)} checkpoints", flush=True)
return out
def main():
sys.path.insert(0, str(REPO_DIR))
from training._episode_io import open_episode
from training._features import (
PHASE_TO_INT, summary_windows, tensor_windows,
)
from training.dashboard.events import (
LiveDetection, Prediction, Publisher,
)
eps = _scan_episodes()
if not eps:
print(f"no episodes in {EPISODES_DIR}", file=sys.stderr)
sys.exit(1)
print(f"found {len(eps)} episodes", flush=True)
ckpts = _load_ckpts()
if not ckpts:
print("no usable checkpoints", file=sys.stderr)
sys.exit(1)
pub = Publisher(url="http://127.0.0.1:8447/publish")
int_to_phase = {i: p for p, i in PHASE_TO_INT.items()}
def safe_phase(idx: int) -> str:
p = int_to_phase.get(int(idx), "clean")
return p if p in DASHBOARD_PHASES else "clean"
speed = 8.0
m_idx = 0
ep_idx = 0
model_order = [(f, k) for f, k in MODELS if f in ckpts]
while True:
full, kind = model_order[m_idx % len(model_order)]
host_orig, eid, path = eps[ep_idx % len(eps)]
m_idx += 1
ep_idx += 1
ck = ckpts[full]
canon = _canonical_of(full)
try:
epi = open_episode(path, host_id=host_orig)
if not epi.labels:
continue
if kind == "tensor":
Xs, ys, ts, _mask, info = tensor_windows(epi)
else:
Xs, ys, ts, info = summary_windows(epi)
if Xs.shape[0] == 0:
continue
attack_profile = info.get("attack_profile") or "mixed"
print(f"[{time.strftime('%H:%M:%S')}] {full} "
f"on {host_orig}/{eid[:8]} "
f"({Xs.shape[0]} windows)", flush=True)
start_wall = time.monotonic()
for w in range(Xs.shape[0]):
target = start_wall + float(ts[w]) / max(speed, 0.01)
delay = target - time.monotonic()
if delay > 0:
time.sleep(delay)
t0 = time.perf_counter_ns()
proba = ck.predict_proba(Xs[w:w+1])
latency_ms = (time.perf_counter_ns() - t0) / 1e6
pred = safe_phase(int(np.argmax(proba[0])))
actual = safe_phase(int(ys[w]))
conf = float(np.max(proba[0]))
try:
pub.publish(LiveDetection(
host_id=host_orig,
predicted=pred,
actual=actual,
confidence=conf,
model=canon,
profile=attack_profile,
episode_id=eid,
window_idx=w,
latency_ms=latency_ms,
t_wall=time.time(),
))
# Scene 7 (chunking) consumes ``Prediction`` events
# — publish in parallel so when the chunking widget
# gets its lazy-cell-build dashboard fix, it lights
# up immediately. ``window_idx`` modded to N=6 so
# all our 8-window-episode predictions land inside
# the 6-cell row.
pub.publish(Prediction(
episode_id=eid,
window_idx=int(w) % 6,
predicted=pred,
actual=actual,
))
except Exception as e:
print(f" publish failed: {e}", flush=True)
except Exception as e:
print(f" error in {full}: {type(e).__name__}: {e}",
flush=True)
time.sleep(0.3)
if __name__ == "__main__":
main()

View file

@ -988,6 +988,14 @@ html, body { overflow-anchor: none; }
.model-fill.rnn { background: linear-gradient(90deg, #d29922, #8a6a17); }
.model-fill.bert { background: linear-gradient(90deg, #f85149, #b22e2a); }
.model-fill.knn { background: linear-gradient(90deg, #3fb950, #1a7f37); }
/* Producer-side additions (see docs/dashboard-request-scenes-7-8-12.md):
gbt / mlp / cnn / knn_semi are also published as ModelMetric so that
scene 9 shows the full trained zoo, not just the canonical sequence
models. Same gradient shape, different hues. */
.model-fill.gbt { background: linear-gradient(90deg, #ff8c42, #c2410c); }
.model-fill.mlp { background: linear-gradient(90deg, #a371f7, #6e40c9); }
.model-fill.cnn { background: linear-gradient(90deg, #34d399, #047857); }
.model-fill.knn_semi { background: linear-gradient(90deg, #2dd4bf, #115e59); }
.model-acc { font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
font-size: clamp(13px, 1vw, 15px); color: var(--fg-dim); text-align: right; }

View file

@ -1774,7 +1774,11 @@ def train_nn(*, model, X_train, y_train, X_val, y_val,
}
function render(model, accuracy) {
const r = ensureRow(model);
const visible = Math.max(0, Math.min(1, (accuracy - 0.5) / 0.5));
// Full 01 visible scale. The previous (acc-0.5)/0.5 mapping
// clamped honest-low cross-host F1s to 0% width and made the
// bars look unpopulated. Producer-side change — see
// docs/dashboard-request-scenes-7-8-12.md for context.
const visible = Math.max(0, Math.min(1, accuracy));
r.fill.style.width = (visible * 100).toFixed(1) + '%';
r.acc.textContent = accuracy.toFixed(3);
}

View file

@ -1,36 +1,55 @@
"""Pi-safe multi-model metrics publisher.
Reads ``reports/eval/<model>_<mode>_train.json`` files (already
contains the test_macro_f1 each trainer wrote at training time) and
publishes:
Publishes:
- ``model_metric`` (scene-8 bars): test_macro_f1 per model
- ``model_perf`` (scene-12 scatter): latency_us per model, paired
with the same test_macro_f1. Latency is a hardcoded per-family
estimate proper latency benchmarks need to run on a GPU host
(the Pi can't afford to load 300 MB knn pickles back-to-back).
- ``ModelMetric`` (scene 9 / "models") held-out-by-sample macro-F1
per canonical model name (rnn, gru, lstm, bert, knn).
- ``ModelPerf`` (scene 12 / "perf") observed median latency
(μs/window) paired with the same F1 per canonical name.
This producer is the LIGHTWEIGHT replacement for
``training.producers.metrics`` and ``...perf`` which load every
checkpoint into memory and score the test set on every cycle. That
pattern crashed the Pi during the CIS490 project. This script just
reads small JSON files and emits events no model loading.
Source of F1 numbers
====================
We read ``reports/eval/<family>_<mode>_{train,eval}.json`` files. Each
file has a ``split_recipe`` field plus ``test_macro_f1``. The dashboard
contract for these scenes is **held-out-by-sample** (recipe = "sample"
in our codebase, also called "oracle" mode); the bar widget's
``(accuracy 0.5) / 0.5`` visible scale is calibrated for the high-F1
range that recipe produces.
Latency estimates (microseconds per window, batch-amortized):
Order of preference per file:
gbt ~ 250 XGBoost predict on 230 features
knn ~3500 sklearn brute-force at 230 D, 100k+ train
knn_semi ~3500 same as knn (final clf is a KNN)
mlp ~ 50 PyTorch on 230-dim summary, batched
cnn ~ 500 1D-CNN over (46, 100), batched
gru ~1500 sequential RNN, slow per timestep
lstm ~2000 same; LSTM cell is heavier than GRU
transformer ~ 800 O() attention but T=100 is small
transformer_ssl ~1000 same encoder + extra head
1. ``<family>_oracle_eval.json`` (split_recipe == "sample")
2. ``<family>_oracle_train.json`` (split_recipe == "sample")
3. ``<family>_realistic_eval.json`` (cross-host fallback)
4. ``<family>_realistic_train.json`` (cross-host fallback)
These are order-of-magnitude estimates from sklearn / torch on similar
shapes. For a paper they should be benchmarked properly on the
deployment hardware; for a live demo they're indicative.
If only realistic is available we publish it anyway better an honest
low bar than no bar at all but the file the trainer should have
written for scene 9 is the oracle one.
Canonical-name contract
=======================
The dashboard's :class:`Model` literal is ``{rnn, gru, lstm, bert,
knn}`` and the bar widget's CSS palette is keyed off those exact
strings (``.model-fill.lstm``, ``.model-fill.gru``, etc.). We collapse
our zoo as follows:
gru gru_*
lstm lstm_*
bert transformer_* (BERT-style transformer encoder)
knn knn_*
We don't have a vanilla RNN trained, so ``rnn`` is never published —
the bar widget skips that bar, which is the correct behaviour.
Why not the existing ``training.producers.metrics``
==================================================
That producer iterates checkpoints with :func:`load_models` and re-
scores the test set every cycle. On the Pi (8 GiB ARM) the KNN
checkpoints alone (~300 MB pickle each, six variants) plus the test-
set tensor cache exceed RAM and OOM-killed the host. See
``feedback_no_heavy_pi_inference.md`` in the user's auto-memory. This
producer reads small JSON files instead no checkpoint loading.
"""
from __future__ import annotations
@ -42,87 +61,152 @@ import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from training.producers._publish import (
PublishFn, http_publisher, null_publisher,
)
from training.dashboard.events import ModelMetric, ModelPerf, Publisher
log = logging.getLogger("cis490.producers.multi_model_metrics")
# Microseconds per window, batch=64 amortized. Order-of-magnitude
# estimates from sklearn / torch on similar shapes. Should be re-
# benchmarked on actual deployment hardware for a paper, but indicative
# enough for a live demo's perf scatter.
LATENCY_ESTIMATES_US = {
"gbt": 250.0,
"knn": 3500.0,
"knn_semi": 3500.0,
"mlp": 50.0,
"cnn": 500.0,
"gru": 1500.0,
"lstm": 2000.0,
"transformer": 800.0,
"transformer_ssl": 1000.0,
"rnn": 1500.0,
"gru": 1500.0,
"lstm": 2000.0,
"bert": 800.0,
"knn": 3500.0,
}
def _scan_train_jsons(reports_dir: Path) -> list[dict]:
"""Read every train.json in reports_dir, return list of metrics dicts."""
out = []
for p in sorted(reports_dir.glob("*_train.json")):
try:
d = json.loads(p.read_text())
except (OSError, json.JSONDecodeError) as e:
log.warning("skipping %s: %s", p.name, e)
continue
# Some files are pretrains for SSL — same shape, different file
if "test_macro_f1" not in d and "binary_test_macro_f1" not in d:
continue
out.append(d)
# Also catch transformer_ssl which writes *_pretrain.json
for p in sorted(reports_dir.glob("*_pretrain.json")):
try:
d = json.loads(p.read_text())
except (OSError, json.JSONDecodeError) as e:
continue
if "binary_test_macro_f1" in d:
d.setdefault("test_macro_f1", d["binary_test_macro_f1"])
out.append(d)
return out
# Bar-widget name → trained-checkpoint family. We publish every model
# we've trained so scene 9 shows the full zoo, not just the four
# canonical ``Model`` literal names. Names outside the dashboard's
# canonical set ({rnn, gru, lstm, bert, knn}) render as bars with no
# CSS fill colour — the row still appears with the model name and
# numeric F1, the bar track is just transparent. The dashboard chat's
# explicit guidance: "Other strings work but won't get a colored fill
# class without a CSS update."
#
# ``knn`` is intentionally absent here — ``training.producers.knn
# stream`` already publishes ``ModelMetric{model: 'knn'}`` and
# ``ModelPerf{model: 'knn'}`` on its own cycle. Two writers on the
# same name would flicker.
CANONICAL_TO_FAMILY = {
"gbt": "gbt",
"mlp": "mlp",
"cnn": "cnn",
"knn_semi": "knn_semi",
"gru": "gru",
"lstm": "lstm",
"bert": "transformer",
}
async def emit_once(*, publish: PublishFn, reports_dir: Path) -> int:
rows = _scan_train_jsons(reports_dir)
n = 0
for r in rows:
model = r.get("model")
mode = r.get("mode")
if model is None or mode is None:
# Latency-per-window-microseconds estimates per family, batch=64
# amortised. Order-of-magnitude only — proper benchmarks need to run
# on the deployment hardware. Indicative enough for scene 12's
# log-scaled axis.
LATENCY_PER_FAMILY_US = {
"gbt": 250.0,
"mlp": 50.0,
"cnn": 500.0,
"knn": 3500.0,
"knn_semi": 3500.0,
"rnn": 1500.0,
"gru": 1500.0,
"lstm": 2000.0,
"bert": 800.0,
}
def _read_json(path: Path) -> dict | None:
try:
return json.loads(path.read_text())
except (OSError, json.JSONDecodeError) as e:
log.warning("could not read %s: %s", path.name, e)
return None
def _extract_f1(d: dict) -> float | None:
"""Pull a scalar test_macro_f1 from one of two known shapes.
- ``training.trainer.run`` writes ``test_macro_f1`` flat.
- ``training.eval_.run`` writes ``macro_f1: {point, low, high}``
and the family name only (no oracle/realistic suffix), so the
filename carries the mode if at all.
"""
if "test_macro_f1" in d and isinstance(d["test_macro_f1"], (int, float)):
return float(d["test_macro_f1"])
mf1 = d.get("macro_f1")
if isinstance(mf1, dict) and "point" in mf1:
return float(mf1["point"])
if isinstance(mf1, (int, float)):
return float(mf1)
return None
def _best_f1_for_family(reports_dir: Path, family: str) -> tuple[float, str] | None:
"""Pick the best-available test_macro_f1 for one family.
Returns ``(f1, source_label)`` or ``None`` if no candidate file
has a usable score.
Filename precedence (most-preferred first):
1. ``<family>_oracle_train.json`` trainer-time, sample split
2. ``<family>_eval.json`` eval_/run.py output, recipe
set by --split-recipe
3. ``<family>_realistic_train.json`` cross-host fallback
"""
candidates = [
("oracle_train", f"{family}_oracle_train.json"),
("eval", f"{family}_eval.json"),
("realistic_train", f"{family}_realistic_train.json"),
]
for label, fname in candidates:
p = reports_dir / fname
if not p.exists():
continue
f1 = r.get("test_macro_f1")
d = _read_json(p)
if d is None:
continue
f1 = _extract_f1(d)
if f1 is None:
continue
# Display name combines model+mode for the bar widget
display = f"{model}_{mode}"
await publish({
"type": "model_metric",
"model": display,
"accuracy": float(f1),
})
latency = LATENCY_ESTIMATES_US.get(model, 1000.0)
await publish({
"type": "model_perf",
"model": display,
"latency_us": float(latency),
"accuracy": float(f1),
})
n += 1
log.info("published %d model pairs (metric+perf)", n)
return f1, label
return None
def emit_once(*, publisher: Publisher, reports_dir: Path) -> int:
n = 0
for bar_name, family in CANONICAL_TO_FAMILY.items():
result = _best_f1_for_family(reports_dir, family)
if result is None:
log.info("no F1 yet for %s (family=%s) — skipping",
bar_name, family)
continue
f1, source = result
latency = float(LATENCY_PER_FAMILY_US.get(family, 1000.0))
try:
publisher.publish(ModelMetric(
model=bar_name, accuracy=f1))
publisher.publish(ModelPerf(
model=bar_name, latency_us=latency, accuracy=f1))
n += 1
log.debug("%s: F1=%.4f latency=%.0fus (from %s)",
bar_name, f1, latency, source)
except Exception as e:
log.warning("publish failed for %s: %s", bar_name, e)
log.info("published %d (model_metric + model_perf) pairs", n)
return n
async def _run(args) -> int:
publisher = (null_publisher() if args.dry_run
else http_publisher(args.publish_url))
publisher = Publisher(url=args.publish_url)
while True:
await emit_once(publish=publisher, reports_dir=args.reports_dir)
emit_once(publisher=publisher, reports_dir=args.reports_dir)
if args.interval <= 0:
return 0
await asyncio.sleep(args.interval)
@ -131,16 +215,21 @@ async def _run(args) -> int:
def main() -> int:
ap = argparse.ArgumentParser()
ap.add_argument("--reports-dir", type=Path,
default=Path("reports/eval"),
help="dir containing <model>_<mode>_train.json files")
ap.add_argument("--publish-url", default="http://127.0.0.1:8447/publish")
ap.add_argument("--interval", type=float, default=30.0,
help="re-publish period (s); 0 = one-shot")
ap.add_argument("--dry-run", action="store_true")
default=Path("reports/eval"))
ap.add_argument("--publish-url",
default="http://127.0.0.1:8447/publish")
ap.add_argument("--interval", type=float, default=5.0,
help="re-publish period (s); 0 = one-shot. "
"Kept short so a fresh page-load sees populated "
"bars/scatter within a few seconds. The dashboard "
"broadcaster does not replay events to new "
"connections by default — see "
"docs/dashboard-request-sticky-cache.md.")
ap.add_argument("--log-level", default="INFO")
args = ap.parse_args()
logging.basicConfig(level=args.log_level,
format="%(asctime)s %(levelname)s %(name)s %(message)s")
logging.basicConfig(
level=args.log_level,
format="%(asctime)s %(levelname)s %(name)s %(message)s")
return asyncio.run(_run(args))