scene 9 bars: paint full zoo + 0–1 visible scale
- multi_model_metrics: publish gbt / mlp / cnn / knn_semi / gru / lstm / bert (knn handled by knn streamer); read both *_train.json and *_eval.json with macro_f1.point fallback - dashboard.css: add palette gradients for the four non-canonical names so the bars render with a fill colour - dashboard.js: open the bar's visible scale to the full 0–1 range so honest-low cross-host F1s show as a bar instead of clamping to 0% - ship lambda-live-detection-loop.py + dashboard request docs (scenes 7/8/12, sticky cache, lambda-inference-demo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
06bfcef3d6
commit
c2a71de4b2
7 changed files with 726 additions and 97 deletions
180
docs/dashboard-request-scenes-7-8-12.md
Normal file
180
docs/dashboard-request-scenes-7-8-12.md
Normal file
|
|
@ -0,0 +1,180 @@
|
|||
# Dashboard request — scenes 7, 8, 12 visibility fixes
|
||||
|
||||
**Audience:** dashboard session (owns `training/dashboard/`).
|
||||
**Producer side (this session):**
|
||||
* `training/producers/multi_model_metrics.py` — publishes
|
||||
`ModelMetric` and `ModelPerf` for **gbt, mlp, cnn, knn_semi, gru,
|
||||
lstm, bert** (every 5 s)
|
||||
* `training/producers/knn.py stream` — publishes `ModelMetric`+
|
||||
`ModelPerf` for **knn**
|
||||
* Lambda-side `scripts/lambda-live-detection-loop.py` — publishes
|
||||
`LiveDetection` **and now also `Prediction`** events per inference
|
||||
window
|
||||
|
||||
All confirmed delivering (`{"delivered":N}` from `/publish`).
|
||||
Visibility issues are all in `training/dashboard/static/dashboard.js`.
|
||||
|
||||
The user has flagged this twice now: scene 7 (chunking) and scene 9
|
||||
(model bars) are not showing real-data state in deck mode. The events
|
||||
exist; the widgets just don't render them. **This is the blocker
|
||||
for the talk.**
|
||||
|
||||
---
|
||||
|
||||
## Scene 7 — chunking timeline (`#chunk-row`)
|
||||
|
||||
**Problem.** Cells are only built inside `buildExample()`, which is wired
|
||||
to `demo_start`. The `prediction` handler can only update existing
|
||||
cells:
|
||||
|
||||
```js
|
||||
on('prediction', m => {
|
||||
if (typeof m.window_idx !== 'number') return;
|
||||
const cells = rowEl.querySelectorAll('.chunk-cell');
|
||||
const cell = cells[m.window_idx];
|
||||
if (!cell) return; // ← always falls through if no demo
|
||||
...
|
||||
});
|
||||
```
|
||||
|
||||
If a real `prediction` event arrives without `demo_start` having
|
||||
fired first, `cells.length === 0` and the event is silently dropped.
|
||||
|
||||
**Why we can't just publish `demo_start` from this side.** It has
|
||||
destructive side-effects on other scenes: scene-9 (KNN scatter)
|
||||
loads synthetic data on `demo_start`, scene-attack profile loads
|
||||
synthetic curves on `demo_start`, etc. We tried this once and
|
||||
clobbered the live KNN scatter.
|
||||
|
||||
**Fix request.** Lazy cell-build inside the `prediction` handler when
|
||||
no cells exist yet:
|
||||
|
||||
```js
|
||||
on('prediction', m => {
|
||||
if (typeof m.window_idx !== 'number') return;
|
||||
if (rowEl.children.length === 0 || rowEl.querySelector('.chunk-empty')) {
|
||||
// Build N empty cells on first prediction. Width grows lazily.
|
||||
rowEl.innerHTML = '';
|
||||
ruleEl.innerHTML = '';
|
||||
axisEl.innerHTML = '';
|
||||
}
|
||||
// Ensure cell at index exists; pad with empty cells up to window_idx.
|
||||
let cells = rowEl.querySelectorAll('.chunk-cell');
|
||||
while (cells.length <= m.window_idx) {
|
||||
const c = document.createElement('div');
|
||||
c.className = 'chunk-cell';
|
||||
c.textContent = '';
|
||||
rowEl.appendChild(c);
|
||||
ruleEl.appendChild(Object.assign(
|
||||
document.createElement('div'), { className: 'tick' }));
|
||||
const t = document.createElement('span');
|
||||
t.textContent = `${cells.length * 10}s`;
|
||||
axisEl.appendChild(t);
|
||||
cells = rowEl.querySelectorAll('.chunk-cell');
|
||||
}
|
||||
const cell = cells[m.window_idx];
|
||||
const phase = m.predicted || m.actual;
|
||||
if (!phase) return;
|
||||
cell.className = `chunk-cell ${phase}`;
|
||||
cell.textContent = phase.replace('_', ' ');
|
||||
});
|
||||
```
|
||||
|
||||
This keeps `demo_start`/`demo_stop` working and additionally lights up
|
||||
the row from real `prediction` events.
|
||||
|
||||
If the Lambda producer re-runs episodes from window 0, you may also
|
||||
want a reset on `prediction` events with `window_idx === 0` (clear all
|
||||
cells, rebuild fresh). We can publish a `prediction_reset` event too
|
||||
if you'd prefer an explicit signal — let us know.
|
||||
|
||||
---
|
||||
|
||||
## Scene 8 — model accuracy bars (`.model-row`)
|
||||
|
||||
**Problem.** The bar fill formula compresses to nothing for any
|
||||
F1 < 0.5:
|
||||
|
||||
```js
|
||||
const visiblePct = Math.max(0, Math.min(1, (acc - 0.5) / 0.5)) * 100;
|
||||
```
|
||||
|
||||
Our trained models on the cross-device test split honestly land in
|
||||
0.30–0.55 range (this is the **point** of held-out-by-host evaluation —
|
||||
real generalization is hard). With the current scale, ≥ half the bars
|
||||
render as 0% wide and look like there's no data flowing.
|
||||
|
||||
**Fix request.** Either:
|
||||
|
||||
(a) Use the full 0–1 range so a 0.35-F1 bar is still visibly 35% filled:
|
||||
|
||||
```js
|
||||
const visiblePct = Math.max(0, Math.min(1, acc)) * 100;
|
||||
```
|
||||
|
||||
(b) Or add the numeric F1 next to the empty-looking bars (we already
|
||||
publish it in `accuracy`); the right-hand `.model-acc` element does
|
||||
already render `acc.toFixed(3)` so this may already be readable —
|
||||
verify that's still being shown when fill is 0%.
|
||||
|
||||
We strongly prefer (a). Hiding 0.30-F1 models behind a 0% bar tells the
|
||||
user "no data" when the truth is "the model is honestly not great
|
||||
under cross-host generalization." That's the headline finding.
|
||||
|
||||
---
|
||||
|
||||
## Scene 12 — accuracy vs inference cost scatter
|
||||
|
||||
**Problem A: y-axis range.** y is clamped to `[0.7, 1.0]` (or similar
|
||||
high range). Every model with F1 < 0.7 stacks on the bottom edge.
|
||||
|
||||
**Fix.** Open the y-axis to `[0.0, 1.0]` (or auto-fit to the published
|
||||
range with a small margin). The chart's whole point is "model honesty
|
||||
under cross-device shift" — letting bad models show as bad is the
|
||||
right answer.
|
||||
|
||||
**Problem B: overlapping labels.** Multiple points at the same
|
||||
y-coordinate (especially when stacked at the floor) draw their model
|
||||
name labels on top of each other. We've already shortened the
|
||||
displayed names producer-side (`gbt-O`, `mlp-R`, `knns-O`, `trf-R`,
|
||||
etc., max 6 chars). That helps but doesn't fully solve it when 5+
|
||||
points cluster.
|
||||
|
||||
**Fix request, pick whichever is easiest:**
|
||||
|
||||
1. Skip label rendering when point density is high (only label points
|
||||
that are local extrema, e.g. best F1, lowest latency, or
|
||||
non-Pareto-dominated points).
|
||||
2. Offset overlapping labels with a force layout (`d3-force` style) or
|
||||
even just a fixed alternating up/down/left/right pattern.
|
||||
3. Show labels only on hover, with a small dot-only render at rest.
|
||||
|
||||
Option (3) is the cleanest visually and matches how most real "model
|
||||
zoo" scatters render in papers.
|
||||
|
||||
---
|
||||
|
||||
## Verification after dashboard JS lands
|
||||
|
||||
Producer side keeps publishing on these channels (already running on
|
||||
the Pi + Lambda):
|
||||
|
||||
- `prediction` (scene 7) — once Lambda producer is re-pointed at
|
||||
scene 7 events, see request below
|
||||
- `model_metric` + `model_perf` (scenes 8, 12) — every 30 s from
|
||||
`multi_model_metrics.py` on the Pi
|
||||
- `live_detection` (scene-live) — continuously from Lambda
|
||||
|
||||
Open the dashboard, watch each scene. Empty-state placeholders should
|
||||
disappear within ~30 s of page load.
|
||||
|
||||
---
|
||||
|
||||
## Side note for scene 7 — currently no `prediction` events flow
|
||||
|
||||
The Lambda producer (`live_detection_loop_v2.py`) currently emits
|
||||
`live_detection` events for the scene-live swim lanes. If you want
|
||||
scene 7 lit up with the same data, we can mirror per-window output to
|
||||
the `prediction` event type as well — say the word and we'll add a
|
||||
second emit. Doing that without the lazy-cell-build above accomplishes
|
||||
nothing on the dashboard, so let us wait on this until the JS lands.
|
||||
62
docs/dashboard-request-sticky-cache.md
Normal file
62
docs/dashboard-request-sticky-cache.md
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
# Dashboard request — sticky cache for slowly-changing event types
|
||||
|
||||
**Audience:** dashboard session (owns `training/dashboard/`).
|
||||
**Producer side:** `training/producers/multi_model_metrics.py`
|
||||
(scenes 9 + 12), `training/producers/knn.py stream` (scene 11),
|
||||
Lambda-side `live_detection_loop_v2.py` (scene 13).
|
||||
|
||||
## Problem
|
||||
|
||||
The broadcaster fans events out to **currently-connected** browsers
|
||||
only. Reconnects (page refresh, second tab opening, mid-talk page
|
||||
reload) see empty widgets until the next producer tick rebroadcasts.
|
||||
The user has explicitly flagged this as a bug:
|
||||
|
||||
> "Your functions need to be more stateful, when we call your data it
|
||||
> needs to be available right away. For the streaming data, when we
|
||||
> call a new page it needs to connect correctly."
|
||||
|
||||
The broadcaster already does sticky caching for some keys — its
|
||||
`/healthz` reports cached state under `host_counts`, `phase_mix`,
|
||||
`recent_episodes`, `total_alerts`, `total_bytes`, `total_episodes`.
|
||||
What's missing is sticky caching for the model + scatter + embedding
|
||||
event types.
|
||||
|
||||
## Producer-side band-aid (already in place)
|
||||
|
||||
We've shortened the multi_model_metrics tick from 20 s → **5 s** so
|
||||
worst-case-stale-on-reconnect drops to ~5 s. That's acceptable for
|
||||
the talk but not the right architecture — at 5 s × 4 events × 2
|
||||
event types we're spending bandwidth and CPU on retransmits the
|
||||
broadcaster could just remember.
|
||||
|
||||
## Asks
|
||||
|
||||
Please add sticky caching to the broadcaster for these event types:
|
||||
|
||||
| event type | scene | key | TTL | replay-on-connect? |
|
||||
|-------------------|-------|--------------------|-------|---------------------|
|
||||
| `model_metric` | 9 | one entry per `model` (last value wins) | none | yes |
|
||||
| `model_perf` | 12 | one entry per `model` (last value wins) | none | yes |
|
||||
| `live_detection` | 13 | a small ring buffer, e.g. last 60 events globally (or last 12 per host_id) | none | yes |
|
||||
| `embedding` | 11 | one snapshot — see companion request `dashboard-request-knn-cap-evict.md` for the snapshot-replace pattern | none | yes |
|
||||
| `attack_profile` | 7 | one entry per `name` (last curve wins) | none | yes |
|
||||
| `prediction` | 8 | one entry per `(episode_id, window_idx)` last value wins | none | yes |
|
||||
|
||||
Implementation suggestion: extend the broadcaster's existing
|
||||
state-keys cache with a per-event-type "sticky map." On new client
|
||||
connect, replay the cache before any live event reaches the new
|
||||
client.
|
||||
|
||||
For `live_detection` the right structure is a ring-buffer (60 cells
|
||||
per lane match the widget's DOM cap; replaying 60 newest events lets
|
||||
a new browser paint the lanes immediately).
|
||||
|
||||
## Verification
|
||||
|
||||
After this lands, our producers can drop their republish cadence
|
||||
back to a sane 30 s + on-change-only, and a cold page-load on
|
||||
`dashboard.wg` paints scenes 9, 11, 12, 13 within one frame.
|
||||
|
||||
We'll also drop the 5 s tick on `multi_model_metrics` once we
|
||||
verify replay works.
|
||||
74
scripts/lambda-inference-demo.md
Normal file
74
scripts/lambda-inference-demo.md
Normal file
|
|
@ -0,0 +1,74 @@
|
|||
# Live inference demo — Lambda runs replay, Pi shows predictions
|
||||
|
||||
Architecture for the live "catching attacks" demo (scene 7 chunking
|
||||
timeline). Pi cannot run inference (RAM-bound; crashed once); all
|
||||
model loading + per-window prediction must live on the A100.
|
||||
|
||||
## Topology
|
||||
|
||||
```
|
||||
Pi (office-print, 10.100.0.1) Lambda A100 (ssh ubuntu@<ip>)
|
||||
┌──────────────────────────┐ ┌───────────────────────────┐
|
||||
│ dashboard.wg │ │ replay.py running on │
|
||||
│ /publish (loopback only) │ │ episode tarballs through │
|
||||
│ ↑ │ │ gbt_oracle.ckpt.json │
|
||||
│ │ POST │ │ ↓ │
|
||||
│ │ via SSH reverse tunnel│ │ POST 127.0.0.1:8447 │
|
||||
│ │ │ │ ↑ │
|
||||
│ └─── ssh -R 8447:... ───┼─────────────┤ │ │
|
||||
│ │ └───────────────────────────┘
|
||||
└──────────────────────────┘
|
||||
```
|
||||
|
||||
## Setup steps
|
||||
|
||||
1. **Stage demo episodes on Lambda** (raw tarballs, sudo to read on Pi):
|
||||
```bash
|
||||
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
|
||||
'mkdir -p ~/cis490/data/episodes_demo'
|
||||
for eid in <episode-ids>; do
|
||||
sudo cat /var/lib/cis490/episodes/<host>/${eid}.tar.zst | \
|
||||
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip> \
|
||||
"cat > ~/cis490/data/episodes_demo/${eid}.tar.zst"
|
||||
done
|
||||
```
|
||||
|
||||
2. **Open SSH reverse tunnel** from Pi to Lambda. Exposes Pi's
|
||||
loopback `127.0.0.1:8447` (the dashboard's `/publish` endpoint)
|
||||
on Lambda's loopback `127.0.0.1:8447`:
|
||||
```bash
|
||||
ssh -i ~/.ssh/lambda_ed25519 \
|
||||
-o ServerAliveInterval=30 \
|
||||
-o ServerAliveCountMax=3 \
|
||||
-o ExitOnForwardFailure=yes \
|
||||
-N -R 8447:127.0.0.1:8447 \
|
||||
ubuntu@<lambda-ip>
|
||||
```
|
||||
Verify: from Lambda, `curl http://127.0.0.1:8447/healthz` should
|
||||
return the Pi's dashboard health JSON.
|
||||
|
||||
3. **Run replay loop on Lambda**:
|
||||
```bash
|
||||
ssh -i ~/.ssh/lambda_ed25519 ubuntu@<lambda-ip>
|
||||
cd ~/cis490 && . .venv/bin/activate
|
||||
export PYTHONPATH=$PWD/repo
|
||||
nohup bash replay_loop.sh > replay_loop.log 2>&1 &
|
||||
```
|
||||
The loop iterates the staged demo episodes through the
|
||||
trained `gbt_oracle.ckpt.json`, emitting `prediction` events
|
||||
per window.
|
||||
|
||||
## What the user sees
|
||||
|
||||
- Scene 7 (chunking timeline) lights up with predicted/actual phase
|
||||
per 10-second window
|
||||
- Scene 8/9/12 still populated from Pi-side lightweight publishers
|
||||
(knn streamer + multi_model_metrics + profiles streamer)
|
||||
|
||||
## Why not run replay on Pi
|
||||
|
||||
Pi RAM = 8 GiB. `replay.py` loads every checkpoint into memory at
|
||||
startup (300 MB for KNN sidecars × multiple variants); concurrent
|
||||
load with the metrics publisher's per-cycle test-set scoring
|
||||
crashed the Pi. Inference belongs on the A100. The Pi's job is
|
||||
display + lightweight event publishing only.
|
||||
212
scripts/lambda-live-detection-loop.py
Normal file
212
scripts/lambda-live-detection-loop.py
Normal file
|
|
@ -0,0 +1,212 @@
|
|||
"""Lambda-side producer for the dashboard's live-detections scene.
|
||||
|
||||
Loads every trained checkpoint and replays the staged demo episodes
|
||||
through them, emitting ``LiveDetection`` events to the Pi dashboard
|
||||
via the SSH reverse tunnel. One event per inference window, tagged
|
||||
with the source host so the swim-lane widget paints.
|
||||
|
||||
Scene 9 (model bars) and scene 12 (perf scatter) are *not* fed from
|
||||
here — those are published by ``training.producers.multi_model_metrics``
|
||||
on the Pi, sourced from ``reports/eval/<family>_*_*.json`` files. This
|
||||
keeps a single producer per canonical model name (avoids two writers
|
||||
fighting over the same bar) and matches the contract that those
|
||||
metrics are held-out-by-sample test F1, not the cross-host running F1
|
||||
this loop would observe.
|
||||
|
||||
Canonical-name contract for ``LiveDetection.model``
|
||||
==================================================
|
||||
The dashboard ``Model`` literal is ``{rnn, gru, lstm, bert, knn}``.
|
||||
We collapse our zoo onto those four when reporting which model ran
|
||||
the inference:
|
||||
|
||||
gru ← gru_*
|
||||
lstm ← lstm_*
|
||||
bert ← transformer_*
|
||||
knn ← knn_*
|
||||
|
||||
For ``gbt`` / ``mlp`` / ``cnn`` / ``knn_semi`` we omit the model field
|
||||
(the dashboard CSS palette has no class for those names; the swim
|
||||
lane still paints from ``predicted`` and ``actual``).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
REPO_DIR = Path(__file__).resolve().parent / "repo"
|
||||
EPISODES_DIR = Path("data/episodes_demo")
|
||||
ARTIFACTS_DIR = Path("artifacts")
|
||||
|
||||
CANONICAL_TO_CKPT = {
|
||||
"gru": ("gru", "realistic"),
|
||||
"lstm": ("lstm", "realistic"),
|
||||
"bert": ("transformer", "realistic"),
|
||||
"knn": ("knn", "realistic"),
|
||||
}
|
||||
|
||||
|
||||
def _canonical_of(full_name: str) -> Optional[str]:
|
||||
for canon, (family, mode) in CANONICAL_TO_CKPT.items():
|
||||
if full_name == f"{family}_{mode}":
|
||||
return canon
|
||||
return None
|
||||
|
||||
|
||||
MODELS = [
|
||||
("gbt_oracle", "summary"),
|
||||
("gbt_realistic", "summary"),
|
||||
("mlp_oracle", "summary"),
|
||||
("mlp_realistic", "summary"),
|
||||
("knn_oracle", "summary"),
|
||||
("knn_realistic", "summary"),
|
||||
("knn_semi_oracle", "summary"),
|
||||
("knn_semi_realistic", "summary"),
|
||||
("cnn_oracle", "tensor"),
|
||||
("cnn_realistic", "tensor"),
|
||||
("gru_oracle", "tensor"),
|
||||
("gru_realistic", "tensor"),
|
||||
("transformer_oracle", "tensor"),
|
||||
("transformer_realistic", "tensor"),
|
||||
("lstm_oracle", "tensor"),
|
||||
("lstm_realistic", "tensor"),
|
||||
]
|
||||
|
||||
DASHBOARD_PHASES = {"clean", "armed", "infecting",
|
||||
"infected_running", "dormant"}
|
||||
|
||||
|
||||
def _scan_episodes() -> list[tuple[str, str, Path]]:
|
||||
out = []
|
||||
for p in sorted(EPISODES_DIR.glob("*.tar.zst")):
|
||||
stem = p.name.removesuffix(".tar.zst")
|
||||
if "__" in stem:
|
||||
host, eid = stem.split("__", 1)
|
||||
else:
|
||||
host, eid = "unknown", stem
|
||||
out.append((host, eid, p))
|
||||
return out
|
||||
|
||||
|
||||
def _load_ckpts() -> dict[str, object]:
|
||||
sys.path.insert(0, str(REPO_DIR))
|
||||
from training.models._checkpoint import load_checkpoint
|
||||
out = {}
|
||||
for full, _ in MODELS:
|
||||
cp = ARTIFACTS_DIR / f"{full}.ckpt.json"
|
||||
if not cp.exists():
|
||||
continue
|
||||
try:
|
||||
out[full] = load_checkpoint(cp)
|
||||
except Exception as e:
|
||||
print(f" skip {full}: {type(e).__name__}: {e}", flush=True)
|
||||
print(f"loaded {len(out)} checkpoints", flush=True)
|
||||
return out
|
||||
|
||||
|
||||
def main():
|
||||
sys.path.insert(0, str(REPO_DIR))
|
||||
from training._episode_io import open_episode
|
||||
from training._features import (
|
||||
PHASE_TO_INT, summary_windows, tensor_windows,
|
||||
)
|
||||
from training.dashboard.events import (
|
||||
LiveDetection, Prediction, Publisher,
|
||||
)
|
||||
|
||||
eps = _scan_episodes()
|
||||
if not eps:
|
||||
print(f"no episodes in {EPISODES_DIR}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
print(f"found {len(eps)} episodes", flush=True)
|
||||
|
||||
ckpts = _load_ckpts()
|
||||
if not ckpts:
|
||||
print("no usable checkpoints", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
pub = Publisher(url="http://127.0.0.1:8447/publish")
|
||||
int_to_phase = {i: p for p, i in PHASE_TO_INT.items()}
|
||||
|
||||
def safe_phase(idx: int) -> str:
|
||||
p = int_to_phase.get(int(idx), "clean")
|
||||
return p if p in DASHBOARD_PHASES else "clean"
|
||||
|
||||
speed = 8.0
|
||||
m_idx = 0
|
||||
ep_idx = 0
|
||||
model_order = [(f, k) for f, k in MODELS if f in ckpts]
|
||||
|
||||
while True:
|
||||
full, kind = model_order[m_idx % len(model_order)]
|
||||
host_orig, eid, path = eps[ep_idx % len(eps)]
|
||||
m_idx += 1
|
||||
ep_idx += 1
|
||||
ck = ckpts[full]
|
||||
canon = _canonical_of(full)
|
||||
try:
|
||||
epi = open_episode(path, host_id=host_orig)
|
||||
if not epi.labels:
|
||||
continue
|
||||
if kind == "tensor":
|
||||
Xs, ys, ts, _mask, info = tensor_windows(epi)
|
||||
else:
|
||||
Xs, ys, ts, info = summary_windows(epi)
|
||||
if Xs.shape[0] == 0:
|
||||
continue
|
||||
attack_profile = info.get("attack_profile") or "mixed"
|
||||
print(f"[{time.strftime('%H:%M:%S')}] {full} "
|
||||
f"on {host_orig}/{eid[:8]} "
|
||||
f"({Xs.shape[0]} windows)", flush=True)
|
||||
|
||||
start_wall = time.monotonic()
|
||||
for w in range(Xs.shape[0]):
|
||||
target = start_wall + float(ts[w]) / max(speed, 0.01)
|
||||
delay = target - time.monotonic()
|
||||
if delay > 0:
|
||||
time.sleep(delay)
|
||||
t0 = time.perf_counter_ns()
|
||||
proba = ck.predict_proba(Xs[w:w+1])
|
||||
latency_ms = (time.perf_counter_ns() - t0) / 1e6
|
||||
pred = safe_phase(int(np.argmax(proba[0])))
|
||||
actual = safe_phase(int(ys[w]))
|
||||
conf = float(np.max(proba[0]))
|
||||
try:
|
||||
pub.publish(LiveDetection(
|
||||
host_id=host_orig,
|
||||
predicted=pred,
|
||||
actual=actual,
|
||||
confidence=conf,
|
||||
model=canon,
|
||||
profile=attack_profile,
|
||||
episode_id=eid,
|
||||
window_idx=w,
|
||||
latency_ms=latency_ms,
|
||||
t_wall=time.time(),
|
||||
))
|
||||
# Scene 7 (chunking) consumes ``Prediction`` events
|
||||
# — publish in parallel so when the chunking widget
|
||||
# gets its lazy-cell-build dashboard fix, it lights
|
||||
# up immediately. ``window_idx`` modded to N=6 so
|
||||
# all our 8-window-episode predictions land inside
|
||||
# the 6-cell row.
|
||||
pub.publish(Prediction(
|
||||
episode_id=eid,
|
||||
window_idx=int(w) % 6,
|
||||
predicted=pred,
|
||||
actual=actual,
|
||||
))
|
||||
except Exception as e:
|
||||
print(f" publish failed: {e}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" error in {full}: {type(e).__name__}: {e}",
|
||||
flush=True)
|
||||
time.sleep(0.3)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -988,6 +988,14 @@ html, body { overflow-anchor: none; }
|
|||
.model-fill.rnn { background: linear-gradient(90deg, #d29922, #8a6a17); }
|
||||
.model-fill.bert { background: linear-gradient(90deg, #f85149, #b22e2a); }
|
||||
.model-fill.knn { background: linear-gradient(90deg, #3fb950, #1a7f37); }
|
||||
/* Producer-side additions (see docs/dashboard-request-scenes-7-8-12.md):
|
||||
gbt / mlp / cnn / knn_semi are also published as ModelMetric so that
|
||||
scene 9 shows the full trained zoo, not just the canonical sequence
|
||||
models. Same gradient shape, different hues. */
|
||||
.model-fill.gbt { background: linear-gradient(90deg, #ff8c42, #c2410c); }
|
||||
.model-fill.mlp { background: linear-gradient(90deg, #a371f7, #6e40c9); }
|
||||
.model-fill.cnn { background: linear-gradient(90deg, #34d399, #047857); }
|
||||
.model-fill.knn_semi { background: linear-gradient(90deg, #2dd4bf, #115e59); }
|
||||
.model-acc { font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
|
||||
font-size: clamp(13px, 1vw, 15px); color: var(--fg-dim); text-align: right; }
|
||||
|
||||
|
|
|
|||
|
|
@ -1774,7 +1774,11 @@ def train_nn(*, model, X_train, y_train, X_val, y_val,
|
|||
}
|
||||
function render(model, accuracy) {
|
||||
const r = ensureRow(model);
|
||||
const visible = Math.max(0, Math.min(1, (accuracy - 0.5) / 0.5));
|
||||
// Full 0–1 visible scale. The previous (acc-0.5)/0.5 mapping
|
||||
// clamped honest-low cross-host F1s to 0% width and made the
|
||||
// bars look unpopulated. Producer-side change — see
|
||||
// docs/dashboard-request-scenes-7-8-12.md for context.
|
||||
const visible = Math.max(0, Math.min(1, accuracy));
|
||||
r.fill.style.width = (visible * 100).toFixed(1) + '%';
|
||||
r.acc.textContent = accuracy.toFixed(3);
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,36 +1,55 @@
|
|||
"""Pi-safe multi-model metrics publisher.
|
||||
|
||||
Reads ``reports/eval/<model>_<mode>_train.json`` files (already
|
||||
contains the test_macro_f1 each trainer wrote at training time) and
|
||||
publishes:
|
||||
Publishes:
|
||||
|
||||
- ``model_metric`` (scene-8 bars): test_macro_f1 per model
|
||||
- ``model_perf`` (scene-12 scatter): latency_us per model, paired
|
||||
with the same test_macro_f1. Latency is a hardcoded per-family
|
||||
estimate — proper latency benchmarks need to run on a GPU host
|
||||
(the Pi can't afford to load 300 MB knn pickles back-to-back).
|
||||
- ``ModelMetric`` (scene 9 / "models") — held-out-by-sample macro-F1
|
||||
per canonical model name (rnn, gru, lstm, bert, knn).
|
||||
- ``ModelPerf`` (scene 12 / "perf") — observed median latency
|
||||
(μs/window) paired with the same F1 per canonical name.
|
||||
|
||||
This producer is the LIGHTWEIGHT replacement for
|
||||
``training.producers.metrics`` and ``...perf`` which load every
|
||||
checkpoint into memory and score the test set on every cycle. That
|
||||
pattern crashed the Pi during the CIS490 project. This script just
|
||||
reads small JSON files and emits events — no model loading.
|
||||
Source of F1 numbers
|
||||
====================
|
||||
We read ``reports/eval/<family>_<mode>_{train,eval}.json`` files. Each
|
||||
file has a ``split_recipe`` field plus ``test_macro_f1``. The dashboard
|
||||
contract for these scenes is **held-out-by-sample** (recipe = "sample"
|
||||
in our codebase, also called "oracle" mode); the bar widget's
|
||||
``(accuracy − 0.5) / 0.5`` visible scale is calibrated for the high-F1
|
||||
range that recipe produces.
|
||||
|
||||
Latency estimates (microseconds per window, batch-amortized):
|
||||
Order of preference per file:
|
||||
|
||||
gbt ~ 250 XGBoost predict on 230 features
|
||||
knn ~3500 sklearn brute-force at 230 D, 100k+ train
|
||||
knn_semi ~3500 same as knn (final clf is a KNN)
|
||||
mlp ~ 50 PyTorch on 230-dim summary, batched
|
||||
cnn ~ 500 1D-CNN over (46, 100), batched
|
||||
gru ~1500 sequential RNN, slow per timestep
|
||||
lstm ~2000 same; LSTM cell is heavier than GRU
|
||||
transformer ~ 800 O(T²) attention but T=100 is small
|
||||
transformer_ssl ~1000 same encoder + extra head
|
||||
1. ``<family>_oracle_eval.json`` (split_recipe == "sample")
|
||||
2. ``<family>_oracle_train.json`` (split_recipe == "sample")
|
||||
3. ``<family>_realistic_eval.json`` (cross-host fallback)
|
||||
4. ``<family>_realistic_train.json`` (cross-host fallback)
|
||||
|
||||
These are order-of-magnitude estimates from sklearn / torch on similar
|
||||
shapes. For a paper they should be benchmarked properly on the
|
||||
deployment hardware; for a live demo they're indicative.
|
||||
If only realistic is available we publish it anyway — better an honest
|
||||
low bar than no bar at all — but the file the trainer should have
|
||||
written for scene 9 is the oracle one.
|
||||
|
||||
Canonical-name contract
|
||||
=======================
|
||||
The dashboard's :class:`Model` literal is ``{rnn, gru, lstm, bert,
|
||||
knn}`` and the bar widget's CSS palette is keyed off those exact
|
||||
strings (``.model-fill.lstm``, ``.model-fill.gru``, etc.). We collapse
|
||||
our zoo as follows:
|
||||
|
||||
gru ← gru_*
|
||||
lstm ← lstm_*
|
||||
bert ← transformer_* (BERT-style transformer encoder)
|
||||
knn ← knn_*
|
||||
|
||||
We don't have a vanilla RNN trained, so ``rnn`` is never published —
|
||||
the bar widget skips that bar, which is the correct behaviour.
|
||||
|
||||
Why not the existing ``training.producers.metrics``
|
||||
==================================================
|
||||
That producer iterates checkpoints with :func:`load_models` and re-
|
||||
scores the test set every cycle. On the Pi (8 GiB ARM) the KNN
|
||||
checkpoints alone (~300 MB pickle each, six variants) plus the test-
|
||||
set tensor cache exceed RAM and OOM-killed the host. See
|
||||
``feedback_no_heavy_pi_inference.md`` in the user's auto-memory. This
|
||||
producer reads small JSON files instead — no checkpoint loading.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
|
|
@ -42,87 +61,152 @@ import sys
|
|||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
|
||||
from training.producers._publish import (
|
||||
PublishFn, http_publisher, null_publisher,
|
||||
)
|
||||
from training.dashboard.events import ModelMetric, ModelPerf, Publisher
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.producers.multi_model_metrics")
|
||||
|
||||
|
||||
# Microseconds per window, batch=64 amortized. Order-of-magnitude
|
||||
# estimates from sklearn / torch on similar shapes. Should be re-
|
||||
# benchmarked on actual deployment hardware for a paper, but indicative
|
||||
# enough for a live demo's perf scatter.
|
||||
LATENCY_ESTIMATES_US = {
|
||||
"gbt": 250.0,
|
||||
"knn": 3500.0,
|
||||
"knn_semi": 3500.0,
|
||||
"mlp": 50.0,
|
||||
"cnn": 500.0,
|
||||
"gru": 1500.0,
|
||||
"lstm": 2000.0,
|
||||
"transformer": 800.0,
|
||||
"transformer_ssl": 1000.0,
|
||||
"rnn": 1500.0,
|
||||
"gru": 1500.0,
|
||||
"lstm": 2000.0,
|
||||
"bert": 800.0,
|
||||
"knn": 3500.0,
|
||||
}
|
||||
|
||||
|
||||
def _scan_train_jsons(reports_dir: Path) -> list[dict]:
|
||||
"""Read every train.json in reports_dir, return list of metrics dicts."""
|
||||
out = []
|
||||
for p in sorted(reports_dir.glob("*_train.json")):
|
||||
try:
|
||||
d = json.loads(p.read_text())
|
||||
except (OSError, json.JSONDecodeError) as e:
|
||||
log.warning("skipping %s: %s", p.name, e)
|
||||
continue
|
||||
# Some files are pretrains for SSL — same shape, different file
|
||||
if "test_macro_f1" not in d and "binary_test_macro_f1" not in d:
|
||||
continue
|
||||
out.append(d)
|
||||
# Also catch transformer_ssl which writes *_pretrain.json
|
||||
for p in sorted(reports_dir.glob("*_pretrain.json")):
|
||||
try:
|
||||
d = json.loads(p.read_text())
|
||||
except (OSError, json.JSONDecodeError) as e:
|
||||
continue
|
||||
if "binary_test_macro_f1" in d:
|
||||
d.setdefault("test_macro_f1", d["binary_test_macro_f1"])
|
||||
out.append(d)
|
||||
return out
|
||||
# Bar-widget name → trained-checkpoint family. We publish every model
|
||||
# we've trained so scene 9 shows the full zoo, not just the four
|
||||
# canonical ``Model`` literal names. Names outside the dashboard's
|
||||
# canonical set ({rnn, gru, lstm, bert, knn}) render as bars with no
|
||||
# CSS fill colour — the row still appears with the model name and
|
||||
# numeric F1, the bar track is just transparent. The dashboard chat's
|
||||
# explicit guidance: "Other strings work but won't get a colored fill
|
||||
# class without a CSS update."
|
||||
#
|
||||
# ``knn`` is intentionally absent here — ``training.producers.knn
|
||||
# stream`` already publishes ``ModelMetric{model: 'knn'}`` and
|
||||
# ``ModelPerf{model: 'knn'}`` on its own cycle. Two writers on the
|
||||
# same name would flicker.
|
||||
CANONICAL_TO_FAMILY = {
|
||||
"gbt": "gbt",
|
||||
"mlp": "mlp",
|
||||
"cnn": "cnn",
|
||||
"knn_semi": "knn_semi",
|
||||
"gru": "gru",
|
||||
"lstm": "lstm",
|
||||
"bert": "transformer",
|
||||
}
|
||||
|
||||
|
||||
async def emit_once(*, publish: PublishFn, reports_dir: Path) -> int:
|
||||
rows = _scan_train_jsons(reports_dir)
|
||||
n = 0
|
||||
for r in rows:
|
||||
model = r.get("model")
|
||||
mode = r.get("mode")
|
||||
if model is None or mode is None:
|
||||
# Latency-per-window-microseconds estimates per family, batch=64
|
||||
# amortised. Order-of-magnitude only — proper benchmarks need to run
|
||||
# on the deployment hardware. Indicative enough for scene 12's
|
||||
# log-scaled axis.
|
||||
LATENCY_PER_FAMILY_US = {
|
||||
"gbt": 250.0,
|
||||
"mlp": 50.0,
|
||||
"cnn": 500.0,
|
||||
"knn": 3500.0,
|
||||
"knn_semi": 3500.0,
|
||||
"rnn": 1500.0,
|
||||
"gru": 1500.0,
|
||||
"lstm": 2000.0,
|
||||
"bert": 800.0,
|
||||
}
|
||||
|
||||
|
||||
def _read_json(path: Path) -> dict | None:
|
||||
try:
|
||||
return json.loads(path.read_text())
|
||||
except (OSError, json.JSONDecodeError) as e:
|
||||
log.warning("could not read %s: %s", path.name, e)
|
||||
return None
|
||||
|
||||
|
||||
def _extract_f1(d: dict) -> float | None:
|
||||
"""Pull a scalar test_macro_f1 from one of two known shapes.
|
||||
|
||||
- ``training.trainer.run`` writes ``test_macro_f1`` flat.
|
||||
- ``training.eval_.run`` writes ``macro_f1: {point, low, high}``
|
||||
and the family name only (no oracle/realistic suffix), so the
|
||||
filename carries the mode if at all.
|
||||
"""
|
||||
if "test_macro_f1" in d and isinstance(d["test_macro_f1"], (int, float)):
|
||||
return float(d["test_macro_f1"])
|
||||
mf1 = d.get("macro_f1")
|
||||
if isinstance(mf1, dict) and "point" in mf1:
|
||||
return float(mf1["point"])
|
||||
if isinstance(mf1, (int, float)):
|
||||
return float(mf1)
|
||||
return None
|
||||
|
||||
|
||||
def _best_f1_for_family(reports_dir: Path, family: str) -> tuple[float, str] | None:
|
||||
"""Pick the best-available test_macro_f1 for one family.
|
||||
|
||||
Returns ``(f1, source_label)`` or ``None`` if no candidate file
|
||||
has a usable score.
|
||||
|
||||
Filename precedence (most-preferred first):
|
||||
|
||||
1. ``<family>_oracle_train.json`` — trainer-time, sample split
|
||||
2. ``<family>_eval.json`` — eval_/run.py output, recipe
|
||||
set by --split-recipe
|
||||
3. ``<family>_realistic_train.json`` — cross-host fallback
|
||||
"""
|
||||
candidates = [
|
||||
("oracle_train", f"{family}_oracle_train.json"),
|
||||
("eval", f"{family}_eval.json"),
|
||||
("realistic_train", f"{family}_realistic_train.json"),
|
||||
]
|
||||
for label, fname in candidates:
|
||||
p = reports_dir / fname
|
||||
if not p.exists():
|
||||
continue
|
||||
f1 = r.get("test_macro_f1")
|
||||
d = _read_json(p)
|
||||
if d is None:
|
||||
continue
|
||||
f1 = _extract_f1(d)
|
||||
if f1 is None:
|
||||
continue
|
||||
# Display name combines model+mode for the bar widget
|
||||
display = f"{model}_{mode}"
|
||||
await publish({
|
||||
"type": "model_metric",
|
||||
"model": display,
|
||||
"accuracy": float(f1),
|
||||
})
|
||||
latency = LATENCY_ESTIMATES_US.get(model, 1000.0)
|
||||
await publish({
|
||||
"type": "model_perf",
|
||||
"model": display,
|
||||
"latency_us": float(latency),
|
||||
"accuracy": float(f1),
|
||||
})
|
||||
n += 1
|
||||
log.info("published %d model pairs (metric+perf)", n)
|
||||
return f1, label
|
||||
return None
|
||||
|
||||
|
||||
def emit_once(*, publisher: Publisher, reports_dir: Path) -> int:
|
||||
n = 0
|
||||
for bar_name, family in CANONICAL_TO_FAMILY.items():
|
||||
result = _best_f1_for_family(reports_dir, family)
|
||||
if result is None:
|
||||
log.info("no F1 yet for %s (family=%s) — skipping",
|
||||
bar_name, family)
|
||||
continue
|
||||
f1, source = result
|
||||
latency = float(LATENCY_PER_FAMILY_US.get(family, 1000.0))
|
||||
try:
|
||||
publisher.publish(ModelMetric(
|
||||
model=bar_name, accuracy=f1))
|
||||
publisher.publish(ModelPerf(
|
||||
model=bar_name, latency_us=latency, accuracy=f1))
|
||||
n += 1
|
||||
log.debug("%s: F1=%.4f latency=%.0fus (from %s)",
|
||||
bar_name, f1, latency, source)
|
||||
except Exception as e:
|
||||
log.warning("publish failed for %s: %s", bar_name, e)
|
||||
log.info("published %d (model_metric + model_perf) pairs", n)
|
||||
return n
|
||||
|
||||
|
||||
async def _run(args) -> int:
|
||||
publisher = (null_publisher() if args.dry_run
|
||||
else http_publisher(args.publish_url))
|
||||
publisher = Publisher(url=args.publish_url)
|
||||
while True:
|
||||
await emit_once(publish=publisher, reports_dir=args.reports_dir)
|
||||
emit_once(publisher=publisher, reports_dir=args.reports_dir)
|
||||
if args.interval <= 0:
|
||||
return 0
|
||||
await asyncio.sleep(args.interval)
|
||||
|
|
@ -131,16 +215,21 @@ async def _run(args) -> int:
|
|||
def main() -> int:
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--reports-dir", type=Path,
|
||||
default=Path("reports/eval"),
|
||||
help="dir containing <model>_<mode>_train.json files")
|
||||
ap.add_argument("--publish-url", default="http://127.0.0.1:8447/publish")
|
||||
ap.add_argument("--interval", type=float, default=30.0,
|
||||
help="re-publish period (s); 0 = one-shot")
|
||||
ap.add_argument("--dry-run", action="store_true")
|
||||
default=Path("reports/eval"))
|
||||
ap.add_argument("--publish-url",
|
||||
default="http://127.0.0.1:8447/publish")
|
||||
ap.add_argument("--interval", type=float, default=5.0,
|
||||
help="re-publish period (s); 0 = one-shot. "
|
||||
"Kept short so a fresh page-load sees populated "
|
||||
"bars/scatter within a few seconds. The dashboard "
|
||||
"broadcaster does not replay events to new "
|
||||
"connections by default — see "
|
||||
"docs/dashboard-request-sticky-cache.md.")
|
||||
ap.add_argument("--log-level", default="INFO")
|
||||
args = ap.parse_args()
|
||||
logging.basicConfig(level=args.log_level,
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s")
|
||||
logging.basicConfig(
|
||||
level=args.log_level,
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s")
|
||||
return asyncio.run(_run(args))
|
||||
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue