training/dashboard: PRODUCERS.md + client.py for the model session

Documents the event contract for producers that want to drive widgets on the dashboard: - /publish endpoint (loopback only; Caddy 404s externally) - All six widget-driving event types and their shapes - The reconnect gotcha (live events not replayed; only `snapshot` is) - Two integration patterns (separate process vs in-process) - Three options for browser-triggered demos - systemd hardening that constrains producers running on the Pi Adds a stdlib-only Publisher helper so producers don't need to hand-roll urllib. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 21:34:54 -05:00 · 2026-05-07 21:34:54 -05:00 · e5931df8ad
commit e5931df8ad
parent a8157ed177
3 changed files with 291 additions and 0 deletions
--- a/training/dashboard/PRODUCERS.md
+++ b/training/dashboard/PRODUCERS.md
@ -0,0 +1,219 @@
+# Producing events for the dashboard
+
+If your code wants to drive a widget on `dashboard.wg`, this is the
+contract you need to honor and the pitfalls to avoid. The dashboard
+itself doesn't know anything about models — it just renders typed
+events that arrive on the message bus.
+
+## The `/publish` endpoint
+
+```
+POST http://127.0.0.1:8447/publish
+content-type: application/json
+body: {"type": "...", ...}
+```
+
+Returns `{"delivered": N}` — the number of currently-connected
+browsers that got the message. Returns `403` from anywhere other
+than `127.0.0.1` / `::1` (defense in depth) and Caddy's
+`dashboard.wg` site explicitly **404s** the path so WG peers can
+never reach it. Producers must run *on the Pi*.
+
+That loopback-only constraint is deliberate: any process on the Pi
+can publish without auth, and nothing off-host can. If you need a
+lab host to push events, route them through the receiver pipeline,
+not this endpoint.
+
+## Event types the widgets already subscribe to
+
+Each widget on the deck listens for one or more event types. Match
+these shapes and the corresponding scene starts updating live.
+
+| `type`           | Body                                                | Drives                                                  |
+| ---------------- | --------------------------------------------------- | ------------------------------------------------------- |
+| `phase`          | `{phase}`                                           | scene 5 — rolling 5-min phase mix                       |
+| `attack_profile` | `{name, shape, curve: [n0, n1, ...]}`               | scene 6 — per-profile envelope thumbnails               |
+| `prediction`     | `{episode_id, window_idx: 0-5, predicted, actual}`  | scene 7 — 10-second chunking timeline                   |
+| `model_metric`   | `{model: "rnn"\|"gru"\|"lstm"\|"bert", accuracy}`   | scene 8 — model accuracy bars                           |
+| `embedding`      | `{x: 0-1, y: 0-1, phase}`                           | scene 9 — KNN scatter (one circle per event)            |
+| `model_perf`     | `{model, latency_us, accuracy}`                     | scene 10 — accuracy vs inference cost                   |
+
+Phase values are
+`clean | armed | infecting | infected_running | dormant`. The
+visualization will accept any other phase string but it'll fall back
+to a neutral color and won't appear in the legend.
+
+`x` / `y` for `embedding` events are normalized to `[0, 1]`. Run
+your projector (PCA, UMAP, t-SNE, whatever) and rescale before
+publishing.
+
+You're free to introduce new types — they'll flow through unharmed,
+they just won't drive any existing widget. If you want a new widget,
+ping the dashboard side.
+
+## The reconnect gotcha (read this)
+
+The dashboard publishes a `snapshot` event to every new browser
+connection (and every 30 s thereafter). The snapshot contains
+*disk-derived* state only:
+`total_episodes`, `total_alerts`, `total_bytes`, `host_counts`,
+`recent_episodes`. Anything you publish via `/publish` is
+**not** part of that snapshot.
+
+Concretely: if you publish `model_metric` once at startup, only
+browsers connected at that moment see it. A browser opened ten
+seconds later sees an empty model-bars panel.
+
+Two ways out, pick one:
+
+1. **Re-publish on a tick.** Cheapest. Every 10–30 s, publish your
+   current state again. Cheap because the bus is in-process and the
+   serializer is small.
+
+2. **Ask the dashboard to cache it.** If you want server-side
+   persistence so browsers reconnect to a warm view, file a request
+   and we'll add a per-type accumulator + replay-on-connect to the
+   broadcaster. This is intentionally not built yet — wait until you
+   know which event types need it.
+
+## Two integration patterns
+
+### A. Separate publisher process (recommended for live inference)
+
+Your model code is its own process, packaged however you like.
+It POSTs events as it runs. No coupling to the dashboard's import
+graph, separate restart cycle, separate venv if you want.
+
+```python
+# training/models/run_inference.py (your file, your jurisdiction)
+from training.dashboard.client import Publisher
+
+pub = Publisher()  # defaults to http://127.0.0.1:8447/publish
+
+for window_idx, (predicted, actual) in enumerate(predictions):
+    pub.publish({
+        "type": "prediction",
+        "episode_id": episode_id,
+        "window_idx": window_idx,
+        "predicted": predicted,
+        "actual": actual,
+    })
+```
+
+If you run this as a systemd unit, mind the hardening notes below.
+
+### B. In-process (only if your model is light and stateless)
+
+Import the broadcaster directly and `await` its publish. This is
+faster (no HTTP) but couples your code to the dashboard's lifetime
+— a bug in your handler crashes the dashboard.
+
+```python
+from training.dashboard.app import broadcaster
+
+async def push_metric(model, accuracy):
+    await broadcaster.publish({"type": "model_metric",
+                                "model": model, "accuracy": accuracy})
+```
+
+Only do this if your code (a) is async-friendly, (b) won't block the
+event loop on heavy computation (use `asyncio.to_thread` for that),
+and (c) is genuinely on the dashboard's import path. Otherwise stick
+with pattern A.
+
+## Browser-triggered runs ("running examples on the Pi")
+
+The WebSocket is currently one-way (server → browser). If you want
+a *Run demo* button on the dashboard that triggers your model code
+on the Pi, you have three options:
+
+1. **Polling-style.** Your model code wakes periodically, picks the
+   latest episode, runs inference, publishes events. Browser is a
+   pure viewer. Simplest; no UI changes needed on the dashboard side.
+
+2. **HTTP endpoint, separate service.** Run your model code as
+   `cis490-models.service` listening on its own port (say `8448`).
+   Add a Caddy block (or extend the `dashboard.wg` snippet) so
+   `https://dashboard.wg/api/...` reverse-proxies to it. Add a
+   button to the dashboard frontend that hits that path. Your
+   service runs the demo and POSTs results back to `/publish`.
+
+3. **In-process action registry.** If your code lives in the
+   dashboard process, we can add a `/trigger/<action>` endpoint and
+   a registry where you call `register_action("run_lstm_demo",
+   fn)` at import time. Not built yet — request it if (3) fits.
+
+Pick one and we'll wire the dashboard side to match. (1) is the
+default — start there.
+
+## systemd hardening to watch out for
+
+`/etc/systemd/system/cis490-dashboard.service` runs as user
+`cis490` with:
+
+- `ProtectSystem=strict` — entire filesystem mounted read-only
+  except `/dev`, `/proc`, `/sys`, and any `ReadWritePaths`. There
+  are currently no `ReadWritePaths`. So inside the service, **you
+  can READ `/var/lib/cis490` but you can't write anywhere outside
+  `/tmp`**.
+- `ProtectHome=true` — `/home` is invisible. Don't reference paths
+  there from inside this service.
+- `User=cis490` — same identity as the receiver, so file
+  permissions for everything under `/var/lib/cis490` work uniformly.
+
+What this means for you:
+
+- **Reading episode tarballs from `/var/lib/cis490/episodes/` is
+  fine.** Your inference code can open them.
+- **Writing inference caches / model checkpoints needs a writable
+  path.** If you want a cache under `/var/lib/cis490/models/` we
+  can add `ReadWritePaths=/var/lib/cis490` to the unit. File a
+  request with the path you need.
+- **Loading model weights from disk works** as long as they're
+  baked into `/opt/cis490` (e.g. `/opt/cis490/training/models/weights/`).
+- **The `.venv` at `/opt/cis490/.venv`** is shared with the
+  receiver and bootstrap services. New Python deps go in
+  `pyproject.toml` and someone with write access to `/opt/cis490`
+  re-runs `uv sync` (or `pip install`).
+
+If your code runs as a separate `cis490-models.service`, copy the
+hardening template from `etc/cis490-dashboard.service` and adjust
+`ReadWritePaths` for your needs.
+
+## Worked examples
+
+```bash
+# Quick smoke test from a Pi shell
+curl -s http://127.0.0.1:8447/publish \
+  -H 'content-type: application/json' \
+  -d '{"type":"model_metric","model":"lstm","accuracy":0.928}'
+# → {"delivered": <N>}
+```
+
+```python
+# Re-publishing model state every 20 seconds so reconnects stay warm
+import time
+from training.dashboard.client import Publisher
+
+pub = Publisher()
+state = {"rnn": 0.872, "gru": 0.911, "lstm": 0.928, "bert": 0.954}
+
+while True:
+    for model, acc in state.items():
+        pub.publish({"type": "model_metric", "model": model, "accuracy": acc})
+    time.sleep(20)
+```
+
+```python
+# Async producer streaming embeddings as a UMAP runs
+import asyncio
+from training.dashboard.client import Publisher
+
+async def stream_embeddings(points):
+    pub = Publisher()
+    for x, y, phase in points:
+        await asyncio.to_thread(pub.publish, {
+            "type": "embedding", "x": x, "y": y, "phase": phase,
+        })
+        await asyncio.sleep(0.02)   # let the UI breathe
+```
--- a/training/dashboard/README.md
+++ b/training/dashboard/README.md
@ -35,6 +35,21 @@ curl -s http://127.0.0.1:8447/publish \
 The `/publish` endpoint is loopback-only (403 otherwise) and is **not**
 reverse-proxied by Caddy, so it cannot be hit from the WG mesh.

+## Producing events from your own code
+
+If you're writing code that should drive a dashboard widget (model
+inference, training-loop metrics, profile envelopes), see
+[`PRODUCERS.md`](./PRODUCERS.md) — it documents every event type
+the widgets subscribe to, the loopback-only `/publish` contract, the
+reconnect gotcha, and the systemd hardening that constrains
+producers running on the Pi. There's a stdlib-only Python helper at
+[`client.py`](./client.py):
+
+```python
+from training.dashboard.client import Publisher
+Publisher().publish({"type": "model_metric", "model": "lstm", "accuracy": 0.928})
+```
+
 ## Customizing the page

 The default `static/index.html` exposes `window.dashboard`:
--- a/training/dashboard/client.py
+++ b/training/dashboard/client.py
@ -0,0 +1,57 @@
+"""Tiny HTTP client for publishing events to the dashboard.
+
+For producer code outside the dashboard process. See PRODUCERS.md
+for the event contract and the loopback-only constraint.
+
+Stdlib-only on purpose so adding a producer doesn't pull a new
+dependency into ``pyproject.toml``. Sync API; wrap in
+``asyncio.to_thread`` if you're calling from an event loop.
+"""
+from __future__ import annotations
+
+import json
+import logging
+import urllib.error
+import urllib.request
+from typing import Any
+
+
+log = logging.getLogger("cis490.dashboard.client")
+
+DEFAULT_URL = "http://127.0.0.1:8447/publish"
+
+
+class Publisher:
+    """One-shot publisher. Reuses no connection; the dashboard's
+    upstream is uvicorn on loopback so the per-call overhead is
+    sub-millisecond. If you're publishing >100 events/s, switch to
+    the in-process pattern documented in PRODUCERS.md instead."""
+
+    def __init__(self, url: str = DEFAULT_URL, timeout: float = 2.0) -> None:
+        self.url = url
+        self.timeout = timeout
+
+    def publish(self, msg: dict[str, Any]) -> dict[str, Any]:
+        """POST ``msg`` to /publish. Returns the parsed response
+        body (``{"delivered": N}`` on success). Raises
+        ``urllib.error.HTTPError`` on non-2xx; producers usually
+        want to log + continue rather than abort, so wrap as
+        appropriate."""
+        body = json.dumps(msg).encode("utf-8")
+        req = urllib.request.Request(
+            self.url, data=body,
+            headers={"content-type": "application/json"},
+            method="POST",
+        )
+        with urllib.request.urlopen(req, timeout=self.timeout) as resp:
+            return json.loads(resp.read())
+
+    def try_publish(self, msg: dict[str, Any]) -> int:
+        """Like ``publish`` but swallows exceptions and returns the
+        delivered count (0 on failure). Convenience for producers
+        that want fire-and-forget semantics."""
+        try:
+            return int(self.publish(msg).get("delivered", 0))
+        except (urllib.error.URLError, OSError, ValueError):
+            log.exception("publish to %s failed for %r", self.url, msg.get("type"))
+            return 0