training/dashboard: PRODUCERS.md + client.py for the model session
Documents the event contract for producers that want to drive widgets on the dashboard: - /publish endpoint (loopback only; Caddy 404s externally) - All six widget-driving event types and their shapes - The reconnect gotcha (live events not replayed; only `snapshot` is) - Two integration patterns (separate process vs in-process) - Three options for browser-triggered demos - systemd hardening that constrains producers running on the Pi Adds a stdlib-only Publisher helper so producers don't need to hand-roll urllib. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
a8157ed177
commit
e5931df8ad
3 changed files with 291 additions and 0 deletions
219
training/dashboard/PRODUCERS.md
Normal file
219
training/dashboard/PRODUCERS.md
Normal file
|
|
@ -0,0 +1,219 @@
|
|||
# Producing events for the dashboard
|
||||
|
||||
If your code wants to drive a widget on `dashboard.wg`, this is the
|
||||
contract you need to honor and the pitfalls to avoid. The dashboard
|
||||
itself doesn't know anything about models — it just renders typed
|
||||
events that arrive on the message bus.
|
||||
|
||||
## The `/publish` endpoint
|
||||
|
||||
```
|
||||
POST http://127.0.0.1:8447/publish
|
||||
content-type: application/json
|
||||
body: {"type": "...", ...}
|
||||
```
|
||||
|
||||
Returns `{"delivered": N}` — the number of currently-connected
|
||||
browsers that got the message. Returns `403` from anywhere other
|
||||
than `127.0.0.1` / `::1` (defense in depth) and Caddy's
|
||||
`dashboard.wg` site explicitly **404s** the path so WG peers can
|
||||
never reach it. Producers must run *on the Pi*.
|
||||
|
||||
That loopback-only constraint is deliberate: any process on the Pi
|
||||
can publish without auth, and nothing off-host can. If you need a
|
||||
lab host to push events, route them through the receiver pipeline,
|
||||
not this endpoint.
|
||||
|
||||
## Event types the widgets already subscribe to
|
||||
|
||||
Each widget on the deck listens for one or more event types. Match
|
||||
these shapes and the corresponding scene starts updating live.
|
||||
|
||||
| `type` | Body | Drives |
|
||||
| ---------------- | --------------------------------------------------- | ------------------------------------------------------- |
|
||||
| `phase` | `{phase}` | scene 5 — rolling 5-min phase mix |
|
||||
| `attack_profile` | `{name, shape, curve: [n0, n1, ...]}` | scene 6 — per-profile envelope thumbnails |
|
||||
| `prediction` | `{episode_id, window_idx: 0-5, predicted, actual}` | scene 7 — 10-second chunking timeline |
|
||||
| `model_metric` | `{model: "rnn"\|"gru"\|"lstm"\|"bert", accuracy}` | scene 8 — model accuracy bars |
|
||||
| `embedding` | `{x: 0-1, y: 0-1, phase}` | scene 9 — KNN scatter (one circle per event) |
|
||||
| `model_perf` | `{model, latency_us, accuracy}` | scene 10 — accuracy vs inference cost |
|
||||
|
||||
Phase values are
|
||||
`clean | armed | infecting | infected_running | dormant`. The
|
||||
visualization will accept any other phase string but it'll fall back
|
||||
to a neutral color and won't appear in the legend.
|
||||
|
||||
`x` / `y` for `embedding` events are normalized to `[0, 1]`. Run
|
||||
your projector (PCA, UMAP, t-SNE, whatever) and rescale before
|
||||
publishing.
|
||||
|
||||
You're free to introduce new types — they'll flow through unharmed,
|
||||
they just won't drive any existing widget. If you want a new widget,
|
||||
ping the dashboard side.
|
||||
|
||||
## The reconnect gotcha (read this)
|
||||
|
||||
The dashboard publishes a `snapshot` event to every new browser
|
||||
connection (and every 30 s thereafter). The snapshot contains
|
||||
*disk-derived* state only:
|
||||
`total_episodes`, `total_alerts`, `total_bytes`, `host_counts`,
|
||||
`recent_episodes`. Anything you publish via `/publish` is
|
||||
**not** part of that snapshot.
|
||||
|
||||
Concretely: if you publish `model_metric` once at startup, only
|
||||
browsers connected at that moment see it. A browser opened ten
|
||||
seconds later sees an empty model-bars panel.
|
||||
|
||||
Two ways out, pick one:
|
||||
|
||||
1. **Re-publish on a tick.** Cheapest. Every 10–30 s, publish your
|
||||
current state again. Cheap because the bus is in-process and the
|
||||
serializer is small.
|
||||
|
||||
2. **Ask the dashboard to cache it.** If you want server-side
|
||||
persistence so browsers reconnect to a warm view, file a request
|
||||
and we'll add a per-type accumulator + replay-on-connect to the
|
||||
broadcaster. This is intentionally not built yet — wait until you
|
||||
know which event types need it.
|
||||
|
||||
## Two integration patterns
|
||||
|
||||
### A. Separate publisher process (recommended for live inference)
|
||||
|
||||
Your model code is its own process, packaged however you like.
|
||||
It POSTs events as it runs. No coupling to the dashboard's import
|
||||
graph, separate restart cycle, separate venv if you want.
|
||||
|
||||
```python
|
||||
# training/models/run_inference.py (your file, your jurisdiction)
|
||||
from training.dashboard.client import Publisher
|
||||
|
||||
pub = Publisher() # defaults to http://127.0.0.1:8447/publish
|
||||
|
||||
for window_idx, (predicted, actual) in enumerate(predictions):
|
||||
pub.publish({
|
||||
"type": "prediction",
|
||||
"episode_id": episode_id,
|
||||
"window_idx": window_idx,
|
||||
"predicted": predicted,
|
||||
"actual": actual,
|
||||
})
|
||||
```
|
||||
|
||||
If you run this as a systemd unit, mind the hardening notes below.
|
||||
|
||||
### B. In-process (only if your model is light and stateless)
|
||||
|
||||
Import the broadcaster directly and `await` its publish. This is
|
||||
faster (no HTTP) but couples your code to the dashboard's lifetime
|
||||
— a bug in your handler crashes the dashboard.
|
||||
|
||||
```python
|
||||
from training.dashboard.app import broadcaster
|
||||
|
||||
async def push_metric(model, accuracy):
|
||||
await broadcaster.publish({"type": "model_metric",
|
||||
"model": model, "accuracy": accuracy})
|
||||
```
|
||||
|
||||
Only do this if your code (a) is async-friendly, (b) won't block the
|
||||
event loop on heavy computation (use `asyncio.to_thread` for that),
|
||||
and (c) is genuinely on the dashboard's import path. Otherwise stick
|
||||
with pattern A.
|
||||
|
||||
## Browser-triggered runs ("running examples on the Pi")
|
||||
|
||||
The WebSocket is currently one-way (server → browser). If you want
|
||||
a *Run demo* button on the dashboard that triggers your model code
|
||||
on the Pi, you have three options:
|
||||
|
||||
1. **Polling-style.** Your model code wakes periodically, picks the
|
||||
latest episode, runs inference, publishes events. Browser is a
|
||||
pure viewer. Simplest; no UI changes needed on the dashboard side.
|
||||
|
||||
2. **HTTP endpoint, separate service.** Run your model code as
|
||||
`cis490-models.service` listening on its own port (say `8448`).
|
||||
Add a Caddy block (or extend the `dashboard.wg` snippet) so
|
||||
`https://dashboard.wg/api/...` reverse-proxies to it. Add a
|
||||
button to the dashboard frontend that hits that path. Your
|
||||
service runs the demo and POSTs results back to `/publish`.
|
||||
|
||||
3. **In-process action registry.** If your code lives in the
|
||||
dashboard process, we can add a `/trigger/<action>` endpoint and
|
||||
a registry where you call `register_action("run_lstm_demo",
|
||||
fn)` at import time. Not built yet — request it if (3) fits.
|
||||
|
||||
Pick one and we'll wire the dashboard side to match. (1) is the
|
||||
default — start there.
|
||||
|
||||
## systemd hardening to watch out for
|
||||
|
||||
`/etc/systemd/system/cis490-dashboard.service` runs as user
|
||||
`cis490` with:
|
||||
|
||||
- `ProtectSystem=strict` — entire filesystem mounted read-only
|
||||
except `/dev`, `/proc`, `/sys`, and any `ReadWritePaths`. There
|
||||
are currently no `ReadWritePaths`. So inside the service, **you
|
||||
can READ `/var/lib/cis490` but you can't write anywhere outside
|
||||
`/tmp`**.
|
||||
- `ProtectHome=true` — `/home` is invisible. Don't reference paths
|
||||
there from inside this service.
|
||||
- `User=cis490` — same identity as the receiver, so file
|
||||
permissions for everything under `/var/lib/cis490` work uniformly.
|
||||
|
||||
What this means for you:
|
||||
|
||||
- **Reading episode tarballs from `/var/lib/cis490/episodes/` is
|
||||
fine.** Your inference code can open them.
|
||||
- **Writing inference caches / model checkpoints needs a writable
|
||||
path.** If you want a cache under `/var/lib/cis490/models/` we
|
||||
can add `ReadWritePaths=/var/lib/cis490` to the unit. File a
|
||||
request with the path you need.
|
||||
- **Loading model weights from disk works** as long as they're
|
||||
baked into `/opt/cis490` (e.g. `/opt/cis490/training/models/weights/`).
|
||||
- **The `.venv` at `/opt/cis490/.venv`** is shared with the
|
||||
receiver and bootstrap services. New Python deps go in
|
||||
`pyproject.toml` and someone with write access to `/opt/cis490`
|
||||
re-runs `uv sync` (or `pip install`).
|
||||
|
||||
If your code runs as a separate `cis490-models.service`, copy the
|
||||
hardening template from `etc/cis490-dashboard.service` and adjust
|
||||
`ReadWritePaths` for your needs.
|
||||
|
||||
## Worked examples
|
||||
|
||||
```bash
|
||||
# Quick smoke test from a Pi shell
|
||||
curl -s http://127.0.0.1:8447/publish \
|
||||
-H 'content-type: application/json' \
|
||||
-d '{"type":"model_metric","model":"lstm","accuracy":0.928}'
|
||||
# → {"delivered": <N>}
|
||||
```
|
||||
|
||||
```python
|
||||
# Re-publishing model state every 20 seconds so reconnects stay warm
|
||||
import time
|
||||
from training.dashboard.client import Publisher
|
||||
|
||||
pub = Publisher()
|
||||
state = {"rnn": 0.872, "gru": 0.911, "lstm": 0.928, "bert": 0.954}
|
||||
|
||||
while True:
|
||||
for model, acc in state.items():
|
||||
pub.publish({"type": "model_metric", "model": model, "accuracy": acc})
|
||||
time.sleep(20)
|
||||
```
|
||||
|
||||
```python
|
||||
# Async producer streaming embeddings as a UMAP runs
|
||||
import asyncio
|
||||
from training.dashboard.client import Publisher
|
||||
|
||||
async def stream_embeddings(points):
|
||||
pub = Publisher()
|
||||
for x, y, phase in points:
|
||||
await asyncio.to_thread(pub.publish, {
|
||||
"type": "embedding", "x": x, "y": y, "phase": phase,
|
||||
})
|
||||
await asyncio.sleep(0.02) # let the UI breathe
|
||||
```
|
||||
|
|
@ -35,6 +35,21 @@ curl -s http://127.0.0.1:8447/publish \
|
|||
The `/publish` endpoint is loopback-only (403 otherwise) and is **not**
|
||||
reverse-proxied by Caddy, so it cannot be hit from the WG mesh.
|
||||
|
||||
## Producing events from your own code
|
||||
|
||||
If you're writing code that should drive a dashboard widget (model
|
||||
inference, training-loop metrics, profile envelopes), see
|
||||
[`PRODUCERS.md`](./PRODUCERS.md) — it documents every event type
|
||||
the widgets subscribe to, the loopback-only `/publish` contract, the
|
||||
reconnect gotcha, and the systemd hardening that constrains
|
||||
producers running on the Pi. There's a stdlib-only Python helper at
|
||||
[`client.py`](./client.py):
|
||||
|
||||
```python
|
||||
from training.dashboard.client import Publisher
|
||||
Publisher().publish({"type": "model_metric", "model": "lstm", "accuracy": 0.928})
|
||||
```
|
||||
|
||||
## Customizing the page
|
||||
|
||||
The default `static/index.html` exposes `window.dashboard`:
|
||||
|
|
|
|||
57
training/dashboard/client.py
Normal file
57
training/dashboard/client.py
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
"""Tiny HTTP client for publishing events to the dashboard.
|
||||
|
||||
For producer code outside the dashboard process. See PRODUCERS.md
|
||||
for the event contract and the loopback-only constraint.
|
||||
|
||||
Stdlib-only on purpose so adding a producer doesn't pull a new
|
||||
dependency into ``pyproject.toml``. Sync API; wrap in
|
||||
``asyncio.to_thread`` if you're calling from an event loop.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from typing import Any
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.dashboard.client")
|
||||
|
||||
DEFAULT_URL = "http://127.0.0.1:8447/publish"
|
||||
|
||||
|
||||
class Publisher:
|
||||
"""One-shot publisher. Reuses no connection; the dashboard's
|
||||
upstream is uvicorn on loopback so the per-call overhead is
|
||||
sub-millisecond. If you're publishing >100 events/s, switch to
|
||||
the in-process pattern documented in PRODUCERS.md instead."""
|
||||
|
||||
def __init__(self, url: str = DEFAULT_URL, timeout: float = 2.0) -> None:
|
||||
self.url = url
|
||||
self.timeout = timeout
|
||||
|
||||
def publish(self, msg: dict[str, Any]) -> dict[str, Any]:
|
||||
"""POST ``msg`` to /publish. Returns the parsed response
|
||||
body (``{"delivered": N}`` on success). Raises
|
||||
``urllib.error.HTTPError`` on non-2xx; producers usually
|
||||
want to log + continue rather than abort, so wrap as
|
||||
appropriate."""
|
||||
body = json.dumps(msg).encode("utf-8")
|
||||
req = urllib.request.Request(
|
||||
self.url, data=body,
|
||||
headers={"content-type": "application/json"},
|
||||
method="POST",
|
||||
)
|
||||
with urllib.request.urlopen(req, timeout=self.timeout) as resp:
|
||||
return json.loads(resp.read())
|
||||
|
||||
def try_publish(self, msg: dict[str, Any]) -> int:
|
||||
"""Like ``publish`` but swallows exceptions and returns the
|
||||
delivered count (0 on failure). Convenience for producers
|
||||
that want fire-and-forget semantics."""
|
||||
try:
|
||||
return int(self.publish(msg).get("delivered", 0))
|
||||
except (urllib.error.URLError, OSError, ValueError):
|
||||
log.exception("publish to %s failed for %r", self.url, msg.get("type"))
|
||||
return 0
|
||||
Loading…
Add table
Reference in a new issue