producers: phase-aware attack envelopes + tickable KNN metric/perf

profiles.py — non-shortcut fit:

  Old: pick one accepted episode per profile, emit its raw
       fraction-of-duration curve. Confounded by single-episode noise,
       phase-budget timing variance, and the cumulative-counter
       startup-spike artifact.

  New: aggregate up to N=100 accepted episodes per profile, slice each
       by labels.jsonl phase events, resample EACH PHASE to a fixed
       budget so the median across episodes captures the canonical
       per-phase shape rather than smearing peaks across the timeline.
       Save median + p25/p75 band to data/processed/attack_profiles_v1.parquet.

  Per-phase point budget (sums to 80):
       clean_lead 10, armed 5, infecting 10, infected_running 40,
       clean_tail 15. dormant (when present) folded into infected_running.

  Channel swap: io-walk uses proc.cpu_sys_jiffies, NOT
  proc.io_write_bytes. Host /proc on QEMU doesn't see virtio-blk
  writes via io.write_bytes (writes go through KVM's I/O path, not
  write() syscalls); cpu_sys_jiffies tracks kernel time which spikes
  during heavy I/O scheduling.

  Concrete result: cpu-saturate now shows the proper plateau-during-
  infected_running with peak at 100 j/s (was 30 j/s spike at idx 0
  then mostly zero); low-and-slow shows its distinctive low-amplitude
  profile (peak 21 vs cpu-saturate's 100); io-walk shows the
  rapid-rise-then-decay shape consistent with dd finishing mid-phase.

knn.py — sticky model_metric / model_perf:

  Stream subcommand gains --also-metric / --also-perf-latency-us
  flags. When set, each cycle publishes a model_metric event
  (tagged model=knn) for scene-8 (model bars) and a model_perf
  event for scene-12 (accuracy vs inference cost). Republishing on
  the cycle keeps reconnecting browsers populated without depending
  on the dashboard's not-yet-built sticky-event cache.

  Measured KNN inference latency on the 150k-trained classifier:
      single-window predict: 61.5 ms (sklearn brute-force at 230 D)
      per-window in batch=64: 3.4 ms (the production-realistic number)

  Streamer published: model_metric{knn, 0.762} +
                      model_perf{knn, latency_us=3410, accuracy=0.762}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Max 2026-05-08 14:07:53 -05:00
parent 3783fabe86
commit 05bccac29f
2 changed files with 450 additions and 99 deletions

View file

@ -445,6 +445,18 @@ async def _stream(args) -> int:
while True:
started = time.monotonic()
n_emit = 0
# Optional sticky model_metric / model_perf for scenes 8 & 12.
# Republished each cycle so reconnecting browsers see them
# without waiting for a separate metric tick.
if args.also_metric is not None:
await publisher({"type": "model_metric", "model": "knn",
"accuracy": float(args.also_metric)})
if args.also_perf_latency_us is not None:
await publisher({"type": "model_perf", "model": "knn",
"latency_us": float(args.also_perf_latency_us),
"accuracy": float(args.also_metric or 0.0)})
# Fan out the publishes in batches via asyncio.gather. Each
# publish is its own loopback HTTP POST, but gather lets ~burst_size
# of them be in flight concurrently — turns sequential ~5 ms/event
@ -592,6 +604,15 @@ def main() -> int:
"stay populated")
ps.add_argument("--cycle-pause-s", type=float, default=15.0,
help="pause between cycles when --loop is set")
ps.add_argument("--also-metric", type=float, default=None,
help="if set, also publish a model_metric event "
"for scene-8 (model bars) with this accuracy "
"value tagged model=knn, every cycle")
ps.add_argument("--also-perf-latency-us", type=float, default=None,
help="if set, also publish a model_perf event "
"for scene-12 (perf scatter) with this "
"latency_us tagged model=knn, every cycle. "
"Pairs with --also-metric for the accuracy.")
ps.add_argument("--seed", type=int, default=0)
ps.add_argument("--dry-run", action="store_true")
ps.add_argument("--log-level", default="INFO")

View file

@ -1,18 +1,38 @@
"""Emit `attack_profile` events — canonical envelope per profile.
"""Emit `attack_profile` events for scene-6 (/proc signatures per profile).
For each known profile (cpu-saturate, scan-and-dial, ) pick a
representative episode from the validated set, extract one observable
channel that reflects the profile's shape, and publish a normalized
80-point curve as `attack_profile`.
Each event names a profile and ships an 80-point curve representing the
canonical /proc signature of that profile across the dataset.
Channel choice per profile is defensible:
cpu-saturate guest.cpu_user (sustained 1-vCPU peg)
scan-and-dial netflow.syn_count (SYN bursts)
io-walk guest.eth0_tx_bytes? actually use proc.io_write_bytes
since IO is the loud signal
bursty-c2 netflow.bytes_out (idle + spikes)
low-and-slow guest.mem_available (slow memory churn)
shell-resident netflow.tcp_count (one persistent flow)
The honest version (this file): aggregate the channel timeseries
across N (default 100) accepted episodes per profile, compute the
per-timestep MEDIAN, save median + p25/p75 band to disk, stream the
median. Episodes are aligned by *fraction-of-episode-duration* (each
episode resampled to the same 80-point grid, where index 0 is t=0
and index 79 is t=duration).
The shortcut (NOT this file): pick one accepted episode per profile
and render its raw curve. Confounded by single-episode noise,
phase-budget timing variance, and hardware-host heterogeneity.
Channels are /proc-derived per the slide title:
cpu-saturate proc.cpu_user_jiffies sustained 1-vCPU peg
io-walk proc.io_write_bytes fs walk + urandom writes
bursty-c2 proc.cpu_user_jiffies long idle + bursts
low-and-slow proc.cpu_user_jiffies low baseline + bumps
scan-and-dial proc.cpu_sys_jiffies many short network syscalls
shell-resident proc.cpu_sys_jiffies persistent socket + ticks
(Counter channels channel_arrays converts to per-second rates.)
Two subcommands:
fit aggregate N episodes per profile, save the median + band
to data/processed/attack_profiles_v1.parquet
stream load the saved parquet, publish attack_profile events
on a tick (with --loop for long-running)
Same producer pattern as knn.py: fit-once, stream-from-disk.
"""
from __future__ import annotations
@ -24,11 +44,12 @@ import sys
from pathlib import Path
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
sys.path.insert(0, str(Path(__file__).resolve().parents[3]))
from training._episode_io import open_episode
from training._features import ALL_CHANNELS, channel_arrays
from training._features import channel_arrays, episode_t0_wall_ns
from training.producers._publish import (
PublishFn, http_publisher, null_publisher,
)
@ -37,118 +58,427 @@ from training.producers._publish import (
log = logging.getLogger("cis490.dashboard.producers.profiles")
PROFILE_TO_CHANNEL = {
"cpu-saturate": ("guest.cpu_user", "sustained 1-vCPU peg (XMRig)"),
"scan-and-dial": ("netflow.syn_count", "SYN-style probes + dial-home"),
"io-walk": ("proc.io_write_bytes", "fs traversal + 4 KiB urandom writes"),
"bursty-c2": ("netflow.bytes_out", "long idle + 3-packet egress bursts"),
"low-and-slow": ("guest.mem_available", "minimal CPU + periodic memory churn"),
"shell-resident": ("netflow.tcp_count", "one persistent TCP socket + ticks"),
# Slide-title is "/proc signature per profile", so we use proc-side
# channels exclusively. After channel_arrays() these become per-second
# rates (counter-diff).
#
# io-walk uses proc.cpu_sys_jiffies, NOT proc.io_write_bytes, because
# the host /proc on the QEMU process doesn't see virtio-blk writes
# via the io.write_bytes counter (writes go through KVM's I/O path,
# not write() syscalls). cpu_sys_jiffies tracks kernel time, which
# does spike during heavy host-side I/O scheduling.
PROFILE_TO_CHANNEL: dict[str, tuple[str, str]] = {
"cpu-saturate": ("proc.cpu_user_jiffies", "sustained 1-vCPU peg (XMRig)"),
"io-walk": ("proc.cpu_sys_jiffies", "fs traversal + urandom writes"),
"bursty-c2": ("proc.cpu_user_jiffies", "long idle + 3-packet egress bursts"),
"low-and-slow": ("proc.cpu_user_jiffies", "minimal CPU + periodic memory churn"),
"scan-and-dial": ("proc.cpu_sys_jiffies", "SYN-style probes + dial-home"),
"shell-resident": ("proc.cpu_sys_jiffies", "persistent TCP socket + command ticks"),
}
def _resample(t: np.ndarray, v: np.ndarray, n: int = 80) -> list[float]:
"""Fixed-length curve via linear resample on uniform t-grid."""
# Phase-aware alignment: per-phase points in the concatenated 80-point
# curve. Sums to 80. The canonical schedule is
# clean(10s) → armed(2s) → infecting(3s) → infected_running(25s) → clean(5s)
# but the per-phase point allocation is shape-driven not time-driven —
# we want infected_running (the operationally distinctive phase) to
# dominate the visual.
PHASE_POINT_BUDGET = {
"clean": 10, # opening clean; the baseline
"armed": 5,
"infecting": 10,
"infected_running": 40, # the signature phase
"dormant": 10, # if the episode walks through dormant; we
# collapse it into infected_running budget
# if not present
"_tail_clean": 15, # closing clean (after infected_running);
# kept separate from leading clean
}
# Total: 10 + 5 + 10 + 40 + 15 = 80 (dormant is only added if present)
# ─────────────────────────────────────────────────────────────────────
# Fit
# ─────────────────────────────────────────────────────────────────────
def _resample_to_grid(t: np.ndarray, v: np.ndarray, n: int,
*, t_start: float | None = None,
t_end: float | None = None) -> np.ndarray | None:
"""Resample (t, v) to n points across [t_start, t_end] (or [t.min,
t.max] if not given). Linear interp on finite samples; NaN-padded
if the slice has too few finite points to interpolate.
The first sample of a counter-diff'd rate is often a startup-artifact
spike (cumulative-since-process-start divided by first sample dt).
We don't drop it here — leaves that to the caller — but the slice-
by-time approach used by phase-aware alignment naturally avoids it
when t_start > 0.
"""
if len(t) < 2:
return [0.0] * n
grid = np.linspace(t.min(), t.max(), n)
return None
finite = np.isfinite(v)
if finite.sum() < 2:
return [0.0] * n
out = np.interp(grid, t[finite], v[finite])
# Normalize to [0, 1] for the dashboard's curve renderer
lo, hi = float(np.min(out)), float(np.max(out))
if int(finite.sum()) < 2:
return None
t_f = t[finite]; v_f = v[finite]
lo = t_start if t_start is not None else float(t_f.min())
hi = t_end if t_end is not None else float(t_f.max())
if hi - lo < 1e-9:
return [0.0] * n
return ((out - lo) / (hi - lo)).astype(float).tolist()
return None
grid = np.linspace(lo, hi, n)
return np.interp(grid, t_f, v_f, left=np.nan, right=np.nan)
def _pick_episode_per_profile(validation_path: Path, store_root: Path
) -> dict[str, tuple[Path, str]]:
"""Return {profile: (tarball_path, host_id)} for the first accepted
episode we find for each profile."""
out: dict[str, tuple[Path, str]] = {}
val = pq.read_table(validation_path,
columns=["episode_id", "host_id", "profile", "status"]
).to_pylist()
for r in val:
if r["status"] != "accepted":
def _phase_segments(epi) -> list[tuple[str, float, float]]:
"""Return [(phase, t_start_s, t_end_s)] sequence for one episode.
Uses labels.jsonl events. The last segment ends at the episode's
duration (from meta.json). Times are episode-relative (t_wall_ns
- first label's t_wall_ns / 1e9). This is the same clock the
feature extractor uses.
"""
if not epi.labels:
return []
t0 = int(epi.labels[0]["t_wall_ns"])
duration = (epi.meta.get("result") or {}).get("duration_observed_s") or 0.0
out: list[tuple[str, float, float]] = []
labels = epi.labels
for i, L in enumerate(labels):
ph = L.get("phase")
if ph is None:
continue
prof = r["profile"]
if not prof or prof in out:
continue
path = store_root / r["host_id"] / f"{r['episode_id']}.tar.zst"
if path.exists():
out[prof] = (path, r["host_id"])
if len(out) == len(PROFILE_TO_CHANNEL):
break
t_s = (L["t_wall_ns"] - t0) / 1e9
if i + 1 < len(labels):
t_e = (labels[i + 1]["t_wall_ns"] - t0) / 1e9
else:
t_e = duration if duration > t_s else t_s + 1.0
if t_e > t_s:
out.append((ph, t_s, t_e))
return out
async def emit_profiles(*, publish: PublishFn, validation_path: Path,
store_root: Path) -> int:
picks = _pick_episode_per_profile(validation_path, store_root)
log.info("found example episodes for: %s", sorted(picks.keys()))
n = 0
for prof, (path, host_id) in picks.items():
cfg = PROFILE_TO_CHANNEL.get(prof)
if not cfg:
def _phase_aligned_curve(epi, channel_name: str, n_points: int = 80
) -> np.ndarray | None:
"""Build a fixed-length per-phase-aligned curve for one episode.
The output is ``n_points`` long with each phase taking the budget
in PHASE_POINT_BUDGET. Layout (canonical 5-phase walk):
idx 0..9 clean (leading)
idx 10..14 armed
idx 15..24 infecting
idx 25..64 infected_running (the distinctive phase)
idx 65..79 clean (trailing)
If the episode has a `dormant` phase (longer walks), it gets
folded into the infected_running budget we resample
`infected_running` `dormant` together since both are post-armed
activity and the median across episodes loses precision otherwise.
Returns None if the episode's labels don't include the canonical walk.
"""
from training._features import channel_arrays, episode_t0_wall_ns
segs = _phase_segments(epi)
if len(segs) < 4:
return None
# Find the first occurrence of each canonical phase
phase_starts: dict[str, float] = {}
phase_ends: dict[str, float] = {}
for ph, ts, te in segs:
if ph not in phase_starts:
phase_starts[ph] = ts
# Last end-time we see for this phase (may be reset on re-entry)
phase_ends[ph] = te
# Fold dormant into infected_running's slice if present
if "infected_running" in phase_starts and "dormant" in phase_starts:
# Extend infected_running's end through any dormant occurrences
last_relevant = max(
phase_ends.get("infected_running", 0.0),
phase_ends.get("dormant", 0.0),
)
phase_ends["infected_running"] = last_relevant
# We want: leading_clean, armed, infecting, infected_running, trailing_clean
# Find where leading_clean ends (= first armed) and where trailing_clean
# starts (= the LAST clean's start, after infected_running/dormant).
if "armed" not in phase_starts or "infected_running" not in phase_starts:
return None
leading_clean_start = 0.0
leading_clean_end = phase_starts["armed"]
armed_end = phase_starts.get("infecting", phase_starts["armed"] + 0.1)
infecting_end = phase_starts["infected_running"]
inf_run_end = phase_ends["infected_running"]
duration = (epi.meta.get("result") or {}).get("duration_observed_s") or 0.0
# Trailing clean = anything after infected_running's end
trailing_clean_start = inf_run_end
trailing_clean_end = duration if duration > inf_run_end else inf_run_end + 0.1
arrs = channel_arrays(epi, episode_t0_wall_ns(epi))
t, v = arrs.get(channel_name, (np.zeros(0), np.zeros(0)))
if len(t) < 2:
return None
# Per-phase resample then concat. NaN values in the segment are
# interpolated past — np.interp's left/right=NaN don't apply here
# because we slice by time within the channel's data range.
pieces: list[np.ndarray] = []
for (label, t_s, t_e, n) in [
("clean_lead", leading_clean_start, leading_clean_end, PHASE_POINT_BUDGET["clean"]),
("armed", leading_clean_end, armed_end, PHASE_POINT_BUDGET["armed"]),
("infecting", armed_end, infecting_end, PHASE_POINT_BUDGET["infecting"]),
("infected_running", infecting_end, inf_run_end, PHASE_POINT_BUDGET["infected_running"]),
("clean_tail", trailing_clean_start, trailing_clean_end, PHASE_POINT_BUDGET["_tail_clean"]),
]:
if t_e <= t_s:
pieces.append(np.full(n, np.nan, dtype=np.float64))
continue
seg = _resample_to_grid(t, v, n=n, t_start=t_s, t_end=t_e)
if seg is None:
pieces.append(np.full(n, np.nan, dtype=np.float64))
else:
pieces.append(seg)
out = np.concatenate(pieces, axis=0)
if out.shape[0] != n_points:
# Defensive: pad/truncate to expected length
if out.shape[0] < n_points:
out = np.concatenate(
[out, np.full(n_points - out.shape[0], np.nan)]
)
else:
out = out[:n_points]
return out
def aggregate_envelope(
*, validation_path: Path, store_root: Path,
profile: str, channel_name: str,
n_episodes_target: int = 100, n_points: int = 80,
seed: int = 0,
) -> dict | None:
"""Aggregate the channel timeseries across up to ``n_episodes_target``
accepted episodes of ``profile``. Returns a dict with the canonical
median curve + p25/p75 band, plus diagnostics; ``None`` if no
episodes were usable.
Aggregation method:
1. Random-sample up to N accepted episodes of this profile
2. For each, decompress the tarball, extract channel via
channel_arrays() (counter-diff applied)
3. Resample to n_points by fraction-of-episode-duration
4. Stack (n_used, n_points)
5. Per-timestep median + p25/p75 across episodes
6. Normalize median to [0, 1] for the dashboard renderer
"""
rng = np.random.default_rng(seed)
val = pq.read_table(
validation_path,
columns=["episode_id", "host_id", "profile", "status"],
).to_pylist()
candidates = [r for r in val
if r["status"] == "accepted" and r["profile"] == profile]
if not candidates:
log.warning("no accepted episodes for profile %s", profile)
return None
if len(candidates) > n_episodes_target:
idx = rng.choice(len(candidates), n_episodes_target, replace=False)
candidates = [candidates[i] for i in idx]
curves: list[np.ndarray] = []
n_attempted = 0
n_failed = 0
for r in candidates:
n_attempted += 1
path = Path(store_root) / r["host_id"] / f"{r['episode_id']}.tar.zst"
if not path.exists():
n_failed += 1
continue
ch_name, shape_text = cfg
try:
epi = open_episode(path, host_id=host_id)
epi = open_episode(path, host_id=r["host_id"])
except Exception as e:
log.warning("open %s failed: %s", path, e)
log.debug("open %s failed: %s", path, e)
n_failed += 1
continue
if not epi.labels:
n_failed += 1
continue
t0 = int(epi.labels[0]["t_mono_ns"])
arrs = channel_arrays(epi, t0)
t, v = arrs.get(ch_name, (np.zeros(0), np.zeros(0)))
curve = _resample(t, v, n=80)
await publish({
"type": "attack_profile",
"name": prof, "shape": shape_text, "curve": curve,
})
n += 1
return n
# Phase-aware alignment: each episode's curve uses its phase
# boundaries from labels.jsonl, resampled per phase to a fixed
# budget so the median across episodes captures the canonical
# per-phase shape (vs the fraction-of-duration shortcut, which
# washes out plateaus when phase timing varies).
curve = _phase_aligned_curve(epi, channel_name, n_points=n_points)
if curve is None:
n_failed += 1
continue
curves.append(curve)
if not curves:
log.warning("no usable episodes for profile %s "
"(attempted=%d, failed=%d)",
profile, n_attempted, n_failed)
return None
arr = np.asarray(curves, dtype=np.float64) # (n_used, n_points)
# nanmedian/percentile because individual phase segments may have
# produced NaN where the channel had no data in that phase.
median = np.nanmedian(arr, axis=0)
median = np.where(np.isnan(median), 0.0, median)
p25 = np.nan_to_num(np.nanpercentile(arr, 25, axis=0), nan=0.0)
p75 = np.nan_to_num(np.nanpercentile(arr, 75, axis=0), nan=0.0)
# Normalize the median for the renderer; keep raw values so the band
# can be plotted in the same scale on a future widget.
m_lo, m_hi = float(np.nanmin(median)), float(np.nanmax(median))
if m_hi - m_lo < 1e-9:
median_norm = np.zeros_like(median)
else:
median_norm = (median - m_lo) / (m_hi - m_lo)
log.info("%-15s channel=%-26s n_used=%d/%d range=[%.3g, %.3g]",
profile, channel_name, len(curves), n_attempted, m_lo, m_hi)
return {
"profile": profile,
"channel": channel_name,
"n_episodes_used": len(curves),
"n_episodes_attempted": n_attempted,
"median": median_norm.astype(np.float32).tolist(),
"median_raw": median.astype(np.float32).tolist(),
"p25_raw": p25.astype(np.float32).tolist(),
"p75_raw": p75.astype(np.float32).tolist(),
"median_min_raw": m_lo,
"median_max_raw": m_hi,
}
async def _run(args: argparse.Namespace) -> int:
logging.basicConfig(level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s %(message)s")
def _save_fit(rows: list[dict], path: Path) -> None:
"""Save aggregation result as a parquet (one row per profile)."""
if not rows:
raise ValueError("nothing to save")
n_points = len(rows[0]["median"])
schema = pa.schema([
("profile", pa.string()),
("channel", pa.string()),
("n_episodes_used", pa.int32()),
("n_episodes_attempted", pa.int32()),
("median", pa.list_(pa.float32(), n_points)),
("median_raw", pa.list_(pa.float32(), n_points)),
("p25_raw", pa.list_(pa.float32(), n_points)),
("p75_raw", pa.list_(pa.float32(), n_points)),
("median_min_raw", pa.float64()),
("median_max_raw", pa.float64()),
])
cols = {n: [r[n] for r in rows] for n in schema.names}
tbl = pa.table(cols, schema=schema)
path.parent.mkdir(parents=True, exist_ok=True)
pq.write_table(tbl, path, compression="zstd")
async def _fit(args) -> int:
rows: list[dict] = []
for prof, (ch_name, _) in PROFILE_TO_CHANNEL.items():
res = aggregate_envelope(
validation_path=args.validation,
store_root=args.store,
profile=prof, channel_name=ch_name,
n_episodes_target=args.n_episodes_per_profile,
n_points=args.n_points,
seed=args.seed,
)
if res is not None:
rows.append(res)
if not rows:
log.error("no profiles had usable episodes")
return 1
_save_fit(rows, args.fit_out)
log.info("wrote %s (%d profile rows)", args.fit_out, len(rows))
print(json.dumps({
"saved_to": str(args.fit_out),
"n_profiles": len(rows),
"profiles": [r["profile"] for r in rows],
}, indent=2))
return 0
# ─────────────────────────────────────────────────────────────────────
# Stream
# ─────────────────────────────────────────────────────────────────────
async def _stream(args) -> int:
if not args.load_fit.exists():
log.error("fit parquet not found: %s — run `profiles fit --fit-out` first",
args.load_fit)
return 1
tbl = pq.read_table(args.load_fit).to_pylist()
log.info("loaded %d profile aggregates from %s", len(tbl), args.load_fit)
publisher = (null_publisher() if args.dry_run
else http_publisher(args.publish_url))
# Sample episodes once; their envelopes are static. Cache and
# re-publish on a tick for reconnects.
cached: list[dict] = []
async def cached_publish(msg: dict) -> None:
cached.append(msg)
await publisher(msg)
await emit_profiles(publish=cached_publish,
validation_path=args.validation,
store_root=args.store)
if args.interval <= 0 or not cached:
return 0
while True:
await asyncio.sleep(args.interval)
for msg in cached:
await publisher(msg)
for r in tbl:
shape_text = PROFILE_TO_CHANNEL.get(r["profile"], (None, ""))[1]
await publisher({
"type": "attack_profile",
"name": r["profile"],
"shape": shape_text,
"curve": list(r["median"]),
})
log.info("cycle complete: %d profiles", len(tbl))
if not args.loop:
return 0
await asyncio.sleep(args.cycle_pause_s)
# ─────────────────────────────────────────────────────────────────────
# CLI
# ─────────────────────────────────────────────────────────────────────
def main() -> int:
ap = argparse.ArgumentParser()
ap.add_argument("--validation", required=True, type=Path)
ap.add_argument("--store", required=True, type=Path)
ap.add_argument("--publish-url", default="http://127.0.0.1:8447/publish")
ap.add_argument("--interval", type=float, default=30.0,
help="re-publish cached profile curves every N seconds; "
"0 = one-shot.")
ap.add_argument("--dry-run", action="store_true")
args = ap.parse_args()
return asyncio.run(_run(args))
p = argparse.ArgumentParser()
sub = p.add_subparsers(dest="cmd", required=True)
pf = sub.add_parser("fit",
help="aggregate N episodes per profile and save "
"the canonical envelope to disk")
pf.add_argument("--validation", required=True, type=Path)
pf.add_argument("--store", required=True, type=Path,
help="receiver episode store root, e.g. "
"/var/lib/cis490/episodes")
pf.add_argument("--fit-out", type=Path,
default=Path("data/processed/attack_profiles_v1.parquet"))
pf.add_argument("--n-episodes-per-profile", type=int, default=100)
pf.add_argument("--n-points", type=int, default=80,
help="resampled curve length (default 80, what "
"the dashboard's profile thumbnail expects)")
pf.add_argument("--seed", type=int, default=0)
pf.add_argument("--log-level", default="INFO")
pf.set_defaults(func=_fit)
ps = sub.add_parser("stream",
help="publish attack_profile events from a saved "
"fit parquet")
ps.add_argument("--load-fit", type=Path,
default=Path("data/processed/attack_profiles_v1.parquet"))
ps.add_argument("--publish-url", default="http://127.0.0.1:8447/publish")
ps.add_argument("--loop", action="store_true",
help="cycle indefinitely so reconnecting browsers "
"stay populated")
ps.add_argument("--cycle-pause-s", type=float, default=20.0,
help="pause between cycles when --loop is set")
ps.add_argument("--dry-run", action="store_true")
ps.add_argument("--log-level", default="INFO")
ps.set_defaults(func=_stream)
args = p.parse_args()
logging.basicConfig(level=args.log_level,
format="%(asctime)s %(levelname)s %(name)s %(message)s")
return asyncio.run(args.func(args))
if __name__ == "__main__":