Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.
perf collector (rows_perf=0 on 100% of episodes):
- perf stat -j writes to stderr by default with -p; we read stdout.
Add --log-fd 1 so JSON reaches stdout where the parser sees it.
- Event names come back annotated with the privilege scope perf
actually measured ("cycles:u" under perf_event_paranoid=2). Strip
the suffix so _build_row's plain-name lookups hit. Without this
every metric was None even when perf reported real numbers.
- tests/test_collectors_emit.py covers the regression with a real
busy-loop fixture; emit-test discipline per §4.4.
guest-agent collector (rows_guest=0 on 100% of episodes):
- Alpine cloud image doesn't ship python3, so the in-guest agent's
`#!/usr/bin/env python3` shebang silently fails. Add packages:
[python3] to cidata user-data so cloud-init installs it before
the OpenRC service starts.
- Guest agent now exits nonzero (was: silent stdout fallback) when
/dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
reports the failure to /var/log/cis490-agent.log instead of the
bytes vanishing into the void. Refs §1.
- Host-side collector emits guest_agent_connected /
guest_agent_first_byte / guest_agent_silent_window into the
orchestrator's events.jsonl. Future episodes show the in-guest
failure mode per-episode instead of inferring from rows_guest=0.
k-gamingcom missing qmp/netflow/pcap (also affected elliott on
Tier-3 episodes — was misclassified as host divergence):
- tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
qmp_socket / guest_agent_socket / bridge_iface — even though
launch_target.sh creates the underlying chardevs and BRIDGE
supplies the iface. tools/run_real_vm_demo.py wires them
correctly; Tier-3 had a copy-paste gap.
- tests/test_collectors_emit.py adds a source-grep regression so
the wiring stays honest.
samba_usermap_script never lands session (0/67 in §3 probe):
- Bind handler default WfsDelay (~5s) gives up before bind_perl on
Metasploitable2 has finished forking + binding LPORT under
SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
exploits/driver.py so framework + driver agree on the wait
budget. Add ConnectTimeout=15 so the handler's bind connect has
retry budget instead of one-shot.
orchestrator/fleet.py: usable_modules + BRIDGE handling were both
unconditional, so:
- With BRIDGE set, requires_bridge modules were still being
dropped — picker only ever returned samba_usermap_script across
every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
failure on HEAD).
- env.pop("BRIDGE") fired even when BRIDGE was the operator's
explicit setup, breaking modules that need bridge mode (vsftpd
backdoor on hardcoded port 6200, distccd, etc.).
Both made conditional on bridge_set so the picker walks the full
catalog under bridge mode and SLIRP-only modules still get a
clean SLIRP env when BRIDGE is unset.
receiver/app.py: half-pregnant v2 schema state in HEAD — calling
store.ingest_stream(episode_type=..., benign_profile=...) with
kwargs the matching store.py change was in the WIP stash. Removed
v2 awareness from app.py so v1 episodes (what the producer ships
today) get accepted again. SCHEMA_VERSION default reset to 1 to
match.
229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
174 lines
6.8 KiB
Python
174 lines
6.8 KiB
Python
"""§4.4 collector emit tests — each collector MUST produce >=1 row when
|
|
run for a few seconds against a synthesized busy workload. A collector
|
|
that fails this is removed from the active set (PIPELINE.md §4.4) — no
|
|
silent zero-row inclusion.
|
|
|
|
These tests intentionally invoke the real collector binaries (perf,
|
|
tcpdump) against real subprocesses. They will skip on environments
|
|
where the binary or capability is unavailable, but they will fail —
|
|
not skip — when the binary IS present and the collector still emits
|
|
zero rows. The whole point is to catch the "collector silently
|
|
disabled" failure mode.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
import os
|
|
import shutil
|
|
import socket
|
|
import subprocess
|
|
import threading
|
|
import time
|
|
from pathlib import Path
|
|
|
|
import pytest
|
|
|
|
from collectors import perf_qemu
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Helpers
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
def _spawn_busy_loop() -> subprocess.Popen:
|
|
"""Spawn a CPU-burning child whose PID we can hand to a collector.
|
|
`exec yes` so the captured PID IS the busy process — without exec,
|
|
the captured PID is the wrapping shell that sits parked waiting on
|
|
its child, and perf samples an idle process."""
|
|
return subprocess.Popen(
|
|
["sh", "-c", "exec yes >/dev/null"],
|
|
stdout=subprocess.DEVNULL,
|
|
stderr=subprocess.DEVNULL,
|
|
)
|
|
|
|
|
|
def _run_collector_briefly(target, *, seconds: float, **kw) -> int:
|
|
"""Spin a collector run_loop in a thread for `seconds`, then stop it.
|
|
Returns the row count the collector reports."""
|
|
stop = threading.Event()
|
|
result: dict[str, int] = {}
|
|
|
|
def _go() -> None:
|
|
result["rows"] = target(stop_event=stop, **kw)
|
|
|
|
th = threading.Thread(target=_go, daemon=True)
|
|
th.start()
|
|
time.sleep(seconds)
|
|
stop.set()
|
|
th.join(timeout=seconds + 5.0)
|
|
return result.get("rows", 0)
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# perf
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
@pytest.mark.skipif(
|
|
shutil.which("perf") is None,
|
|
reason="perf binary not on PATH; this host can't host the perf collector",
|
|
)
|
|
def test_perf_emits_rows_against_busy_pid(tmp_path: Path) -> None:
|
|
"""The perf collector must emit at least one row when pointed at a
|
|
busy PID for a few seconds. Software events (page-faults,
|
|
context-switches, cpu-clock) are used so the test is portable
|
|
across CPUs that lack hardware performance counters; the production
|
|
DEFAULT_EVENTS adds hardware events on top, which is fine where
|
|
they're available and degrades gracefully where they're not.
|
|
|
|
Regression for: perf stat -j writes to stderr by default with -p,
|
|
so reading proc.stdout silently gives 0 lines and 0 rows. Fixed
|
|
by passing --log-fd 1 in the perf invocation.
|
|
"""
|
|
busy = _spawn_busy_loop()
|
|
try:
|
|
out = tmp_path / "telemetry-perf.jsonl"
|
|
rows = _run_collector_briefly(
|
|
perf_qemu.run_loop,
|
|
seconds=2.0,
|
|
pid=busy.pid,
|
|
output_path=out,
|
|
t_mono_origin_ns=0,
|
|
interval_ms=200,
|
|
events=("page-faults", "context-switches", "cpu-clock"),
|
|
)
|
|
finally:
|
|
busy.terminate()
|
|
try:
|
|
busy.wait(timeout=2.0)
|
|
except subprocess.TimeoutExpired:
|
|
busy.kill()
|
|
busy.wait(timeout=1.0)
|
|
|
|
assert rows >= 1, (
|
|
f"perf collector wrote 0 rows against a busy PID — see "
|
|
f"PIPELINE.md §4.4. File: {out}, exists={out.exists()}, "
|
|
f"size={out.stat().st_size if out.exists() else 'n/a'}"
|
|
)
|
|
# Sanity-check the on-disk file matches what run_loop reported.
|
|
on_disk = out.read_text().splitlines() if out.exists() else []
|
|
assert len(on_disk) == rows, (
|
|
f"row count mismatch: run_loop returned {rows} but "
|
|
f"{len(on_disk)} lines on disk"
|
|
)
|
|
# Spot-check the row shape — one parsed row should have the
|
|
# expected schema.
|
|
sample = json.loads(on_disk[0])
|
|
assert sample["source"] == "host_perf"
|
|
assert sample["available_in_deployment"] is False
|
|
assert "t_mono_ns" in sample and "interval_s" in sample
|
|
# At least one row must have a populated metric — if every metric
|
|
# is None on every row, the parser is dropping values. Regression
|
|
# for: event names come back as "cycles:u" / "instructions:u"
|
|
# under perf_event_paranoid=2 (userspace-only), but `_build_row`
|
|
# looks up plain "cycles" / "instructions" — so every metric was
|
|
# silently null even when perf reported real numbers. The mapped
|
|
# fields in the row schema are cycles, instructions, page_faults,
|
|
# context_switches, branches, branch_misses, cache_references,
|
|
# cache_misses; we only need ANY of them populated to confirm the
|
|
# parser is wiring values into the row.
|
|
parsed = [json.loads(l) for l in on_disk]
|
|
metric_keys = ("cycles", "instructions", "page_faults",
|
|
"context_switches", "branches")
|
|
assert any(r.get(k) is not None for r in parsed for k in metric_keys), (
|
|
f"every metric is None on every row — perf parser is dropping "
|
|
f"values. Sample row: {parsed[0]}"
|
|
)
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Tier-3 demo wiring regression
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
def test_run_tier3_demo_wires_collector_sockets_into_episode_config() -> None:
|
|
"""`run_tier3_demo.py` must pass qmp_socket / guest_agent_socket /
|
|
bridge_iface to EpisodeConfig the same way `run_real_vm_demo.py`
|
|
does. Without these, those collectors silently emit zero rows on
|
|
every Tier-3 episode even though launch_target.sh creates the
|
|
underlying chardevs. Regression for: bug found 2026-05-03 against
|
|
elliott-thinkpad + k-gamingcom (rows_qmp=0 / rows_guest=0 / pcap=0
|
|
on 100% of Tier-3 episodes).
|
|
|
|
This is a source-grep test rather than an exec test because
|
|
run_tier3_demo.py boots qemu + msfrpcd, neither of which is
|
|
available in CI. The grep keeps the wiring honest with no
|
|
runtime cost."""
|
|
src = (Path(__file__).resolve().parent.parent
|
|
/ "tools" / "run_tier3_demo.py").read_text()
|
|
# The exact fragments that, if absent, mean the collectors will
|
|
# silently never start. Each must appear as a keyword arg of the
|
|
# EpisodeConfig(...) constructor call site.
|
|
for needle in (
|
|
"qmp_socket=qmp_sock",
|
|
"guest_agent_socket=agent_sock",
|
|
"bridge_iface=os.environ.get(\"BRIDGE\")",
|
|
):
|
|
assert needle in src, (
|
|
f"run_tier3_demo.py is missing `{needle}` on its "
|
|
f"EpisodeConfig — see PIPELINE.md §4.4. Tier-3 episodes "
|
|
f"will silently produce 0 rows for the corresponding "
|
|
f"collector."
|
|
)
|