Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.
perf collector (rows_perf=0 on 100% of episodes):
- perf stat -j writes to stderr by default with -p; we read stdout.
Add --log-fd 1 so JSON reaches stdout where the parser sees it.
- Event names come back annotated with the privilege scope perf
actually measured ("cycles:u" under perf_event_paranoid=2). Strip
the suffix so _build_row's plain-name lookups hit. Without this
every metric was None even when perf reported real numbers.
- tests/test_collectors_emit.py covers the regression with a real
busy-loop fixture; emit-test discipline per §4.4.
guest-agent collector (rows_guest=0 on 100% of episodes):
- Alpine cloud image doesn't ship python3, so the in-guest agent's
`#!/usr/bin/env python3` shebang silently fails. Add packages:
[python3] to cidata user-data so cloud-init installs it before
the OpenRC service starts.
- Guest agent now exits nonzero (was: silent stdout fallback) when
/dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
reports the failure to /var/log/cis490-agent.log instead of the
bytes vanishing into the void. Refs §1.
- Host-side collector emits guest_agent_connected /
guest_agent_first_byte / guest_agent_silent_window into the
orchestrator's events.jsonl. Future episodes show the in-guest
failure mode per-episode instead of inferring from rows_guest=0.
k-gamingcom missing qmp/netflow/pcap (also affected elliott on
Tier-3 episodes — was misclassified as host divergence):
- tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
qmp_socket / guest_agent_socket / bridge_iface — even though
launch_target.sh creates the underlying chardevs and BRIDGE
supplies the iface. tools/run_real_vm_demo.py wires them
correctly; Tier-3 had a copy-paste gap.
- tests/test_collectors_emit.py adds a source-grep regression so
the wiring stays honest.
samba_usermap_script never lands session (0/67 in §3 probe):
- Bind handler default WfsDelay (~5s) gives up before bind_perl on
Metasploitable2 has finished forking + binding LPORT under
SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
exploits/driver.py so framework + driver agree on the wait
budget. Add ConnectTimeout=15 so the handler's bind connect has
retry budget instead of one-shot.
orchestrator/fleet.py: usable_modules + BRIDGE handling were both
unconditional, so:
- With BRIDGE set, requires_bridge modules were still being
dropped — picker only ever returned samba_usermap_script across
every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
failure on HEAD).
- env.pop("BRIDGE") fired even when BRIDGE was the operator's
explicit setup, breaking modules that need bridge mode (vsftpd
backdoor on hardcoded port 6200, distccd, etc.).
Both made conditional on bridge_set so the picker walks the full
catalog under bridge mode and SLIRP-only modules still get a
clean SLIRP env when BRIDGE is unset.
receiver/app.py: half-pregnant v2 schema state in HEAD — calling
store.ingest_stream(episode_type=..., benign_profile=...) with
kwargs the matching store.py change was in the WIP stash. Removed
v2 awareness from app.py so v1 episodes (what the producer ships
today) get accepted again. SCHEMA_VERSION default reset to 1 to
match.
229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
286 lines
9.3 KiB
Python
286 lines
9.3 KiB
Python
#!/usr/bin/env python3
|
|
"""In-guest telemetry agent — runs INSIDE the VM.
|
|
|
|
Writes one JSON-lines row per tick to a virtio-serial port that the
|
|
host has wired up as ``cis490.guest.agent``. The host-side collector
|
|
(`collectors.guest_agent`) reads these rows and stamps them with the
|
|
host's monotonic clock before persisting to ``telemetry-guest.jsonl``.
|
|
|
|
Stdlib only — no `psutil`, no extra deps to bake into the guest. Every
|
|
field is read from /proc on the guest, so this works on busybox-based
|
|
Alpine, on Cirros, and on Metasploitable2 unchanged.
|
|
|
|
Wire path inside the guest:
|
|
/dev/virtio-ports/cis490.guest.agent
|
|
|
|
The host side opens the matching unix socket on the hypervisor.
|
|
The protocol is intentionally trivial: the agent emits newline-
|
|
delimited JSON; the host emits nothing back. One direction.
|
|
|
|
This source is the **deployable** side — every row is tagged
|
|
``available_in_deployment: true``. See docs/threat-model.md.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import json
|
|
import os
|
|
import platform
|
|
import sys
|
|
import time
|
|
from typing import Any
|
|
|
|
|
|
SOURCE = "guest_agent"
|
|
AVAILABLE_IN_DEPLOYMENT = True
|
|
DEFAULT_PORT = "/dev/virtio-ports/cis490.guest.agent"
|
|
DEFAULT_INTERVAL_MS = 100 # 10 Hz
|
|
DEFAULT_TOP_N = 8
|
|
|
|
|
|
# ---------- /proc parsers ---------------------------------------------------
|
|
|
|
|
|
def _read(path: str) -> str | None:
|
|
try:
|
|
with open(path, "rb") as f:
|
|
return f.read().decode("ascii", errors="replace")
|
|
except (FileNotFoundError, PermissionError):
|
|
return None
|
|
|
|
|
|
def read_loadavg() -> tuple[float, float, float] | None:
|
|
text = _read("/proc/loadavg")
|
|
if text is None:
|
|
return None
|
|
parts = text.split()
|
|
return float(parts[0]), float(parts[1]), float(parts[2])
|
|
|
|
|
|
def read_meminfo() -> dict[str, int]:
|
|
text = _read("/proc/meminfo")
|
|
out: dict[str, int] = {}
|
|
if text is None:
|
|
return out
|
|
for line in text.splitlines():
|
|
k, _, rest = line.partition(":")
|
|
v = rest.strip()
|
|
if v.endswith(" kB"):
|
|
try:
|
|
out[k] = int(v[:-3]) * 1024
|
|
except ValueError:
|
|
pass
|
|
return out
|
|
|
|
|
|
def read_cpu_total() -> dict[str, int] | None:
|
|
"""First line of /proc/stat: aggregate cpu user/nice/sys/idle/...
|
|
in jiffies since boot."""
|
|
text = _read("/proc/stat")
|
|
if text is None:
|
|
return None
|
|
line = text.splitlines()[0]
|
|
fields = line.split()
|
|
# cpu user nice system idle iowait irq softirq steal guest guest_nice
|
|
if not fields or fields[0] != "cpu":
|
|
return None
|
|
nums = [int(x) for x in fields[1:]]
|
|
pad = nums + [0] * max(0, 10 - len(nums))
|
|
return {
|
|
"user": pad[0],
|
|
"nice": pad[1],
|
|
"system": pad[2],
|
|
"idle": pad[3],
|
|
"iowait": pad[4],
|
|
"irq": pad[5],
|
|
"softirq": pad[6],
|
|
"steal": pad[7],
|
|
"guest": pad[8],
|
|
"guest_nice":pad[9],
|
|
}
|
|
|
|
|
|
def read_thermal_milli_c() -> int | None:
|
|
"""Best-effort: /sys/class/thermal/thermal_zone0/temp."""
|
|
text = _read("/sys/class/thermal/thermal_zone0/temp")
|
|
if text is None:
|
|
return None
|
|
try:
|
|
return int(text.strip())
|
|
except ValueError:
|
|
return None
|
|
|
|
|
|
def read_net_devs() -> dict[str, dict[str, int]]:
|
|
"""Parse /proc/net/dev → {iface: {rx_bytes, tx_bytes, rx_pkts, tx_pkts}}."""
|
|
text = _read("/proc/net/dev")
|
|
out: dict[str, dict[str, int]] = {}
|
|
if text is None:
|
|
return out
|
|
lines = text.splitlines()
|
|
for line in lines[2:]:
|
|
if ":" not in line:
|
|
continue
|
|
name, _, rest = line.partition(":")
|
|
name = name.strip()
|
|
if name == "lo":
|
|
continue
|
|
cols = rest.split()
|
|
if len(cols) < 16:
|
|
continue
|
|
out[name] = {
|
|
"rx_bytes": int(cols[0]),
|
|
"rx_pkts": int(cols[1]),
|
|
"tx_bytes": int(cols[8]),
|
|
"tx_pkts": int(cols[9]),
|
|
}
|
|
return out
|
|
|
|
|
|
def read_listen_ports() -> list[int]:
|
|
"""TCP listen sockets from /proc/net/tcp + tcp6. State 0A = LISTEN."""
|
|
out: set[int] = set()
|
|
for path in ("/proc/net/tcp", "/proc/net/tcp6"):
|
|
text = _read(path)
|
|
if not text:
|
|
continue
|
|
for line in text.splitlines()[1:]:
|
|
cols = line.split()
|
|
if len(cols) < 4:
|
|
continue
|
|
if cols[3] != "0A":
|
|
continue
|
|
local = cols[1] # "ADDR:PORT" with PORT in hex
|
|
_, _, port_hex = local.rpartition(":")
|
|
try:
|
|
out.add(int(port_hex, 16))
|
|
except ValueError:
|
|
pass
|
|
return sorted(out)
|
|
|
|
|
|
def read_top_procs(top_n: int) -> list[dict[str, Any]]:
|
|
"""Top-N processes by RSS. Cheap O(N) scan of /proc."""
|
|
procs: list[dict[str, Any]] = []
|
|
try:
|
|
entries = os.listdir("/proc")
|
|
except OSError:
|
|
return procs
|
|
for ent in entries:
|
|
if not ent.isdigit():
|
|
continue
|
|
pid = int(ent)
|
|
stat = _read(f"/proc/{pid}/stat")
|
|
if stat is None:
|
|
continue
|
|
try:
|
|
rparen = stat.rindex(")")
|
|
comm = stat[stat.index("(") + 1 : rparen]
|
|
fields = stat[rparen + 2:].split()
|
|
utime = int(fields[11])
|
|
stime = int(fields[12])
|
|
rss_pages = int(fields[21])
|
|
except (ValueError, IndexError):
|
|
continue
|
|
procs.append({
|
|
"pid": pid,
|
|
"comm": comm[:32],
|
|
"cpu_jiffies": utime + stime,
|
|
"rss_bytes": rss_pages * os.sysconf("SC_PAGESIZE"),
|
|
})
|
|
procs.sort(key=lambda p: p["rss_bytes"], reverse=True)
|
|
return procs[:top_n]
|
|
|
|
|
|
# ---------- one tick --------------------------------------------------------
|
|
|
|
|
|
def collect_once(top_n: int = DEFAULT_TOP_N) -> dict[str, Any]:
|
|
mem = read_meminfo()
|
|
cpu = read_cpu_total()
|
|
load = read_loadavg()
|
|
return {
|
|
"t_guest_mono_ns": time.monotonic_ns(),
|
|
"t_guest_wall_ns": time.time_ns(),
|
|
"source": SOURCE,
|
|
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
|
"kernel": platform.release(),
|
|
"cpu_total_jiffies": cpu,
|
|
"load_1m_5m_15m": list(load) if load else None,
|
|
"mem_total_bytes": (mem.get("MemTotal") or 0),
|
|
"mem_available_bytes": (mem.get("MemAvailable") or 0),
|
|
"mem_buffers_bytes": (mem.get("Buffers") or 0),
|
|
"mem_cached_bytes": (mem.get("Cached") or 0),
|
|
"swap_used_bytes": (mem.get("SwapTotal", 0) - mem.get("SwapFree", 0)),
|
|
"thermal_milli_c": read_thermal_milli_c(),
|
|
"net": read_net_devs(),
|
|
"listen_ports": read_listen_ports(),
|
|
"top_procs": read_top_procs(top_n),
|
|
}
|
|
|
|
|
|
# ---------- main loop -------------------------------------------------------
|
|
|
|
|
|
def main(argv: list[str] | None = None) -> int:
|
|
p = argparse.ArgumentParser(prog="cis490-guest-agent")
|
|
p.add_argument("--port", default=DEFAULT_PORT,
|
|
help="virtio-serial port path inside the guest")
|
|
p.add_argument("--interval-ms", type=int, default=DEFAULT_INTERVAL_MS)
|
|
p.add_argument("--top-n", type=int, default=DEFAULT_TOP_N)
|
|
p.add_argument("--once", action="store_true",
|
|
help="emit a single row and exit (for smoke tests)")
|
|
args = p.parse_args(argv)
|
|
|
|
if args.once:
|
|
sys.stdout.write(json.dumps(collect_once(args.top_n)) + "\n")
|
|
sys.stdout.flush()
|
|
return 0
|
|
|
|
# Open the virtio-serial port. The host wires this up via QEMU's
|
|
# virtserialport device; if it's missing, either virtio_console
|
|
# isn't loaded in the guest kernel, the device wasn't included on
|
|
# the QEMU command line, or udev hasn't created the symlink yet.
|
|
# Exit loudly so OpenRC re-runs us (per service config) and so
|
|
# the failure is visible in /var/log/cis490-agent.log instead of
|
|
# being absorbed by a silent stdout fallback. Refs PIPELINE.md
|
|
# §1 — a host that can't meet the bar must say so loudly, not
|
|
# silently downgrade to a half-running state.
|
|
if not os.path.exists(args.port):
|
|
sys.stderr.write(
|
|
f"[cis490-agent] FATAL: virtio-serial port {args.port} not "
|
|
f"present. Check (a) virtio_console kernel module is loaded "
|
|
f"inside the guest, (b) the QEMU command line includes "
|
|
f"-device virtserialport,name=cis490.guest.agent, (c) udev "
|
|
f"is creating /dev/virtio-ports/* symlinks. Exiting nonzero "
|
|
f"so this failure is observable rather than silently lost.\n"
|
|
)
|
|
return 2
|
|
out_fp = open(args.port, "wb", buffering=0)
|
|
|
|
interval_ns = args.interval_ms * 1_000_000
|
|
next_tick = time.monotonic_ns()
|
|
try:
|
|
while True:
|
|
row = collect_once(args.top_n)
|
|
out_fp.write((json.dumps(row) + "\n").encode("utf-8"))
|
|
try:
|
|
out_fp.flush()
|
|
except (AttributeError, OSError):
|
|
pass
|
|
next_tick += interval_ns
|
|
sleep_ns = next_tick - time.monotonic_ns()
|
|
if sleep_ns > 0:
|
|
time.sleep(sleep_ns / 1_000_000_000)
|
|
else:
|
|
next_tick = time.monotonic_ns()
|
|
except KeyboardInterrupt:
|
|
return 0
|
|
except (BrokenPipeError, OSError) as e:
|
|
sys.stderr.write(f"[cis490-agent] write failed: {e}\n")
|
|
return 1
|
|
|
|
|
|
if __name__ == "__main__":
|
|
sys.exit(main())
|