CIS490/vm/guest-agent/cis490_agent.py
Max Gorog 4ab5477226 PIPELINE §5 step 1: fix four root-cause defects
Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.

perf collector (rows_perf=0 on 100% of episodes):
  - perf stat -j writes to stderr by default with -p; we read stdout.
    Add --log-fd 1 so JSON reaches stdout where the parser sees it.
  - Event names come back annotated with the privilege scope perf
    actually measured ("cycles:u" under perf_event_paranoid=2). Strip
    the suffix so _build_row's plain-name lookups hit. Without this
    every metric was None even when perf reported real numbers.
  - tests/test_collectors_emit.py covers the regression with a real
    busy-loop fixture; emit-test discipline per §4.4.

guest-agent collector (rows_guest=0 on 100% of episodes):
  - Alpine cloud image doesn't ship python3, so the in-guest agent's
    `#!/usr/bin/env python3` shebang silently fails. Add packages:
    [python3] to cidata user-data so cloud-init installs it before
    the OpenRC service starts.
  - Guest agent now exits nonzero (was: silent stdout fallback) when
    /dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
    reports the failure to /var/log/cis490-agent.log instead of the
    bytes vanishing into the void. Refs §1.
  - Host-side collector emits guest_agent_connected /
    guest_agent_first_byte / guest_agent_silent_window into the
    orchestrator's events.jsonl. Future episodes show the in-guest
    failure mode per-episode instead of inferring from rows_guest=0.

k-gamingcom missing qmp/netflow/pcap (also affected elliott on
  Tier-3 episodes — was misclassified as host divergence):
  - tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
    qmp_socket / guest_agent_socket / bridge_iface — even though
    launch_target.sh creates the underlying chardevs and BRIDGE
    supplies the iface. tools/run_real_vm_demo.py wires them
    correctly; Tier-3 had a copy-paste gap.
  - tests/test_collectors_emit.py adds a source-grep regression so
    the wiring stays honest.

samba_usermap_script never lands session (0/67 in §3 probe):
  - Bind handler default WfsDelay (~5s) gives up before bind_perl on
    Metasploitable2 has finished forking + binding LPORT under
    SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
    exploits/driver.py so framework + driver agree on the wait
    budget. Add ConnectTimeout=15 so the handler's bind connect has
    retry budget instead of one-shot.

orchestrator/fleet.py: usable_modules + BRIDGE handling were both
  unconditional, so:
  - With BRIDGE set, requires_bridge modules were still being
    dropped — picker only ever returned samba_usermap_script across
    every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
    failure on HEAD).
  - env.pop("BRIDGE") fired even when BRIDGE was the operator's
    explicit setup, breaking modules that need bridge mode (vsftpd
    backdoor on hardcoded port 6200, distccd, etc.).
  Both made conditional on bridge_set so the picker walks the full
  catalog under bridge mode and SLIRP-only modules still get a
  clean SLIRP env when BRIDGE is unset.

receiver/app.py: half-pregnant v2 schema state in HEAD — calling
  store.ingest_stream(episode_type=..., benign_profile=...) with
  kwargs the matching store.py change was in the WIP stash. Removed
  v2 awareness from app.py so v1 episodes (what the producer ships
  today) get accepted again. SCHEMA_VERSION default reset to 1 to
  match.

229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:05:25 -05:00

286 lines
9.3 KiB
Python

#!/usr/bin/env python3
"""In-guest telemetry agent — runs INSIDE the VM.
Writes one JSON-lines row per tick to a virtio-serial port that the
host has wired up as ``cis490.guest.agent``. The host-side collector
(`collectors.guest_agent`) reads these rows and stamps them with the
host's monotonic clock before persisting to ``telemetry-guest.jsonl``.
Stdlib only — no `psutil`, no extra deps to bake into the guest. Every
field is read from /proc on the guest, so this works on busybox-based
Alpine, on Cirros, and on Metasploitable2 unchanged.
Wire path inside the guest:
/dev/virtio-ports/cis490.guest.agent
The host side opens the matching unix socket on the hypervisor.
The protocol is intentionally trivial: the agent emits newline-
delimited JSON; the host emits nothing back. One direction.
This source is the **deployable** side — every row is tagged
``available_in_deployment: true``. See docs/threat-model.md.
"""
from __future__ import annotations
import argparse
import json
import os
import platform
import sys
import time
from typing import Any
SOURCE = "guest_agent"
AVAILABLE_IN_DEPLOYMENT = True
DEFAULT_PORT = "/dev/virtio-ports/cis490.guest.agent"
DEFAULT_INTERVAL_MS = 100 # 10 Hz
DEFAULT_TOP_N = 8
# ---------- /proc parsers ---------------------------------------------------
def _read(path: str) -> str | None:
try:
with open(path, "rb") as f:
return f.read().decode("ascii", errors="replace")
except (FileNotFoundError, PermissionError):
return None
def read_loadavg() -> tuple[float, float, float] | None:
text = _read("/proc/loadavg")
if text is None:
return None
parts = text.split()
return float(parts[0]), float(parts[1]), float(parts[2])
def read_meminfo() -> dict[str, int]:
text = _read("/proc/meminfo")
out: dict[str, int] = {}
if text is None:
return out
for line in text.splitlines():
k, _, rest = line.partition(":")
v = rest.strip()
if v.endswith(" kB"):
try:
out[k] = int(v[:-3]) * 1024
except ValueError:
pass
return out
def read_cpu_total() -> dict[str, int] | None:
"""First line of /proc/stat: aggregate cpu user/nice/sys/idle/...
in jiffies since boot."""
text = _read("/proc/stat")
if text is None:
return None
line = text.splitlines()[0]
fields = line.split()
# cpu user nice system idle iowait irq softirq steal guest guest_nice
if not fields or fields[0] != "cpu":
return None
nums = [int(x) for x in fields[1:]]
pad = nums + [0] * max(0, 10 - len(nums))
return {
"user": pad[0],
"nice": pad[1],
"system": pad[2],
"idle": pad[3],
"iowait": pad[4],
"irq": pad[5],
"softirq": pad[6],
"steal": pad[7],
"guest": pad[8],
"guest_nice":pad[9],
}
def read_thermal_milli_c() -> int | None:
"""Best-effort: /sys/class/thermal/thermal_zone0/temp."""
text = _read("/sys/class/thermal/thermal_zone0/temp")
if text is None:
return None
try:
return int(text.strip())
except ValueError:
return None
def read_net_devs() -> dict[str, dict[str, int]]:
"""Parse /proc/net/dev → {iface: {rx_bytes, tx_bytes, rx_pkts, tx_pkts}}."""
text = _read("/proc/net/dev")
out: dict[str, dict[str, int]] = {}
if text is None:
return out
lines = text.splitlines()
for line in lines[2:]:
if ":" not in line:
continue
name, _, rest = line.partition(":")
name = name.strip()
if name == "lo":
continue
cols = rest.split()
if len(cols) < 16:
continue
out[name] = {
"rx_bytes": int(cols[0]),
"rx_pkts": int(cols[1]),
"tx_bytes": int(cols[8]),
"tx_pkts": int(cols[9]),
}
return out
def read_listen_ports() -> list[int]:
"""TCP listen sockets from /proc/net/tcp + tcp6. State 0A = LISTEN."""
out: set[int] = set()
for path in ("/proc/net/tcp", "/proc/net/tcp6"):
text = _read(path)
if not text:
continue
for line in text.splitlines()[1:]:
cols = line.split()
if len(cols) < 4:
continue
if cols[3] != "0A":
continue
local = cols[1] # "ADDR:PORT" with PORT in hex
_, _, port_hex = local.rpartition(":")
try:
out.add(int(port_hex, 16))
except ValueError:
pass
return sorted(out)
def read_top_procs(top_n: int) -> list[dict[str, Any]]:
"""Top-N processes by RSS. Cheap O(N) scan of /proc."""
procs: list[dict[str, Any]] = []
try:
entries = os.listdir("/proc")
except OSError:
return procs
for ent in entries:
if not ent.isdigit():
continue
pid = int(ent)
stat = _read(f"/proc/{pid}/stat")
if stat is None:
continue
try:
rparen = stat.rindex(")")
comm = stat[stat.index("(") + 1 : rparen]
fields = stat[rparen + 2:].split()
utime = int(fields[11])
stime = int(fields[12])
rss_pages = int(fields[21])
except (ValueError, IndexError):
continue
procs.append({
"pid": pid,
"comm": comm[:32],
"cpu_jiffies": utime + stime,
"rss_bytes": rss_pages * os.sysconf("SC_PAGESIZE"),
})
procs.sort(key=lambda p: p["rss_bytes"], reverse=True)
return procs[:top_n]
# ---------- one tick --------------------------------------------------------
def collect_once(top_n: int = DEFAULT_TOP_N) -> dict[str, Any]:
mem = read_meminfo()
cpu = read_cpu_total()
load = read_loadavg()
return {
"t_guest_mono_ns": time.monotonic_ns(),
"t_guest_wall_ns": time.time_ns(),
"source": SOURCE,
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
"kernel": platform.release(),
"cpu_total_jiffies": cpu,
"load_1m_5m_15m": list(load) if load else None,
"mem_total_bytes": (mem.get("MemTotal") or 0),
"mem_available_bytes": (mem.get("MemAvailable") or 0),
"mem_buffers_bytes": (mem.get("Buffers") or 0),
"mem_cached_bytes": (mem.get("Cached") or 0),
"swap_used_bytes": (mem.get("SwapTotal", 0) - mem.get("SwapFree", 0)),
"thermal_milli_c": read_thermal_milli_c(),
"net": read_net_devs(),
"listen_ports": read_listen_ports(),
"top_procs": read_top_procs(top_n),
}
# ---------- main loop -------------------------------------------------------
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser(prog="cis490-guest-agent")
p.add_argument("--port", default=DEFAULT_PORT,
help="virtio-serial port path inside the guest")
p.add_argument("--interval-ms", type=int, default=DEFAULT_INTERVAL_MS)
p.add_argument("--top-n", type=int, default=DEFAULT_TOP_N)
p.add_argument("--once", action="store_true",
help="emit a single row and exit (for smoke tests)")
args = p.parse_args(argv)
if args.once:
sys.stdout.write(json.dumps(collect_once(args.top_n)) + "\n")
sys.stdout.flush()
return 0
# Open the virtio-serial port. The host wires this up via QEMU's
# virtserialport device; if it's missing, either virtio_console
# isn't loaded in the guest kernel, the device wasn't included on
# the QEMU command line, or udev hasn't created the symlink yet.
# Exit loudly so OpenRC re-runs us (per service config) and so
# the failure is visible in /var/log/cis490-agent.log instead of
# being absorbed by a silent stdout fallback. Refs PIPELINE.md
# §1 — a host that can't meet the bar must say so loudly, not
# silently downgrade to a half-running state.
if not os.path.exists(args.port):
sys.stderr.write(
f"[cis490-agent] FATAL: virtio-serial port {args.port} not "
f"present. Check (a) virtio_console kernel module is loaded "
f"inside the guest, (b) the QEMU command line includes "
f"-device virtserialport,name=cis490.guest.agent, (c) udev "
f"is creating /dev/virtio-ports/* symlinks. Exiting nonzero "
f"so this failure is observable rather than silently lost.\n"
)
return 2
out_fp = open(args.port, "wb", buffering=0)
interval_ns = args.interval_ms * 1_000_000
next_tick = time.monotonic_ns()
try:
while True:
row = collect_once(args.top_n)
out_fp.write((json.dumps(row) + "\n").encode("utf-8"))
try:
out_fp.flush()
except (AttributeError, OSError):
pass
next_tick += interval_ns
sleep_ns = next_tick - time.monotonic_ns()
if sleep_ns > 0:
time.sleep(sleep_ns / 1_000_000_000)
else:
next_tick = time.monotonic_ns()
except KeyboardInterrupt:
return 0
except (BrokenPipeError, OSError) as e:
sys.stderr.write(f"[cis490-agent] write failed: {e}\n")
return 1
if __name__ == "__main__":
sys.exit(main())