CIS490/vm/guest-agent/cis490_agent.py
max 1b6c7b2f4a Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts
This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.

Collectors landed:
  collectors/qmp.py          — source 2 (oracle). Tiny synchronous QMP
                               client + row builder + run loop. Tolerates
                               older qemu without query-stats.
  collectors/guest_agent.py  — source 5 (deployable). Reads the
                               virtio-serial host-side socket, parses
                               agent JSON-lines, re-stamps to the host
                               monotonic clock, persists.
  collectors/pcap.py         — source 4 (deployable). tcpdump capture
                               + pure-Python pcap reader + 100 ms
                               netflow.jsonl bucketizer. Decodes
                               Ethernet/IPv4/TCP/UDP enough for the
                               schema in docs/data-model.md.

In-guest agent:
  vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
    /proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
    thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
  tools/build_cidata.py — embeds the agent + an OpenRC service into
    user-data so first boot of the Alpine cidata image auto-starts it.

Launchers:
  vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
    the agent socket; SLOT env support so multiple VMs run without
    socket / port collisions; PORT_BASE on launch_target so multiple
    target VMs hostfwd different host ports.
  vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
    no NAT). Idempotent.

Fleet:
  orchestrator/fleet.py — capacity detector (cores / RAM / load
    headroom) + concurrent-slot runner. Per-slot ENV selects the
    sample. FleetCapacity dataclass round-trips into meta.json so
    "this episode ran with 6 concurrent VMs" is auditable post-hoc.
  tools/run_fleet.py — CLI: --capacity report; --waves N runs N
    waves of (max_concurrent) episodes each, every slot with a
    different sample.
  etc/cis490-orchestrator.service — now drives the fleet runner with
    Restart=always so each invocation runs one wave and respawns,
    giving a continuous stream.

Samples:
  samples/manifest.toml — six profiles spanning the five major
    behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
  samples/manifest.py — strict TOML loader (rejects dups, unknown
    categories) + deterministic select(host_id, slot, episode_index)
    so different hosts on the network walk the catalog in different
    orders without any coordinator.

EpisodeRunner:
  orchestrator/episode.py — optional qmp_socket + guest_agent_socket
    fields on EpisodeConfig; when set, additional collector threads
    run alongside proc_qemu. EpisodeResult now carries rows_qmp +
    rows_guest counters.

Tier-3 setup automation:
  scripts/install-msfrpcd.sh — installs metasploit-framework where
    the package manager has it, generates a strong password into
    /etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
    127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
    once MSFRPC_PASSWORD is sourced.
  scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
    from the operator (Rapid7 download is registration-walled), pulls,
    verifies, converts vmdk → qcow2, lands at vm/images/.

Tests: 82 pass (was 51). New suites:
  tests/test_qmp.py       — fake QMP server, capability handshake,
                            blockstats, async-event interleaving,
                            5-failure backoff
  tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
                              re-stamp, malformed-line tolerance
  tests/test_pcap.py      — synthetic pcap with TCP/UDP/ARP frames,
                            bucketize correctness across windows
  tests/test_fleet.py     — capacity math (8-core idle / low-RAM /
                            high-load / Pi5 / 1-core box), manifest
                            selection determinism + diversity

What's queued for the next commit (already discussed in convo):
  - MSFExploitDriver v2: map sample.profile → distinct in-session
    workload so Tier-3 episodes don't all produce the same yes-loop
    envelope. Critical for ML to learn varied malware shapes.
  - Real-sample fetch from MalwareBazaar by sha256.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:02:27 -05:00

274 lines
8.5 KiB
Python

#!/usr/bin/env python3
"""In-guest telemetry agent — runs INSIDE the VM.
Writes one JSON-lines row per tick to a virtio-serial port that the
host has wired up as ``cis490.guest.agent``. The host-side collector
(`collectors.guest_agent`) reads these rows and stamps them with the
host's monotonic clock before persisting to ``telemetry-guest.jsonl``.
Stdlib only — no `psutil`, no extra deps to bake into the guest. Every
field is read from /proc on the guest, so this works on busybox-based
Alpine, on Cirros, and on Metasploitable2 unchanged.
Wire path inside the guest:
/dev/virtio-ports/cis490.guest.agent
The host side opens the matching unix socket on the hypervisor.
The protocol is intentionally trivial: the agent emits newline-
delimited JSON; the host emits nothing back. One direction.
This source is the **deployable** side — every row is tagged
``available_in_deployment: true``. See docs/threat-model.md.
"""
from __future__ import annotations
import argparse
import json
import os
import platform
import sys
import time
from typing import Any
SOURCE = "guest_agent"
AVAILABLE_IN_DEPLOYMENT = True
DEFAULT_PORT = "/dev/virtio-ports/cis490.guest.agent"
DEFAULT_INTERVAL_MS = 100 # 10 Hz
DEFAULT_TOP_N = 8
# ---------- /proc parsers ---------------------------------------------------
def _read(path: str) -> str | None:
try:
with open(path, "rb") as f:
return f.read().decode("ascii", errors="replace")
except (FileNotFoundError, PermissionError):
return None
def read_loadavg() -> tuple[float, float, float] | None:
text = _read("/proc/loadavg")
if text is None:
return None
parts = text.split()
return float(parts[0]), float(parts[1]), float(parts[2])
def read_meminfo() -> dict[str, int]:
text = _read("/proc/meminfo")
out: dict[str, int] = {}
if text is None:
return out
for line in text.splitlines():
k, _, rest = line.partition(":")
v = rest.strip()
if v.endswith(" kB"):
try:
out[k] = int(v[:-3]) * 1024
except ValueError:
pass
return out
def read_cpu_total() -> dict[str, int] | None:
"""First line of /proc/stat: aggregate cpu user/nice/sys/idle/...
in jiffies since boot."""
text = _read("/proc/stat")
if text is None:
return None
line = text.splitlines()[0]
fields = line.split()
# cpu user nice system idle iowait irq softirq steal guest guest_nice
if not fields or fields[0] != "cpu":
return None
nums = [int(x) for x in fields[1:]]
pad = nums + [0] * max(0, 10 - len(nums))
return {
"user": pad[0],
"nice": pad[1],
"system": pad[2],
"idle": pad[3],
"iowait": pad[4],
"irq": pad[5],
"softirq": pad[6],
"steal": pad[7],
"guest": pad[8],
"guest_nice":pad[9],
}
def read_thermal_milli_c() -> int | None:
"""Best-effort: /sys/class/thermal/thermal_zone0/temp."""
text = _read("/sys/class/thermal/thermal_zone0/temp")
if text is None:
return None
try:
return int(text.strip())
except ValueError:
return None
def read_net_devs() -> dict[str, dict[str, int]]:
"""Parse /proc/net/dev → {iface: {rx_bytes, tx_bytes, rx_pkts, tx_pkts}}."""
text = _read("/proc/net/dev")
out: dict[str, dict[str, int]] = {}
if text is None:
return out
lines = text.splitlines()
for line in lines[2:]:
if ":" not in line:
continue
name, _, rest = line.partition(":")
name = name.strip()
if name == "lo":
continue
cols = rest.split()
if len(cols) < 16:
continue
out[name] = {
"rx_bytes": int(cols[0]),
"rx_pkts": int(cols[1]),
"tx_bytes": int(cols[8]),
"tx_pkts": int(cols[9]),
}
return out
def read_listen_ports() -> list[int]:
"""TCP listen sockets from /proc/net/tcp + tcp6. State 0A = LISTEN."""
out: set[int] = set()
for path in ("/proc/net/tcp", "/proc/net/tcp6"):
text = _read(path)
if not text:
continue
for line in text.splitlines()[1:]:
cols = line.split()
if len(cols) < 4:
continue
if cols[3] != "0A":
continue
local = cols[1] # "ADDR:PORT" with PORT in hex
_, _, port_hex = local.rpartition(":")
try:
out.add(int(port_hex, 16))
except ValueError:
pass
return sorted(out)
def read_top_procs(top_n: int) -> list[dict[str, Any]]:
"""Top-N processes by RSS. Cheap O(N) scan of /proc."""
procs: list[dict[str, Any]] = []
try:
entries = os.listdir("/proc")
except OSError:
return procs
for ent in entries:
if not ent.isdigit():
continue
pid = int(ent)
stat = _read(f"/proc/{pid}/stat")
if stat is None:
continue
try:
rparen = stat.rindex(")")
comm = stat[stat.index("(") + 1 : rparen]
fields = stat[rparen + 2:].split()
utime = int(fields[11])
stime = int(fields[12])
rss_pages = int(fields[21])
except (ValueError, IndexError):
continue
procs.append({
"pid": pid,
"comm": comm[:32],
"cpu_jiffies": utime + stime,
"rss_bytes": rss_pages * os.sysconf("SC_PAGESIZE"),
})
procs.sort(key=lambda p: p["rss_bytes"], reverse=True)
return procs[:top_n]
# ---------- one tick --------------------------------------------------------
def collect_once(top_n: int = DEFAULT_TOP_N) -> dict[str, Any]:
mem = read_meminfo()
cpu = read_cpu_total()
load = read_loadavg()
return {
"t_guest_mono_ns": time.monotonic_ns(),
"t_guest_wall_ns": time.time_ns(),
"source": SOURCE,
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
"kernel": platform.release(),
"cpu_total_jiffies": cpu,
"load_1m_5m_15m": list(load) if load else None,
"mem_total_bytes": (mem.get("MemTotal") or 0),
"mem_available_bytes": (mem.get("MemAvailable") or 0),
"mem_buffers_bytes": (mem.get("Buffers") or 0),
"mem_cached_bytes": (mem.get("Cached") or 0),
"swap_used_bytes": (mem.get("SwapTotal", 0) - mem.get("SwapFree", 0)),
"thermal_milli_c": read_thermal_milli_c(),
"net": read_net_devs(),
"listen_ports": read_listen_ports(),
"top_procs": read_top_procs(top_n),
}
# ---------- main loop -------------------------------------------------------
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser(prog="cis490-guest-agent")
p.add_argument("--port", default=DEFAULT_PORT,
help="virtio-serial port path inside the guest")
p.add_argument("--interval-ms", type=int, default=DEFAULT_INTERVAL_MS)
p.add_argument("--top-n", type=int, default=DEFAULT_TOP_N)
p.add_argument("--once", action="store_true",
help="emit a single row and exit (for smoke tests)")
args = p.parse_args(argv)
if args.once:
sys.stdout.write(json.dumps(collect_once(args.top_n)) + "\n")
sys.stdout.flush()
return 0
# Open the virtio-serial port. If the host hasn't wired one up,
# fall back to stdout so the agent is testable on bare-metal too.
out_fp: Any
if os.path.exists(args.port):
out_fp = open(args.port, "wb", buffering=0)
else:
sys.stderr.write(f"[cis490-agent] {args.port} missing; writing to stdout\n")
out_fp = sys.stdout.buffer
interval_ns = args.interval_ms * 1_000_000
next_tick = time.monotonic_ns()
try:
while True:
row = collect_once(args.top_n)
out_fp.write((json.dumps(row) + "\n").encode("utf-8"))
try:
out_fp.flush()
except (AttributeError, OSError):
pass
next_tick += interval_ns
sleep_ns = next_tick - time.monotonic_ns()
if sleep_ns > 0:
time.sleep(sleep_ns / 1_000_000_000)
else:
next_tick = time.monotonic_ns()
except KeyboardInterrupt:
return 0
except (BrokenPipeError, OSError) as e:
sys.stderr.write(f"[cis490-agent] write failed: {e}\n")
return 1
if __name__ == "__main__":
sys.exit(main())