This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.
Collectors landed:
collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP
client + row builder + run loop. Tolerates
older qemu without query-stats.
collectors/guest_agent.py — source 5 (deployable). Reads the
virtio-serial host-side socket, parses
agent JSON-lines, re-stamps to the host
monotonic clock, persists.
collectors/pcap.py — source 4 (deployable). tcpdump capture
+ pure-Python pcap reader + 100 ms
netflow.jsonl bucketizer. Decodes
Ethernet/IPv4/TCP/UDP enough for the
schema in docs/data-model.md.
In-guest agent:
vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
/proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
tools/build_cidata.py — embeds the agent + an OpenRC service into
user-data so first boot of the Alpine cidata image auto-starts it.
Launchers:
vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
the agent socket; SLOT env support so multiple VMs run without
socket / port collisions; PORT_BASE on launch_target so multiple
target VMs hostfwd different host ports.
vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
no NAT). Idempotent.
Fleet:
orchestrator/fleet.py — capacity detector (cores / RAM / load
headroom) + concurrent-slot runner. Per-slot ENV selects the
sample. FleetCapacity dataclass round-trips into meta.json so
"this episode ran with 6 concurrent VMs" is auditable post-hoc.
tools/run_fleet.py — CLI: --capacity report; --waves N runs N
waves of (max_concurrent) episodes each, every slot with a
different sample.
etc/cis490-orchestrator.service — now drives the fleet runner with
Restart=always so each invocation runs one wave and respawns,
giving a continuous stream.
Samples:
samples/manifest.toml — six profiles spanning the five major
behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
samples/manifest.py — strict TOML loader (rejects dups, unknown
categories) + deterministic select(host_id, slot, episode_index)
so different hosts on the network walk the catalog in different
orders without any coordinator.
EpisodeRunner:
orchestrator/episode.py — optional qmp_socket + guest_agent_socket
fields on EpisodeConfig; when set, additional collector threads
run alongside proc_qemu. EpisodeResult now carries rows_qmp +
rows_guest counters.
Tier-3 setup automation:
scripts/install-msfrpcd.sh — installs metasploit-framework where
the package manager has it, generates a strong password into
/etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
once MSFRPC_PASSWORD is sourced.
scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
from the operator (Rapid7 download is registration-walled), pulls,
verifies, converts vmdk → qcow2, lands at vm/images/.
Tests: 82 pass (was 51). New suites:
tests/test_qmp.py — fake QMP server, capability handshake,
blockstats, async-event interleaving,
5-failure backoff
tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
re-stamp, malformed-line tolerance
tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames,
bucketize correctness across windows
tests/test_fleet.py — capacity math (8-core idle / low-RAM /
high-load / Pi5 / 1-core box), manifest
selection determinism + diversity
What's queued for the next commit (already discussed in convo):
- MSFExploitDriver v2: map sample.profile → distinct in-session
workload so Tier-3 episodes don't all produce the same yes-loop
envelope. Critical for ML to learn varied malware shapes.
- Real-sample fetch from MalwareBazaar by sha256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
119 lines
4 KiB
Python
119 lines
4 KiB
Python
"""Source 5 (feature, deployable): in-guest agent reader.
|
|
|
|
QEMU exposes a virtio-serial channel two ways:
|
|
- inside the guest: ``/dev/virtio-ports/cis490.guest.agent``
|
|
- on the host: a unix socket at ``$RUN_DIR/agent.sock``
|
|
|
|
The in-guest agent (`vm/guest-agent/cis490_agent.py`) writes one
|
|
JSON-lines row per tick into the guest-side device. Bytes traverse the
|
|
virtio bus and surface on the host socket. This collector reads them,
|
|
re-stamps with the host's monotonic clock (so rows align with all
|
|
other telemetry on a single timeline), and persists to
|
|
``telemetry-guest.jsonl``.
|
|
|
|
Why re-stamp? The agent's clock is the *guest* clock, which can drift
|
|
from the host (rare in KVM, but happens during live-migration tests
|
|
and on heavy host load). The original guest timestamps stay in the row
|
|
under ``t_guest_*`` so analysts can quantify drift if they care.
|
|
|
|
This source is the **deployable** side: every row is tagged
|
|
``available_in_deployment: true``. See docs/threat-model.md.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
import logging
|
|
import socket
|
|
import threading
|
|
import time
|
|
from pathlib import Path
|
|
|
|
|
|
log = logging.getLogger("cis490.collectors.guest_agent")
|
|
|
|
SOURCE = "guest_agent"
|
|
AVAILABLE_IN_DEPLOYMENT = True
|
|
|
|
|
|
def _connect(socket_path: Path, timeout_s: float) -> socket.socket | None:
|
|
deadline = time.monotonic() + timeout_s
|
|
last_err: OSError | None = None
|
|
while time.monotonic() < deadline:
|
|
try:
|
|
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
|
s.settimeout(2.0)
|
|
s.connect(str(socket_path))
|
|
return s
|
|
except OSError as e:
|
|
last_err = e
|
|
time.sleep(0.5)
|
|
if last_err is not None:
|
|
log.warning("guest-agent socket %s never came up: %s", socket_path, last_err)
|
|
return None
|
|
|
|
|
|
def _stamp(row: dict, t_mono_origin_ns: int) -> dict:
|
|
"""Replace the agent's wall-only timestamps with host-clock ones,
|
|
keeping the originals under ``t_guest_*`` for drift analysis."""
|
|
out = dict(row)
|
|
out.setdefault("t_guest_mono_ns", row.get("t_guest_mono_ns"))
|
|
out.setdefault("t_guest_wall_ns", row.get("t_guest_wall_ns"))
|
|
out["t_mono_ns"] = time.monotonic_ns() - t_mono_origin_ns
|
|
out["t_wall_ns"] = time.time_ns()
|
|
out.setdefault("source", SOURCE)
|
|
out.setdefault("available_in_deployment", AVAILABLE_IN_DEPLOYMENT)
|
|
return out
|
|
|
|
|
|
def run_loop(
|
|
socket_path: str | Path,
|
|
output_path: Path,
|
|
t_mono_origin_ns: int,
|
|
stop_event: threading.Event,
|
|
*,
|
|
connect_timeout_s: float = 30.0,
|
|
) -> int:
|
|
"""Read agent JSON-lines from the host-side virtio-serial unix
|
|
socket. Re-stamp each row with the host clock and persist."""
|
|
sock_path = Path(socket_path)
|
|
sock = _connect(sock_path, connect_timeout_s)
|
|
if sock is None:
|
|
return 0
|
|
|
|
rows = 0
|
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
|
buf = b""
|
|
try:
|
|
with output_path.open("a", buffering=1) as f:
|
|
while not stop_event.is_set():
|
|
try:
|
|
sock.settimeout(0.5)
|
|
chunk = sock.recv(8192)
|
|
except socket.timeout:
|
|
continue
|
|
except OSError as e:
|
|
log.warning("guest-agent recv failed: %s", e)
|
|
break
|
|
if not chunk:
|
|
log.info("guest-agent socket closed")
|
|
break
|
|
buf += chunk
|
|
while b"\n" in buf:
|
|
line, _, buf = buf.partition(b"\n")
|
|
line = line.strip()
|
|
if not line:
|
|
continue
|
|
try:
|
|
row = json.loads(line)
|
|
except json.JSONDecodeError as e:
|
|
log.warning("dropping malformed guest-agent line: %s", e)
|
|
continue
|
|
f.write(json.dumps(_stamp(row, t_mono_origin_ns)) + "\n")
|
|
rows += 1
|
|
finally:
|
|
try:
|
|
sock.close()
|
|
except OSError:
|
|
pass
|
|
return rows
|