This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.
Collectors landed:
collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP
client + row builder + run loop. Tolerates
older qemu without query-stats.
collectors/guest_agent.py — source 5 (deployable). Reads the
virtio-serial host-side socket, parses
agent JSON-lines, re-stamps to the host
monotonic clock, persists.
collectors/pcap.py — source 4 (deployable). tcpdump capture
+ pure-Python pcap reader + 100 ms
netflow.jsonl bucketizer. Decodes
Ethernet/IPv4/TCP/UDP enough for the
schema in docs/data-model.md.
In-guest agent:
vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
/proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
tools/build_cidata.py — embeds the agent + an OpenRC service into
user-data so first boot of the Alpine cidata image auto-starts it.
Launchers:
vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
the agent socket; SLOT env support so multiple VMs run without
socket / port collisions; PORT_BASE on launch_target so multiple
target VMs hostfwd different host ports.
vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
no NAT). Idempotent.
Fleet:
orchestrator/fleet.py — capacity detector (cores / RAM / load
headroom) + concurrent-slot runner. Per-slot ENV selects the
sample. FleetCapacity dataclass round-trips into meta.json so
"this episode ran with 6 concurrent VMs" is auditable post-hoc.
tools/run_fleet.py — CLI: --capacity report; --waves N runs N
waves of (max_concurrent) episodes each, every slot with a
different sample.
etc/cis490-orchestrator.service — now drives the fleet runner with
Restart=always so each invocation runs one wave and respawns,
giving a continuous stream.
Samples:
samples/manifest.toml — six profiles spanning the five major
behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
samples/manifest.py — strict TOML loader (rejects dups, unknown
categories) + deterministic select(host_id, slot, episode_index)
so different hosts on the network walk the catalog in different
orders without any coordinator.
EpisodeRunner:
orchestrator/episode.py — optional qmp_socket + guest_agent_socket
fields on EpisodeConfig; when set, additional collector threads
run alongside proc_qemu. EpisodeResult now carries rows_qmp +
rows_guest counters.
Tier-3 setup automation:
scripts/install-msfrpcd.sh — installs metasploit-framework where
the package manager has it, generates a strong password into
/etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
once MSFRPC_PASSWORD is sourced.
scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
from the operator (Rapid7 download is registration-walled), pulls,
verifies, converts vmdk → qcow2, lands at vm/images/.
Tests: 82 pass (was 51). New suites:
tests/test_qmp.py — fake QMP server, capability handshake,
blockstats, async-event interleaving,
5-failure backoff
tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
re-stamp, malformed-line tolerance
tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames,
bucketize correctness across windows
tests/test_fleet.py — capacity math (8-core idle / low-RAM /
high-load / Pi5 / 1-core box), manifest
selection determinism + diversity
What's queued for the next commit (already discussed in convo):
- MSFExploitDriver v2: map sample.profile → distinct in-session
workload so Tier-3 episodes don't all produce the same yes-loop
envelope. Critical for ML to learn varied malware shapes.
- Real-sample fetch from MalwareBazaar by sha256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
288 lines
9.3 KiB
Python
288 lines
9.3 KiB
Python
"""Source 4 (feature, deployable): bridge-side pcap + bucketed netflow.
|
|
|
|
Captures packets on the host-only ``br-malware`` bridge during an
|
|
episode, writes the raw pcap, and produces a bucketed JSONL file the
|
|
trainer can consume directly.
|
|
|
|
The capture is **gateway-side** — the orchestrator sees the same
|
|
packets a real upstream router/gateway would see in deployment, so
|
|
features derived here transfer 1:1 to the deployment-time gateway
|
|
observer.
|
|
|
|
Implementation:
|
|
|
|
- ``run_capture()`` spawns ``tcpdump -i <bridge> -U -w <out.pcap>``
|
|
as a subprocess for the episode duration. ``-U`` flushes per
|
|
packet so the file is consumable mid-flight.
|
|
|
|
- ``bucketize()`` reads a finished pcap and emits 100 ms-bucketed
|
|
rows into ``netflow.jsonl``. Pure-Python pcap parser (no scapy /
|
|
dpkt dependency); decodes Ethernet + IPv4 + TCP/UDP enough to fill
|
|
the schema in docs/data-model.md.
|
|
|
|
The pure-Python parser is intentionally minimal — it does NOT do
|
|
fragment reassembly, IPv6, VLAN tags, or anything fancy. It handles
|
|
the cases that occur on a host-only bridge for malware behaviour:
|
|
plain Ethernet II, IPv4, TCP/UDP. Other frames are still counted at
|
|
the byte/packet level but skipped for protocol-specific stats.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
import logging
|
|
import os
|
|
import struct
|
|
import subprocess
|
|
import threading
|
|
import time
|
|
from collections import defaultdict
|
|
from dataclasses import dataclass
|
|
from pathlib import Path
|
|
|
|
|
|
log = logging.getLogger("cis490.collectors.pcap")
|
|
|
|
SOURCE = "bridge_pcap"
|
|
AVAILABLE_IN_DEPLOYMENT = True
|
|
|
|
# Pcap file-level header
|
|
_PCAP_GLOBAL_HDR = "<IHHiIII"
|
|
_PCAP_GLOBAL_HDR_SIZE = 24
|
|
_PCAP_REC_HDR = "<IIII"
|
|
_PCAP_REC_HDR_SIZE = 16
|
|
_PCAP_MAGIC_USEC = 0xa1b2c3d4
|
|
_PCAP_MAGIC_NSEC = 0xa1b23c4d # nanosecond resolution variant
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Capture
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
@dataclass
|
|
class CaptureHandle:
|
|
proc: subprocess.Popen
|
|
pcap_path: Path
|
|
bridge: str
|
|
started_mono_ns: int
|
|
|
|
|
|
def run_capture(
|
|
*,
|
|
bridge: str,
|
|
pcap_path: Path,
|
|
snaplen: int = 256,
|
|
bpf: str | None = None,
|
|
) -> CaptureHandle:
|
|
"""Start a tcpdump capture on ``bridge``. Returns a handle the
|
|
caller stops via ``stop_capture()``."""
|
|
pcap_path.parent.mkdir(parents=True, exist_ok=True)
|
|
args = ["tcpdump", "-i", bridge, "-U", "-s", str(snaplen), "-w", str(pcap_path)]
|
|
if bpf:
|
|
args.append(bpf)
|
|
log.info("starting pcap: %s", " ".join(args))
|
|
proc = subprocess.Popen(
|
|
args,
|
|
stdout=subprocess.DEVNULL,
|
|
stderr=subprocess.PIPE,
|
|
# tcpdump may need root or CAP_NET_RAW. We don't elevate here.
|
|
)
|
|
return CaptureHandle(
|
|
proc=proc, pcap_path=pcap_path, bridge=bridge,
|
|
started_mono_ns=time.monotonic_ns(),
|
|
)
|
|
|
|
|
|
def stop_capture(handle: CaptureHandle, *, timeout_s: float = 5.0) -> int:
|
|
"""SIGINT tcpdump (the Right Signal — flushes buffers + exits 0).
|
|
Returns the process exit code."""
|
|
proc = handle.proc
|
|
if proc.poll() is None:
|
|
proc.send_signal(2) # SIGINT
|
|
try:
|
|
proc.wait(timeout=timeout_s)
|
|
except subprocess.TimeoutExpired:
|
|
proc.kill()
|
|
proc.wait(timeout=timeout_s)
|
|
return proc.returncode
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Pure-Python pcap parser
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
def _iter_pcap(path: Path):
|
|
"""Yield ``(t_pkt_ns, frame_bytes)`` for every record in a pcap
|
|
file. Tolerates either microsecond or nanosecond magics."""
|
|
with path.open("rb") as f:
|
|
hdr = f.read(_PCAP_GLOBAL_HDR_SIZE)
|
|
if len(hdr) < _PCAP_GLOBAL_HDR_SIZE:
|
|
return
|
|
magic = struct.unpack("<I", hdr[:4])[0]
|
|
if magic == _PCAP_MAGIC_USEC:
|
|
sub_mult = 1000 # us → ns
|
|
elif magic == _PCAP_MAGIC_NSEC:
|
|
sub_mult = 1
|
|
else:
|
|
log.warning("unknown pcap magic %#x in %s", magic, path)
|
|
return
|
|
while True:
|
|
rec = f.read(_PCAP_REC_HDR_SIZE)
|
|
if len(rec) < _PCAP_REC_HDR_SIZE:
|
|
return
|
|
ts_sec, ts_sub, caplen, _ = struct.unpack(_PCAP_REC_HDR, rec)
|
|
data = f.read(caplen)
|
|
if len(data) < caplen:
|
|
return
|
|
t_ns = ts_sec * 1_000_000_000 + ts_sub * sub_mult
|
|
yield t_ns, data
|
|
|
|
|
|
def _decode(frame: bytes) -> dict:
|
|
"""Decode an Ethernet/IPv4/{TCP,UDP} frame to a flat dict. Unknown
|
|
protocols return only the ethertype + lengths."""
|
|
out: dict = {"size": len(frame)}
|
|
if len(frame) < 14:
|
|
return out
|
|
ethertype = struct.unpack(">H", frame[12:14])[0]
|
|
out["ethertype"] = ethertype
|
|
if ethertype != 0x0800: # not IPv4 — count, don't decode further
|
|
return out
|
|
ip = frame[14:]
|
|
if len(ip) < 20:
|
|
return out
|
|
ihl = (ip[0] & 0x0F) * 4
|
|
if ihl < 20 or len(ip) < ihl:
|
|
return out
|
|
proto = ip[9]
|
|
src = ip[12:16]
|
|
dst = ip[16:20]
|
|
out["ip_proto"] = proto
|
|
out["src_ip"] = ".".join(str(b) for b in src)
|
|
out["dst_ip"] = ".".join(str(b) for b in dst)
|
|
payload = ip[ihl:]
|
|
if proto == 6 and len(payload) >= 20: # TCP
|
|
sport, dport, _, _, off_flags = struct.unpack(">HHIIH", payload[:14])
|
|
flags = off_flags & 0x003F
|
|
out["src_port"] = sport
|
|
out["dst_port"] = dport
|
|
out["tcp_flags"] = flags # FIN=1 SYN=2 RST=4 PSH=8 ACK=16 URG=32
|
|
elif proto == 17 and len(payload) >= 8: # UDP
|
|
sport, dport, _, _ = struct.unpack(">HHHH", payload[:8])
|
|
out["src_port"] = sport
|
|
out["dst_port"] = dport
|
|
return out
|
|
|
|
|
|
def bucketize(
|
|
pcap_path: Path,
|
|
netflow_path: Path,
|
|
*,
|
|
bucket_ms: int = 100,
|
|
t_mono_origin_ns: int = 0,
|
|
bridge_ip: str = "10.200.0.1",
|
|
) -> int:
|
|
"""Read a pcap and emit one row per ``bucket_ms`` window into
|
|
``netflow.jsonl``. The ``in/out`` direction is from the bridge
|
|
perspective (host = ``bridge_ip``):
|
|
|
|
out = packet whose src is the host-side address (host → guest)
|
|
in = anything else seen on the bridge (guest → host or
|
|
guest-to-guest)
|
|
|
|
Returns the number of rows written."""
|
|
if not pcap_path.exists():
|
|
return 0
|
|
bucket_ns = bucket_ms * 1_000_000
|
|
netflow_path.parent.mkdir(parents=True, exist_ok=True)
|
|
|
|
rows = 0
|
|
bucket_start: int | None = None
|
|
agg: dict = _empty_bucket()
|
|
with netflow_path.open("a", buffering=1) as out:
|
|
for t_pkt_ns, frame in _iter_pcap(pcap_path):
|
|
d = _decode(frame)
|
|
# Establish first bucket origin on first packet.
|
|
if bucket_start is None:
|
|
bucket_start = t_pkt_ns - (t_pkt_ns % bucket_ns)
|
|
while t_pkt_ns >= bucket_start + bucket_ns:
|
|
_flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
|
|
rows += 1
|
|
agg = _empty_bucket()
|
|
bucket_start += bucket_ns
|
|
_accumulate(agg, d, bridge_ip)
|
|
if bucket_start is not None and any(v for v in agg.values() if v):
|
|
_flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
|
|
rows += 1
|
|
return rows
|
|
|
|
|
|
def _empty_bucket() -> dict:
|
|
return {
|
|
"pkts_in": 0, "pkts_out": 0,
|
|
"bytes_in": 0, "bytes_out": 0,
|
|
"syn_count": 0, "fin_count": 0, "rst_count": 0,
|
|
"udp_count": 0, "tcp_count": 0,
|
|
"dns_query_count": 0,
|
|
"dst_ips": set(), "dst_ports": set(),
|
|
"tcp_new_flows": 0,
|
|
}
|
|
|
|
|
|
def _accumulate(agg: dict, d: dict, bridge_ip: str) -> None:
|
|
sz = d.get("size", 0)
|
|
is_out = d.get("src_ip") == bridge_ip
|
|
if is_out:
|
|
agg["pkts_out"] += 1
|
|
agg["bytes_out"] += sz
|
|
else:
|
|
agg["pkts_in"] += 1
|
|
agg["bytes_in"] += sz
|
|
|
|
proto = d.get("ip_proto")
|
|
if proto == 6:
|
|
agg["tcp_count"] += 1
|
|
flags = d.get("tcp_flags", 0)
|
|
if flags & 0x02: # SYN
|
|
agg["syn_count"] += 1
|
|
if not (flags & 0x10): # SYN without ACK = new flow
|
|
agg["tcp_new_flows"] += 1
|
|
if flags & 0x01:
|
|
agg["fin_count"] += 1
|
|
if flags & 0x04:
|
|
agg["rst_count"] += 1
|
|
elif proto == 17:
|
|
agg["udp_count"] += 1
|
|
if d.get("dst_port") == 53:
|
|
agg["dns_query_count"] += 1
|
|
|
|
dst = d.get("dst_ip")
|
|
if dst:
|
|
agg["dst_ips"].add(dst)
|
|
dport = d.get("dst_port")
|
|
if dport is not None:
|
|
agg["dst_ports"].add(dport)
|
|
|
|
|
|
def _flush(out, agg: dict, bucket_start_ns: int, bucket_ns: int, t_mono_origin_ns: int) -> None:
|
|
row = {
|
|
"t_mono_ns": bucket_start_ns - t_mono_origin_ns,
|
|
"t_wall_ns": bucket_start_ns,
|
|
"source": SOURCE,
|
|
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
|
"bucket_ms": bucket_ns // 1_000_000,
|
|
"pkts_in": agg["pkts_in"], "pkts_out": agg["pkts_out"],
|
|
"bytes_in": agg["bytes_in"], "bytes_out": agg["bytes_out"],
|
|
"syn_count": agg["syn_count"],
|
|
"fin_count": agg["fin_count"],
|
|
"rst_count": agg["rst_count"],
|
|
"udp_count": agg["udp_count"],
|
|
"tcp_count": agg["tcp_count"],
|
|
"dns_query_count": agg["dns_query_count"],
|
|
"unique_dst_ips": len(agg["dst_ips"]),
|
|
"unique_dst_ports": len(agg["dst_ports"]),
|
|
"tcp_new_flows": agg["tcp_new_flows"],
|
|
}
|
|
out.write(json.dumps(row) + "\n")
|