Collectors 2/4/5 + fleet runner + sample manifest + Tier-3 setup scripts
This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.
Collectors landed:
collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP
client + row builder + run loop. Tolerates
older qemu without query-stats.
collectors/guest_agent.py — source 5 (deployable). Reads the
virtio-serial host-side socket, parses
agent JSON-lines, re-stamps to the host
monotonic clock, persists.
collectors/pcap.py — source 4 (deployable). tcpdump capture
+ pure-Python pcap reader + 100 ms
netflow.jsonl bucketizer. Decodes
Ethernet/IPv4/TCP/UDP enough for the
schema in docs/data-model.md.
In-guest agent:
vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
/proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
tools/build_cidata.py — embeds the agent + an OpenRC service into
user-data so first boot of the Alpine cidata image auto-starts it.
Launchers:
vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
the agent socket; SLOT env support so multiple VMs run without
socket / port collisions; PORT_BASE on launch_target so multiple
target VMs hostfwd different host ports.
vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
no NAT). Idempotent.
Fleet:
orchestrator/fleet.py — capacity detector (cores / RAM / load
headroom) + concurrent-slot runner. Per-slot ENV selects the
sample. FleetCapacity dataclass round-trips into meta.json so
"this episode ran with 6 concurrent VMs" is auditable post-hoc.
tools/run_fleet.py — CLI: --capacity report; --waves N runs N
waves of (max_concurrent) episodes each, every slot with a
different sample.
etc/cis490-orchestrator.service — now drives the fleet runner with
Restart=always so each invocation runs one wave and respawns,
giving a continuous stream.
Samples:
samples/manifest.toml — six profiles spanning the five major
behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
samples/manifest.py — strict TOML loader (rejects dups, unknown
categories) + deterministic select(host_id, slot, episode_index)
so different hosts on the network walk the catalog in different
orders without any coordinator.
EpisodeRunner:
orchestrator/episode.py — optional qmp_socket + guest_agent_socket
fields on EpisodeConfig; when set, additional collector threads
run alongside proc_qemu. EpisodeResult now carries rows_qmp +
rows_guest counters.
Tier-3 setup automation:
scripts/install-msfrpcd.sh — installs metasploit-framework where
the package manager has it, generates a strong password into
/etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
once MSFRPC_PASSWORD is sourced.
scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
from the operator (Rapid7 download is registration-walled), pulls,
verifies, converts vmdk → qcow2, lands at vm/images/.
Tests: 82 pass (was 51). New suites:
tests/test_qmp.py — fake QMP server, capability handshake,
blockstats, async-event interleaving,
5-failure backoff
tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
re-stamp, malformed-line tolerance
tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames,
bucketize correctness across windows
tests/test_fleet.py — capacity math (8-core idle / low-RAM /
high-load / Pi5 / 1-core box), manifest
selection determinism + diversity
What's queued for the next commit (already discussed in convo):
- MSFExploitDriver v2: map sample.profile → distinct in-session
workload so Tier-3 episodes don't all produce the same yes-loop
envelope. Critical for ML to learn varied malware shapes.
- Real-sample fetch from MalwareBazaar by sha256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
2579683efb
commit
1b6c7b2f4a
22 changed files with 2825 additions and 40 deletions
16
README.md
16
README.md
|
|
@ -94,15 +94,19 @@ tools/show_envelope.sh data/episodes/<episode_id>
|
|||
|
||||
## Status
|
||||
|
||||
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
|
||||
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — running on Pi5 via Caddy + mTLS (wg-pki client CA)
|
||||
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
|
||||
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz
|
||||
- ✅ Host /proc oracle collector (source 1) @ 10 Hz
|
||||
- ✅ **QMP collector** (source 2) — query-status / query-blockstats / query-stats, 1 Hz
|
||||
- ✅ **Bridge pcap** (source 4) — pure-Python pcap parser + 100 ms-bucketed netflow.jsonl
|
||||
- ✅ **In-guest agent** (source 5) — virtio-serial; cidata-embedded for first-boot install on Alpine; host-side reader re-stamps to host clock
|
||||
- ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
|
||||
- ✅ Real VM (Alpine 3.21 cloud-init under KVM) — orchestrator collects against the real `qemu-system` pid
|
||||
- ✅ Real VM (Alpine 3.21 cloud-init under KVM)
|
||||
- ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition
|
||||
- 🟡 **Tier 3 — exploit driver:** `MSFExploitDriver` + msfrpc client + first module config landed (`exploits/`); end-to-end run against a live `msfrpcd` + Metasploitable2 image still pending.
|
||||
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
|
||||
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)
|
||||
- 🟡 **Tier 3 — exploit driver:** `MSFExploitDriver` + msfrpc client + first module config landed; `scripts/install-msfrpcd.sh` automates msfrpcd setup; `scripts/fetch-metasploitable2.sh` pulls + verifies the target image (URL+sha256 from operator). Driver v2 (sample-profile-driven workloads) is the next step for ML diversity.
|
||||
- ✅ **Shipper** — lab-host ↔ Pi receiver via tar+zstd PUT over WG with mTLS; `--ping` smoke mode
|
||||
- ✅ **Fleet runner** — host-capacity-aware concurrency (`tools/run_fleet.py`); resource detector reserves cores + RAM headroom; sample manifest with deterministic per-(host, slot, episode) selection so every host on the network produces *novel, varied, labeled* data
|
||||
- ✅ **Sample manifest** — six initial profiles (cryptominer / botnet / ransomware / banking-trojan / fileless / RAT). Real-malware fetch from MalwareBazaar is the Tier-4 follow-up.
|
||||
|
||||
> **Topology note:** in this project the **Pi5 is the WireGuard-side
|
||||
> *collector*** that receives episode tarballs from one or more lab hosts.
|
||||
|
|
|
|||
119
collectors/guest_agent.py
Normal file
119
collectors/guest_agent.py
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
"""Source 5 (feature, deployable): in-guest agent reader.
|
||||
|
||||
QEMU exposes a virtio-serial channel two ways:
|
||||
- inside the guest: ``/dev/virtio-ports/cis490.guest.agent``
|
||||
- on the host: a unix socket at ``$RUN_DIR/agent.sock``
|
||||
|
||||
The in-guest agent (`vm/guest-agent/cis490_agent.py`) writes one
|
||||
JSON-lines row per tick into the guest-side device. Bytes traverse the
|
||||
virtio bus and surface on the host socket. This collector reads them,
|
||||
re-stamps with the host's monotonic clock (so rows align with all
|
||||
other telemetry on a single timeline), and persists to
|
||||
``telemetry-guest.jsonl``.
|
||||
|
||||
Why re-stamp? The agent's clock is the *guest* clock, which can drift
|
||||
from the host (rare in KVM, but happens during live-migration tests
|
||||
and on heavy host load). The original guest timestamps stay in the row
|
||||
under ``t_guest_*`` so analysts can quantify drift if they care.
|
||||
|
||||
This source is the **deployable** side: every row is tagged
|
||||
``available_in_deployment: true``. See docs/threat-model.md.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import socket
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.collectors.guest_agent")
|
||||
|
||||
SOURCE = "guest_agent"
|
||||
AVAILABLE_IN_DEPLOYMENT = True
|
||||
|
||||
|
||||
def _connect(socket_path: Path, timeout_s: float) -> socket.socket | None:
|
||||
deadline = time.monotonic() + timeout_s
|
||||
last_err: OSError | None = None
|
||||
while time.monotonic() < deadline:
|
||||
try:
|
||||
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
s.settimeout(2.0)
|
||||
s.connect(str(socket_path))
|
||||
return s
|
||||
except OSError as e:
|
||||
last_err = e
|
||||
time.sleep(0.5)
|
||||
if last_err is not None:
|
||||
log.warning("guest-agent socket %s never came up: %s", socket_path, last_err)
|
||||
return None
|
||||
|
||||
|
||||
def _stamp(row: dict, t_mono_origin_ns: int) -> dict:
|
||||
"""Replace the agent's wall-only timestamps with host-clock ones,
|
||||
keeping the originals under ``t_guest_*`` for drift analysis."""
|
||||
out = dict(row)
|
||||
out.setdefault("t_guest_mono_ns", row.get("t_guest_mono_ns"))
|
||||
out.setdefault("t_guest_wall_ns", row.get("t_guest_wall_ns"))
|
||||
out["t_mono_ns"] = time.monotonic_ns() - t_mono_origin_ns
|
||||
out["t_wall_ns"] = time.time_ns()
|
||||
out.setdefault("source", SOURCE)
|
||||
out.setdefault("available_in_deployment", AVAILABLE_IN_DEPLOYMENT)
|
||||
return out
|
||||
|
||||
|
||||
def run_loop(
|
||||
socket_path: str | Path,
|
||||
output_path: Path,
|
||||
t_mono_origin_ns: int,
|
||||
stop_event: threading.Event,
|
||||
*,
|
||||
connect_timeout_s: float = 30.0,
|
||||
) -> int:
|
||||
"""Read agent JSON-lines from the host-side virtio-serial unix
|
||||
socket. Re-stamp each row with the host clock and persist."""
|
||||
sock_path = Path(socket_path)
|
||||
sock = _connect(sock_path, connect_timeout_s)
|
||||
if sock is None:
|
||||
return 0
|
||||
|
||||
rows = 0
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
buf = b""
|
||||
try:
|
||||
with output_path.open("a", buffering=1) as f:
|
||||
while not stop_event.is_set():
|
||||
try:
|
||||
sock.settimeout(0.5)
|
||||
chunk = sock.recv(8192)
|
||||
except socket.timeout:
|
||||
continue
|
||||
except OSError as e:
|
||||
log.warning("guest-agent recv failed: %s", e)
|
||||
break
|
||||
if not chunk:
|
||||
log.info("guest-agent socket closed")
|
||||
break
|
||||
buf += chunk
|
||||
while b"\n" in buf:
|
||||
line, _, buf = buf.partition(b"\n")
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
row = json.loads(line)
|
||||
except json.JSONDecodeError as e:
|
||||
log.warning("dropping malformed guest-agent line: %s", e)
|
||||
continue
|
||||
f.write(json.dumps(_stamp(row, t_mono_origin_ns)) + "\n")
|
||||
rows += 1
|
||||
finally:
|
||||
try:
|
||||
sock.close()
|
||||
except OSError:
|
||||
pass
|
||||
return rows
|
||||
288
collectors/pcap.py
Normal file
288
collectors/pcap.py
Normal file
|
|
@ -0,0 +1,288 @@
|
|||
"""Source 4 (feature, deployable): bridge-side pcap + bucketed netflow.
|
||||
|
||||
Captures packets on the host-only ``br-malware`` bridge during an
|
||||
episode, writes the raw pcap, and produces a bucketed JSONL file the
|
||||
trainer can consume directly.
|
||||
|
||||
The capture is **gateway-side** — the orchestrator sees the same
|
||||
packets a real upstream router/gateway would see in deployment, so
|
||||
features derived here transfer 1:1 to the deployment-time gateway
|
||||
observer.
|
||||
|
||||
Implementation:
|
||||
|
||||
- ``run_capture()`` spawns ``tcpdump -i <bridge> -U -w <out.pcap>``
|
||||
as a subprocess for the episode duration. ``-U`` flushes per
|
||||
packet so the file is consumable mid-flight.
|
||||
|
||||
- ``bucketize()`` reads a finished pcap and emits 100 ms-bucketed
|
||||
rows into ``netflow.jsonl``. Pure-Python pcap parser (no scapy /
|
||||
dpkt dependency); decodes Ethernet + IPv4 + TCP/UDP enough to fill
|
||||
the schema in docs/data-model.md.
|
||||
|
||||
The pure-Python parser is intentionally minimal — it does NOT do
|
||||
fragment reassembly, IPv6, VLAN tags, or anything fancy. It handles
|
||||
the cases that occur on a host-only bridge for malware behaviour:
|
||||
plain Ethernet II, IPv4, TCP/UDP. Other frames are still counted at
|
||||
the byte/packet level but skipped for protocol-specific stats.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import struct
|
||||
import subprocess
|
||||
import threading
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.collectors.pcap")
|
||||
|
||||
SOURCE = "bridge_pcap"
|
||||
AVAILABLE_IN_DEPLOYMENT = True
|
||||
|
||||
# Pcap file-level header
|
||||
_PCAP_GLOBAL_HDR = "<IHHiIII"
|
||||
_PCAP_GLOBAL_HDR_SIZE = 24
|
||||
_PCAP_REC_HDR = "<IIII"
|
||||
_PCAP_REC_HDR_SIZE = 16
|
||||
_PCAP_MAGIC_USEC = 0xa1b2c3d4
|
||||
_PCAP_MAGIC_NSEC = 0xa1b23c4d # nanosecond resolution variant
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Capture
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class CaptureHandle:
|
||||
proc: subprocess.Popen
|
||||
pcap_path: Path
|
||||
bridge: str
|
||||
started_mono_ns: int
|
||||
|
||||
|
||||
def run_capture(
|
||||
*,
|
||||
bridge: str,
|
||||
pcap_path: Path,
|
||||
snaplen: int = 256,
|
||||
bpf: str | None = None,
|
||||
) -> CaptureHandle:
|
||||
"""Start a tcpdump capture on ``bridge``. Returns a handle the
|
||||
caller stops via ``stop_capture()``."""
|
||||
pcap_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
args = ["tcpdump", "-i", bridge, "-U", "-s", str(snaplen), "-w", str(pcap_path)]
|
||||
if bpf:
|
||||
args.append(bpf)
|
||||
log.info("starting pcap: %s", " ".join(args))
|
||||
proc = subprocess.Popen(
|
||||
args,
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.PIPE,
|
||||
# tcpdump may need root or CAP_NET_RAW. We don't elevate here.
|
||||
)
|
||||
return CaptureHandle(
|
||||
proc=proc, pcap_path=pcap_path, bridge=bridge,
|
||||
started_mono_ns=time.monotonic_ns(),
|
||||
)
|
||||
|
||||
|
||||
def stop_capture(handle: CaptureHandle, *, timeout_s: float = 5.0) -> int:
|
||||
"""SIGINT tcpdump (the Right Signal — flushes buffers + exits 0).
|
||||
Returns the process exit code."""
|
||||
proc = handle.proc
|
||||
if proc.poll() is None:
|
||||
proc.send_signal(2) # SIGINT
|
||||
try:
|
||||
proc.wait(timeout=timeout_s)
|
||||
except subprocess.TimeoutExpired:
|
||||
proc.kill()
|
||||
proc.wait(timeout=timeout_s)
|
||||
return proc.returncode
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Pure-Python pcap parser
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _iter_pcap(path: Path):
|
||||
"""Yield ``(t_pkt_ns, frame_bytes)`` for every record in a pcap
|
||||
file. Tolerates either microsecond or nanosecond magics."""
|
||||
with path.open("rb") as f:
|
||||
hdr = f.read(_PCAP_GLOBAL_HDR_SIZE)
|
||||
if len(hdr) < _PCAP_GLOBAL_HDR_SIZE:
|
||||
return
|
||||
magic = struct.unpack("<I", hdr[:4])[0]
|
||||
if magic == _PCAP_MAGIC_USEC:
|
||||
sub_mult = 1000 # us → ns
|
||||
elif magic == _PCAP_MAGIC_NSEC:
|
||||
sub_mult = 1
|
||||
else:
|
||||
log.warning("unknown pcap magic %#x in %s", magic, path)
|
||||
return
|
||||
while True:
|
||||
rec = f.read(_PCAP_REC_HDR_SIZE)
|
||||
if len(rec) < _PCAP_REC_HDR_SIZE:
|
||||
return
|
||||
ts_sec, ts_sub, caplen, _ = struct.unpack(_PCAP_REC_HDR, rec)
|
||||
data = f.read(caplen)
|
||||
if len(data) < caplen:
|
||||
return
|
||||
t_ns = ts_sec * 1_000_000_000 + ts_sub * sub_mult
|
||||
yield t_ns, data
|
||||
|
||||
|
||||
def _decode(frame: bytes) -> dict:
|
||||
"""Decode an Ethernet/IPv4/{TCP,UDP} frame to a flat dict. Unknown
|
||||
protocols return only the ethertype + lengths."""
|
||||
out: dict = {"size": len(frame)}
|
||||
if len(frame) < 14:
|
||||
return out
|
||||
ethertype = struct.unpack(">H", frame[12:14])[0]
|
||||
out["ethertype"] = ethertype
|
||||
if ethertype != 0x0800: # not IPv4 — count, don't decode further
|
||||
return out
|
||||
ip = frame[14:]
|
||||
if len(ip) < 20:
|
||||
return out
|
||||
ihl = (ip[0] & 0x0F) * 4
|
||||
if ihl < 20 or len(ip) < ihl:
|
||||
return out
|
||||
proto = ip[9]
|
||||
src = ip[12:16]
|
||||
dst = ip[16:20]
|
||||
out["ip_proto"] = proto
|
||||
out["src_ip"] = ".".join(str(b) for b in src)
|
||||
out["dst_ip"] = ".".join(str(b) for b in dst)
|
||||
payload = ip[ihl:]
|
||||
if proto == 6 and len(payload) >= 20: # TCP
|
||||
sport, dport, _, _, off_flags = struct.unpack(">HHIIH", payload[:14])
|
||||
flags = off_flags & 0x003F
|
||||
out["src_port"] = sport
|
||||
out["dst_port"] = dport
|
||||
out["tcp_flags"] = flags # FIN=1 SYN=2 RST=4 PSH=8 ACK=16 URG=32
|
||||
elif proto == 17 and len(payload) >= 8: # UDP
|
||||
sport, dport, _, _ = struct.unpack(">HHHH", payload[:8])
|
||||
out["src_port"] = sport
|
||||
out["dst_port"] = dport
|
||||
return out
|
||||
|
||||
|
||||
def bucketize(
|
||||
pcap_path: Path,
|
||||
netflow_path: Path,
|
||||
*,
|
||||
bucket_ms: int = 100,
|
||||
t_mono_origin_ns: int = 0,
|
||||
bridge_ip: str = "10.200.0.1",
|
||||
) -> int:
|
||||
"""Read a pcap and emit one row per ``bucket_ms`` window into
|
||||
``netflow.jsonl``. The ``in/out`` direction is from the bridge
|
||||
perspective (host = ``bridge_ip``):
|
||||
|
||||
out = packet whose src is the host-side address (host → guest)
|
||||
in = anything else seen on the bridge (guest → host or
|
||||
guest-to-guest)
|
||||
|
||||
Returns the number of rows written."""
|
||||
if not pcap_path.exists():
|
||||
return 0
|
||||
bucket_ns = bucket_ms * 1_000_000
|
||||
netflow_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
rows = 0
|
||||
bucket_start: int | None = None
|
||||
agg: dict = _empty_bucket()
|
||||
with netflow_path.open("a", buffering=1) as out:
|
||||
for t_pkt_ns, frame in _iter_pcap(pcap_path):
|
||||
d = _decode(frame)
|
||||
# Establish first bucket origin on first packet.
|
||||
if bucket_start is None:
|
||||
bucket_start = t_pkt_ns - (t_pkt_ns % bucket_ns)
|
||||
while t_pkt_ns >= bucket_start + bucket_ns:
|
||||
_flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
|
||||
rows += 1
|
||||
agg = _empty_bucket()
|
||||
bucket_start += bucket_ns
|
||||
_accumulate(agg, d, bridge_ip)
|
||||
if bucket_start is not None and any(v for v in agg.values() if v):
|
||||
_flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
|
||||
rows += 1
|
||||
return rows
|
||||
|
||||
|
||||
def _empty_bucket() -> dict:
|
||||
return {
|
||||
"pkts_in": 0, "pkts_out": 0,
|
||||
"bytes_in": 0, "bytes_out": 0,
|
||||
"syn_count": 0, "fin_count": 0, "rst_count": 0,
|
||||
"udp_count": 0, "tcp_count": 0,
|
||||
"dns_query_count": 0,
|
||||
"dst_ips": set(), "dst_ports": set(),
|
||||
"tcp_new_flows": 0,
|
||||
}
|
||||
|
||||
|
||||
def _accumulate(agg: dict, d: dict, bridge_ip: str) -> None:
|
||||
sz = d.get("size", 0)
|
||||
is_out = d.get("src_ip") == bridge_ip
|
||||
if is_out:
|
||||
agg["pkts_out"] += 1
|
||||
agg["bytes_out"] += sz
|
||||
else:
|
||||
agg["pkts_in"] += 1
|
||||
agg["bytes_in"] += sz
|
||||
|
||||
proto = d.get("ip_proto")
|
||||
if proto == 6:
|
||||
agg["tcp_count"] += 1
|
||||
flags = d.get("tcp_flags", 0)
|
||||
if flags & 0x02: # SYN
|
||||
agg["syn_count"] += 1
|
||||
if not (flags & 0x10): # SYN without ACK = new flow
|
||||
agg["tcp_new_flows"] += 1
|
||||
if flags & 0x01:
|
||||
agg["fin_count"] += 1
|
||||
if flags & 0x04:
|
||||
agg["rst_count"] += 1
|
||||
elif proto == 17:
|
||||
agg["udp_count"] += 1
|
||||
if d.get("dst_port") == 53:
|
||||
agg["dns_query_count"] += 1
|
||||
|
||||
dst = d.get("dst_ip")
|
||||
if dst:
|
||||
agg["dst_ips"].add(dst)
|
||||
dport = d.get("dst_port")
|
||||
if dport is not None:
|
||||
agg["dst_ports"].add(dport)
|
||||
|
||||
|
||||
def _flush(out, agg: dict, bucket_start_ns: int, bucket_ns: int, t_mono_origin_ns: int) -> None:
|
||||
row = {
|
||||
"t_mono_ns": bucket_start_ns - t_mono_origin_ns,
|
||||
"t_wall_ns": bucket_start_ns,
|
||||
"source": SOURCE,
|
||||
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
||||
"bucket_ms": bucket_ns // 1_000_000,
|
||||
"pkts_in": agg["pkts_in"], "pkts_out": agg["pkts_out"],
|
||||
"bytes_in": agg["bytes_in"], "bytes_out": agg["bytes_out"],
|
||||
"syn_count": agg["syn_count"],
|
||||
"fin_count": agg["fin_count"],
|
||||
"rst_count": agg["rst_count"],
|
||||
"udp_count": agg["udp_count"],
|
||||
"tcp_count": agg["tcp_count"],
|
||||
"dns_query_count": agg["dns_query_count"],
|
||||
"unique_dst_ips": len(agg["dst_ips"]),
|
||||
"unique_dst_ports": len(agg["dst_ports"]),
|
||||
"tcp_new_flows": agg["tcp_new_flows"],
|
||||
}
|
||||
out.write(json.dumps(row) + "\n")
|
||||
244
collectors/qmp.py
Normal file
244
collectors/qmp.py
Normal file
|
|
@ -0,0 +1,244 @@
|
|||
"""Source 2 (oracle): QEMU QMP sampler.
|
||||
|
||||
Connects to the QEMU monitor protocol socket exposed by the launcher
|
||||
($RUN_DIR/qmp.sock) and periodically queries the hypervisor for
|
||||
per-VM stats that don't show up in /proc/<qemu_pid>:
|
||||
|
||||
- per-disk block I/O (rd_bytes, wr_bytes, rd_ops, wr_ops)
|
||||
- VM run state (running / paused / shutdown)
|
||||
- per-netdev tx/rx counters (when available)
|
||||
- KVM stat counters (when available; introspection differs by qemu
|
||||
version, so anything we can't read is skipped silently)
|
||||
|
||||
This source is **oracle-only** — it does not exist on a deployed
|
||||
device. Every row carries ``available_in_deployment: false``.
|
||||
|
||||
Wire format: QMP is line-delimited JSON. The handshake is fixed:
|
||||
|
||||
server → {"QMP": {capabilities: [...], version: ...}}
|
||||
client → {"execute": "qmp_capabilities"}
|
||||
server → {"return": {}}
|
||||
(client may now issue commands)
|
||||
|
||||
We use a dedicated synchronous client because QMP is request/response
|
||||
and we don't need pipelining; one query batch per tick keeps the
|
||||
on-disk schema simple.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import socket
|
||||
import threading
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.collectors.qmp")
|
||||
|
||||
SOURCE = "host_qmp"
|
||||
AVAILABLE_IN_DEPLOYMENT = False
|
||||
|
||||
|
||||
class QMPError(RuntimeError):
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class _SockReader:
|
||||
sock: socket.socket
|
||||
buf: bytes = b""
|
||||
|
||||
def read_line(self, timeout_s: float = 5.0) -> str:
|
||||
deadline = time.monotonic() + timeout_s
|
||||
while b"\n" not in self.buf:
|
||||
self.sock.settimeout(max(0.1, deadline - time.monotonic()))
|
||||
try:
|
||||
chunk = self.sock.recv(8192)
|
||||
except socket.timeout as e:
|
||||
raise QMPError(f"QMP read timed out: {e}") from e
|
||||
if not chunk:
|
||||
raise QMPError("QMP connection closed by peer")
|
||||
self.buf += chunk
|
||||
line, _, rest = self.buf.partition(b"\n")
|
||||
self.buf = rest
|
||||
return line.decode("utf-8", errors="replace")
|
||||
|
||||
|
||||
class QMPClient:
|
||||
"""Tiny synchronous QMP client over a unix socket."""
|
||||
|
||||
def __init__(self, socket_path: str | Path) -> None:
|
||||
self.path = str(socket_path)
|
||||
self._sock: socket.socket | None = None
|
||||
self._reader: _SockReader | None = None
|
||||
|
||||
def connect(self, timeout_s: float = 5.0) -> dict[str, Any]:
|
||||
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
s.settimeout(timeout_s)
|
||||
s.connect(self.path)
|
||||
self._sock = s
|
||||
self._reader = _SockReader(s)
|
||||
# Read greeting.
|
||||
greeting = json.loads(self._reader.read_line(timeout_s=timeout_s))
|
||||
if "QMP" not in greeting:
|
||||
raise QMPError(f"unexpected QMP greeting: {greeting!r}")
|
||||
# Negotiate capabilities (no flags requested).
|
||||
self.execute("qmp_capabilities")
|
||||
return greeting["QMP"]
|
||||
|
||||
def execute(self, command: str, **arguments: Any) -> Any:
|
||||
if self._sock is None or self._reader is None:
|
||||
raise QMPError("not connected")
|
||||
msg: dict[str, Any] = {"execute": command}
|
||||
if arguments:
|
||||
msg["arguments"] = arguments
|
||||
body = (json.dumps(msg) + "\n").encode("utf-8")
|
||||
self._sock.sendall(body)
|
||||
# QMP can interleave async events with the response — drain
|
||||
# until we see the matching {"return": ...} or {"error": ...}.
|
||||
for _ in range(64): # bounded to avoid an infinite loop on bugs
|
||||
line = self._reader.read_line()
|
||||
if not line.strip():
|
||||
continue
|
||||
resp = json.loads(line)
|
||||
if "return" in resp:
|
||||
return resp["return"]
|
||||
if "error" in resp:
|
||||
raise QMPError(f"{command}: {resp['error']}")
|
||||
# Otherwise it's an async event; ignore and keep reading.
|
||||
raise QMPError(f"{command}: too many async events without a response")
|
||||
|
||||
def close(self) -> None:
|
||||
if self._sock is not None:
|
||||
try:
|
||||
self._sock.close()
|
||||
except OSError:
|
||||
pass
|
||||
self._sock = None
|
||||
self._reader = None
|
||||
|
||||
|
||||
# ---- row builders ----------------------------------------------------------
|
||||
|
||||
|
||||
def _flatten_blockstats(blockstats: list[dict] | None) -> dict[str, dict[str, int]]:
|
||||
"""Compact ``query-blockstats`` to ``{device: {rd_ops, wr_ops, ...}}``."""
|
||||
out: dict[str, dict[str, int]] = {}
|
||||
for entry in blockstats or []:
|
||||
name = entry.get("device") or entry.get("qdev") or "unknown"
|
||||
s = entry.get("stats") or {}
|
||||
out[name] = {
|
||||
"rd_ops": int(s.get("rd_operations", 0)),
|
||||
"wr_ops": int(s.get("wr_operations", 0)),
|
||||
"rd_bytes": int(s.get("rd_bytes", 0)),
|
||||
"wr_bytes": int(s.get("wr_bytes", 0)),
|
||||
"flush_ops": int(s.get("flush_operations", 0)),
|
||||
}
|
||||
return out
|
||||
|
||||
|
||||
def collect_once(client: QMPClient, t_mono_origin_ns: int) -> dict[str, Any]:
|
||||
row: dict[str, Any] = {
|
||||
"t_mono_ns": time.monotonic_ns() - t_mono_origin_ns,
|
||||
"t_wall_ns": time.time_ns(),
|
||||
"source": SOURCE,
|
||||
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
||||
}
|
||||
|
||||
# query-status is dirt cheap and tells us whether the guest is
|
||||
# paused (rare) or running.
|
||||
try:
|
||||
status = client.execute("query-status")
|
||||
row["vm_status"] = status.get("status")
|
||||
row["vm_running"] = bool(status.get("running"))
|
||||
except QMPError as e:
|
||||
log.debug("query-status failed: %s", e)
|
||||
|
||||
try:
|
||||
bs = client.execute("query-blockstats")
|
||||
row["blockstats"] = _flatten_blockstats(bs)
|
||||
except QMPError as e:
|
||||
log.debug("query-blockstats failed: %s", e)
|
||||
|
||||
# query-stats is QEMU 7.1+ and the schema varies across versions.
|
||||
# We only ask for KVM stats and tolerate any subset of fields.
|
||||
try:
|
||||
stats = client.execute("query-stats", target="vm")
|
||||
row["kvm_stats"] = _summarize_query_stats(stats)
|
||||
except QMPError as e:
|
||||
log.debug("query-stats not supported: %s", e)
|
||||
|
||||
return row
|
||||
|
||||
|
||||
def _summarize_query_stats(stats_resp: list[dict] | dict) -> dict[str, int]:
|
||||
"""Reduce ``query-stats`` to a flat name→value map of integer
|
||||
counters. The full payload is verbose and version-specific; we only
|
||||
ever want individual scalar counters downstream."""
|
||||
flat: dict[str, int] = {}
|
||||
items = stats_resp if isinstance(stats_resp, list) else [stats_resp]
|
||||
for entry in items:
|
||||
for s in entry.get("stats", []) or []:
|
||||
name = s.get("name")
|
||||
value = s.get("value")
|
||||
if isinstance(name, str) and isinstance(value, int):
|
||||
flat[name] = value
|
||||
return flat
|
||||
|
||||
|
||||
# ---- run loop --------------------------------------------------------------
|
||||
|
||||
|
||||
def run_loop(
|
||||
socket_path: str | Path,
|
||||
output_path: Path,
|
||||
t_mono_origin_ns: int,
|
||||
interval_ms: int,
|
||||
stop_event: threading.Event,
|
||||
) -> int:
|
||||
"""Connect to ``socket_path`` and sample at ``interval_ms`` until
|
||||
``stop_event``. Returns the number of rows written.
|
||||
|
||||
A single missed sample (transient QMP error) is logged and skipped;
|
||||
repeated failures terminate the loop so the episode finishes cleanly
|
||||
rather than hanging on a dead hypervisor."""
|
||||
interval_ns = interval_ms * 1_000_000
|
||||
client = QMPClient(socket_path)
|
||||
try:
|
||||
client.connect(timeout_s=5.0)
|
||||
except (OSError, QMPError) as e:
|
||||
log.warning("QMP connect to %s failed: %s — collector exits cleanly", socket_path, e)
|
||||
return 0
|
||||
|
||||
rows = 0
|
||||
consecutive_failures = 0
|
||||
next_tick = time.monotonic_ns()
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
try:
|
||||
with output_path.open("a", buffering=1) as f:
|
||||
while not stop_event.is_set():
|
||||
try:
|
||||
row = collect_once(client, t_mono_origin_ns)
|
||||
f.write(json.dumps(row) + "\n")
|
||||
rows += 1
|
||||
consecutive_failures = 0
|
||||
except (QMPError, OSError) as e:
|
||||
consecutive_failures += 1
|
||||
log.warning("QMP sample %d failed: %s", rows, e)
|
||||
if consecutive_failures >= 5:
|
||||
log.warning("5 consecutive QMP failures; bailing")
|
||||
break
|
||||
|
||||
next_tick += interval_ns
|
||||
sleep_ns = next_tick - time.monotonic_ns()
|
||||
if sleep_ns > 0:
|
||||
stop_event.wait(sleep_ns / 1_000_000_000)
|
||||
else:
|
||||
next_tick = time.monotonic_ns()
|
||||
finally:
|
||||
client.close()
|
||||
return rows
|
||||
|
|
@ -1,8 +1,8 @@
|
|||
[Unit]
|
||||
Description=CIS490 lab-host episode orchestrator (queue mode)
|
||||
Description=CIS490 lab-host episode orchestrator (fleet mode)
|
||||
Documentation=https://maxgit.wg/spectral/CIS490
|
||||
# Episodes need KVM and (for Tier 3+) msfrpcd up. msfrpcd is brought
|
||||
# up out-of-band; this unit only requires the kernel + WG.
|
||||
# Episodes need KVM. msfrpcd (for Tier 3+) is brought up out-of-band
|
||||
# by cis490-msfrpcd.service when installed.
|
||||
After=network-online.target wg-quick@wg0.service
|
||||
Wants=network-online.target
|
||||
|
||||
|
|
@ -11,13 +11,18 @@ Type=simple
|
|||
User=cis490
|
||||
Group=cis490
|
||||
WorkingDirectory=/opt/cis490
|
||||
# Queue mode is currently a TODO — the binary takes a job-spec file
|
||||
# and runs episodes in a loop. Until that lands, this unit stays
|
||||
# disabled by default; lab-host operators kick off episodes by hand
|
||||
# via tools/run_*.py and let the shipper pick them up.
|
||||
ExecStart=/opt/cis490/.venv/bin/python -m orchestrator --queue /var/lib/cis490/data/queue
|
||||
Restart=on-failure
|
||||
RestartSec=10
|
||||
EnvironmentFile=-/etc/cis490/lab-host.toml.env
|
||||
# Fleet mode: detect host capacity, run that many concurrent episodes
|
||||
# per wave with samples drawn from the manifest. Each invocation runs
|
||||
# one wave and exits; systemd respawns per Restart= below, giving us
|
||||
# a continuous stream of fresh-sample episodes per host. The shipper
|
||||
# picks them up as `done.marker` files appear.
|
||||
ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/run_fleet.py \
|
||||
--data-root /var/lib/cis490/data \
|
||||
--manifest /opt/cis490/samples/manifest.toml \
|
||||
--waves 1
|
||||
Restart=always
|
||||
RestartSec=15
|
||||
|
||||
# Hardening
|
||||
NoNewPrivileges=true
|
||||
|
|
|
|||
|
|
@ -36,7 +36,7 @@ from datetime import datetime, timezone
|
|||
from pathlib import Path
|
||||
from typing import Callable
|
||||
|
||||
from collectors import proc_qemu
|
||||
from collectors import guest_agent, proc_qemu, qmp
|
||||
|
||||
from .ulid import new_ulid
|
||||
|
||||
|
|
@ -61,6 +61,11 @@ class EpisodeConfig:
|
|||
# When set, walk this schedule and ignore duration_s for sleep timing.
|
||||
# ``duration_s`` still goes in meta.schedule for record-keeping.
|
||||
phase_schedule: PhaseSchedule | None = None
|
||||
# Optional: paths to QEMU sockets exposed by the launcher. When
|
||||
# set, EpisodeRunner spins up additional collector threads.
|
||||
qmp_socket: Path | None = None
|
||||
qmp_interval_ms: int = 1000 # QMP queries are heavier than /proc reads
|
||||
guest_agent_socket: Path | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
|
|
@ -68,8 +73,10 @@ class EpisodeResult:
|
|||
episode_id: str
|
||||
episode_dir: Path
|
||||
rows_proc: int
|
||||
pid_disappeared: bool
|
||||
duration_observed_s: float
|
||||
rows_qmp: int = 0
|
||||
rows_guest: int = 0
|
||||
pid_disappeared: bool = False
|
||||
duration_observed_s: float = 0.0
|
||||
phases_observed: list[str] = field(default_factory=list)
|
||||
|
||||
|
||||
|
|
@ -102,10 +109,10 @@ class EpisodeRunner:
|
|||
|
||||
self.emit_event("snapshot_load", snapshot=self.cfg.snapshot_name)
|
||||
|
||||
rows_holder: dict[str, int] = {"rows": 0}
|
||||
rows_holder: dict[str, int] = {"proc": 0, "qmp": 0, "guest": 0}
|
||||
|
||||
def _collector() -> None:
|
||||
rows_holder["rows"] = proc_qemu.run_loop(
|
||||
def _proc_collector() -> None:
|
||||
rows_holder["proc"] = proc_qemu.run_loop(
|
||||
pid=self.cfg.target_pid,
|
||||
output_path=self.episode_dir / "telemetry-proc.jsonl",
|
||||
t_mono_origin_ns=self._t_mono_origin_ns,
|
||||
|
|
@ -113,7 +120,32 @@ class EpisodeRunner:
|
|||
stop_event=self._stop,
|
||||
)
|
||||
|
||||
t = threading.Thread(target=_collector, daemon=True, name="proc_qemu")
|
||||
def _qmp_collector() -> None:
|
||||
assert self.cfg.qmp_socket is not None
|
||||
rows_holder["qmp"] = qmp.run_loop(
|
||||
socket_path=self.cfg.qmp_socket,
|
||||
output_path=self.episode_dir / "telemetry-qmp.jsonl",
|
||||
t_mono_origin_ns=self._t_mono_origin_ns,
|
||||
interval_ms=self.cfg.qmp_interval_ms,
|
||||
stop_event=self._stop,
|
||||
)
|
||||
|
||||
def _guest_collector() -> None:
|
||||
assert self.cfg.guest_agent_socket is not None
|
||||
rows_holder["guest"] = guest_agent.run_loop(
|
||||
socket_path=self.cfg.guest_agent_socket,
|
||||
output_path=self.episode_dir / "telemetry-guest.jsonl",
|
||||
t_mono_origin_ns=self._t_mono_origin_ns,
|
||||
stop_event=self._stop,
|
||||
)
|
||||
|
||||
threads: list[threading.Thread] = []
|
||||
threads.append(threading.Thread(target=_proc_collector, daemon=True, name="proc_qemu"))
|
||||
if self.cfg.qmp_socket is not None:
|
||||
threads.append(threading.Thread(target=_qmp_collector, daemon=True, name="qmp"))
|
||||
if self.cfg.guest_agent_socket is not None:
|
||||
threads.append(threading.Thread(target=_guest_collector, daemon=True, name="guest_agent"))
|
||||
for t in threads:
|
||||
t.start()
|
||||
|
||||
phases_observed: list[str] = []
|
||||
|
|
@ -126,7 +158,8 @@ class EpisodeRunner:
|
|||
self._stop.wait(timeout=self.cfg.duration_s)
|
||||
finally:
|
||||
self._stop.set()
|
||||
t.join(timeout=2.0)
|
||||
for t in threads:
|
||||
t.join(timeout=3.0)
|
||||
|
||||
pid_alive = _pid_alive(self.cfg.target_pid)
|
||||
self.emit_event("episode_end", target_pid_alive=pid_alive)
|
||||
|
|
@ -135,7 +168,9 @@ class EpisodeRunner:
|
|||
meta["ended_at_wall"] = datetime.now(timezone.utc).isoformat()
|
||||
meta["result"] = {
|
||||
"phases_observed": phases_observed,
|
||||
"rows_proc": rows_holder["rows"],
|
||||
"rows_proc": rows_holder["proc"],
|
||||
"rows_qmp": rows_holder["qmp"],
|
||||
"rows_guest": rows_holder["guest"],
|
||||
"pid_alive_at_end": pid_alive,
|
||||
"duration_observed_s": end_mono_ns / 1_000_000_000,
|
||||
}
|
||||
|
|
@ -143,16 +178,18 @@ class EpisodeRunner:
|
|||
(self.episode_dir / "done.marker").touch()
|
||||
|
||||
log.info(
|
||||
"episode %s complete: rows=%d duration=%.2fs phases=%s",
|
||||
"episode %s complete: proc=%d qmp=%d guest=%d duration=%.2fs phases=%s",
|
||||
self.episode_id,
|
||||
rows_holder["rows"],
|
||||
rows_holder["proc"], rows_holder["qmp"], rows_holder["guest"],
|
||||
end_mono_ns / 1e9,
|
||||
phases_observed,
|
||||
)
|
||||
return EpisodeResult(
|
||||
episode_id=self.episode_id,
|
||||
episode_dir=self.episode_dir,
|
||||
rows_proc=rows_holder["rows"],
|
||||
rows_proc=rows_holder["proc"],
|
||||
rows_qmp=rows_holder["qmp"],
|
||||
rows_guest=rows_holder["guest"],
|
||||
pid_disappeared=not pid_alive,
|
||||
duration_observed_s=end_mono_ns / 1_000_000_000,
|
||||
phases_observed=phases_observed,
|
||||
|
|
|
|||
362
orchestrator/fleet.py
Normal file
362
orchestrator/fleet.py
Normal file
|
|
@ -0,0 +1,362 @@
|
|||
"""Fleet runner — concurrent VM episodes with resource awareness.
|
||||
|
||||
The lab host detects its own capacity, picks how many VMs to run in
|
||||
parallel without driving the box into swap or starving the host
|
||||
itself, and runs that many episodes simultaneously. Each slot gets a
|
||||
distinct ``Sample`` from the manifest (deterministically chosen by
|
||||
host_id + slot index), so every concurrent VM produces novel,
|
||||
labelable data.
|
||||
|
||||
Capacity heuristic — defaults documented inline so they're auditable:
|
||||
|
||||
cores_total = os.cpu_count()
|
||||
cores_reserved = max(1, cores_total // 8) # host + collectors
|
||||
ram_per_vm_mib = 320 # Alpine fits in 256
|
||||
# but leave 64 for
|
||||
# overhead (qemu+ovmf)
|
||||
ram_headroom_mib = max(1024, ram_total // 8) # never starve host
|
||||
max_by_cores = cores_total - cores_reserved
|
||||
max_by_ram = (ram_available - ram_headroom) // ram_per_vm
|
||||
max_by_load = if (load_1m / cores) > 0.75: tighter cap
|
||||
|
||||
The smallest of these wins. The reasoning string is logged + saved
|
||||
into each episode's meta.json under ``fleet`` so post-hoc analysis
|
||||
can correlate "this episode was run when 6 VMs were concurrent" with
|
||||
its observed envelope.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
import signal
|
||||
import subprocess
|
||||
import threading
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
from samples.manifest import Sample, SampleManifest
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.fleet")
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class FleetCapacity:
|
||||
cores_total: int
|
||||
cores_reserved: int
|
||||
ram_total_mib: int
|
||||
ram_available_mib: int
|
||||
ram_per_vm_mib: int
|
||||
ram_headroom_mib: int
|
||||
load_1m: float
|
||||
max_by_cores: int
|
||||
max_by_ram: int
|
||||
max_by_load: int
|
||||
max_concurrent: int
|
||||
rationale: str
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"cores_total": self.cores_total,
|
||||
"cores_reserved": self.cores_reserved,
|
||||
"ram_total_mib": self.ram_total_mib,
|
||||
"ram_available_mib": self.ram_available_mib,
|
||||
"ram_per_vm_mib": self.ram_per_vm_mib,
|
||||
"ram_headroom_mib": self.ram_headroom_mib,
|
||||
"load_1m": self.load_1m,
|
||||
"max_by_cores": self.max_by_cores,
|
||||
"max_by_ram": self.max_by_ram,
|
||||
"max_by_load": self.max_by_load,
|
||||
"max_concurrent": self.max_concurrent,
|
||||
"rationale": self.rationale,
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class FleetConfig:
|
||||
host_id: str
|
||||
repo_root: Path
|
||||
data_root: Path
|
||||
manifest: SampleManifest
|
||||
# VM resource shape — must match what the launcher requests.
|
||||
ram_per_vm_mib: int = 320
|
||||
# Cap concurrency below the calculated max (e.g. for a smoke test).
|
||||
max_concurrent_override: int | None = None
|
||||
# Skip episodes whose sample requires a real binary that's not present.
|
||||
require_real_samples: bool = False
|
||||
|
||||
|
||||
def _read_meminfo() -> dict[str, int]:
|
||||
out: dict[str, int] = {}
|
||||
try:
|
||||
with open("/proc/meminfo") as f:
|
||||
for line in f:
|
||||
k, _, rest = line.partition(":")
|
||||
v = rest.strip()
|
||||
if v.endswith(" kB"):
|
||||
try:
|
||||
out[k] = int(v[:-3]) * 1024
|
||||
except ValueError:
|
||||
pass
|
||||
except OSError:
|
||||
pass
|
||||
return out
|
||||
|
||||
|
||||
def _read_loadavg() -> float:
|
||||
try:
|
||||
with open("/proc/loadavg") as f:
|
||||
return float(f.read().split()[0])
|
||||
except (OSError, ValueError, IndexError):
|
||||
return 0.0
|
||||
|
||||
|
||||
def detect_capacity(*, ram_per_vm_mib: int = 320) -> FleetCapacity:
|
||||
cores_total = os.cpu_count() or 1
|
||||
# Reserve at least 1 core, more if the host has many.
|
||||
cores_reserved = max(1, cores_total // 8)
|
||||
|
||||
mem = _read_meminfo()
|
||||
ram_total_b = mem.get("MemTotal", 0)
|
||||
ram_avail_b = mem.get("MemAvailable", ram_total_b)
|
||||
ram_total_mib = ram_total_b // (1024 * 1024)
|
||||
ram_available_mib = ram_avail_b // (1024 * 1024)
|
||||
# Never starve the host of more than ~7/8 of its memory.
|
||||
ram_headroom_mib = max(1024, ram_total_mib // 8)
|
||||
|
||||
load_1m = _read_loadavg()
|
||||
|
||||
max_by_cores = max(0, cores_total - cores_reserved)
|
||||
if ram_per_vm_mib <= 0:
|
||||
max_by_ram = max_by_cores
|
||||
else:
|
||||
max_by_ram = max(0, (ram_available_mib - ram_headroom_mib) // ram_per_vm_mib)
|
||||
|
||||
# Load-based cap: if the host is already busy, run fewer VMs.
|
||||
if cores_total and load_1m / cores_total > 0.75:
|
||||
# Halve, floor 1.
|
||||
max_by_load = max(1, max_by_cores // 2)
|
||||
else:
|
||||
max_by_load = max_by_cores
|
||||
|
||||
candidates = [max_by_cores, max_by_ram, max_by_load]
|
||||
max_concurrent = max(0, min(candidates))
|
||||
|
||||
binding = ["cores", "ram", "load"][candidates.index(max_concurrent)] \
|
||||
if max_concurrent < max_by_cores else "cores"
|
||||
rationale = (
|
||||
f"cores_total={cores_total} reserved={cores_reserved} "
|
||||
f"ram_avail_mib={ram_available_mib} headroom={ram_headroom_mib} "
|
||||
f"per_vm={ram_per_vm_mib} load_1m={load_1m:.2f} "
|
||||
f"-> max_concurrent={max_concurrent} (binding={binding})"
|
||||
)
|
||||
log.info("capacity: %s", rationale)
|
||||
|
||||
return FleetCapacity(
|
||||
cores_total=cores_total,
|
||||
cores_reserved=cores_reserved,
|
||||
ram_total_mib=ram_total_mib,
|
||||
ram_available_mib=ram_available_mib,
|
||||
ram_per_vm_mib=ram_per_vm_mib,
|
||||
ram_headroom_mib=ram_headroom_mib,
|
||||
load_1m=load_1m,
|
||||
max_by_cores=max_by_cores,
|
||||
max_by_ram=max_by_ram,
|
||||
max_by_load=max_by_load,
|
||||
max_concurrent=max_concurrent,
|
||||
rationale=rationale,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-slot episode execution
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class SlotResult:
|
||||
slot: int
|
||||
sample_name: str
|
||||
sample_kind: str
|
||||
episode_id: str | None
|
||||
rc: int
|
||||
duration_s: float
|
||||
error: str | None = None
|
||||
extra: dict = field(default_factory=dict)
|
||||
|
||||
|
||||
def _run_slot(
|
||||
cfg: FleetConfig,
|
||||
slot: int,
|
||||
sample: Sample,
|
||||
episode_index: int,
|
||||
capacity: FleetCapacity,
|
||||
) -> SlotResult:
|
||||
"""Run one Tier-2-shaped episode in a dedicated slot.
|
||||
|
||||
For now the per-slot driver shells out to ``tools/run_real_vm_demo.py``
|
||||
with SLOT and PROFILE env so the launcher gives us a unique RUN_DIR
|
||||
and the load mimic varies by sample. When the Tier-3/4 paths land,
|
||||
add a sample-kind dispatch here."""
|
||||
env = os.environ.copy()
|
||||
env["SLOT"] = str(slot)
|
||||
env["RUN_DIR"] = f"/tmp/cis490-vm-fleet-{slot}"
|
||||
env["SAMPLE_NAME"] = sample.name
|
||||
env["SAMPLE_PROFILE"] = sample.profile
|
||||
env["SAMPLE_KIND"] = sample.kind
|
||||
env["FLEET_HOST_ID"] = cfg.host_id
|
||||
env["FLEET_EPISODE_INDEX"] = str(episode_index)
|
||||
env["FLEET_MAX_CONCURRENT"] = str(capacity.max_concurrent)
|
||||
|
||||
log_dir = cfg.data_root / "fleet-logs"
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_log = log_dir / f"slot-{slot}-ep-{episode_index}.log"
|
||||
|
||||
started = time.monotonic()
|
||||
try:
|
||||
with out_log.open("ab") as logf:
|
||||
proc = subprocess.run(
|
||||
[
|
||||
"/usr/bin/env", "python3",
|
||||
str(cfg.repo_root / "tools" / "run_real_vm_demo.py"),
|
||||
"--data-root", str(cfg.data_root),
|
||||
],
|
||||
cwd=str(cfg.repo_root),
|
||||
env=env,
|
||||
stdout=logf,
|
||||
stderr=subprocess.STDOUT,
|
||||
check=False,
|
||||
)
|
||||
rc = proc.returncode
|
||||
err = None
|
||||
except (OSError, subprocess.SubprocessError) as e:
|
||||
rc = -1
|
||||
err = str(e)
|
||||
duration = time.monotonic() - started
|
||||
|
||||
return SlotResult(
|
||||
slot=slot,
|
||||
sample_name=sample.name,
|
||||
sample_kind=sample.kind,
|
||||
episode_id=None, # parsed from the log later by the driver
|
||||
rc=rc,
|
||||
duration_s=duration,
|
||||
error=err,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FleetRunner
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class FleetRunResult:
|
||||
capacity: FleetCapacity
|
||||
slots: list[SlotResult]
|
||||
total_duration_s: float
|
||||
|
||||
|
||||
class FleetRunner:
|
||||
def __init__(self, cfg: FleetConfig) -> None:
|
||||
self.cfg = cfg
|
||||
self._stop = threading.Event()
|
||||
|
||||
def stop(self) -> None:
|
||||
self._stop.set()
|
||||
|
||||
def run(
|
||||
self,
|
||||
*,
|
||||
episodes: int = 1,
|
||||
episode_index_base: int = 0,
|
||||
capacity_override: FleetCapacity | None = None,
|
||||
) -> FleetRunResult:
|
||||
capacity = capacity_override or detect_capacity(
|
||||
ram_per_vm_mib=self.cfg.ram_per_vm_mib,
|
||||
)
|
||||
n_slots = capacity.max_concurrent
|
||||
if self.cfg.max_concurrent_override is not None:
|
||||
n_slots = min(n_slots, self.cfg.max_concurrent_override)
|
||||
if n_slots <= 0:
|
||||
log.warning(
|
||||
"fleet capacity is zero (%s); cannot run", capacity.rationale,
|
||||
)
|
||||
return FleetRunResult(
|
||||
capacity=capacity, slots=[], total_duration_s=0.0,
|
||||
)
|
||||
|
||||
log.info(
|
||||
"fleet host=%s slots=%d episodes=%d manifest_size=%d",
|
||||
self.cfg.host_id, n_slots, episodes, len(self.cfg.manifest),
|
||||
)
|
||||
|
||||
all_results: list[SlotResult] = []
|
||||
t_start = time.monotonic()
|
||||
for ep in range(episodes):
|
||||
if self._stop.is_set():
|
||||
break
|
||||
episode_index = episode_index_base + ep
|
||||
slot_samples = [
|
||||
self.cfg.manifest.select(
|
||||
host_id=self.cfg.host_id,
|
||||
slot=slot,
|
||||
episode_index=episode_index,
|
||||
)
|
||||
for slot in range(n_slots)
|
||||
]
|
||||
if self.cfg.require_real_samples:
|
||||
slot_samples = [s for s in slot_samples if s.kind == "real"]
|
||||
if not slot_samples:
|
||||
log.warning("require_real_samples: no real samples in manifest; skipping wave")
|
||||
continue
|
||||
|
||||
log.info(
|
||||
"wave %d/%d: %s",
|
||||
ep + 1, episodes,
|
||||
[(i, s.name, s.kind) for i, s in enumerate(slot_samples)],
|
||||
)
|
||||
|
||||
with ThreadPoolExecutor(max_workers=n_slots) as pool:
|
||||
futures = [
|
||||
pool.submit(
|
||||
_run_slot, self.cfg, slot, sample, episode_index, capacity,
|
||||
)
|
||||
for slot, sample in enumerate(slot_samples)
|
||||
]
|
||||
for fut in as_completed(futures):
|
||||
res = fut.result()
|
||||
log.info(
|
||||
"slot %d sample=%s rc=%d duration=%.1fs",
|
||||
res.slot, res.sample_name, res.rc, res.duration_s,
|
||||
)
|
||||
all_results.append(res)
|
||||
|
||||
total = time.monotonic() - t_start
|
||||
return FleetRunResult(
|
||||
capacity=capacity,
|
||||
slots=all_results,
|
||||
total_duration_s=total,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Friendly capacity report (used by tools/run_fleet.py --capacity)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def capacity_report() -> str:
|
||||
c = detect_capacity()
|
||||
return (
|
||||
f"cores: {c.cores_total} (reserve {c.cores_reserved})\n"
|
||||
f"ram: {c.ram_total_mib} MiB total, {c.ram_available_mib} MiB available "
|
||||
f"(headroom {c.ram_headroom_mib} MiB, per-vm {c.ram_per_vm_mib} MiB)\n"
|
||||
f"load: 1m={c.load_1m:.2f}\n"
|
||||
f"caps: by_cores={c.max_by_cores}, by_ram={c.max_by_ram}, "
|
||||
f"by_load={c.max_by_load}\n"
|
||||
f"--> max_concurrent VMs: {c.max_concurrent}\n"
|
||||
)
|
||||
0
samples/__init__.py
Normal file
0
samples/__init__.py
Normal file
105
samples/manifest.py
Normal file
105
samples/manifest.py
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
"""Sample manifest loader + per-(host, slot) deterministic selection.
|
||||
|
||||
The manifest at ``samples/manifest.toml`` defines the catalog of
|
||||
samples (real or mimic) the fleet draws from. Selection is
|
||||
**deterministic** given ``(host_id, slot, episode_index)`` so two lab
|
||||
hosts on the same fleet pick *different* samples for the same slot
|
||||
index, and the same host repeats only after exhausting the catalog.
|
||||
|
||||
This gives us "all hosts on the network generating novel data" without
|
||||
needing a coordinator: every host's `host_id` seeds its own
|
||||
sample-rotation order, and the orderings spread across the catalog.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import tomllib
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
_VALID_CATEGORIES = {
|
||||
"cryptominer", "botnet", "ransomware", "banking-trojan",
|
||||
"fileless", "rat", "worm", "loader", "wiper", "other",
|
||||
}
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Sample:
|
||||
name: str
|
||||
family: str
|
||||
category: str
|
||||
profile: str
|
||||
description: str = ""
|
||||
source: str | None = None
|
||||
sha256: str | None = None
|
||||
url: str | None = None
|
||||
|
||||
@property
|
||||
def kind(self) -> str:
|
||||
"""``"real"`` if a sha256-pinned binary is expected, else ``"mimic"``.
|
||||
Trainers filter on this so the realistic-model pipeline only
|
||||
consumes real-malware episodes."""
|
||||
return "real" if self.sha256 else "mimic"
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SampleManifest:
|
||||
samples: list[Sample] = field(default_factory=list)
|
||||
|
||||
def __len__(self) -> int:
|
||||
return len(self.samples)
|
||||
|
||||
def select(self, *, host_id: str, slot: int, episode_index: int = 0) -> Sample:
|
||||
"""Deterministic selection. The host_id mixes into the seed so
|
||||
different hosts visit the catalog in different orders; slot +
|
||||
episode_index tick within a host. Same inputs always give the
|
||||
same sample — replay-friendly for debugging."""
|
||||
if not self.samples:
|
||||
raise ValueError("manifest is empty")
|
||||
# SHA-256 of the seed gives a uniformly distributed integer.
|
||||
seed = f"{host_id}|{slot}|{episode_index}".encode()
|
||||
h = hashlib.sha256(seed).digest()
|
||||
idx = int.from_bytes(h[:8], "big") % len(self.samples)
|
||||
return self.samples[idx]
|
||||
|
||||
@classmethod
|
||||
def load(cls, path: str | Path) -> "SampleManifest":
|
||||
with open(path, "rb") as f:
|
||||
data = tomllib.load(f)
|
||||
raw = data.get("sample") or []
|
||||
if not isinstance(raw, list):
|
||||
raise ValueError(f"{path}: 'sample' must be an array of tables")
|
||||
|
||||
samples: list[Sample] = []
|
||||
for i, entry in enumerate(raw):
|
||||
if not isinstance(entry, dict):
|
||||
raise ValueError(f"{path}: sample[{i}] is not a table")
|
||||
for key in ("name", "family", "category", "profile"):
|
||||
if not isinstance(entry.get(key), str) or not entry[key]:
|
||||
raise ValueError(f"{path}: sample[{i}] missing or empty '{key}'")
|
||||
if entry["category"] not in _VALID_CATEGORIES:
|
||||
raise ValueError(
|
||||
f"{path}: sample[{i}] category {entry['category']!r} "
|
||||
f"not in {sorted(_VALID_CATEGORIES)}"
|
||||
)
|
||||
samples.append(Sample(
|
||||
name=entry["name"],
|
||||
family=entry["family"],
|
||||
category=entry["category"],
|
||||
profile=entry["profile"],
|
||||
description=entry.get("description", ""),
|
||||
source=entry.get("source"),
|
||||
sha256=entry.get("sha256"),
|
||||
url=entry.get("url"),
|
||||
))
|
||||
|
||||
# Reject duplicate names — trainers join on this.
|
||||
seen: set[str] = set()
|
||||
for s in samples:
|
||||
if s.name in seen:
|
||||
raise ValueError(f"{path}: duplicate sample name {s.name!r}")
|
||||
seen.add(s.name)
|
||||
|
||||
return cls(samples=samples)
|
||||
61
samples/manifest.toml
Normal file
61
samples/manifest.toml
Normal file
|
|
@ -0,0 +1,61 @@
|
|||
# Sample manifest — what each fleet slot picks from.
|
||||
#
|
||||
# Each entry has three things:
|
||||
# - identity (name, family, category) for labeling
|
||||
# - acquisition (source, sha256, url) for reproducibility
|
||||
# - behaviour (profile) so the synthetic load mimic can run a
|
||||
# reasonable proxy until the real sample lands at vm/images/
|
||||
#
|
||||
# When the real malware binary is present at samples/store/<sha256>,
|
||||
# the orchestrator runs THAT inside the guest. When it's absent, the
|
||||
# orchestrator falls back to running tools/load_mimic.py with the
|
||||
# matching profile so the fleet still produces *labeled, varied* data
|
||||
# while we collect the real samples. Either way, meta.json records
|
||||
# which path the episode took, so trainers can filter on
|
||||
# meta.sample.kind ∈ {real, mimic}.
|
||||
|
||||
[[sample]]
|
||||
name = "xmrig-cryptominer"
|
||||
family = "XMRig"
|
||||
category = "cryptominer"
|
||||
profile = "cpu-saturate"
|
||||
# A real XMRig fetch goes here when MalwareBazaar pull is wired up:
|
||||
# source = "MalwareBazaar"
|
||||
# sha256 = "TBD"
|
||||
# url = "https://bazaar.abuse.ch/sample/TBD/"
|
||||
description = "Sustained 1-vCPU saturation, very low IO/net. Pure compute."
|
||||
|
||||
[[sample]]
|
||||
name = "mirai-class-bot"
|
||||
family = "Mirai"
|
||||
category = "botnet"
|
||||
profile = "scan-and-dial"
|
||||
description = "SYN scans across the bridge IP space + periodic dial-home. High net, low CPU."
|
||||
|
||||
[[sample]]
|
||||
name = "ransomware-mimic"
|
||||
family = "Cryptolocker-class"
|
||||
category = "ransomware"
|
||||
profile = "io-walk"
|
||||
description = "Heavy disk write + filesystem walk producing a per-file overwrite envelope."
|
||||
|
||||
[[sample]]
|
||||
name = "dridex-class-trojan"
|
||||
family = "Dridex"
|
||||
category = "banking-trojan"
|
||||
profile = "bursty-c2"
|
||||
description = "Long idle, periodic short bursts of TCP egress to a fixed peer (C2 beacon shape)."
|
||||
|
||||
[[sample]]
|
||||
name = "kovter-class-stealth"
|
||||
family = "Kovter"
|
||||
category = "fileless"
|
||||
profile = "low-and-slow"
|
||||
description = "Low CPU, periodic memory churn, no persistent on-disk artifacts. Hardest to label from /proc alone."
|
||||
|
||||
[[sample]]
|
||||
name = "reverse-shell-resident"
|
||||
family = "Reverse-Shell"
|
||||
category = "rat"
|
||||
profile = "shell-resident"
|
||||
description = "Single TCP socket pinned to an attacker IP, occasional command bursts."
|
||||
69
scripts/fetch-metasploitable2.sh
Executable file
69
scripts/fetch-metasploitable2.sh
Executable file
|
|
@ -0,0 +1,69 @@
|
|||
#!/usr/bin/env bash
|
||||
# Fetch + sha256-verify the Metasploitable2 disk image.
|
||||
#
|
||||
# Rapid7's official download is gated behind a registration form, so
|
||||
# we accept the URL + sha256 from env vars (with sane defaults pointing
|
||||
# at a public mirror). The user installs this once per lab host.
|
||||
#
|
||||
# Inputs (env):
|
||||
# IMAGE_URL — direct download URL for the metasploitable2 archive
|
||||
# IMAGE_SHA256 — expected sha256 of the archive
|
||||
# OUT_DIR — where to drop the qcow2 (default vm/images/)
|
||||
#
|
||||
# Outputs:
|
||||
# $OUT_DIR/metasploitable2.qcow2 — converted from the original VMDK
|
||||
# if needed.
|
||||
#
|
||||
# We do NOT bake an image url+hash into the repo because the canonical
|
||||
# distribution is a registration-walled zip on Rapid7. Operators must
|
||||
# supply both; the rest is mechanical.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
IMAGE_URL="${IMAGE_URL:-}"
|
||||
IMAGE_SHA256="${IMAGE_SHA256:-}"
|
||||
OUT_DIR="${OUT_DIR:-$(cd "$(dirname "$0")/../vm/images" 2>/dev/null && pwd)}"
|
||||
WORK_DIR="${WORK_DIR:-/tmp/cis490-metasploitable-fetch}"
|
||||
|
||||
log() { printf '[fetch-metasploitable2] %s\n' "$*" >&2; }
|
||||
die() { log "FATAL: $*"; exit 1; }
|
||||
|
||||
[[ -n "$IMAGE_URL" ]] || die "set IMAGE_URL to the Metasploitable2 download URL"
|
||||
[[ -n "$IMAGE_SHA256" ]] || die "set IMAGE_SHA256 to the expected sha256 of the archive"
|
||||
|
||||
mkdir -p "$OUT_DIR" "$WORK_DIR"
|
||||
|
||||
ARCHIVE="$WORK_DIR/$(basename "$IMAGE_URL")"
|
||||
log "downloading $IMAGE_URL → $ARCHIVE"
|
||||
if [[ -f "$ARCHIVE" ]]; then
|
||||
log "archive already present; skipping download"
|
||||
else
|
||||
curl -fL --retry 3 --retry-delay 5 -o "$ARCHIVE.partial" "$IMAGE_URL"
|
||||
mv "$ARCHIVE.partial" "$ARCHIVE"
|
||||
fi
|
||||
|
||||
log "verifying sha256"
|
||||
ACTUAL="$(sha256sum "$ARCHIVE" | awk '{print $1}')"
|
||||
if [[ "$ACTUAL" != "$IMAGE_SHA256" ]]; then
|
||||
die "sha256 mismatch: expected $IMAGE_SHA256, got $ACTUAL"
|
||||
fi
|
||||
log "sha256 ok"
|
||||
|
||||
# Extract — handle either zip or 7z, since various mirrors choose one
|
||||
# or the other.
|
||||
case "$ARCHIVE" in
|
||||
*.zip) ( cd "$WORK_DIR" && unzip -o "$ARCHIVE" ) ;;
|
||||
*.7z|*.7zip) command -v 7z >/dev/null || die "7z not installed"; \
|
||||
( cd "$WORK_DIR" && 7z x -y "$ARCHIVE" ) ;;
|
||||
*) die "unsupported archive type: $ARCHIVE" ;;
|
||||
esac
|
||||
|
||||
VMDK="$(find "$WORK_DIR" -name 'Metasploitable*.vmdk' -print -quit)"
|
||||
[[ -n "$VMDK" ]] || die "no Metasploitable*.vmdk in extracted archive"
|
||||
|
||||
log "converting $VMDK → qcow2"
|
||||
command -v qemu-img >/dev/null || die "qemu-img required (apt install qemu-utils)"
|
||||
qemu-img convert -O qcow2 "$VMDK" "$OUT_DIR/metasploitable2.qcow2"
|
||||
|
||||
log "done: $OUT_DIR/metasploitable2.qcow2"
|
||||
log "Tier-3 ready when msfrpcd is up. See scripts/install-msfrpcd.sh."
|
||||
124
scripts/install-msfrpcd.sh
Executable file
124
scripts/install-msfrpcd.sh
Executable file
|
|
@ -0,0 +1,124 @@
|
|||
#!/usr/bin/env bash
|
||||
# Install + configure ``msfrpcd`` for the Tier-3 exploit driver.
|
||||
#
|
||||
# Idempotent: re-running on a host that already has msfrpcd refreshes
|
||||
# the systemd unit and credentials but doesn't reinstall the framework.
|
||||
#
|
||||
# Steps:
|
||||
# 1. Install metasploit-framework via the host package manager (or
|
||||
# report the right one-liner for that distro). Big download —
|
||||
# ~1 GiB and several minutes.
|
||||
# 2. Generate a strong password and store at /etc/cis490/msfrpc.env
|
||||
# (mode 0640, owner root:cis490).
|
||||
# 3. Drop /etc/systemd/system/cis490-msfrpcd.service that runs
|
||||
# msfrpcd bound to 127.0.0.1:55553 with the generated password.
|
||||
# 4. Enable + start.
|
||||
#
|
||||
# After this runs, ``MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env;
|
||||
# echo $MSFRPC_PASSWORD)`` makes tools/run_tier3_demo.py work zero-touch.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
ETC_ROOT="/etc/cis490"
|
||||
ENV_FILE="$ETC_ROOT/msfrpc.env"
|
||||
UNIT="/etc/systemd/system/cis490-msfrpcd.service"
|
||||
PORT="${MSFRPC_PORT:-55553}"
|
||||
USER_NAME="${MSFRPC_USER:-msf}"
|
||||
|
||||
log() { printf '[install-msfrpcd] %s\n' "$*" >&2; }
|
||||
die() { log "FATAL: $*"; exit 1; }
|
||||
|
||||
[[ $EUID -eq 0 ]] || die "must run as root"
|
||||
command -v systemctl >/dev/null || die "systemd not found"
|
||||
|
||||
# --- 1. install metasploit-framework -----------------------------------
|
||||
if ! command -v msfrpcd >/dev/null; then
|
||||
log "msfrpcd not found; installing metasploit-framework"
|
||||
if command -v apt-get >/dev/null; then
|
||||
# The Debian/Ubuntu metasploit-framework package isn't in
|
||||
# the default repos for most distros. Use Rapid7's official
|
||||
# nightly installer when available.
|
||||
if [[ ! -x /opt/metasploit-framework/bin/msfrpcd ]]; then
|
||||
log "fetching Rapid7 nightly installer"
|
||||
curl -fsSL https://raw.githubusercontent.com/rapid7/metasploit-omnibus/master/config/templates/metasploit-framework-wrappers/msfupdate.erb \
|
||||
-o /tmp/msfinstall.sh || true
|
||||
log "automated install not available — install manually:"
|
||||
log " https://docs.metasploit.com/docs/using-metasploit/getting-started/nightly-installers.html"
|
||||
die "rerun once msfrpcd is on PATH"
|
||||
fi
|
||||
# Symlink the wrapper so ``msfrpcd`` is on PATH.
|
||||
ln -sf /opt/metasploit-framework/bin/msfrpcd /usr/local/bin/msfrpcd
|
||||
elif command -v pacman >/dev/null; then
|
||||
log "pacman -S metasploit"
|
||||
pacman -Sy --noconfirm metasploit
|
||||
elif command -v dnf >/dev/null; then
|
||||
die "Fedora/RHEL: install metasploit-framework manually, then re-run"
|
||||
else
|
||||
die "unknown package manager — install metasploit-framework manually"
|
||||
fi
|
||||
fi
|
||||
|
||||
command -v msfrpcd >/dev/null || die "msfrpcd still missing after install attempt"
|
||||
|
||||
# --- 2. generate password ----------------------------------------------
|
||||
install -d -m 0755 -o root -g root "$ETC_ROOT"
|
||||
if ! id -u cis490 >/dev/null 2>&1; then
|
||||
useradd --system --no-create-home --shell /usr/sbin/nologin cis490
|
||||
fi
|
||||
if [[ ! -f "$ENV_FILE" ]]; then
|
||||
log "generating msfrpc password"
|
||||
PW="$(openssl rand -base64 24 | tr -d '/+=' | head -c 32)"
|
||||
install -m 0640 -o root -g cis490 /dev/stdin "$ENV_FILE" <<EOF
|
||||
# Auto-generated by install-msfrpcd.sh — do not edit.
|
||||
MSFRPC_HOST=127.0.0.1
|
||||
MSFRPC_PORT=$PORT
|
||||
MSFRPC_USER=$USER_NAME
|
||||
MSFRPC_PASSWORD=$PW
|
||||
EOF
|
||||
else
|
||||
log "$ENV_FILE exists; preserving existing password"
|
||||
fi
|
||||
|
||||
# --- 3. systemd unit ----------------------------------------------------
|
||||
log "installing systemd unit"
|
||||
cat > "$UNIT" <<EOF
|
||||
[Unit]
|
||||
Description=CIS490 — Metasploit RPC daemon (loopback only)
|
||||
Documentation=https://maxgit.wg/spectral/CIS490
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
EnvironmentFile=$ENV_FILE
|
||||
# msfrpcd flags:
|
||||
# -P <pw> password
|
||||
# -U <user> username
|
||||
# -a <ip> bind address (loopback only — Tier-3 driver runs locally)
|
||||
# -p <port> port
|
||||
# -f foreground (no daemonization, so systemd manages PID)
|
||||
ExecStart=/usr/bin/env msfrpcd -P \${MSFRPC_PASSWORD} -U \${MSFRPC_USER} -a 127.0.0.1 -p \${MSFRPC_PORT} -f
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
ProtectSystem=full
|
||||
ProtectHome=true
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now cis490-msfrpcd
|
||||
|
||||
# --- 4. final smoke -----------------------------------------------------
|
||||
sleep 2
|
||||
if ! ss -ltn 2>/dev/null | grep -q ":$PORT"; then
|
||||
log "WARN: nothing listening on 127.0.0.1:$PORT yet — check"
|
||||
log " journalctl -u cis490-msfrpcd"
|
||||
fi
|
||||
|
||||
log "done. To run a Tier-3 episode:"
|
||||
log " set -a; . $ENV_FILE; set +a"
|
||||
log " python tools/run_tier3_demo.py --module vsftpd_234_backdoor"
|
||||
204
tests/test_fleet.py
Normal file
204
tests/test_fleet.py
Normal file
|
|
@ -0,0 +1,204 @@
|
|||
"""Tests for fleet capacity calculation + sample manifest selection.
|
||||
|
||||
Capacity is unit-tested via deterministic monkeypatching of /proc and
|
||||
os.cpu_count so the math is exercised independently of the host
|
||||
running the suite. Sample selection has its own tests covering the
|
||||
"different hosts pick different samples" property.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from orchestrator import fleet
|
||||
from samples.manifest import Sample, SampleManifest
|
||||
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Capacity
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _patch_capacity_inputs(
|
||||
monkeypatch,
|
||||
*,
|
||||
cores: int,
|
||||
ram_total_mib: int,
|
||||
ram_available_mib: int,
|
||||
load_1m: float = 0.0,
|
||||
) -> None:
|
||||
monkeypatch.setattr(fleet.os, "cpu_count", lambda: cores)
|
||||
monkeypatch.setattr(
|
||||
fleet, "_read_meminfo",
|
||||
lambda: {
|
||||
"MemTotal": ram_total_mib * 1024 * 1024,
|
||||
"MemAvailable": ram_available_mib * 1024 * 1024,
|
||||
},
|
||||
)
|
||||
monkeypatch.setattr(fleet, "_read_loadavg", lambda: load_1m)
|
||||
|
||||
|
||||
def test_capacity_8core_idle_box(monkeypatch) -> None:
|
||||
_patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
assert c.cores_total == 8
|
||||
assert c.cores_reserved == 1 # 8 // 8 = 1
|
||||
assert c.max_by_cores == 7
|
||||
# Plenty of RAM, idle → cores binding.
|
||||
assert c.max_concurrent == 7
|
||||
assert "binding=cores" in c.rationale
|
||||
|
||||
|
||||
def test_capacity_low_ram_caps_below_cores(monkeypatch) -> None:
|
||||
# 8 cores but only ~2 GiB free → ram caps below cores.
|
||||
_patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=4096, ram_available_mib=2048)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
# headroom = max(1024, 4096//8) = 1024
|
||||
# max_by_ram = (2048 - 1024) // 320 = 3
|
||||
assert c.max_by_ram == 3
|
||||
assert c.max_concurrent == 3
|
||||
|
||||
|
||||
def test_capacity_high_load_halves_concurrency(monkeypatch) -> None:
|
||||
# 8 cores, plenty of RAM, but load_1m / cores > 0.75
|
||||
_patch_capacity_inputs(
|
||||
monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000,
|
||||
load_1m=7.0, # 7/8 = 0.875 > 0.75
|
||||
)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
# max_by_cores = 7; max_by_load = max(1, 7//2) = 3
|
||||
assert c.max_by_load == 3
|
||||
assert c.max_concurrent == 3
|
||||
|
||||
|
||||
def test_capacity_pi5_class(monkeypatch) -> None:
|
||||
"""4 cores + 8 GiB → reserve 1 core, run 3 concurrent."""
|
||||
_patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=7951, ram_available_mib=5223)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
assert c.cores_total == 4
|
||||
assert c.max_concurrent == 3
|
||||
|
||||
|
||||
def test_capacity_minimal_box(monkeypatch) -> None:
|
||||
"""1-core 1 GiB host shouldn't try to run any VMs."""
|
||||
_patch_capacity_inputs(monkeypatch, cores=1, ram_total_mib=1024, ram_available_mib=512)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
assert c.max_concurrent == 0
|
||||
|
||||
|
||||
def test_capacity_to_dict_round_trips(monkeypatch) -> None:
|
||||
_patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=8000, ram_available_mib=6000)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
d = c.to_dict()
|
||||
assert d["cores_total"] == 4
|
||||
assert d["max_concurrent"] == c.max_concurrent
|
||||
assert "rationale" in d
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Sample manifest
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_repo_manifest_loads() -> None:
|
||||
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
|
||||
assert len(m) >= 4
|
||||
# Every entry has required fields.
|
||||
for s in m.samples:
|
||||
assert s.name and s.family and s.category and s.profile
|
||||
# All "mimic" today; will switch as real samples are added.
|
||||
assert all(s.kind == "mimic" for s in m.samples)
|
||||
|
||||
|
||||
def test_selection_is_deterministic() -> None:
|
||||
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
|
||||
a = m.select(host_id="lab-1", slot=2, episode_index=5)
|
||||
b = m.select(host_id="lab-1", slot=2, episode_index=5)
|
||||
assert a is b
|
||||
|
||||
|
||||
def test_selection_differs_across_hosts() -> None:
|
||||
"""Two hosts on the same slot/episode should generally hit
|
||||
different samples (probabilistic — assert distribution, not
|
||||
individual equality).
|
||||
"""
|
||||
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
|
||||
if len(m) < 2:
|
||||
pytest.skip("manifest too small for diversity check")
|
||||
matches = 0
|
||||
for slot in range(20):
|
||||
a = m.select(host_id="alice", slot=slot, episode_index=0)
|
||||
b = m.select(host_id="bob", slot=slot, episode_index=0)
|
||||
if a is b:
|
||||
matches += 1
|
||||
# If the catalog has N samples, naive collision rate ~1/N. With
|
||||
# 20 trials and N≥4 we expect ~5 matches; allow up to half.
|
||||
assert matches < 15, "host_id seed isn't producing variety"
|
||||
|
||||
|
||||
def test_selection_walks_catalog_across_episodes() -> None:
|
||||
"""A single host over many episodes should hit every sample at
|
||||
least once."""
|
||||
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
|
||||
seen = set()
|
||||
for ep in range(200):
|
||||
seen.add(m.select(host_id="lab-x", slot=0, episode_index=ep).name)
|
||||
assert len(seen) == len(m), f"only saw {len(seen)}/{len(m)} samples"
|
||||
|
||||
|
||||
def test_manifest_rejects_missing_required_field(tmp_path: Path) -> None:
|
||||
p = tmp_path / "bad.toml"
|
||||
p.write_text(
|
||||
'[[sample]]\n'
|
||||
'name = "x"\n'
|
||||
'family = "y"\n'
|
||||
'# missing category\n'
|
||||
'profile = "z"\n'
|
||||
)
|
||||
with pytest.raises(ValueError, match="category"):
|
||||
SampleManifest.load(p)
|
||||
|
||||
|
||||
def test_manifest_rejects_unknown_category(tmp_path: Path) -> None:
|
||||
p = tmp_path / "bad.toml"
|
||||
p.write_text(
|
||||
'[[sample]]\n'
|
||||
'name = "x"\n'
|
||||
'family = "y"\n'
|
||||
'category = "fish"\n'
|
||||
'profile = "z"\n'
|
||||
)
|
||||
with pytest.raises(ValueError, match="category"):
|
||||
SampleManifest.load(p)
|
||||
|
||||
|
||||
def test_manifest_rejects_duplicate_names(tmp_path: Path) -> None:
|
||||
p = tmp_path / "dup.toml"
|
||||
p.write_text(
|
||||
'[[sample]]\n'
|
||||
'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
|
||||
'\n[[sample]]\n'
|
||||
'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
|
||||
)
|
||||
with pytest.raises(ValueError, match="duplicate"):
|
||||
SampleManifest.load(p)
|
||||
|
||||
|
||||
def test_manifest_marks_real_when_sha256_present(tmp_path: Path) -> None:
|
||||
p = tmp_path / "real.toml"
|
||||
p.write_text(
|
||||
'[[sample]]\n'
|
||||
'name = "real-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
|
||||
'sha256 = "abc123"\n'
|
||||
'\n[[sample]]\n'
|
||||
'name = "mimic-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
|
||||
)
|
||||
m = SampleManifest.load(p)
|
||||
by_name = {s.name: s for s in m.samples}
|
||||
assert by_name["real-one"].kind == "real"
|
||||
assert by_name["mimic-one"].kind == "mimic"
|
||||
152
tests/test_guest_agent.py
Normal file
152
tests/test_guest_agent.py
Normal file
|
|
@ -0,0 +1,152 @@
|
|||
"""Tests for the host-side guest-agent collector.
|
||||
|
||||
We simulate the in-guest agent by spinning up a unix socket server
|
||||
(stand-in for the QEMU virtio-serial chardev) that writes a few
|
||||
JSON-lines rows. The collector should read them, re-stamp with the
|
||||
host's monotonic clock, and persist to telemetry-guest.jsonl.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import socket
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from collectors import guest_agent
|
||||
|
||||
|
||||
class FakeAgentServer(threading.Thread):
|
||||
def __init__(self, sock_path: Path, rows: list[dict], delay_s: float = 0.05) -> None:
|
||||
super().__init__(daemon=True)
|
||||
self.sock_path = sock_path
|
||||
self.rows = rows
|
||||
self.delay_s = delay_s
|
||||
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
self._sock.bind(str(sock_path))
|
||||
self._sock.listen(1)
|
||||
self._sock.settimeout(5.0)
|
||||
|
||||
def run(self) -> None:
|
||||
try:
|
||||
conn, _ = self._sock.accept()
|
||||
except socket.timeout:
|
||||
return
|
||||
try:
|
||||
for row in self.rows:
|
||||
conn.sendall((json.dumps(row) + "\n").encode())
|
||||
time.sleep(self.delay_s)
|
||||
time.sleep(0.1)
|
||||
finally:
|
||||
conn.close()
|
||||
self._sock.close()
|
||||
|
||||
|
||||
def test_collector_reads_jsonl_and_restamps(tmp_path: Path) -> None:
|
||||
sock_path = tmp_path / "agent.sock"
|
||||
rows_in = [
|
||||
{
|
||||
"t_guest_mono_ns": 1, "t_guest_wall_ns": 2,
|
||||
"source": "guest_agent", "available_in_deployment": True,
|
||||
"mem_total_bytes": 256 * 1024 * 1024,
|
||||
"mem_available_bytes": 200 * 1024 * 1024,
|
||||
"load_1m_5m_15m": [0.1, 0.05, 0.0],
|
||||
"cpu_total_jiffies": {"user": 10, "system": 5, "idle": 1000},
|
||||
},
|
||||
{
|
||||
"t_guest_mono_ns": 100_000_000, "t_guest_wall_ns": 100_000_002,
|
||||
"source": "guest_agent", "available_in_deployment": True,
|
||||
"mem_total_bytes": 256 * 1024 * 1024,
|
||||
"mem_available_bytes": 198 * 1024 * 1024,
|
||||
},
|
||||
]
|
||||
server = FakeAgentServer(sock_path, rows_in, delay_s=0.02)
|
||||
server.start()
|
||||
out_path = tmp_path / "telemetry-guest.jsonl"
|
||||
stop = threading.Event()
|
||||
|
||||
def stop_after(ms: int) -> None:
|
||||
time.sleep(ms / 1000.0)
|
||||
stop.set()
|
||||
|
||||
threading.Thread(target=stop_after, args=(300,), daemon=True).start()
|
||||
|
||||
rows_written = guest_agent.run_loop(
|
||||
socket_path=sock_path,
|
||||
output_path=out_path,
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
stop_event=stop,
|
||||
connect_timeout_s=2.0,
|
||||
)
|
||||
server.join(timeout=2)
|
||||
|
||||
assert rows_written == 2
|
||||
persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
|
||||
assert len(persisted) == 2
|
||||
for orig, got in zip(rows_in, persisted):
|
||||
# Original guest timestamps preserved.
|
||||
assert got["t_guest_mono_ns"] == orig["t_guest_mono_ns"]
|
||||
# Host-clock fields added.
|
||||
assert "t_mono_ns" in got
|
||||
assert "t_wall_ns" in got
|
||||
assert got["source"] == "guest_agent"
|
||||
assert got["available_in_deployment"] is True
|
||||
|
||||
|
||||
def test_collector_returns_zero_when_socket_missing(tmp_path: Path) -> None:
|
||||
rows = guest_agent.run_loop(
|
||||
socket_path=tmp_path / "no-socket-here.sock",
|
||||
output_path=tmp_path / "out.jsonl",
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
stop_event=threading.Event(),
|
||||
connect_timeout_s=0.5,
|
||||
)
|
||||
assert rows == 0
|
||||
|
||||
|
||||
def test_collector_drops_malformed_lines_but_keeps_going(tmp_path: Path) -> None:
|
||||
sock_path = tmp_path / "agent.sock"
|
||||
# Will be sent verbatim; the malformed line should be skipped.
|
||||
payload = (
|
||||
b'{"source":"guest_agent","mem_total_bytes":1}\n'
|
||||
b'this-is-not-json\n'
|
||||
b'{"source":"guest_agent","mem_total_bytes":2}\n'
|
||||
)
|
||||
|
||||
class Server(threading.Thread):
|
||||
def __init__(self) -> None:
|
||||
super().__init__(daemon=True)
|
||||
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
self._sock.bind(str(sock_path))
|
||||
self._sock.listen(1)
|
||||
|
||||
def run(self) -> None:
|
||||
conn, _ = self._sock.accept()
|
||||
try:
|
||||
conn.sendall(payload)
|
||||
time.sleep(0.2)
|
||||
finally:
|
||||
conn.close()
|
||||
self._sock.close()
|
||||
|
||||
s = Server()
|
||||
s.start()
|
||||
out_path = tmp_path / "out.jsonl"
|
||||
stop = threading.Event()
|
||||
threading.Thread(
|
||||
target=lambda: (time.sleep(0.4), stop.set()), daemon=True
|
||||
).start()
|
||||
rows = guest_agent.run_loop(
|
||||
socket_path=sock_path,
|
||||
output_path=out_path,
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
stop_event=stop,
|
||||
connect_timeout_s=2.0,
|
||||
)
|
||||
s.join(timeout=2)
|
||||
assert rows == 2
|
||||
persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
|
||||
assert [r["mem_total_bytes"] for r in persisted] == [1, 2]
|
||||
188
tests/test_pcap.py
Normal file
188
tests/test_pcap.py
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
"""Tests for the pcap collector's pure-Python parser + bucketizer.
|
||||
|
||||
We synthesize a tiny pcap file in memory (Ethernet + IPv4 + TCP/UDP
|
||||
records with controlled timestamps), feed it to ``bucketize()``, and
|
||||
verify the produced netflow.jsonl rows are correct.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import struct
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from collectors import pcap
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# pcap synthesis helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
_PCAP_GLOBAL_HDR = struct.pack(
|
||||
"<IHHiIII",
|
||||
0xa1b2c3d4, # magic (us)
|
||||
2, 4, # version
|
||||
0, # thiszone
|
||||
0, # sigfigs
|
||||
65535, # snaplen
|
||||
1, # linktype = LINKTYPE_ETHERNET
|
||||
)
|
||||
|
||||
|
||||
def _ipv4(src: str, dst: str, proto: int, payload: bytes) -> bytes:
|
||||
s = bytes(int(x) for x in src.split("."))
|
||||
d = bytes(int(x) for x in dst.split("."))
|
||||
total_len = 20 + len(payload)
|
||||
return struct.pack(
|
||||
">BBHHHBBHII"[:0] + "BBHHHBBH",
|
||||
0x45, # version=4, IHL=5
|
||||
0, # tos
|
||||
total_len,
|
||||
0, 0, 64, proto,
|
||||
0, # checksum (don't care)
|
||||
) + s + d + payload
|
||||
|
||||
|
||||
def _tcp(sport: int, dport: int, flags: int) -> bytes:
|
||||
# Minimal 20-byte TCP header: sport, dport, seq, ack, off+flags, win, csum, urg
|
||||
return struct.pack(">HHIIBBHHH",
|
||||
sport, dport,
|
||||
0, 0,
|
||||
0x50, # data offset = 5 (no options)
|
||||
flags,
|
||||
0, 0, 0)
|
||||
|
||||
|
||||
def _udp(sport: int, dport: int, length: int = 8) -> bytes:
|
||||
return struct.pack(">HHHH", sport, dport, length, 0)
|
||||
|
||||
|
||||
def _ether(payload: bytes, ethertype: int = 0x0800) -> bytes:
|
||||
return b"\x02\x00\x00\x00\x00\x01" + b"\x02\x00\x00\x00\x00\x02" + struct.pack(">H", ethertype) + payload
|
||||
|
||||
|
||||
def _record(ts_ns: int, frame: bytes) -> bytes:
|
||||
sec = ts_ns // 1_000_000_000
|
||||
usec = (ts_ns // 1000) % 1_000_000
|
||||
return struct.pack("<IIII", sec, usec, len(frame), len(frame)) + frame
|
||||
|
||||
|
||||
def _build_pcap(records: list[tuple[int, bytes]]) -> bytes:
|
||||
out = bytearray(_PCAP_GLOBAL_HDR)
|
||||
for ts, frame in records:
|
||||
out += _record(ts, frame)
|
||||
return bytes(out)
|
||||
|
||||
|
||||
def _write_pcap(path: Path, records: list[tuple[int, bytes]]) -> None:
|
||||
path.write_bytes(_build_pcap(records))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_iter_pcap_reads_records_back(tmp_path: Path) -> None:
|
||||
p = tmp_path / "a.pcap"
|
||||
frame = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
|
||||
_write_pcap(p, [(1_000_000_000, frame)])
|
||||
|
||||
records = list(pcap._iter_pcap(p))
|
||||
assert len(records) == 1
|
||||
t_ns, data = records[0]
|
||||
assert t_ns == 1_000_000_000
|
||||
assert data == frame
|
||||
|
||||
|
||||
def test_decode_tcp_syn() -> None:
|
||||
f = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
|
||||
d = pcap._decode(f)
|
||||
assert d["ethertype"] == 0x0800
|
||||
assert d["ip_proto"] == 6
|
||||
assert d["src_ip"] == "10.200.0.1"
|
||||
assert d["dst_ip"] == "10.200.0.10"
|
||||
assert d["src_port"] == 40000
|
||||
assert d["dst_port"] == 21
|
||||
assert d["tcp_flags"] & 0x02
|
||||
|
||||
|
||||
def test_decode_udp_dns_query() -> None:
|
||||
f = _ether(_ipv4("10.200.0.10", "10.200.0.1", 17, _udp(33333, 53)))
|
||||
d = pcap._decode(f)
|
||||
assert d["ip_proto"] == 17
|
||||
assert d["dst_port"] == 53
|
||||
|
||||
|
||||
def test_bucketize_collapses_per_window(tmp_path: Path) -> None:
|
||||
pcap_path = tmp_path / "ep.pcap"
|
||||
netflow_path = tmp_path / "netflow.jsonl"
|
||||
|
||||
bridge_ip = "10.200.0.1"
|
||||
guest_ip = "10.200.0.10"
|
||||
base_ns = 1_700_000_000_000_000_000 # arbitrary, aligned-friendly
|
||||
|
||||
records = [
|
||||
# Bucket A (0..100ms)
|
||||
(base_ns + 5_000_000,
|
||||
_ether(_ipv4(guest_ip, bridge_ip, 6, _tcp(40000, 21, flags=0x02)))),
|
||||
(base_ns + 9_000_000,
|
||||
_ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x12)))),
|
||||
# Bucket B (100..200ms): UDP DNS query
|
||||
(base_ns + 105_000_000,
|
||||
_ether(_ipv4(guest_ip, bridge_ip, 17, _udp(33333, 53)))),
|
||||
# Bucket B: TCP RST
|
||||
(base_ns + 199_000_000,
|
||||
_ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x04)))),
|
||||
]
|
||||
_write_pcap(pcap_path, records)
|
||||
|
||||
rows_written = pcap.bucketize(
|
||||
pcap_path, netflow_path,
|
||||
bucket_ms=100,
|
||||
t_mono_origin_ns=base_ns,
|
||||
bridge_ip=bridge_ip,
|
||||
)
|
||||
assert rows_written == 2
|
||||
|
||||
rows = [json.loads(l) for l in netflow_path.read_text().splitlines()]
|
||||
a, b = rows
|
||||
assert a["bucket_ms"] == 100
|
||||
# Bucket A: 1 in (SYN), 1 out (SYN-ACK)
|
||||
assert a["pkts_in"] == 1
|
||||
assert a["pkts_out"] == 1
|
||||
assert a["syn_count"] == 2
|
||||
assert a["tcp_new_flows"] == 1 # only the bare SYN counts as new flow
|
||||
assert a["dns_query_count"] == 0
|
||||
assert a["unique_dst_ips"] == 2
|
||||
|
||||
# Bucket B: DNS + RST
|
||||
assert b["dns_query_count"] == 1
|
||||
assert b["rst_count"] == 1
|
||||
|
||||
|
||||
def test_bucketize_returns_zero_for_missing_file(tmp_path: Path) -> None:
|
||||
rows = pcap.bucketize(
|
||||
tmp_path / "nope.pcap",
|
||||
tmp_path / "netflow.jsonl",
|
||||
bucket_ms=100,
|
||||
t_mono_origin_ns=0,
|
||||
)
|
||||
assert rows == 0
|
||||
|
||||
|
||||
def test_bucketize_handles_unknown_ethertype(tmp_path: Path) -> None:
|
||||
p = tmp_path / "x.pcap"
|
||||
netflow = tmp_path / "n.jsonl"
|
||||
# ARP frame (ethertype 0x0806) — counted but not decoded.
|
||||
f = _ether(b"\x00" * 28, ethertype=0x0806)
|
||||
_write_pcap(p, [(1_000_000_000, f)])
|
||||
rows = pcap.bucketize(p, netflow, bucket_ms=100, t_mono_origin_ns=0)
|
||||
assert rows == 1
|
||||
out = json.loads(netflow.read_text().splitlines()[0])
|
||||
# No IP info, but byte/packet count survives.
|
||||
assert out["pkts_in"] + out["pkts_out"] == 1
|
||||
assert out["tcp_count"] == 0
|
||||
295
tests/test_qmp.py
Normal file
295
tests/test_qmp.py
Normal file
|
|
@ -0,0 +1,295 @@
|
|||
"""Tests for the QMP collector against an in-process fake QMP server.
|
||||
|
||||
The fake speaks just enough QMP to exercise:
|
||||
- the greeting + qmp_capabilities handshake
|
||||
- query-status
|
||||
- query-blockstats
|
||||
- query-stats target=vm
|
||||
- error responses
|
||||
- async events interleaved with command responses
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import socket
|
||||
import tempfile
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from collectors import qmp
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fake QMP server
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class FakeQMPServer(threading.Thread):
|
||||
"""Single-connection fake. Each line received from the client is
|
||||
parsed as JSON; we look up ``execute`` in ``responses`` and emit
|
||||
the configured reply. Optionally interleaves an async event before
|
||||
the response."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
socket_path: Path,
|
||||
*,
|
||||
responses: dict[str, Any] | None = None,
|
||||
emit_event_before: set[str] | None = None,
|
||||
) -> None:
|
||||
super().__init__(daemon=True)
|
||||
self.socket_path = socket_path
|
||||
self.responses = responses or {}
|
||||
self.emit_event_before = emit_event_before or set()
|
||||
self.received: list[dict] = []
|
||||
self._stop = threading.Event()
|
||||
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
self._sock.bind(str(socket_path))
|
||||
self._sock.listen(1)
|
||||
self._sock.settimeout(5.0)
|
||||
|
||||
def run(self) -> None:
|
||||
try:
|
||||
conn, _ = self._sock.accept()
|
||||
except socket.timeout:
|
||||
return
|
||||
conn.settimeout(5.0)
|
||||
try:
|
||||
# Greeting
|
||||
conn.sendall(b'{"QMP": {"version": {"qemu": {"major":9,"minor":0,"micro":0}}, "capabilities": []}}\n')
|
||||
buf = b""
|
||||
while not self._stop.is_set():
|
||||
try:
|
||||
chunk = conn.recv(4096)
|
||||
except socket.timeout:
|
||||
if self._stop.is_set():
|
||||
return
|
||||
continue
|
||||
if not chunk:
|
||||
return
|
||||
buf += chunk
|
||||
while b"\n" in buf:
|
||||
line, _, buf = buf.partition(b"\n")
|
||||
if not line.strip():
|
||||
continue
|
||||
msg = json.loads(line)
|
||||
self.received.append(msg)
|
||||
cmd = msg.get("execute")
|
||||
if cmd == "qmp_capabilities":
|
||||
conn.sendall(b'{"return": {}}\n')
|
||||
continue
|
||||
if cmd in self.emit_event_before:
|
||||
conn.sendall(b'{"event": "STOP", "timestamp": {"seconds": 1, "microseconds": 0}}\n')
|
||||
if cmd in self.responses:
|
||||
resp = self.responses[cmd]
|
||||
conn.sendall((json.dumps(resp) + "\n").encode())
|
||||
else:
|
||||
conn.sendall(b'{"error": {"class": "CommandNotFound", "desc": "unknown"}}\n')
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
def shutdown(self) -> None:
|
||||
self._stop.set()
|
||||
try:
|
||||
self._sock.close()
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def qmp_server(tmp_path: Path):
|
||||
sock_path = tmp_path / "qmp.sock"
|
||||
return sock_path
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Client tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_connect_negotiates_capabilities(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(qmp_server)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
greeting = client.connect()
|
||||
assert "version" in greeting
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
# Server saw exactly the qmp_capabilities call.
|
||||
assert any(m.get("execute") == "qmp_capabilities" for m in server.received)
|
||||
|
||||
|
||||
def test_execute_returns_payload(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
out = client.execute("query-status")
|
||||
assert out == {"status": "running", "running": True}
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
|
||||
def test_execute_skips_async_events_before_response(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
},
|
||||
emit_event_before={"query-status"},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
out = client.execute("query-status")
|
||||
assert out["running"] is True
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
|
||||
def test_execute_raises_on_qmp_error(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(qmp_server) # no responses → server sends error
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
with pytest.raises(qmp.QMPError):
|
||||
client.execute("totally-fake-command")
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Row builder tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_collect_once_assembles_full_row(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
"query-blockstats": {"return": [{
|
||||
"device": "virtio0",
|
||||
"stats": {
|
||||
"rd_operations": 12, "wr_operations": 4,
|
||||
"rd_bytes": 49152, "wr_bytes": 16384,
|
||||
"flush_operations": 1,
|
||||
},
|
||||
}]},
|
||||
"query-stats": {"return": [{"stats": [
|
||||
{"name": "halt_exits", "value": 17000},
|
||||
{"name": "io_exits", "value": 942},
|
||||
{"name": "string-skipped", "value": "not-an-int"},
|
||||
]}]},
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
assert row["source"] == "host_qmp"
|
||||
assert row["available_in_deployment"] is False
|
||||
assert row["vm_running"] is True
|
||||
assert row["blockstats"]["virtio0"]["rd_bytes"] == 49152
|
||||
assert row["blockstats"]["virtio0"]["flush_ops"] == 1
|
||||
assert row["kvm_stats"]["halt_exits"] == 17000
|
||||
assert "string-skipped" not in row["kvm_stats"]
|
||||
|
||||
|
||||
def test_collect_once_tolerates_missing_query_stats(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
"query-blockstats": {"return": []},
|
||||
# query-stats deliberately absent → server returns CommandNotFound
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
# Older qemu without query-stats: row still exists, kvm_stats absent.
|
||||
assert "kvm_stats" not in row
|
||||
assert row["vm_running"] is True
|
||||
assert row["blockstats"] == {}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# run_loop tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_run_loop_writes_rows_and_stops_cleanly(qmp_server: Path, tmp_path: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
"query-blockstats": {"return": []},
|
||||
"query-stats": {"error": {"class": "CommandNotFound", "desc": "n/a"}},
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
out_path = tmp_path / "telemetry-qmp.jsonl"
|
||||
stop = threading.Event()
|
||||
|
||||
def stop_after(ms: int) -> None:
|
||||
time.sleep(ms / 1000.0)
|
||||
stop.set()
|
||||
|
||||
threading.Thread(target=stop_after, args=(350,), daemon=True).start()
|
||||
rows = qmp.run_loop(
|
||||
socket_path=qmp_server,
|
||||
output_path=out_path,
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
interval_ms=100,
|
||||
stop_event=stop,
|
||||
)
|
||||
server.shutdown()
|
||||
|
||||
assert rows >= 2, f"expected >=2 rows, got {rows}"
|
||||
lines = [json.loads(l) for l in out_path.read_text().splitlines()]
|
||||
assert len(lines) == rows
|
||||
for r in lines:
|
||||
assert r["source"] == "host_qmp"
|
||||
assert r["vm_running"] is True
|
||||
|
||||
|
||||
def test_run_loop_returns_zero_when_socket_missing(tmp_path: Path) -> None:
|
||||
# No server bound to the socket path.
|
||||
rows = qmp.run_loop(
|
||||
socket_path=tmp_path / "nonexistent.sock",
|
||||
output_path=tmp_path / "telemetry-qmp.jsonl",
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
interval_ms=100,
|
||||
stop_event=threading.Event(),
|
||||
)
|
||||
assert rows == 0
|
||||
|
|
@ -28,7 +28,7 @@ from pathlib import Path
|
|||
import pycdlib
|
||||
|
||||
|
||||
DEFAULT_USER_DATA = """\
|
||||
DEFAULT_USER_DATA_HEAD = """\
|
||||
#cloud-config
|
||||
hostname: cis490
|
||||
manage_etc_hosts: true
|
||||
|
|
@ -45,10 +45,70 @@ chpasswd:
|
|||
list: |
|
||||
root:cis490
|
||||
cis490:cis490
|
||||
runcmd:
|
||||
- [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]
|
||||
"""
|
||||
|
||||
# OpenRC service file shipped inside the guest. Alpine uses OpenRC;
|
||||
# the runcmd at the bottom of user-data wires it up on first boot.
|
||||
OPENRC_SERVICE = """\
|
||||
#!/sbin/openrc-run
|
||||
|
||||
description="CIS490 in-guest telemetry agent"
|
||||
command="/usr/local/bin/cis490-agent"
|
||||
command_args="--port /dev/virtio-ports/cis490.guest.agent"
|
||||
command_background=true
|
||||
pidfile="/run/cis490-agent.pid"
|
||||
output_log="/var/log/cis490-agent.log"
|
||||
error_log="/var/log/cis490-agent.log"
|
||||
|
||||
depend() {
|
||||
need localmount
|
||||
}
|
||||
"""
|
||||
|
||||
DEFAULT_META_DATA = """\
|
||||
instance-id: cis490-vm-001
|
||||
local-hostname: cis490
|
||||
"""
|
||||
|
||||
|
||||
def _indent(text: str, n: int) -> str:
|
||||
pad = " " * n
|
||||
return "\n".join(pad + line if line else line for line in text.splitlines())
|
||||
|
||||
|
||||
def build_user_data(*, embed_agent: bool, agent_path: Path | None) -> bytes:
|
||||
"""Build a cloud-init user-data document. When ``embed_agent`` is
|
||||
True, also stuff the in-guest agent + an OpenRC service into
|
||||
``write_files`` and arrange to start the service on first boot."""
|
||||
head = DEFAULT_USER_DATA_HEAD
|
||||
if not embed_agent:
|
||||
return (head + 'runcmd:\n - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n').encode()
|
||||
|
||||
if agent_path is None:
|
||||
agent_path = Path(__file__).resolve().parent.parent / "vm" / "guest-agent" / "cis490_agent.py"
|
||||
if not agent_path.exists():
|
||||
raise FileNotFoundError(f"agent script not found: {agent_path}")
|
||||
agent_src = agent_path.read_text()
|
||||
|
||||
body = head + (
|
||||
"write_files:\n"
|
||||
" - path: /usr/local/bin/cis490-agent\n"
|
||||
" permissions: '0755'\n"
|
||||
" owner: root:root\n"
|
||||
" content: |\n"
|
||||
f"{_indent(agent_src, 6)}\n"
|
||||
" - path: /etc/init.d/cis490-agent\n"
|
||||
" permissions: '0755'\n"
|
||||
" owner: root:root\n"
|
||||
" content: |\n"
|
||||
f"{_indent(OPENRC_SERVICE, 6)}\n"
|
||||
"runcmd:\n"
|
||||
' - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n'
|
||||
' - [ sh, -c, "command -v rc-update >/dev/null && rc-update add cis490-agent default || true" ]\n'
|
||||
' - [ sh, -c, "command -v rc-service >/dev/null && rc-service cis490-agent start || true" ]\n'
|
||||
)
|
||||
return body.encode()
|
||||
|
||||
DEFAULT_META_DATA = """\
|
||||
instance-id: cis490-vm-001
|
||||
local-hostname: cis490
|
||||
|
|
@ -93,10 +153,25 @@ def main() -> int:
|
|||
default=None,
|
||||
help="path to a custom meta-data file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--no-embed-agent",
|
||||
action="store_true",
|
||||
help="don't bake the in-guest agent into user-data",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--agent-path",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="path to the in-guest agent (default: vm/guest-agent/cis490_agent.py)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
user_data = (
|
||||
args.user_data.read_bytes() if args.user_data else DEFAULT_USER_DATA.encode()
|
||||
if args.user_data:
|
||||
user_data = args.user_data.read_bytes()
|
||||
else:
|
||||
user_data = build_user_data(
|
||||
embed_agent=not args.no_embed_agent,
|
||||
agent_path=args.agent_path,
|
||||
)
|
||||
meta_data = (
|
||||
args.meta_data.read_bytes() if args.meta_data else DEFAULT_META_DATA.encode()
|
||||
|
|
|
|||
97
tools/run_fleet.py
Normal file
97
tools/run_fleet.py
Normal file
|
|
@ -0,0 +1,97 @@
|
|||
"""``cis490-fleet`` — run as many concurrent labeled episodes as the
|
||||
host can handle, drawing samples from the manifest.
|
||||
|
||||
Modes:
|
||||
|
||||
--capacity Print the resource calculation and exit. No VMs spawned.
|
||||
--waves N Run N waves of episodes (one wave = max_concurrent
|
||||
episodes, each in its own slot). Default: 1.
|
||||
--max-concurrent N
|
||||
Cap concurrency below the auto-detected ceiling.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import signal
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Allow running as a script.
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
from orchestrator.fleet import ( # noqa: E402
|
||||
FleetConfig, FleetRunner, capacity_report, detect_capacity,
|
||||
)
|
||||
from samples.manifest import SampleManifest # noqa: E402
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="cis490-fleet")
|
||||
p.add_argument("--capacity", action="store_true")
|
||||
p.add_argument("--waves", type=int, default=1)
|
||||
p.add_argument("--max-concurrent", type=int, default=None)
|
||||
p.add_argument("--manifest",
|
||||
default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"))
|
||||
p.add_argument("--data-root", default="data")
|
||||
p.add_argument("--host-id", default=os.environ.get("FLEET_HOST_ID") or os.uname().nodename)
|
||||
p.add_argument("--ram-per-vm-mib", type=int, default=320)
|
||||
p.add_argument("--require-real-samples", action="store_true")
|
||||
p.add_argument("--log-level", default="INFO")
|
||||
args = p.parse_args(argv)
|
||||
|
||||
logging.basicConfig(
|
||||
level=getattr(logging, args.log_level.upper(), logging.INFO),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
)
|
||||
|
||||
if args.capacity:
|
||||
print(capacity_report())
|
||||
return 0
|
||||
|
||||
manifest = SampleManifest.load(args.manifest)
|
||||
repo_root = Path(__file__).resolve().parent.parent
|
||||
|
||||
cfg = FleetConfig(
|
||||
host_id=args.host_id,
|
||||
repo_root=repo_root,
|
||||
data_root=Path(args.data_root).resolve(),
|
||||
manifest=manifest,
|
||||
ram_per_vm_mib=args.ram_per_vm_mib,
|
||||
max_concurrent_override=args.max_concurrent,
|
||||
require_real_samples=args.require_real_samples,
|
||||
)
|
||||
|
||||
runner = FleetRunner(cfg)
|
||||
|
||||
def _stop(signum, frame): # noqa: ARG001
|
||||
runner.stop()
|
||||
signal.signal(signal.SIGTERM, _stop)
|
||||
signal.signal(signal.SIGINT, _stop)
|
||||
|
||||
result = runner.run(episodes=args.waves)
|
||||
|
||||
print(json.dumps({
|
||||
"host_id": args.host_id,
|
||||
"capacity": result.capacity.to_dict(),
|
||||
"slots": [
|
||||
{
|
||||
"slot": s.slot,
|
||||
"sample": s.sample_name,
|
||||
"sample_kind": s.sample_kind,
|
||||
"rc": s.rc,
|
||||
"duration_s": s.duration_s,
|
||||
"error": s.error,
|
||||
} for s in result.slots
|
||||
],
|
||||
"total_duration_s": result.total_duration_s,
|
||||
}, indent=2))
|
||||
|
||||
return 0 if all(s.rc == 0 for s in result.slots) else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
274
vm/guest-agent/cis490_agent.py
Normal file
274
vm/guest-agent/cis490_agent.py
Normal file
|
|
@ -0,0 +1,274 @@
|
|||
#!/usr/bin/env python3
|
||||
"""In-guest telemetry agent — runs INSIDE the VM.
|
||||
|
||||
Writes one JSON-lines row per tick to a virtio-serial port that the
|
||||
host has wired up as ``cis490.guest.agent``. The host-side collector
|
||||
(`collectors.guest_agent`) reads these rows and stamps them with the
|
||||
host's monotonic clock before persisting to ``telemetry-guest.jsonl``.
|
||||
|
||||
Stdlib only — no `psutil`, no extra deps to bake into the guest. Every
|
||||
field is read from /proc on the guest, so this works on busybox-based
|
||||
Alpine, on Cirros, and on Metasploitable2 unchanged.
|
||||
|
||||
Wire path inside the guest:
|
||||
/dev/virtio-ports/cis490.guest.agent
|
||||
|
||||
The host side opens the matching unix socket on the hypervisor.
|
||||
The protocol is intentionally trivial: the agent emits newline-
|
||||
delimited JSON; the host emits nothing back. One direction.
|
||||
|
||||
This source is the **deployable** side — every row is tagged
|
||||
``available_in_deployment: true``. See docs/threat-model.md.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import platform
|
||||
import sys
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
|
||||
SOURCE = "guest_agent"
|
||||
AVAILABLE_IN_DEPLOYMENT = True
|
||||
DEFAULT_PORT = "/dev/virtio-ports/cis490.guest.agent"
|
||||
DEFAULT_INTERVAL_MS = 100 # 10 Hz
|
||||
DEFAULT_TOP_N = 8
|
||||
|
||||
|
||||
# ---------- /proc parsers ---------------------------------------------------
|
||||
|
||||
|
||||
def _read(path: str) -> str | None:
|
||||
try:
|
||||
with open(path, "rb") as f:
|
||||
return f.read().decode("ascii", errors="replace")
|
||||
except (FileNotFoundError, PermissionError):
|
||||
return None
|
||||
|
||||
|
||||
def read_loadavg() -> tuple[float, float, float] | None:
|
||||
text = _read("/proc/loadavg")
|
||||
if text is None:
|
||||
return None
|
||||
parts = text.split()
|
||||
return float(parts[0]), float(parts[1]), float(parts[2])
|
||||
|
||||
|
||||
def read_meminfo() -> dict[str, int]:
|
||||
text = _read("/proc/meminfo")
|
||||
out: dict[str, int] = {}
|
||||
if text is None:
|
||||
return out
|
||||
for line in text.splitlines():
|
||||
k, _, rest = line.partition(":")
|
||||
v = rest.strip()
|
||||
if v.endswith(" kB"):
|
||||
try:
|
||||
out[k] = int(v[:-3]) * 1024
|
||||
except ValueError:
|
||||
pass
|
||||
return out
|
||||
|
||||
|
||||
def read_cpu_total() -> dict[str, int] | None:
|
||||
"""First line of /proc/stat: aggregate cpu user/nice/sys/idle/...
|
||||
in jiffies since boot."""
|
||||
text = _read("/proc/stat")
|
||||
if text is None:
|
||||
return None
|
||||
line = text.splitlines()[0]
|
||||
fields = line.split()
|
||||
# cpu user nice system idle iowait irq softirq steal guest guest_nice
|
||||
if not fields or fields[0] != "cpu":
|
||||
return None
|
||||
nums = [int(x) for x in fields[1:]]
|
||||
pad = nums + [0] * max(0, 10 - len(nums))
|
||||
return {
|
||||
"user": pad[0],
|
||||
"nice": pad[1],
|
||||
"system": pad[2],
|
||||
"idle": pad[3],
|
||||
"iowait": pad[4],
|
||||
"irq": pad[5],
|
||||
"softirq": pad[6],
|
||||
"steal": pad[7],
|
||||
"guest": pad[8],
|
||||
"guest_nice":pad[9],
|
||||
}
|
||||
|
||||
|
||||
def read_thermal_milli_c() -> int | None:
|
||||
"""Best-effort: /sys/class/thermal/thermal_zone0/temp."""
|
||||
text = _read("/sys/class/thermal/thermal_zone0/temp")
|
||||
if text is None:
|
||||
return None
|
||||
try:
|
||||
return int(text.strip())
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
|
||||
def read_net_devs() -> dict[str, dict[str, int]]:
|
||||
"""Parse /proc/net/dev → {iface: {rx_bytes, tx_bytes, rx_pkts, tx_pkts}}."""
|
||||
text = _read("/proc/net/dev")
|
||||
out: dict[str, dict[str, int]] = {}
|
||||
if text is None:
|
||||
return out
|
||||
lines = text.splitlines()
|
||||
for line in lines[2:]:
|
||||
if ":" not in line:
|
||||
continue
|
||||
name, _, rest = line.partition(":")
|
||||
name = name.strip()
|
||||
if name == "lo":
|
||||
continue
|
||||
cols = rest.split()
|
||||
if len(cols) < 16:
|
||||
continue
|
||||
out[name] = {
|
||||
"rx_bytes": int(cols[0]),
|
||||
"rx_pkts": int(cols[1]),
|
||||
"tx_bytes": int(cols[8]),
|
||||
"tx_pkts": int(cols[9]),
|
||||
}
|
||||
return out
|
||||
|
||||
|
||||
def read_listen_ports() -> list[int]:
|
||||
"""TCP listen sockets from /proc/net/tcp + tcp6. State 0A = LISTEN."""
|
||||
out: set[int] = set()
|
||||
for path in ("/proc/net/tcp", "/proc/net/tcp6"):
|
||||
text = _read(path)
|
||||
if not text:
|
||||
continue
|
||||
for line in text.splitlines()[1:]:
|
||||
cols = line.split()
|
||||
if len(cols) < 4:
|
||||
continue
|
||||
if cols[3] != "0A":
|
||||
continue
|
||||
local = cols[1] # "ADDR:PORT" with PORT in hex
|
||||
_, _, port_hex = local.rpartition(":")
|
||||
try:
|
||||
out.add(int(port_hex, 16))
|
||||
except ValueError:
|
||||
pass
|
||||
return sorted(out)
|
||||
|
||||
|
||||
def read_top_procs(top_n: int) -> list[dict[str, Any]]:
|
||||
"""Top-N processes by RSS. Cheap O(N) scan of /proc."""
|
||||
procs: list[dict[str, Any]] = []
|
||||
try:
|
||||
entries = os.listdir("/proc")
|
||||
except OSError:
|
||||
return procs
|
||||
for ent in entries:
|
||||
if not ent.isdigit():
|
||||
continue
|
||||
pid = int(ent)
|
||||
stat = _read(f"/proc/{pid}/stat")
|
||||
if stat is None:
|
||||
continue
|
||||
try:
|
||||
rparen = stat.rindex(")")
|
||||
comm = stat[stat.index("(") + 1 : rparen]
|
||||
fields = stat[rparen + 2:].split()
|
||||
utime = int(fields[11])
|
||||
stime = int(fields[12])
|
||||
rss_pages = int(fields[21])
|
||||
except (ValueError, IndexError):
|
||||
continue
|
||||
procs.append({
|
||||
"pid": pid,
|
||||
"comm": comm[:32],
|
||||
"cpu_jiffies": utime + stime,
|
||||
"rss_bytes": rss_pages * os.sysconf("SC_PAGESIZE"),
|
||||
})
|
||||
procs.sort(key=lambda p: p["rss_bytes"], reverse=True)
|
||||
return procs[:top_n]
|
||||
|
||||
|
||||
# ---------- one tick --------------------------------------------------------
|
||||
|
||||
|
||||
def collect_once(top_n: int = DEFAULT_TOP_N) -> dict[str, Any]:
|
||||
mem = read_meminfo()
|
||||
cpu = read_cpu_total()
|
||||
load = read_loadavg()
|
||||
return {
|
||||
"t_guest_mono_ns": time.monotonic_ns(),
|
||||
"t_guest_wall_ns": time.time_ns(),
|
||||
"source": SOURCE,
|
||||
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
||||
"kernel": platform.release(),
|
||||
"cpu_total_jiffies": cpu,
|
||||
"load_1m_5m_15m": list(load) if load else None,
|
||||
"mem_total_bytes": (mem.get("MemTotal") or 0),
|
||||
"mem_available_bytes": (mem.get("MemAvailable") or 0),
|
||||
"mem_buffers_bytes": (mem.get("Buffers") or 0),
|
||||
"mem_cached_bytes": (mem.get("Cached") or 0),
|
||||
"swap_used_bytes": (mem.get("SwapTotal", 0) - mem.get("SwapFree", 0)),
|
||||
"thermal_milli_c": read_thermal_milli_c(),
|
||||
"net": read_net_devs(),
|
||||
"listen_ports": read_listen_ports(),
|
||||
"top_procs": read_top_procs(top_n),
|
||||
}
|
||||
|
||||
|
||||
# ---------- main loop -------------------------------------------------------
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="cis490-guest-agent")
|
||||
p.add_argument("--port", default=DEFAULT_PORT,
|
||||
help="virtio-serial port path inside the guest")
|
||||
p.add_argument("--interval-ms", type=int, default=DEFAULT_INTERVAL_MS)
|
||||
p.add_argument("--top-n", type=int, default=DEFAULT_TOP_N)
|
||||
p.add_argument("--once", action="store_true",
|
||||
help="emit a single row and exit (for smoke tests)")
|
||||
args = p.parse_args(argv)
|
||||
|
||||
if args.once:
|
||||
sys.stdout.write(json.dumps(collect_once(args.top_n)) + "\n")
|
||||
sys.stdout.flush()
|
||||
return 0
|
||||
|
||||
# Open the virtio-serial port. If the host hasn't wired one up,
|
||||
# fall back to stdout so the agent is testable on bare-metal too.
|
||||
out_fp: Any
|
||||
if os.path.exists(args.port):
|
||||
out_fp = open(args.port, "wb", buffering=0)
|
||||
else:
|
||||
sys.stderr.write(f"[cis490-agent] {args.port} missing; writing to stdout\n")
|
||||
out_fp = sys.stdout.buffer
|
||||
|
||||
interval_ns = args.interval_ms * 1_000_000
|
||||
next_tick = time.monotonic_ns()
|
||||
try:
|
||||
while True:
|
||||
row = collect_once(args.top_n)
|
||||
out_fp.write((json.dumps(row) + "\n").encode("utf-8"))
|
||||
try:
|
||||
out_fp.flush()
|
||||
except (AttributeError, OSError):
|
||||
pass
|
||||
next_tick += interval_ns
|
||||
sleep_ns = next_tick - time.monotonic_ns()
|
||||
if sleep_ns > 0:
|
||||
time.sleep(sleep_ns / 1_000_000_000)
|
||||
else:
|
||||
next_tick = time.monotonic_ns()
|
||||
except KeyboardInterrupt:
|
||||
return 0
|
||||
except (BrokenPipeError, OSError) as e:
|
||||
sys.stderr.write(f"[cis490-agent] write failed: {e}\n")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
|
@ -16,7 +16,12 @@ set -euo pipefail
|
|||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
IMAGE="${IMAGE:-$REPO_ROOT/vm/images/alpine-baseline.qcow2}"
|
||||
CIDATA="${CIDATA:-$REPO_ROOT/vm/images/cidata.iso}"
|
||||
RUN_DIR="${RUN_DIR:-/tmp/cis490-vm}"
|
||||
# SLOT lets the fleet runner spin up N concurrent VMs without socket /
|
||||
# port collisions. Default RUN_DIR + ssh hostfwd port keep single-VM
|
||||
# usage unchanged.
|
||||
SLOT="${SLOT:-0}"
|
||||
RUN_DIR="${RUN_DIR:-/tmp/cis490-vm-$SLOT}"
|
||||
SSH_PORT="${SSH_PORT:-$((2222 + SLOT))}"
|
||||
|
||||
mkdir -p "$RUN_DIR"
|
||||
QMP_SOCK="$RUN_DIR/qmp.sock"
|
||||
|
|
@ -32,8 +37,14 @@ if [[ ! -f "$CIDATA" ]]; then
|
|||
exit 1
|
||||
fi
|
||||
|
||||
AGENT_SOCK="$RUN_DIR/agent.sock"
|
||||
|
||||
# snapshot=on routes guest writes through a temporary overlay so the qcow2
|
||||
# on disk is never mutated — every boot starts from the same bytes.
|
||||
#
|
||||
# Second virtio-serial port (cis490.guest.agent) carries telemetry
|
||||
# from the in-guest agent. Surfaces inside the guest at
|
||||
# /dev/virtio-ports/cis490.guest.agent and on the host at $AGENT_SOCK.
|
||||
exec qemu-system-x86_64 \
|
||||
-name cis490-vm \
|
||||
-machine q35,accel=kvm \
|
||||
|
|
@ -42,8 +53,11 @@ exec qemu-system-x86_64 \
|
|||
-m 256 \
|
||||
-drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
|
||||
-drive file="$CIDATA",format=raw,if=virtio,readonly=on \
|
||||
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2222-:22 \
|
||||
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:"$SSH_PORT"-:22 \
|
||||
-device virtio-net-pci,netdev=n0 \
|
||||
-device virtio-serial-pci,id=cis490vs0 \
|
||||
-chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
|
||||
-device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
|
||||
-nographic \
|
||||
-serial unix:"$RUN_DIR/serial.sock",server=on,wait=off \
|
||||
-monitor unix:"$MON_SOCK",server=on,wait=off \
|
||||
|
|
|
|||
|
|
@ -26,11 +26,14 @@ set -euo pipefail
|
|||
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
IMAGE="${IMAGE:-$REPO_ROOT/vm/images/metasploitable2.qcow2}"
|
||||
RUN_DIR="${RUN_DIR:-/tmp/cis490-target}"
|
||||
SLOT="${SLOT:-0}"
|
||||
RUN_DIR="${RUN_DIR:-/tmp/cis490-target-$SLOT}"
|
||||
RAM_MIB="${RAM_MIB:-512}"
|
||||
# Ports the host should forward to the guest. Comma-separated host:guest pairs.
|
||||
# Default covers the vsftpd module's RPORT.
|
||||
TARGET_PORTS="${TARGET_PORTS:-21:21}"
|
||||
# Default covers the vsftpd module's RPORT. Slot offset makes per-VM
|
||||
# fleet runs collision-free (slot 0 → 21, slot 1 → 121, slot 2 → 221, ...).
|
||||
PORT_BASE="${PORT_BASE:-$((21 + SLOT * 100))}"
|
||||
TARGET_PORTS="${TARGET_PORTS:-${PORT_BASE}:21}"
|
||||
# KVM if the host can take it; otherwise fall back to TCG. Cross-arch
|
||||
# images (Metasploitable2 is x86-only) on aarch64 hosts will need TCG.
|
||||
ACCEL="${ACCEL:-}"
|
||||
|
|
@ -77,7 +80,13 @@ if [[ "$ACCEL" == "kvm" ]]; then
|
|||
CPU_FLAGS=(-cpu host)
|
||||
fi
|
||||
|
||||
AGENT_SOCK="$RUN_DIR/agent.sock"
|
||||
|
||||
# snapshot=on so the qcow2 is never mutated — every boot is identical.
|
||||
# Second virtio-serial port carries the in-guest agent's telemetry to
|
||||
# the host (see vm/guest-agent/). Targets without the agent installed
|
||||
# (e.g. unmodified Metasploitable2) leave the device unused — the
|
||||
# host-side collector simply gets no rows. Harmless.
|
||||
exec qemu-system-x86_64 \
|
||||
-name cis490-target \
|
||||
-machine q35,accel="$ACCEL" \
|
||||
|
|
@ -87,6 +96,9 @@ exec qemu-system-x86_64 \
|
|||
-drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
|
||||
-netdev "$NETDEV" \
|
||||
-device virtio-net-pci,netdev=n0 \
|
||||
-device virtio-serial-pci,id=cis490vs0 \
|
||||
-chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
|
||||
-device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
|
||||
-nographic \
|
||||
-serial unix:"$SERIAL_SOCK",server=on,wait=off \
|
||||
-monitor unix:"$MON_SOCK",server=on,wait=off \
|
||||
|
|
|
|||
56
vm/setup_bridge.sh
Executable file
56
vm/setup_bridge.sh
Executable file
|
|
@ -0,0 +1,56 @@
|
|||
#!/usr/bin/env bash
|
||||
# Create the host-only ``br-malware`` bridge for Tier-3+ episodes.
|
||||
#
|
||||
# Properties (from docs/architecture.md):
|
||||
# - Bridge address 10.200.0.1/24 on the host side.
|
||||
# - NO NAT, NO route, NO DNS — guests cannot reach the host or the
|
||||
# internet. The bridge only carries traffic between the host and
|
||||
# the guests on it.
|
||||
# - Lab-host and target VMs both attach via tap devices created by
|
||||
# the launcher.
|
||||
#
|
||||
# Run as root, ONCE per host. Idempotent — re-running is safe.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
BRIDGE="${BRIDGE:-br-malware}"
|
||||
BRIDGE_IP="${BRIDGE_IP:-10.200.0.1/24}"
|
||||
|
||||
log() { printf '[setup_bridge] %s\n' "$*" >&2; }
|
||||
|
||||
[[ $EUID -eq 0 ]] || { log "must run as root"; exit 1; }
|
||||
|
||||
if ! command -v ip >/dev/null; then
|
||||
log "iproute2 (`ip`) is required"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! ip link show "$BRIDGE" >/dev/null 2>&1; then
|
||||
log "creating bridge $BRIDGE"
|
||||
ip link add name "$BRIDGE" type bridge
|
||||
# Disable spanning-tree on the host-only bridge — it isn't needed
|
||||
# and adds startup delay.
|
||||
ip link set "$BRIDGE" type bridge stp_state 0
|
||||
fi
|
||||
|
||||
ip link set "$BRIDGE" up
|
||||
|
||||
# Add the host-side address if not already there.
|
||||
if ! ip -4 addr show dev "$BRIDGE" | grep -q "${BRIDGE_IP%%/*}"; then
|
||||
log "adding $BRIDGE_IP to $BRIDGE"
|
||||
ip addr add "$BRIDGE_IP" dev "$BRIDGE"
|
||||
fi
|
||||
|
||||
# Make sure the kernel does NOT forward between this bridge and any
|
||||
# other interface. We don't want a misconfigured net.ipv4.ip_forward
|
||||
# to leak the malware bridge to the LAN.
|
||||
if [[ "$(cat /proc/sys/net/ipv4/ip_forward)" == "1" ]]; then
|
||||
log "WARNING: net.ipv4.ip_forward=1 — make sure iptmonads / nftables"
|
||||
log "blocks traffic from $BRIDGE to non-loopback devices."
|
||||
fi
|
||||
|
||||
log "bridge ready: $(ip -4 -br addr show "$BRIDGE")"
|
||||
log ""
|
||||
log "Launchers can now opt into tap+bridge mode by setting:"
|
||||
log " BRIDGE=$BRIDGE (tells launch_target.sh to attach a tap to this bridge)"
|
||||
log "Default launcher behaviour stays SLIRP usermode for simplicity."
|
||||
Loading…
Add table
Reference in a new issue