This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.
Collectors landed:
collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP
client + row builder + run loop. Tolerates
older qemu without query-stats.
collectors/guest_agent.py — source 5 (deployable). Reads the
virtio-serial host-side socket, parses
agent JSON-lines, re-stamps to the host
monotonic clock, persists.
collectors/pcap.py — source 4 (deployable). tcpdump capture
+ pure-Python pcap reader + 100 ms
netflow.jsonl bucketizer. Decodes
Ethernet/IPv4/TCP/UDP enough for the
schema in docs/data-model.md.
In-guest agent:
vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
/proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
tools/build_cidata.py — embeds the agent + an OpenRC service into
user-data so first boot of the Alpine cidata image auto-starts it.
Launchers:
vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
the agent socket; SLOT env support so multiple VMs run without
socket / port collisions; PORT_BASE on launch_target so multiple
target VMs hostfwd different host ports.
vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
no NAT). Idempotent.
Fleet:
orchestrator/fleet.py — capacity detector (cores / RAM / load
headroom) + concurrent-slot runner. Per-slot ENV selects the
sample. FleetCapacity dataclass round-trips into meta.json so
"this episode ran with 6 concurrent VMs" is auditable post-hoc.
tools/run_fleet.py — CLI: --capacity report; --waves N runs N
waves of (max_concurrent) episodes each, every slot with a
different sample.
etc/cis490-orchestrator.service — now drives the fleet runner with
Restart=always so each invocation runs one wave and respawns,
giving a continuous stream.
Samples:
samples/manifest.toml — six profiles spanning the five major
behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
samples/manifest.py — strict TOML loader (rejects dups, unknown
categories) + deterministic select(host_id, slot, episode_index)
so different hosts on the network walk the catalog in different
orders without any coordinator.
EpisodeRunner:
orchestrator/episode.py — optional qmp_socket + guest_agent_socket
fields on EpisodeConfig; when set, additional collector threads
run alongside proc_qemu. EpisodeResult now carries rows_qmp +
rows_guest counters.
Tier-3 setup automation:
scripts/install-msfrpcd.sh — installs metasploit-framework where
the package manager has it, generates a strong password into
/etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
once MSFRPC_PASSWORD is sourced.
scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
from the operator (Rapid7 download is registration-walled), pulls,
verifies, converts vmdk → qcow2, lands at vm/images/.
Tests: 82 pass (was 51). New suites:
tests/test_qmp.py — fake QMP server, capability handshake,
blockstats, async-event interleaving,
5-failure backoff
tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
re-stamp, malformed-line tolerance
tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames,
bucketize correctness across windows
tests/test_fleet.py — capacity math (8-core idle / low-RAM /
high-load / Pi5 / 1-core box), manifest
selection determinism + diversity
What's queued for the next commit (already discussed in convo):
- MSFExploitDriver v2: map sample.profile → distinct in-session
workload so Tier-3 episodes don't all produce the same yes-loop
envelope. Critical for ML to learn varied malware shapes.
- Real-sample fetch from MalwareBazaar by sha256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
105 lines
3.8 KiB
Python
105 lines
3.8 KiB
Python
"""Sample manifest loader + per-(host, slot) deterministic selection.
|
|
|
|
The manifest at ``samples/manifest.toml`` defines the catalog of
|
|
samples (real or mimic) the fleet draws from. Selection is
|
|
**deterministic** given ``(host_id, slot, episode_index)`` so two lab
|
|
hosts on the same fleet pick *different* samples for the same slot
|
|
index, and the same host repeats only after exhausting the catalog.
|
|
|
|
This gives us "all hosts on the network generating novel data" without
|
|
needing a coordinator: every host's `host_id` seeds its own
|
|
sample-rotation order, and the orderings spread across the catalog.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import hashlib
|
|
import tomllib
|
|
from dataclasses import dataclass, field
|
|
from pathlib import Path
|
|
|
|
|
|
_VALID_CATEGORIES = {
|
|
"cryptominer", "botnet", "ransomware", "banking-trojan",
|
|
"fileless", "rat", "worm", "loader", "wiper", "other",
|
|
}
|
|
|
|
|
|
@dataclass(frozen=True)
|
|
class Sample:
|
|
name: str
|
|
family: str
|
|
category: str
|
|
profile: str
|
|
description: str = ""
|
|
source: str | None = None
|
|
sha256: str | None = None
|
|
url: str | None = None
|
|
|
|
@property
|
|
def kind(self) -> str:
|
|
"""``"real"`` if a sha256-pinned binary is expected, else ``"mimic"``.
|
|
Trainers filter on this so the realistic-model pipeline only
|
|
consumes real-malware episodes."""
|
|
return "real" if self.sha256 else "mimic"
|
|
|
|
|
|
@dataclass(frozen=True)
|
|
class SampleManifest:
|
|
samples: list[Sample] = field(default_factory=list)
|
|
|
|
def __len__(self) -> int:
|
|
return len(self.samples)
|
|
|
|
def select(self, *, host_id: str, slot: int, episode_index: int = 0) -> Sample:
|
|
"""Deterministic selection. The host_id mixes into the seed so
|
|
different hosts visit the catalog in different orders; slot +
|
|
episode_index tick within a host. Same inputs always give the
|
|
same sample — replay-friendly for debugging."""
|
|
if not self.samples:
|
|
raise ValueError("manifest is empty")
|
|
# SHA-256 of the seed gives a uniformly distributed integer.
|
|
seed = f"{host_id}|{slot}|{episode_index}".encode()
|
|
h = hashlib.sha256(seed).digest()
|
|
idx = int.from_bytes(h[:8], "big") % len(self.samples)
|
|
return self.samples[idx]
|
|
|
|
@classmethod
|
|
def load(cls, path: str | Path) -> "SampleManifest":
|
|
with open(path, "rb") as f:
|
|
data = tomllib.load(f)
|
|
raw = data.get("sample") or []
|
|
if not isinstance(raw, list):
|
|
raise ValueError(f"{path}: 'sample' must be an array of tables")
|
|
|
|
samples: list[Sample] = []
|
|
for i, entry in enumerate(raw):
|
|
if not isinstance(entry, dict):
|
|
raise ValueError(f"{path}: sample[{i}] is not a table")
|
|
for key in ("name", "family", "category", "profile"):
|
|
if not isinstance(entry.get(key), str) or not entry[key]:
|
|
raise ValueError(f"{path}: sample[{i}] missing or empty '{key}'")
|
|
if entry["category"] not in _VALID_CATEGORIES:
|
|
raise ValueError(
|
|
f"{path}: sample[{i}] category {entry['category']!r} "
|
|
f"not in {sorted(_VALID_CATEGORIES)}"
|
|
)
|
|
samples.append(Sample(
|
|
name=entry["name"],
|
|
family=entry["family"],
|
|
category=entry["category"],
|
|
profile=entry["profile"],
|
|
description=entry.get("description", ""),
|
|
source=entry.get("source"),
|
|
sha256=entry.get("sha256"),
|
|
url=entry.get("url"),
|
|
))
|
|
|
|
# Reject duplicate names — trainers join on this.
|
|
seen: set[str] = set()
|
|
for s in samples:
|
|
if s.name in seen:
|
|
raise ValueError(f"{path}: duplicate sample name {s.name!r}")
|
|
seen.add(s.name)
|
|
|
|
return cls(samples=samples)
|