User caught it: I shipped the theZoo path without running it
end-to-end. A real fetch on the Pi exposed two bugs:
1. Family-name matcher was substring-strict. "Cryptolocker-class"
wouldn't match the dir "CryptoLocker_22Jan2014" because "-class"
isn't in the dir name. Now expands to a sequence of tokens
(full, head-of-dash, head-of-dot, head-of-underscore) and tries
each. First match wins.
2. Extraction picker was "largest non-text" — a bad heuristic for
theZoo, where each Linux.* zip often contains MULTIPLE binaries
for different platforms (Linux i386, x86-64, ARM, FreeBSD, sometimes
even Windows PE). The largest is rarely the i386 Linux ELF that
would actually run on Metasploitable2. Now sniffs ELF magic bytes
in stdlib and tiers:
1. Linux i386 ELF (largest first)
2. any other ELF (best-effort, may not execute)
3. largest non-text (Wine fallback)
Verified end-to-end on the Pi against a real theZoo clone (~500 MB,
263 family dirs, 2026-05-01 fresh pull):
linux-encoder-ransomware → ELF 32-bit Intel i386 SYSV (278 KB)
linux-wirenet-rat → ELF 32-bit Intel i386 SYSV (64 KB)
linux-rex-ransomware → ELF 32-bit Intel i386 SYSV Go (7.6 MB)
linux-neurevt-bot → ELF 32-bit Intel i386 SYSV (3.0 MB)
linux-earthkrahang-apt → ELF 32-bit Intel i386 GNU/Linux (5.8 MB)
5/5 picks are runnable Linux i386 ELFs. Manifest rewrites in place
add source/sha256/url; meta.sample.kind goes to "real" automatically.
Manifest rewritten:
- Old families (XMRig, Mirai, Cryptolocker-class, Dridex, Kovter,
Reverse-Shell) → mostly absent from theZoo's Linux catalog or
matched the wrong arch.
- New families chosen against a verified theZoo presence list:
Linux.Encoder, Linux.Wirenet, Ransomware.Rex, Neurevt,
EarthKrahang.
- XMRig + Kovter remain as mimic-only fallbacks (theZoo lacks a
runnable Linux i386 binary for these; orchestrator falls back
to the mimic profile).
Tests added (tests/test_auto_fetch_samples.py): 13 cases covering
ELF magic detection (i386 accepted, FreeBSD/x86-64/ARM/PE32/text
all rejected), family-token expansion (the "-class" suffix bug),
extraction picker (prefers Linux i386 over larger non-Linux ELFs),
manifest in-place rewrite preserves mode + skips entries that
already have sha256.
What's still NOT verified end-to-end (requires a lab host with
KVM x86):
- Metasploitable2 boot under QEMU
- vsftpd_234_backdoor exploit fire via msfrpcd
- chunked binary upload through a real shell session
- real binary executing inside a Metasploitable2 guest
The Pi is ARM64 — can't run Metasploitable2. install-tier-3-4.sh's
verify step (run_tier3_demo.py) covers all four on a real lab host;
deploy verifies on first run there.
171/171 tests pass.
77 lines
3.1 KiB
TOML
77 lines
3.1 KiB
TOML
# Sample manifest — what each fleet slot picks from.
|
|
#
|
|
# Each entry has three things:
|
|
# - identity (name, family, category) for labeling
|
|
# - acquisition (source, sha256, url) for reproducibility
|
|
# - behaviour (profile) so the synthetic load mimic can run a
|
|
# reasonable proxy until the real sample lands at samples/store/.
|
|
#
|
|
# When the real malware binary is present at samples/store/<sha256>,
|
|
# the orchestrator runs THAT inside the guest. When it's absent, the
|
|
# orchestrator falls back to the mimic workload with the matching
|
|
# profile so the fleet still produces *labeled, varied* data while
|
|
# we collect the real samples. Either way, meta.json records which
|
|
# path the episode took, so trainers can filter on
|
|
# meta.sample.kind ∈ {real, mimic}.
|
|
#
|
|
# Families below are CHOSEN AND TESTED to match theZoo entries that
|
|
# contain a Linux 32-bit Intel 80386 ELF binary — i.e. binaries that
|
|
# will execute natively inside our Metasploitable2 (Ubuntu 8.04 i386)
|
|
# target VM. Verified against a fresh theZoo clone on 2026-05-01;
|
|
# tools/auto_fetch_samples.py prefers the Linux-i386 ELF in each
|
|
# multi-binary zip via `_is_linux_i386_elf` magic-byte sniffing.
|
|
|
|
[[sample]]
|
|
name = "linux-encoder-ransomware"
|
|
family = "Linux.Encoder"
|
|
category = "ransomware"
|
|
profile = "io-walk"
|
|
description = "Linux.Encoder.1 (Linux i386 ELF). The first known Linux ransomware. Heavy disk write + fs walk producing a per-file overwrite envelope."
|
|
|
|
[[sample]]
|
|
name = "linux-wirenet-rat"
|
|
family = "Linux.Wirenet"
|
|
category = "rat"
|
|
profile = "shell-resident"
|
|
description = "Linux.Wirenet (Linux i386 ELF). RAT with a long-lived TCP socket pinned to a fixed peer; occasional command bursts."
|
|
|
|
[[sample]]
|
|
name = "linux-rex-ransomware"
|
|
family = "Ransomware.Rex"
|
|
category = "ransomware"
|
|
profile = "io-walk"
|
|
description = "Ransomware.Rex (Linux i386 ELF, written in Go). File-walk encryption envelope with periodic CPU spikes during AES."
|
|
|
|
[[sample]]
|
|
name = "linux-neurevt-bot"
|
|
family = "Neurevt"
|
|
category = "botnet"
|
|
profile = "scan-and-dial"
|
|
description = "Neurevt 1.7 (Linux i386 ELF). Botnet panel binary; SYN scans + periodic dial-home pattern."
|
|
|
|
[[sample]]
|
|
name = "linux-earthkrahang-apt"
|
|
family = "EarthKrahang"
|
|
category = "rat"
|
|
profile = "bursty-c2"
|
|
description = "EarthKrahang 2024 (Linux i386 ELF). APT backdoor; long idle + periodic small TCP egress bursts."
|
|
|
|
# Mimic-only fallback families. theZoo doesn't have a clean Linux i386
|
|
# binary for these; auto_fetch_samples.py logs a warning and the
|
|
# orchestrator stays on the mimic workload until a real binary is
|
|
# staged manually at samples/store/<sha256>. Kept here so the trainer
|
|
# can still collect cpu-saturate and low-and-slow envelopes (those
|
|
# profiles' theZoo coverage is sparse).
|
|
[[sample]]
|
|
name = "xmrig-cryptominer"
|
|
family = "XMRig"
|
|
category = "cryptominer"
|
|
profile = "cpu-saturate"
|
|
description = "Mimic only on Metasploitable2 (no Linux-i386 XMRig in theZoo)."
|
|
|
|
[[sample]]
|
|
name = "kovter-class-stealth"
|
|
family = "Kovter"
|
|
category = "fileless"
|
|
profile = "low-and-slow"
|
|
description = "Mimic only — Kovter is Windows-native; theZoo's binary won't run on Metasploitable2 i386."
|