Adds the Tier-3 exploit driver — an MSFExploitDriver that plugs into
EpisodeRunner.on_phase, fires a Metasploit module against a target VM
via msfrpcd, watches for the resulting session, and stamps each
transition (exploit_fire, session_open, session_landing_probe,
sample_executed, session_dormant, session_killed) into the episode's
events.jsonl on the orchestrator's monotonic clock.
What landed:
- exploits/msfrpc.py — minimal msgpack-over-HTTPS client (auth,
module.execute, job/session lifecycle) so we don't depend on a
third-party MSF wrapper.
- exploits/driver.py — phase-to-msfrpc adapter; idempotent fire,
session-open polling with timeout, workload start/stop, teardown.
- exploits/modules.py + exploits/modules/vsftpd_234_backdoor.toml —
TOML module configs with {{ target_ip }} placeholders, replacing the
imperative .rc-script approach the README previously hinted at.
- vm/launch_target.sh — SLIRP+restrict=on launcher for the
intentionally-vulnerable target VM (host can reach guest via
hostfwd, guest cannot reach host or internet).
- tools/run_tier3_demo.py — end-to-end runner mirroring run_real_vm_demo.
- tests/test_exploits.py — 12 new tests against a fake MSFRpcClient,
including an integration test that drives a real EpisodeRunner.
Plumbing changes:
- EpisodeRunner._emit_event → public emit_event, so external drivers
share the runner's monotonic clock and events.jsonl.
- mkdir for episode_dir moved to __init__ so emit_event is callable
before run() (driver_setup fires pre-schedule).
Status: driver + tests pass (40/40); end-to-end against a live msfrpcd
+ Metasploitable2 image is the next bring-up step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
212 lines
8.5 KiB
Markdown
212 lines
8.5 KiB
Markdown
# Sources & Works Cited
|
|
|
|
Every external thing this project depends on, leans on for design, or pulls
|
|
samples from. Grouped by category. Where relevant, we note the role each
|
|
thing plays in our pipeline.
|
|
|
|
---
|
|
|
|
## Prior work / academic
|
|
|
|
- **A Trust Model for Resource-Constrained IoT Devices Based on Performance
|
|
Metrics.** IEEE Document 9881803.
|
|
https://ieeexplore.ieee.org/document/9881803
|
|
*Role:* prerequisite paper for this project. Frames detection as a
|
|
trust-over-time score rather than a single-snapshot classifier.
|
|
|
|
- **Mirai: original Linux/IoT botnet using SSH/Telnet weak credentials**
|
|
(Antonakakis et al., USENIX Security 2017).
|
|
https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/antonakakis
|
|
*Role:* the canonical real-world Linux compromise pattern that motivates
|
|
our chosen attack vector ("SSH weak creds → drop payload"). The behavioral
|
|
envelope our model targets is shaped by Mirai-class workloads.
|
|
|
|
- **Linux man pages, `proc(5)`** — kernel ABI for /proc.
|
|
https://man7.org/linux/man-pages/man5/proc.5.html
|
|
*Role:* canonical reference for `/proc/<pid>/{stat,io,status,schedstat}`
|
|
field layout used by `collectors/proc_qemu.py`.
|
|
|
|
- **Linux `perf_event_open(2)` man page.**
|
|
https://man7.org/linux/man-pages/man2/perf_event_open.2.html
|
|
*Role:* the syscall that backs `perf stat` and any in-process hardware-
|
|
counter reads. Both the planned host-side `perf_qemu` collector and the
|
|
in-guest agent will read from this surface.
|
|
|
|
---
|
|
|
|
## Virtualization & operating system
|
|
|
|
- **QEMU** (10.2.0 in our lab). https://www.qemu.org
|
|
*License:* GPL-2.0-or-later. *Role:* the hypervisor running guest VMs;
|
|
we drive it via QMP for oracle telemetry.
|
|
|
|
- **KVM** (Linux kernel module). https://www.linux-kvm.org
|
|
*License:* GPL-2.0. *Role:* hardware-accelerated virtualization backend
|
|
for QEMU.
|
|
|
|
- **Linux kernel** (6.18.x lab host). https://www.kernel.org
|
|
*License:* GPL-2.0. *Role:* host kernel; supplies /proc, /sys, perf, KVM,
|
|
cgroups, virtio-serial.
|
|
|
|
- **systemd** — service supervisor for the receiver and the (planned)
|
|
orchestrator and shipper daemons. https://systemd.io
|
|
*License:* LGPL-2.1-or-later.
|
|
|
|
---
|
|
|
|
## VMs and intentionally-vulnerable images
|
|
|
|
- **Alpine Linux 3.21 cloud-init nocloud image** (current tier-2 guest).
|
|
https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/
|
|
File: `nocloud_alpine-3.21.0-x86_64-bios-cloudinit-r0.qcow2`
|
|
SHA-512: `bb509092cda3548c11bc48a2168ce950d654b50db006e98939c06a5d86487f4e53cbb7954fafbba9ab5c8098008a9f304421ffc3397b0bc1d87b6aa309239b98`
|
|
*License:* MIT for image, GPL/various for contents. *Role:* small
|
|
(~180 MiB) Linux image with cloud-init that picks up our NoCloud
|
|
cidata ISO at first boot. SSH-pwauth and a known root password are
|
|
set via `runcmd`. Used as the in-guest workload host for tier-2 runs
|
|
and as the post-snapshot baseline for the qcow2 snapshot loop.
|
|
|
|
- **Cirros 0.6.3** (initially tried, currently unused).
|
|
https://download.cirros-cloud.net/0.6.3/
|
|
SHA-256 of `cirros-0.6.3-x86_64-disk.img`:
|
|
`7d6355852aeb6dbcd191bcda7cd74f1536cfe5cbf8a10495a7283a8396e4b75b`
|
|
*License:* GPL. *Role:* tiny (~21 MiB) test image; abandoned for this
|
|
project because Cirros 0.6.x's `cirros-init` checks the EC2 metadata
|
|
service before NoCloud and the failure-retry loop took ~17 minutes to
|
|
fall through. Kept in the manifest in case the simpler image is
|
|
useful for a later size-constrained scenario.
|
|
|
|
- **Metasploitable 2** (planned for the exploit phase, Rapid7).
|
|
https://information.rapid7.com/download-metasploitable-2017.html
|
|
*Role:* purposely vulnerable Linux VM whose services have stable
|
|
Metasploit modules (vsftpd 2.3.4 backdoor, distccd RCE,
|
|
Samba `usermap_script`, PHP CGI arg injection, etc.) — gives us
|
|
reproducible exploit fire for the `armed → infecting` transition.
|
|
|
|
- **Metasploitable 3** (Rapid7, optional later, Vagrant-built).
|
|
https://github.com/rapid7/metasploitable3
|
|
*Role:* heavier, Win + Linux variants; reserved for adding diversity to
|
|
the dataset if time allows.
|
|
|
|
---
|
|
|
|
## Exploitation framework
|
|
|
|
- **Metasploit Framework** (Rapid7).
|
|
https://github.com/rapid7/metasploit-framework
|
|
*License:* BSD-3-Clause. *Role:* drives the exploit fire step
|
|
programmatically via `msfrpc`, so episodes label `armed → infecting`
|
|
transitions on `session_open` rather than guessing from metrics.
|
|
|
|
- **Exploit-DB** (Offensive Security).
|
|
https://www.exploit-db.com
|
|
*Role:* cross-reference for CVE → public PoC, where Metasploit doesn't
|
|
cover a vulnerability we want.
|
|
|
|
---
|
|
|
|
## Public malware sample sources
|
|
|
|
> **All samples used in this project are pre-existing, public, and
|
|
> hash-pinned.** We do not author novel malware or exploits.
|
|
> Sample binaries are NEVER committed to the repo — see
|
|
> [`samples/README.md`](../samples/README.md) for safety rules.
|
|
|
|
- **MalwareBazaar** (abuse.ch). https://bazaar.abuse.ch
|
|
*Role:* primary sample fetch source. Provides API + sha256 lookup. Used
|
|
for cryptominers (XMRig variants), webshells, and Linux ELF samples.
|
|
|
|
- **theZoo** (a public live-malware repository). https://thezoo.morirt.com
|
|
https://github.com/ytisf/theZoo
|
|
*Role:* secondary source for older/rarer samples. Categorized by family.
|
|
|
|
- **vx-underground** (collection of malware research artifacts).
|
|
https://vx-underground.org
|
|
*Role:* tertiary source; useful for academic context and
|
|
family-attribution metadata.
|
|
|
|
---
|
|
|
|
## Standards & specifications
|
|
|
|
- **ULID — Universally Unique Lexicographically Sortable Identifier.**
|
|
https://github.com/ulid/spec
|
|
*Role:* episode IDs. 26-char Crockford base32, time-sortable, no
|
|
coordinator. Implemented in `orchestrator/ulid.py`.
|
|
|
|
- **JSON Lines.** https://jsonlines.org
|
|
*Role:* on-disk telemetry, label, and event format. Append-only,
|
|
crash-safe, trivially loadable as a DataFrame.
|
|
|
|
- **PEP 735 — dependency groups.**
|
|
https://peps.python.org/pep-0735/
|
|
*Role:* `pyproject.toml` dependency grouping (the `dev` group).
|
|
|
|
- **Crockford base32.** https://www.crockford.com/base32.html
|
|
*Role:* alphabet for ULIDs.
|
|
|
|
---
|
|
|
|
## Python runtime & libraries
|
|
|
|
- **Python 3.11+** — runtime requirement. https://www.python.org
|
|
- **uv** (Astral) — Python project + dependency manager.
|
|
https://github.com/astral-sh/uv
|
|
- **Starlette** — ASGI framework for the receiver.
|
|
https://www.starlette.io
|
|
- **Uvicorn** — ASGI server. https://www.uvicorn.org
|
|
- **httptools, websockets, watchfiles, python-dotenv, pyyaml** — Uvicorn
|
|
`[standard]` extras.
|
|
- **pytest** — test runner. https://docs.pytest.org
|
|
- **pytest-asyncio** — async test support.
|
|
https://github.com/pytest-dev/pytest-asyncio
|
|
- **httpx** — async HTTP client used for receiver tests via ASGITransport.
|
|
https://www.python-httpx.org
|
|
- **matplotlib** + **numpy** — plotting (envelope visualization only).
|
|
https://matplotlib.org / https://numpy.org
|
|
- **tornado** — required by matplotlib's WebAgg interactive backend.
|
|
https://www.tornadoweb.org
|
|
- **paramiko** — SSH client used for in-guest control on cloud images
|
|
that support it. https://www.paramiko.org
|
|
- **pycdlib** — pure-Python ISO9660/Joliet/Rock Ridge builder. Used to
|
|
produce the NoCloud cidata ISO without depending on system mkisofs/
|
|
xorriso. https://clalancette.github.io/pycdlib/
|
|
- **msgpack** — binary serialization used by Metasploit's RPC API. The
|
|
Tier-3 driver speaks msfrpcd's native msgpack-over-HTTPS so we don't
|
|
pull in a higher-level Metasploit Python client.
|
|
https://msgpack.org
|
|
|
|
---
|
|
|
|
## Lab infrastructure (spectral org, .wg overlay)
|
|
|
|
These are not part of this repo's code, but they are the platform the
|
|
pipeline runs on. See [`reference_wg_infra` memory] for context.
|
|
|
|
- **WireGuard** — VPN tunnel for the `.wg` overlay.
|
|
https://www.wireguard.com *License:* GPL-2.0.
|
|
- **Caddy** — reverse proxy in front of the receiver, terminates internal
|
|
TLS via `tls internal`. https://caddyserver.com
|
|
*License:* Apache-2.0.
|
|
- **Forgejo** — self-hosted git host at `maxgit.wg`.
|
|
https://forgejo.org *License:* GPL-3.0+.
|
|
- **Raspberry Pi 5** — central WG-side collector hardware (the receiver
|
|
+ dataset store run here). NOT the deployment target for the model.
|
|
|
|
---
|
|
|
|
## How to cite this dataset (placeholder)
|
|
|
|
When the dataset reaches a publishable form, the canonical citation will be
|
|
added here. Until then, a short course-project citation is fine:
|
|
|
|
> Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).*
|
|
> Spectral lab, 2026.
|
|
|
|
---
|
|
|
|
## Maintenance
|
|
|
|
When you add a new dependency, sample source, or external tool, add it
|
|
here in the same session. A "works cited" file with stale citations is
|
|
worse than none.
|