CIS490/docs/sources.md
max 613c6fa223 Tier 3: msfrpc-driven exploit driver + first module config
Adds the Tier-3 exploit driver — an MSFExploitDriver that plugs into
EpisodeRunner.on_phase, fires a Metasploit module against a target VM
via msfrpcd, watches for the resulting session, and stamps each
transition (exploit_fire, session_open, session_landing_probe,
sample_executed, session_dormant, session_killed) into the episode's
events.jsonl on the orchestrator's monotonic clock.

What landed:
- exploits/msfrpc.py — minimal msgpack-over-HTTPS client (auth,
  module.execute, job/session lifecycle) so we don't depend on a
  third-party MSF wrapper.
- exploits/driver.py — phase-to-msfrpc adapter; idempotent fire,
  session-open polling with timeout, workload start/stop, teardown.
- exploits/modules.py + exploits/modules/vsftpd_234_backdoor.toml —
  TOML module configs with {{ target_ip }} placeholders, replacing the
  imperative .rc-script approach the README previously hinted at.
- vm/launch_target.sh — SLIRP+restrict=on launcher for the
  intentionally-vulnerable target VM (host can reach guest via
  hostfwd, guest cannot reach host or internet).
- tools/run_tier3_demo.py — end-to-end runner mirroring run_real_vm_demo.
- tests/test_exploits.py — 12 new tests against a fake MSFRpcClient,
  including an integration test that drives a real EpisodeRunner.

Plumbing changes:
- EpisodeRunner._emit_event → public emit_event, so external drivers
  share the runner's monotonic clock and events.jsonl.
- mkdir for episode_dir moved to __init__ so emit_event is callable
  before run() (driver_setup fires pre-schedule).

Status: driver + tests pass (40/40); end-to-end against a live msfrpcd
+ Metasploitable2 image is the next bring-up step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:11:52 -05:00

212 lines
8.5 KiB
Markdown

# Sources & Works Cited
Every external thing this project depends on, leans on for design, or pulls
samples from. Grouped by category. Where relevant, we note the role each
thing plays in our pipeline.
---
## Prior work / academic
- **A Trust Model for Resource-Constrained IoT Devices Based on Performance
Metrics.** IEEE Document 9881803.
https://ieeexplore.ieee.org/document/9881803
*Role:* prerequisite paper for this project. Frames detection as a
trust-over-time score rather than a single-snapshot classifier.
- **Mirai: original Linux/IoT botnet using SSH/Telnet weak credentials**
(Antonakakis et al., USENIX Security 2017).
https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/antonakakis
*Role:* the canonical real-world Linux compromise pattern that motivates
our chosen attack vector ("SSH weak creds → drop payload"). The behavioral
envelope our model targets is shaped by Mirai-class workloads.
- **Linux man pages, `proc(5)`** — kernel ABI for /proc.
https://man7.org/linux/man-pages/man5/proc.5.html
*Role:* canonical reference for `/proc/<pid>/{stat,io,status,schedstat}`
field layout used by `collectors/proc_qemu.py`.
- **Linux `perf_event_open(2)` man page.**
https://man7.org/linux/man-pages/man2/perf_event_open.2.html
*Role:* the syscall that backs `perf stat` and any in-process hardware-
counter reads. Both the planned host-side `perf_qemu` collector and the
in-guest agent will read from this surface.
---
## Virtualization & operating system
- **QEMU** (10.2.0 in our lab). https://www.qemu.org
*License:* GPL-2.0-or-later. *Role:* the hypervisor running guest VMs;
we drive it via QMP for oracle telemetry.
- **KVM** (Linux kernel module). https://www.linux-kvm.org
*License:* GPL-2.0. *Role:* hardware-accelerated virtualization backend
for QEMU.
- **Linux kernel** (6.18.x lab host). https://www.kernel.org
*License:* GPL-2.0. *Role:* host kernel; supplies /proc, /sys, perf, KVM,
cgroups, virtio-serial.
- **systemd** — service supervisor for the receiver and the (planned)
orchestrator and shipper daemons. https://systemd.io
*License:* LGPL-2.1-or-later.
---
## VMs and intentionally-vulnerable images
- **Alpine Linux 3.21 cloud-init nocloud image** (current tier-2 guest).
https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/
File: `nocloud_alpine-3.21.0-x86_64-bios-cloudinit-r0.qcow2`
SHA-512: `bb509092cda3548c11bc48a2168ce950d654b50db006e98939c06a5d86487f4e53cbb7954fafbba9ab5c8098008a9f304421ffc3397b0bc1d87b6aa309239b98`
*License:* MIT for image, GPL/various for contents. *Role:* small
(~180 MiB) Linux image with cloud-init that picks up our NoCloud
cidata ISO at first boot. SSH-pwauth and a known root password are
set via `runcmd`. Used as the in-guest workload host for tier-2 runs
and as the post-snapshot baseline for the qcow2 snapshot loop.
- **Cirros 0.6.3** (initially tried, currently unused).
https://download.cirros-cloud.net/0.6.3/
SHA-256 of `cirros-0.6.3-x86_64-disk.img`:
`7d6355852aeb6dbcd191bcda7cd74f1536cfe5cbf8a10495a7283a8396e4b75b`
*License:* GPL. *Role:* tiny (~21 MiB) test image; abandoned for this
project because Cirros 0.6.x's `cirros-init` checks the EC2 metadata
service before NoCloud and the failure-retry loop took ~17 minutes to
fall through. Kept in the manifest in case the simpler image is
useful for a later size-constrained scenario.
- **Metasploitable 2** (planned for the exploit phase, Rapid7).
https://information.rapid7.com/download-metasploitable-2017.html
*Role:* purposely vulnerable Linux VM whose services have stable
Metasploit modules (vsftpd 2.3.4 backdoor, distccd RCE,
Samba `usermap_script`, PHP CGI arg injection, etc.) — gives us
reproducible exploit fire for the `armed → infecting` transition.
- **Metasploitable 3** (Rapid7, optional later, Vagrant-built).
https://github.com/rapid7/metasploitable3
*Role:* heavier, Win + Linux variants; reserved for adding diversity to
the dataset if time allows.
---
## Exploitation framework
- **Metasploit Framework** (Rapid7).
https://github.com/rapid7/metasploit-framework
*License:* BSD-3-Clause. *Role:* drives the exploit fire step
programmatically via `msfrpc`, so episodes label `armed → infecting`
transitions on `session_open` rather than guessing from metrics.
- **Exploit-DB** (Offensive Security).
https://www.exploit-db.com
*Role:* cross-reference for CVE → public PoC, where Metasploit doesn't
cover a vulnerability we want.
---
## Public malware sample sources
> **All samples used in this project are pre-existing, public, and
> hash-pinned.** We do not author novel malware or exploits.
> Sample binaries are NEVER committed to the repo — see
> [`samples/README.md`](../samples/README.md) for safety rules.
- **MalwareBazaar** (abuse.ch). https://bazaar.abuse.ch
*Role:* primary sample fetch source. Provides API + sha256 lookup. Used
for cryptominers (XMRig variants), webshells, and Linux ELF samples.
- **theZoo** (a public live-malware repository). https://thezoo.morirt.com
https://github.com/ytisf/theZoo
*Role:* secondary source for older/rarer samples. Categorized by family.
- **vx-underground** (collection of malware research artifacts).
https://vx-underground.org
*Role:* tertiary source; useful for academic context and
family-attribution metadata.
---
## Standards & specifications
- **ULID — Universally Unique Lexicographically Sortable Identifier.**
https://github.com/ulid/spec
*Role:* episode IDs. 26-char Crockford base32, time-sortable, no
coordinator. Implemented in `orchestrator/ulid.py`.
- **JSON Lines.** https://jsonlines.org
*Role:* on-disk telemetry, label, and event format. Append-only,
crash-safe, trivially loadable as a DataFrame.
- **PEP 735 — dependency groups.**
https://peps.python.org/pep-0735/
*Role:* `pyproject.toml` dependency grouping (the `dev` group).
- **Crockford base32.** https://www.crockford.com/base32.html
*Role:* alphabet for ULIDs.
---
## Python runtime & libraries
- **Python 3.11+** — runtime requirement. https://www.python.org
- **uv** (Astral) — Python project + dependency manager.
https://github.com/astral-sh/uv
- **Starlette** — ASGI framework for the receiver.
https://www.starlette.io
- **Uvicorn** — ASGI server. https://www.uvicorn.org
- **httptools, websockets, watchfiles, python-dotenv, pyyaml** — Uvicorn
`[standard]` extras.
- **pytest** — test runner. https://docs.pytest.org
- **pytest-asyncio** — async test support.
https://github.com/pytest-dev/pytest-asyncio
- **httpx** — async HTTP client used for receiver tests via ASGITransport.
https://www.python-httpx.org
- **matplotlib** + **numpy** — plotting (envelope visualization only).
https://matplotlib.org / https://numpy.org
- **tornado** — required by matplotlib's WebAgg interactive backend.
https://www.tornadoweb.org
- **paramiko** — SSH client used for in-guest control on cloud images
that support it. https://www.paramiko.org
- **pycdlib** — pure-Python ISO9660/Joliet/Rock Ridge builder. Used to
produce the NoCloud cidata ISO without depending on system mkisofs/
xorriso. https://clalancette.github.io/pycdlib/
- **msgpack** — binary serialization used by Metasploit's RPC API. The
Tier-3 driver speaks msfrpcd's native msgpack-over-HTTPS so we don't
pull in a higher-level Metasploit Python client.
https://msgpack.org
---
## Lab infrastructure (spectral org, .wg overlay)
These are not part of this repo's code, but they are the platform the
pipeline runs on. See [`reference_wg_infra` memory] for context.
- **WireGuard** — VPN tunnel for the `.wg` overlay.
https://www.wireguard.com *License:* GPL-2.0.
- **Caddy** — reverse proxy in front of the receiver, terminates internal
TLS via `tls internal`. https://caddyserver.com
*License:* Apache-2.0.
- **Forgejo** — self-hosted git host at `maxgit.wg`.
https://forgejo.org *License:* GPL-3.0+.
- **Raspberry Pi 5** — central WG-side collector hardware (the receiver
+ dataset store run here). NOT the deployment target for the model.
---
## How to cite this dataset (placeholder)
When the dataset reaches a publishable form, the canonical citation will be
added here. Until then, a short course-project citation is fine:
> Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).*
> Spectral lab, 2026.
---
## Maintenance
When you add a new dependency, sample source, or external tool, add it
here in the same session. A "works cited" file with stale citations is
worse than none.