CIS490/docs/sources.md
max 613c6fa223 Tier 3: msfrpc-driven exploit driver + first module config
Adds the Tier-3 exploit driver — an MSFExploitDriver that plugs into
EpisodeRunner.on_phase, fires a Metasploit module against a target VM
via msfrpcd, watches for the resulting session, and stamps each
transition (exploit_fire, session_open, session_landing_probe,
sample_executed, session_dormant, session_killed) into the episode's
events.jsonl on the orchestrator's monotonic clock.

What landed:
- exploits/msfrpc.py — minimal msgpack-over-HTTPS client (auth,
  module.execute, job/session lifecycle) so we don't depend on a
  third-party MSF wrapper.
- exploits/driver.py — phase-to-msfrpc adapter; idempotent fire,
  session-open polling with timeout, workload start/stop, teardown.
- exploits/modules.py + exploits/modules/vsftpd_234_backdoor.toml —
  TOML module configs with {{ target_ip }} placeholders, replacing the
  imperative .rc-script approach the README previously hinted at.
- vm/launch_target.sh — SLIRP+restrict=on launcher for the
  intentionally-vulnerable target VM (host can reach guest via
  hostfwd, guest cannot reach host or internet).
- tools/run_tier3_demo.py — end-to-end runner mirroring run_real_vm_demo.
- tests/test_exploits.py — 12 new tests against a fake MSFRpcClient,
  including an integration test that drives a real EpisodeRunner.

Plumbing changes:
- EpisodeRunner._emit_event → public emit_event, so external drivers
  share the runner's monotonic clock and events.jsonl.
- mkdir for episode_dir moved to __init__ so emit_event is callable
  before run() (driver_setup fires pre-schedule).

Status: driver + tests pass (40/40); end-to-end against a live msfrpcd
+ Metasploitable2 image is the next bring-up step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:11:52 -05:00

8.5 KiB

Sources & Works Cited

Every external thing this project depends on, leans on for design, or pulls samples from. Grouped by category. Where relevant, we note the role each thing plays in our pipeline.


Prior work / academic


Virtualization & operating system

  • QEMU (10.2.0 in our lab). https://www.qemu.org License: GPL-2.0-or-later. Role: the hypervisor running guest VMs; we drive it via QMP for oracle telemetry.

  • KVM (Linux kernel module). https://www.linux-kvm.org License: GPL-2.0. Role: hardware-accelerated virtualization backend for QEMU.

  • Linux kernel (6.18.x lab host). https://www.kernel.org License: GPL-2.0. Role: host kernel; supplies /proc, /sys, perf, KVM, cgroups, virtio-serial.

  • systemd — service supervisor for the receiver and the (planned) orchestrator and shipper daemons. https://systemd.io License: LGPL-2.1-or-later.


VMs and intentionally-vulnerable images

  • Alpine Linux 3.21 cloud-init nocloud image (current tier-2 guest). https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/ File: nocloud_alpine-3.21.0-x86_64-bios-cloudinit-r0.qcow2 SHA-512: bb509092cda3548c11bc48a2168ce950d654b50db006e98939c06a5d86487f4e53cbb7954fafbba9ab5c8098008a9f304421ffc3397b0bc1d87b6aa309239b98 License: MIT for image, GPL/various for contents. Role: small (~180 MiB) Linux image with cloud-init that picks up our NoCloud cidata ISO at first boot. SSH-pwauth and a known root password are set via runcmd. Used as the in-guest workload host for tier-2 runs and as the post-snapshot baseline for the qcow2 snapshot loop.

  • Cirros 0.6.3 (initially tried, currently unused). https://download.cirros-cloud.net/0.6.3/ SHA-256 of cirros-0.6.3-x86_64-disk.img: 7d6355852aeb6dbcd191bcda7cd74f1536cfe5cbf8a10495a7283a8396e4b75b License: GPL. Role: tiny (~21 MiB) test image; abandoned for this project because Cirros 0.6.x's cirros-init checks the EC2 metadata service before NoCloud and the failure-retry loop took ~17 minutes to fall through. Kept in the manifest in case the simpler image is useful for a later size-constrained scenario.

  • Metasploitable 2 (planned for the exploit phase, Rapid7). https://information.rapid7.com/download-metasploitable-2017.html Role: purposely vulnerable Linux VM whose services have stable Metasploit modules (vsftpd 2.3.4 backdoor, distccd RCE, Samba usermap_script, PHP CGI arg injection, etc.) — gives us reproducible exploit fire for the armed → infecting transition.

  • Metasploitable 3 (Rapid7, optional later, Vagrant-built). https://github.com/rapid7/metasploitable3 Role: heavier, Win + Linux variants; reserved for adding diversity to the dataset if time allows.


Exploitation framework

  • Metasploit Framework (Rapid7). https://github.com/rapid7/metasploit-framework License: BSD-3-Clause. Role: drives the exploit fire step programmatically via msfrpc, so episodes label armed → infecting transitions on session_open rather than guessing from metrics.

  • Exploit-DB (Offensive Security). https://www.exploit-db.com Role: cross-reference for CVE → public PoC, where Metasploit doesn't cover a vulnerability we want.


Public malware sample sources

All samples used in this project are pre-existing, public, and hash-pinned. We do not author novel malware or exploits. Sample binaries are NEVER committed to the repo — see samples/README.md for safety rules.

  • MalwareBazaar (abuse.ch). https://bazaar.abuse.ch Role: primary sample fetch source. Provides API + sha256 lookup. Used for cryptominers (XMRig variants), webshells, and Linux ELF samples.

  • theZoo (a public live-malware repository). https://thezoo.morirt.com https://github.com/ytisf/theZoo Role: secondary source for older/rarer samples. Categorized by family.

  • vx-underground (collection of malware research artifacts). https://vx-underground.org Role: tertiary source; useful for academic context and family-attribution metadata.


Standards & specifications


Python runtime & libraries


Lab infrastructure (spectral org, .wg overlay)

These are not part of this repo's code, but they are the platform the pipeline runs on. See [reference_wg_infra memory] for context.

  • WireGuard — VPN tunnel for the .wg overlay. https://www.wireguard.com License: GPL-2.0.
  • Caddy — reverse proxy in front of the receiver, terminates internal TLS via tls internal. https://caddyserver.com License: Apache-2.0.
  • Forgejo — self-hosted git host at maxgit.wg. https://forgejo.org License: GPL-3.0+.
  • Raspberry Pi 5 — central WG-side collector hardware (the receiver
    • dataset store run here). NOT the deployment target for the model.

How to cite this dataset (placeholder)

When the dataset reaches a publishable form, the canonical citation will be added here. Until then, a short course-project citation is fine:

Gorog, M. CIS490 Behavioral Malware Detection Dataset (in progress). Spectral lab, 2026.


Maintenance

When you add a new dependency, sample source, or external tool, add it here in the same session. A "works cited" file with stale citations is worse than none.