End-to-end now drives a real KVM guest through the full XMRig-shaped
phase schedule with the workload running INSIDE the guest. Telemetry is
host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU
saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven
over the serial console at every phase transition. The plotted envelope
shows clean idle → armed → infecting (disk spike) → infected_running
(100% CPU plateau) → dormant → re-entry → final clean.
Components:
vm/launch_demo.sh now boots Alpine 3.21 nocloud-cloudinit
(Cirros 0.6.x's cirros-init blocks on the
EC2 metadata service for ~17 min before
falling through to NoCloud — abandoned).
Mounts a cidata ISO as a second drive.
tools/build_cidata.py pure-Python NoCloud ISO builder (pycdlib).
Sets root password and ssh_pwauth via
runcmd so we don't depend on a specific
cloud-init version's plain_text_passwd
handling.
tools/vm_serial.py serial-console client (stdlib socket).
Idempotent login (detects already-in-shell
state), sentinel-bracketed run() that
distinguishes shell output from the TTY
echo of input by requiring a leading
\r\n boundary on the marker.
tools/vm_load_controller.py in-guest load controller. set_phase()
dispatches the per-phase shell command
over the serial connection.
tools/run_real_vm_demo.py ties it all together: boot VM, wait for
cloud-init runcmd, log in, run the
EpisodeRunner with on_phase=controller,
shut down VM.
Deps: paramiko, pycdlib added.
docs/sources.md updated with Alpine cloud image (sha512 pinned), and
the new Python deps.
README leads with the tier-2 plot now (real VM, real workload). The
previous synthetic plot is moved below with explicit "host-side mimic,
not a VM" labelling. Tier-2 status flipped to ✅ in the tier table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.3 KiB
Sources & Works Cited
Every external thing this project depends on, leans on for design, or pulls samples from. Grouped by category. Where relevant, we note the role each thing plays in our pipeline.
Prior work / academic
-
A Trust Model for Resource-Constrained IoT Devices Based on Performance Metrics. IEEE Document 9881803. https://ieeexplore.ieee.org/document/9881803 Role: prerequisite paper for this project. Frames detection as a trust-over-time score rather than a single-snapshot classifier.
-
Mirai: original Linux/IoT botnet using SSH/Telnet weak credentials (Antonakakis et al., USENIX Security 2017). https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/antonakakis Role: the canonical real-world Linux compromise pattern that motivates our chosen attack vector ("SSH weak creds → drop payload"). The behavioral envelope our model targets is shaped by Mirai-class workloads.
-
Linux man pages,
proc(5)— kernel ABI for /proc. https://man7.org/linux/man-pages/man5/proc.5.html Role: canonical reference for/proc/<pid>/{stat,io,status,schedstat}field layout used bycollectors/proc_qemu.py. -
Linux
perf_event_open(2)man page. https://man7.org/linux/man-pages/man2/perf_event_open.2.html Role: the syscall that backsperf statand any in-process hardware- counter reads. Both the planned host-sideperf_qemucollector and the in-guest agent will read from this surface.
Virtualization & operating system
-
QEMU (10.2.0 in our lab). https://www.qemu.org License: GPL-2.0-or-later. Role: the hypervisor running guest VMs; we drive it via QMP for oracle telemetry.
-
KVM (Linux kernel module). https://www.linux-kvm.org License: GPL-2.0. Role: hardware-accelerated virtualization backend for QEMU.
-
Linux kernel (6.18.x lab host). https://www.kernel.org License: GPL-2.0. Role: host kernel; supplies /proc, /sys, perf, KVM, cgroups, virtio-serial.
-
systemd — service supervisor for the receiver and the (planned) orchestrator and shipper daemons. https://systemd.io License: LGPL-2.1-or-later.
VMs and intentionally-vulnerable images
-
Alpine Linux 3.21 cloud-init nocloud image (current tier-2 guest). https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/ File:
nocloud_alpine-3.21.0-x86_64-bios-cloudinit-r0.qcow2SHA-512:bb509092cda3548c11bc48a2168ce950d654b50db006e98939c06a5d86487f4e53cbb7954fafbba9ab5c8098008a9f304421ffc3397b0bc1d87b6aa309239b98License: MIT for image, GPL/various for contents. Role: small (~180 MiB) Linux image with cloud-init that picks up our NoCloud cidata ISO at first boot. SSH-pwauth and a known root password are set viaruncmd. Used as the in-guest workload host for tier-2 runs and as the post-snapshot baseline for the qcow2 snapshot loop. -
Cirros 0.6.3 (initially tried, currently unused). https://download.cirros-cloud.net/0.6.3/ SHA-256 of
cirros-0.6.3-x86_64-disk.img:7d6355852aeb6dbcd191bcda7cd74f1536cfe5cbf8a10495a7283a8396e4b75bLicense: GPL. Role: tiny (~21 MiB) test image; abandoned for this project because Cirros 0.6.x'scirros-initchecks the EC2 metadata service before NoCloud and the failure-retry loop took ~17 minutes to fall through. Kept in the manifest in case the simpler image is useful for a later size-constrained scenario. -
Metasploitable 2 (planned for the exploit phase, Rapid7). https://information.rapid7.com/download-metasploitable-2017.html Role: purposely vulnerable Linux VM whose services have stable Metasploit modules (vsftpd 2.3.4 backdoor, distccd RCE, Samba
usermap_script, PHP CGI arg injection, etc.) — gives us reproducible exploit fire for thearmed → infectingtransition. -
Metasploitable 3 (Rapid7, optional later, Vagrant-built). https://github.com/rapid7/metasploitable3 Role: heavier, Win + Linux variants; reserved for adding diversity to the dataset if time allows.
Exploitation framework
-
Metasploit Framework (Rapid7). https://github.com/rapid7/metasploit-framework License: BSD-3-Clause. Role: drives the exploit fire step programmatically via
msfrpc, so episodes labelarmed → infectingtransitions onsession_openrather than guessing from metrics. -
Exploit-DB (Offensive Security). https://www.exploit-db.com Role: cross-reference for CVE → public PoC, where Metasploit doesn't cover a vulnerability we want.
Public malware sample sources
All samples used in this project are pre-existing, public, and hash-pinned. We do not author novel malware or exploits. Sample binaries are NEVER committed to the repo — see
samples/README.mdfor safety rules.
-
MalwareBazaar (abuse.ch). https://bazaar.abuse.ch Role: primary sample fetch source. Provides API + sha256 lookup. Used for cryptominers (XMRig variants), webshells, and Linux ELF samples.
-
theZoo (a public live-malware repository). https://thezoo.morirt.com https://github.com/ytisf/theZoo Role: secondary source for older/rarer samples. Categorized by family.
-
vx-underground (collection of malware research artifacts). https://vx-underground.org Role: tertiary source; useful for academic context and family-attribution metadata.
Standards & specifications
-
ULID — Universally Unique Lexicographically Sortable Identifier. https://github.com/ulid/spec Role: episode IDs. 26-char Crockford base32, time-sortable, no coordinator. Implemented in
orchestrator/ulid.py. -
JSON Lines. https://jsonlines.org Role: on-disk telemetry, label, and event format. Append-only, crash-safe, trivially loadable as a DataFrame.
-
PEP 735 — dependency groups. https://peps.python.org/pep-0735/ Role:
pyproject.tomldependency grouping (thedevgroup). -
Crockford base32. https://www.crockford.com/base32.html Role: alphabet for ULIDs.
Python runtime & libraries
- Python 3.11+ — runtime requirement. https://www.python.org
- uv (Astral) — Python project + dependency manager. https://github.com/astral-sh/uv
- Starlette — ASGI framework for the receiver. https://www.starlette.io
- Uvicorn — ASGI server. https://www.uvicorn.org
- httptools, websockets, watchfiles, python-dotenv, pyyaml — Uvicorn
[standard]extras. - pytest — test runner. https://docs.pytest.org
- pytest-asyncio — async test support. https://github.com/pytest-dev/pytest-asyncio
- httpx — async HTTP client used for receiver tests via ASGITransport. https://www.python-httpx.org
- matplotlib + numpy — plotting (envelope visualization only). https://matplotlib.org / https://numpy.org
- tornado — required by matplotlib's WebAgg interactive backend. https://www.tornadoweb.org
- paramiko — SSH client used for in-guest control on cloud images that support it. https://www.paramiko.org
- pycdlib — pure-Python ISO9660/Joliet/Rock Ridge builder. Used to produce the NoCloud cidata ISO without depending on system mkisofs/ xorriso. https://clalancette.github.io/pycdlib/
Lab infrastructure (spectral org, .wg overlay)
These are not part of this repo's code, but they are the platform the
pipeline runs on. See [reference_wg_infra memory] for context.
- WireGuard — VPN tunnel for the
.wgoverlay. https://www.wireguard.com License: GPL-2.0. - Caddy — reverse proxy in front of the receiver, terminates internal
TLS via
tls internal. https://caddyserver.com License: Apache-2.0. - Forgejo — self-hosted git host at
maxgit.wg. https://forgejo.org License: GPL-3.0+. - Raspberry Pi 5 — central WG-side collector hardware (the receiver
- dataset store run here). NOT the deployment target for the model.
How to cite this dataset (placeholder)
When the dataset reaches a publishable form, the canonical citation will be added here. Until then, a short course-project citation is fine:
Gorog, M. CIS490 Behavioral Malware Detection Dataset (in progress). Spectral lab, 2026.
Maintenance
When you add a new dependency, sample source, or external tool, add it here in the same session. A "works cited" file with stale citations is worse than none.