PIPELINE §5 step 1: fix four root-cause defects

Diagnoses + fixes for the silent-collector / never-lands-session failures that the 200-episode quality probe surfaced (§3 evidence). All four address the producer; no compensating layers added. perf collector (rows_perf=0 on 100% of episodes): - perf stat -j writes to stderr by default with -p; we read stdout. Add --log-fd 1 so JSON reaches stdout where the parser sees it. - Event names come back annotated with the privilege scope perf actually measured ("cycles:u" under perf_event_paranoid=2). Strip the suffix so _build_row's plain-name lookups hit. Without this every metric was None even when perf reported real numbers. - tests/test_collectors_emit.py covers the regression with a real busy-loop fixture; emit-test discipline per §4.4. guest-agent collector (rows_guest=0 on 100% of episodes): - Alpine cloud image doesn't ship python3, so the in-guest agent's `#!/usr/bin/env python3` shebang silently fails. Add packages: [python3] to cidata user-data so cloud-init installs it before the OpenRC service starts. - Guest agent now exits nonzero (was: silent stdout fallback) when /dev/virtio-ports/cis490.guest.agent is missing, so OpenRC reports the failure to /var/log/cis490-agent.log instead of the bytes vanishing into the void. Refs §1. - Host-side collector emits guest_agent_connected / guest_agent_first_byte / guest_agent_silent_window into the orchestrator's events.jsonl. Future episodes show the in-guest failure mode per-episode instead of inferring from rows_guest=0. k-gamingcom missing qmp/netflow/pcap (also affected elliott on Tier-3 episodes — was misclassified as host divergence): - tools/run_tier3_demo.py was building EpisodeConfig WITHOUT qmp_socket / guest_agent_socket / bridge_iface — even though launch_target.sh creates the underlying chardevs and BRIDGE supplies the iface. tools/run_real_vm_demo.py wires them correctly; Tier-3 had a copy-paste gap. - tests/test_collectors_emit.py adds a source-grep regression so the wiring stays honest. samba_usermap_script never lands session (0/67 in §3 probe): - Bind handler default WfsDelay (~5s) gives up before bind_perl on Metasploitable2 has finished forking + binding LPORT under SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in exploits/driver.py so framework + driver agree on the wait budget. Add ConnectTimeout=15 so the handler's bind connect has retry budget instead of one-shot. orchestrator/fleet.py: usable_modules + BRIDGE handling were both unconditional, so: - With BRIDGE set, requires_bridge modules were still being dropped — picker only ever returned samba_usermap_script across every slot/episode (the test_fleet_uses_all_modules_when_bridge_set failure on HEAD). - env.pop("BRIDGE") fired even when BRIDGE was the operator's explicit setup, breaking modules that need bridge mode (vsftpd backdoor on hardcoded port 6200, distccd, etc.). Both made conditional on bridge_set so the picker walks the full catalog under bridge mode and SLIRP-only modules still get a clean SLIRP env when BRIDGE is unset. receiver/app.py: half-pregnant v2 schema state in HEAD — calling store.ingest_stream(episode_type=..., benign_profile=...) with kwargs the matching store.py change was in the WIP stash. Removed v2 awareness from app.py so v1 episodes (what the producer ships today) get accepted again. SCHEMA_VERSION default reset to 1 to match. 229 passed, 0 failed. (HEAD had 15 failures, all linked to the half-pregnant v2 state above.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:05:25 -05:00 · 2026-05-03 17:05:25 -05:00 · 4ab5477226
commit 4ab5477226
parent bfb1c491f8
10 changed files with 339 additions and 49 deletions
--- a/collectors/guest_agent.py
+++ b/collectors/guest_agent.py
@ -73,17 +73,46 @@ def run_loop(
    stop_event: threading.Event,
    *,
    connect_timeout_s: float = 30.0,
    emit_event: "callable | None" = None,
 ) -> int:
    """Read agent JSON-lines from the host-side virtio-serial unix
-    socket. Re-stamp each row with the host clock and persist."""
+    socket. Re-stamp each row with the host clock and persist.
    When ``emit_event`` is provided, the collector emits diagnostic
    events into the orchestrator's events.jsonl on each lifecycle
    boundary (connect / first-byte / silent-window / disconnect). This
    is what makes silent in-guest failures *visible* in the dataset:
    if connect succeeded but first_byte never came, every episode
    shows it. Without these markers the only signal was rows_guest=0,
    which is indistinguishable from "agent collector wasn't even
    enabled." Refs PIPELINE.md §1 + §4.4.
    """
    sock_path = Path(socket_path)
    sock = _connect(sock_path, connect_timeout_s)
    if sock is None:
        log.warning(
            "guest-agent: socket %s never came up after %.1fs — agent "
            "is not running in the guest, virtserialport device is "
            "missing from the QEMU command line, or the chardev "
            "couldn't bind. 0 rows will be emitted.",
            sock_path, connect_timeout_s,
        )
        if emit_event is not None:
            emit_event("guest_agent_connect_failed",
                       socket_path=str(sock_path),
                       timeout_s=connect_timeout_s)
        return 0
    if emit_event is not None:
        emit_event("guest_agent_connected", socket_path=str(sock_path))
    rows = 0
    output_path.parent.mkdir(parents=True, exist_ok=True)
    buf = b""
    first_byte_at_mono_ns: int | None = None
    silent_warned = False
    silent_warn_after_s = 5.0
    connect_mono_ns = time.monotonic_ns()
    try:
        with output_path.open("a", buffering=1) as f:
            while not stop_event.is_set():
@ -91,6 +120,27 @@ def run_loop(
                    sock.settimeout(0.5)
                    chunk = sock.recv(8192)
                except socket.timeout:
                    # The socket is open but nothing's arriving. Emit
                    # exactly one warning when the silent window
                    # exceeds silent_warn_after_s — this is the loud
                    # signal §1 demands when the in-guest agent is
                    # connected but not producing.
                    if (not silent_warned and first_byte_at_mono_ns is None
                            and (time.monotonic_ns() - connect_mono_ns)
                            > silent_warn_after_s * 1e9):
                        log.warning(
                            "guest-agent: socket connected but no bytes "
                            "after %.1fs — in-guest agent likely crashed "
                            "or isn't writing to /dev/virtio-ports/"
                            "cis490.guest.agent",
                            silent_warn_after_s,
                        )
                        if emit_event is not None:
                            emit_event(
                                "guest_agent_silent_window",
                                window_s=silent_warn_after_s,
                            )
                        silent_warned = True
                    continue
                except OSError as e:
                    log.warning("guest-agent recv failed: %s", e)
@ -98,6 +148,20 @@ def run_loop(
                if not chunk:
                    log.info("guest-agent socket closed")
                    break
                if first_byte_at_mono_ns is None:
                    first_byte_at_mono_ns = time.monotonic_ns()
                    log.info(
                        "guest-agent: first byte received %.2fs after connect",
                        (first_byte_at_mono_ns - connect_mono_ns) / 1e9,
                    )
                    if emit_event is not None:
                        emit_event(
                            "guest_agent_first_byte",
                            wait_after_connect_s=(
                                (first_byte_at_mono_ns - connect_mono_ns)
                                / 1e9
                            ),
                        )
                buf += chunk
                while b"\n" in buf:
                    line, _, buf = buf.partition(b"\n")
--- a/collectors/perf_qemu.py
+++ b/collectors/perf_qemu.py
@ -127,11 +127,17 @@ def run_loop(
        log.warning("perf binary not on PATH — perf collector disabled")
        return 0
    # perf stat writes its output (including -j JSON) to stderr by
    # default when -p / --pid is in use; only when perf forks the
    # workload itself does it go to stdout. --log-fd 1 forces output
    # onto fd 1 so we can stream it through proc.stdout. Without this
    # the collector silently writes 0 rows on every episode.
    cmd = [
        "perf", "stat",
        "-p", str(pid),
        "-I", str(interval_ms),
        "-j",
        "--log-fd", "1",
        "-e", ",".join(events),
    ]
    log.info("starting perf: %s", " ".join(cmd))
@ -179,6 +185,12 @@ def run_loop(
                value = _coerce_int(evt.get("counter-value"))
                if interval is None or event_name is None:
                    continue
                # perf annotates event names with the privilege scope it
                # was actually able to measure (e.g. "cycles:u" when only
                # userspace is permitted under perf_event_paranoid=2).
                # Strip the suffix so _build_row's plain-name lookups
                # ("cycles", "instructions", ...) hit.
                event_name = event_name.split(":", 1)[0]
                # perf emits one JSON per (event, interval); a new
                # interval value means we should flush the previous row.
                if cur_interval is not None and interval != cur_interval:
--- a/exploits/modules/samba_usermap_script.toml
+++ b/exploits/modules/samba_usermap_script.toml
@ -15,12 +15,27 @@ path = "multi/samba/usermap_script"
 [module.options]
 RHOSTS = "{{ target_ip }}"
 RPORT = 139
 # WfsDelay = wait-for-session, the budget Metasploit's payload handler
 # has to (a) verify the bind shell on the guest is up and (b) connect
 # to it. Default is ~5s. On Metasploitable2 the perl bind payload
 # takes longer than that to fork+bind under SLIRP+hostfwd, so the
 # handler gives up before the listener is ready and no session lands.
 # 30s gives bind_perl + the SLIRP forward time to settle. Matches
 # session_open_timeout_s in exploits/driver.py so the driver and the
 # framework agree on the wait budget. Refs PIPELINE.md §3 (0/67
 # session_open finding).
 WfsDelay = 30
 [payload]
 path = "cmd/unix/bind_perl"
 [payload.options]
 LPORT = 4444
 # Give the handler retry budget when connecting to the bind port.
 # msfrpcd's BindTcp handler retries every second up to ConnectTimeout
 # until the perl listener accepts. Without this, a single failed
 # connect aborts the session.
 ConnectTimeout = 15
 [session]
 type = "shell"
--- a/orchestrator/episode.py
+++ b/orchestrator/episode.py
@ -286,6 +286,12 @@ class EpisodeRunner:
                output_path=self.episode_dir / "telemetry-guest.jsonl",
                t_mono_origin_ns=self._t_mono_origin_ns,
                stop_event=self._stop,
                # Pipe lifecycle events into the orchestrator's
                # events.jsonl so silent in-guest failures (agent
                # crashed, virtio-serial misconfigured, etc.) are
                # observable per-episode instead of inferred from a
                # rows_guest=0 metric. Refs PIPELINE.md §1 / §4.4.
                emit_event=self.emit_event,
            )
        def _perf_collector() -> None:
--- a/orchestrator/fleet.py
+++ b/orchestrator/fleet.py
@ -243,14 +243,21 @@ def _run_slot(
    run_dir_base = "/tmp/cis490-vm-fleet"
    # Decide tier.
-    # Tier-3 target VMs always use SLIRP+hostfwd so msfrpcd can reach
+    # Tier-3 modules split into two classes by `requires_bridge`:
-    # the guest via loopback. BRIDGE tap is for the Tier-2 idle VM only
+    #   - bind/reverse-shell payloads under SLIRP need only loopback
-    # (pcap source 4). Skip modules that need bridge egress (bind/reverse
+    #     hostfwd (samba_usermap_script with bind_perl, etc.).
-    # shells that open a callback port the guest dials back or binds).
+    #   - modules with hardcoded callback ports or guest-driven
    #     callbacks (vsftpd's port-6200 backdoor, distccd, php_cgi,
    #     unreal_ircd) need a bridge so each guest gets its own IP.
    # When the operator sets BRIDGE (= bridge configured + tap
    # available), every module is usable. Without BRIDGE we drop the
    # bridge-only ones — running them under SLIRP would either fail
    # to land or collide on shared loopback ports across slots.
    bridge_set = bool(os.environ.get("BRIDGE"))
    usable_modules: dict[str, ModuleConfig] = (
-        {k: v for k, v in cfg.modules.items() if not v.requires_bridge}
+        dict(cfg.modules) if bridge_set
-        if cfg.modules else {}
+        else {k: v for k, v in cfg.modules.items() if not v.requires_bridge}
-    )
+    ) if cfg.modules else {}
    tier3_ready = (
        not cfg.force_tier2
        and bool(usable_modules)
@ -302,10 +309,15 @@ def _run_slot(
            target_ports += f",{extra_host_port}:{extra_host_port}"
            env["FLEET_PAYLOAD_LPORT"] = str(extra_host_port)
        env["TARGET_PORTS"] = target_ports
-        # Remove BRIDGE so launch_target.sh uses SLIRP+hostfwd instead of
+        # When BRIDGE is unset, force SLIRP+hostfwd; when it IS set we
-        # tap. Target VM connectivity goes through the hostfwd loopback ports;
+        # keep it so requires_bridge modules (vsftpd backdoor on the
-        # tap/bridge requires guest-IP discovery which isn't wired up yet.
+        # hardcoded port 6200, distccd, etc.) can reach the guest via
-        env.pop("BRIDGE", None)
+        # its own bridge IP. Refs Bug 1 in TIER3-BRINGUP.md (BRIDGE
        # leaking from Tier-2 into Tier-3 broke things) — that fix was
        # too aggressive; it stripped BRIDGE even when the module
        # legitimately needed it.
        if not bridge_set:
            env.pop("BRIDGE", None)
        cmd = [
            py,
            str(cfg.repo_root / "tools" / "run_tier3_demo.py"),
--- a/receiver/app.py
+++ b/receiver/app.py
@ -20,17 +20,7 @@ log = logging.getLogger("cis490.receiver")
 SUFFIX = ".tar.zst"
-SCHEMA_VERSION = 2
+SCHEMA_VERSION = 1
 # Mirrored from orchestrator.benign so the receiver can validate the
 # benign-profile header without taking a dependency on the orchestrator
 # package. Keep in sync if BENIGN_PROFILES grows.
 _VALID_BENIGN_PROFILES: frozenset[str] = frozenset({
    "idle", "web_visitor", "admin_session", "cron_burst",
    "file_browse", "db_query", "package_check",
 })
 _VALID_EPISODE_TYPES: frozenset[str] = frozenset({"control", "infected"})
 def _bearer_check(request: Request, expected: str | None) -> Response | None:
    if expected is None:
@ -98,7 +88,7 @@ def make_app(
        expected_sha = expected_sha.lower()
        try:
-            schema_version = int(request.headers.get("x-schema-version", "2"))
+            schema_version = int(request.headers.get("x-schema-version", "1"))
        except ValueError:
            return JSONResponse({"error": "bad X-Schema-Version"}, status_code=400)
@ -163,21 +153,6 @@ def make_app(
                )
                return JSONResponse(body, status_code=412)
        # Optional matrix-stratification headers. Validated against the
        # closed enums so a misbehaving shipper can't write garbage into
        # the index. Unknown values are dropped (header treated as absent)
        # and logged so the operator can spot a version drift quickly.
        episode_type = (request.headers.get("x-episode-type") or "").strip().lower()
        if episode_type and episode_type not in _VALID_EPISODE_TYPES:
            log.warning("dropping unknown X-Episode-Type=%r host=%s id=%s",
                        episode_type, host_id, episode_id)
            episode_type = ""
        benign_profile = (request.headers.get("x-benign-profile") or "").strip().lower()
        if benign_profile and benign_profile not in _VALID_BENIGN_PROFILES:
            log.warning("dropping unknown X-Benign-Profile=%r host=%s id=%s",
                        benign_profile, host_id, episode_id)
            benign_profile = ""
        cl = request.headers.get("content-length")
        if cl is not None:
            try:
@ -194,8 +169,6 @@ def make_app(
            expected_sha256=expected_sha,
            schema_version=schema_version,
            commit=commit or None,
            episode_type=episode_type or None,
            benign_profile=benign_profile or None,
            body=request.stream(),
            max_bytes=max_episode_bytes,
        )
--- a/tests/test_collectors_emit.py
+++ b/tests/test_collectors_emit.py
@ -0,0 +1,174 @@
 """§4.4 collector emit tests — each collector MUST produce >=1 row when
 run for a few seconds against a synthesized busy workload. A collector
 that fails this is removed from the active set (PIPELINE.md §4.4) — no
 silent zero-row inclusion.
 These tests intentionally invoke the real collector binaries (perf,
 tcpdump) against real subprocesses. They will skip on environments
 where the binary or capability is unavailable, but they will fail —
 not skip — when the binary IS present and the collector still emits
 zero rows. The whole point is to catch the "collector silently
 disabled" failure mode.
 """
 from __future__ import annotations
 import json
 import os
 import shutil
 import socket
 import subprocess
 import threading
 import time
 from pathlib import Path
 import pytest
 from collectors import perf_qemu
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def _spawn_busy_loop() -> subprocess.Popen:
    """Spawn a CPU-burning child whose PID we can hand to a collector.
    `exec yes` so the captured PID IS the busy process — without exec,
    the captured PID is the wrapping shell that sits parked waiting on
    its child, and perf samples an idle process."""
    return subprocess.Popen(
        ["sh", "-c", "exec yes >/dev/null"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )
 def _run_collector_briefly(target, *, seconds: float, **kw) -> int:
    """Spin a collector run_loop in a thread for `seconds`, then stop it.
    Returns the row count the collector reports."""
    stop = threading.Event()
    result: dict[str, int] = {}
    def _go() -> None:
        result["rows"] = target(stop_event=stop, **kw)
    th = threading.Thread(target=_go, daemon=True)
    th.start()
    time.sleep(seconds)
    stop.set()
    th.join(timeout=seconds + 5.0)
    return result.get("rows", 0)
 # ---------------------------------------------------------------------------
 # perf
 # ---------------------------------------------------------------------------
@pytest.mark.skipif(
    shutil.which("perf") is None,
    reason="perf binary not on PATH; this host can't host the perf collector",
 )
 def test_perf_emits_rows_against_busy_pid(tmp_path: Path) -> None:
    """The perf collector must emit at least one row when pointed at a
    busy PID for a few seconds. Software events (page-faults,
    context-switches, cpu-clock) are used so the test is portable
    across CPUs that lack hardware performance counters; the production
    DEFAULT_EVENTS adds hardware events on top, which is fine where
    they're available and degrades gracefully where they're not.
    Regression for: perf stat -j writes to stderr by default with -p,
    so reading proc.stdout silently gives 0 lines and 0 rows. Fixed
    by passing --log-fd 1 in the perf invocation.
    """
    busy = _spawn_busy_loop()
    try:
        out = tmp_path / "telemetry-perf.jsonl"
        rows = _run_collector_briefly(
            perf_qemu.run_loop,
            seconds=2.0,
            pid=busy.pid,
            output_path=out,
            t_mono_origin_ns=0,
            interval_ms=200,
            events=("page-faults", "context-switches", "cpu-clock"),
        )
    finally:
        busy.terminate()
        try:
            busy.wait(timeout=2.0)
        except subprocess.TimeoutExpired:
            busy.kill()
            busy.wait(timeout=1.0)
    assert rows >= 1, (
        f"perf collector wrote 0 rows against a busy PID — see "
        f"PIPELINE.md §4.4. File: {out}, exists={out.exists()}, "
        f"size={out.stat().st_size if out.exists() else 'n/a'}"
    )
    # Sanity-check the on-disk file matches what run_loop reported.
    on_disk = out.read_text().splitlines() if out.exists() else []
    assert len(on_disk) == rows, (
        f"row count mismatch: run_loop returned {rows} but "
        f"{len(on_disk)} lines on disk"
    )
    # Spot-check the row shape — one parsed row should have the
    # expected schema.
    sample = json.loads(on_disk[0])
    assert sample["source"] == "host_perf"
    assert sample["available_in_deployment"] is False
    assert "t_mono_ns" in sample and "interval_s" in sample
    # At least one row must have a populated metric — if every metric
    # is None on every row, the parser is dropping values. Regression
    # for: event names come back as "cycles:u" / "instructions:u"
    # under perf_event_paranoid=2 (userspace-only), but `_build_row`
    # looks up plain "cycles" / "instructions" — so every metric was
    # silently null even when perf reported real numbers. The mapped
    # fields in the row schema are cycles, instructions, page_faults,
    # context_switches, branches, branch_misses, cache_references,
    # cache_misses; we only need ANY of them populated to confirm the
    # parser is wiring values into the row.
    parsed = [json.loads(l) for l in on_disk]
    metric_keys = ("cycles", "instructions", "page_faults",
                   "context_switches", "branches")
    assert any(r.get(k) is not None for r in parsed for k in metric_keys), (
        f"every metric is None on every row — perf parser is dropping "
        f"values. Sample row: {parsed[0]}"
    )
 # ---------------------------------------------------------------------------
 # Tier-3 demo wiring regression
 # ---------------------------------------------------------------------------
 def test_run_tier3_demo_wires_collector_sockets_into_episode_config() -> None:
    """`run_tier3_demo.py` must pass qmp_socket / guest_agent_socket /
    bridge_iface to EpisodeConfig the same way `run_real_vm_demo.py`
    does. Without these, those collectors silently emit zero rows on
    every Tier-3 episode even though launch_target.sh creates the
    underlying chardevs. Regression for: bug found 2026-05-03 against
    elliott-thinkpad + k-gamingcom (rows_qmp=0 / rows_guest=0 / pcap=0
    on 100% of Tier-3 episodes).
    This is a source-grep test rather than an exec test because
    run_tier3_demo.py boots qemu + msfrpcd, neither of which is
    available in CI. The grep keeps the wiring honest with no
    runtime cost."""
    src = (Path(__file__).resolve().parent.parent
           / "tools" / "run_tier3_demo.py").read_text()
    # The exact fragments that, if absent, mean the collectors will
    # silently never start. Each must appear as a keyword arg of the
    # EpisodeConfig(...) constructor call site.
    for needle in (
        "qmp_socket=qmp_sock",
        "guest_agent_socket=agent_sock",
        "bridge_iface=os.environ.get(\"BRIDGE\")",
    ):
        assert needle in src, (
            f"run_tier3_demo.py is missing `{needle}` on its "
            f"EpisodeConfig — see PIPELINE.md §4.4. Tier-3 episodes "
            f"will silently produce 0 rows for the corresponding "
            f"collector."
        )
--- a/tools/build_cidata.py
+++ b/tools/build_cidata.py
@ -90,7 +90,18 @@ def build_user_data(*, embed_agent: bool, agent_path: Path | None) -> bytes:
        raise FileNotFoundError(f"agent script not found: {agent_path}")
    agent_src = agent_path.read_text()
    # The Alpine cloud image (alpine-virt-3.X.Y-x86_64.qcow2) does not
    # ship python3 by default, so the agent's `#!/usr/bin/env python3`
    # shebang fails and OpenRC silently can't start the service.
    # Result: telemetry-guest.jsonl is empty on every episode. Install
    # python3 via cloud-init BEFORE the runcmd that starts the service.
    # Refs PIPELINE.md §1 — a host that can't run the agent must say so
    # loudly; the loud-fail in vm/guest-agent/cis490_agent.py + this
    # explicit dep install close the silent-downgrade loop.
    body = head + (
        "packages:\n"
        "  - python3\n"
        "package_update: true\n"
        "write_files:\n"
        "  - path: /usr/local/bin/cis490-agent\n"
        "    permissions: '0755'\n"
--- a/tools/run_tier3_demo.py
+++ b/tools/run_tier3_demo.py
@ -289,6 +289,14 @@ def main() -> int:
            )
        )
        # Wire the same collector sockets the Tier-2 path wires. Without
        # these, EpisodeConfig defaults to None and the qmp / guest-agent
        # / pcap collectors never start — even though launch_target.sh
        # creates the qmp.sock + agent.sock chardevs and BRIDGE supplies
        # the iface. Refs PIPELINE.md §4.4: a collector that appears
        # configured but emits zero rows is exactly the silent-downgrade
        # pattern §1 forbids.
        agent_sock = run_dir / "agent.sock"
        cfg = EpisodeConfig(
            target_pid=qemu_pid,
            duration_s=sum(d for _, d in DEFAULT_SCHEDULE),
@ -297,6 +305,9 @@ def main() -> int:
            phase_schedule=DEFAULT_SCHEDULE,
            image_name=module.name + "-target",
            snapshot_name="baseline-v1",
            qmp_socket=qmp_sock if qmp_sock.exists() else None,
            guest_agent_socket=agent_sock if agent_sock.exists() else None,
            bridge_iface=os.environ.get("BRIDGE") or None,
            sample=sample,
            exploit_meta={
                "framework": "metasploit",
--- a/vm/guest-agent/cis490_agent.py
+++ b/vm/guest-agent/cis490_agent.py
@ -238,14 +238,26 @@ def main(argv: list[str] | None = None) -> int:
        sys.stdout.flush()
        return 0
-    # Open the virtio-serial port. If the host hasn't wired one up,
+    # Open the virtio-serial port. The host wires this up via QEMU's
-    # fall back to stdout so the agent is testable on bare-metal too.
+    # virtserialport device; if it's missing, either virtio_console
-    out_fp: Any
+    # isn't loaded in the guest kernel, the device wasn't included on
-    if os.path.exists(args.port):
+    # the QEMU command line, or udev hasn't created the symlink yet.
-        out_fp = open(args.port, "wb", buffering=0)
+    # Exit loudly so OpenRC re-runs us (per service config) and so
-    else:
+    # the failure is visible in /var/log/cis490-agent.log instead of
-        sys.stderr.write(f"[cis490-agent] {args.port} missing; writing to stdout\n")
+    # being absorbed by a silent stdout fallback. Refs PIPELINE.md
-        out_fp = sys.stdout.buffer
+    # §1 — a host that can't meet the bar must say so loudly, not
    # silently downgrade to a half-running state.
    if not os.path.exists(args.port):
        sys.stderr.write(
            f"[cis490-agent] FATAL: virtio-serial port {args.port} not "
            f"present. Check (a) virtio_console kernel module is loaded "
            f"inside the guest, (b) the QEMU command line includes "
            f"-device virtserialport,name=cis490.guest.agent, (c) udev "
            f"is creating /dev/virtio-ports/* symlinks. Exiting nonzero "
            f"so this failure is observable rather than silently lost.\n"
        )
        return 2
    out_fp = open(args.port, "wb", buffering=0)
    interval_ns = args.interval_ms * 1_000_000
    next_tick = time.monotonic_ns()