shipper: crashes on startup when ca_bundle path does not exist — should wait/retry #11

Closed
opened 2026-04-30 16:08:16 -05:00 by elliott · 1 comment
Owner

Context

First bring-up of elliott-thinkpad lab host via install-lab-host.sh. Certs have not yet been provisioned by the Pi (requires deploy-cis490-cert.sh on the Pi side). The shipper service starts immediately on systemctl enable --now but crashes because the CA bundle configured in lab-host.toml does not exist on disk yet.

What happened

FileNotFoundError: [Errno 2] No such file or directory
  File "/opt/cis490/shipper/transport.py", line 60, in _build_ssl_context
    ctx = ssl.create_default_context(
        cafile=str(rcv.ca_bundle) if rcv.ca_bundle else None,

cis490-shipper enters a fast crash/restart loop (every 5 s via RestartSec=5) until certs land.

What was tried

No workaround applied. The service will self-heal once /etc/cis490/certs/wg-ca.pem is deployed by the Pi.

Suggested next step

shipper/transport.py _build_ssl_context: if cafile is configured but the path does not exist, log a clear warning and raise a retriable error (or better: sleep-and-retry in __main__) rather than crashing immediately on startup. This makes first-boot bring-up order-independent — shipper can be enabled before cert delivery.

Related: scripts/install-lab-host.sh step 7 already auto-fetches the cert via bootstrap.wg if host_id is set; the race only happens if host_id = REPLACE_ME at install time (as it was here). Fixing that is a separate issue.

Refs spectral/CIS490 Dev_REL1_043026 bring-up.

## Context First bring-up of elliott-thinkpad lab host via `install-lab-host.sh`. Certs have not yet been provisioned by the Pi (requires `deploy-cis490-cert.sh` on the Pi side). The shipper service starts immediately on `systemctl enable --now` but crashes because the CA bundle configured in `lab-host.toml` does not exist on disk yet. ## What happened ``` FileNotFoundError: [Errno 2] No such file or directory File "/opt/cis490/shipper/transport.py", line 60, in _build_ssl_context ctx = ssl.create_default_context( cafile=str(rcv.ca_bundle) if rcv.ca_bundle else None, ``` `cis490-shipper` enters a fast crash/restart loop (every 5 s via `RestartSec=5`) until certs land. ## What was tried No workaround applied. The service will self-heal once `/etc/cis490/certs/wg-ca.pem` is deployed by the Pi. ## Suggested next step `shipper/transport.py _build_ssl_context`: if `cafile` is configured but the path does not exist, log a clear warning and raise a retriable error (or better: sleep-and-retry in `__main__`) rather than crashing immediately on startup. This makes first-boot bring-up order-independent — shipper can be enabled before cert delivery. Related: `scripts/install-lab-host.sh` step 7 already auto-fetches the cert via `bootstrap.wg` if `host_id` is set; the race only happens if `host_id = REPLACE_ME` at install time (as it was here). Fixing that is a separate issue. Refs spectral/CIS490 Dev_REL1_043026 bring-up.
max closed this issue 2026-04-30 16:14:04 -05:00
Owner

Fixed in 86a088c. Transport now pre-flights ca_bundle / client_cert / client_key paths and raises a recoverable _CertNotReadyError instead of letting ssl.create_default_context throw FileNotFoundError out of init. Each ping / ship_tarball retries the build, so the systemd unit no longer crash-loops during first-boot bring-up — it logs a warning once and self-heals when the cert lands.

Regression test: tests/test_shipper.py::test_transport_defers_when_ca_bundle_missing.

Fixed in 86a088c. Transport now pre-flights ca_bundle / client_cert / client_key paths and raises a recoverable _CertNotReadyError instead of letting ssl.create_default_context throw FileNotFoundError out of __init__. Each ping / ship_tarball retries the build, so the systemd unit no longer crash-loops during first-boot bring-up — it logs a warning once and self-heals when the cert lands. Regression test: tests/test_shipper.py::test_transport_defers_when_ca_bundle_missing.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: bolyai/CIS490#11
No description provided.