● CI is green. Issue #484 deadline pass complete. #488

Merged
navicore merged 2 commits from issue-484 into main 2026-05-14 15:29:12 +00:00
Owner

#484

Summary of what landed:

TCP connect timeout (crates/runtime/src/tcp.rs)

  • Default 10s, SEQ_TCP_CONNECT_TIMEOUT_MS override
  • connect_to_addrs now uses TcpStream::connect_timeout per address

TLS handshake timeout (crates/runtime/src/tls.rs)

  • Default 10s, SEQ_TLS_HANDSHAKE_TIMEOUT_MS override
  • Per-IO budget on the underlying may stream during complete_io; cleared after handshake
  • #[cfg(test)] override hook for deterministic test timing

HTTP request/response timeout (crates/runtime/src/http_client/)

  • HttpStream trait extended with set_read_timeout/set_write_timeout so both plain TCP and TLS-wrapped streams
    expose the underlying may deadline
  • Default 30s, SEQ_HTTP_REQUEST_TIMEOUT_MS override
  • Bounds every read/write inside run_once; cleared before returning to pool
  • #[cfg(test)] override hook

Tests (crates/runtime/src/http_client/integration_tests.rs)

  • http_request_timeout_fires_on_silent_server: silent server stalls; client errors in <2s with 200ms override
  • tls_handshake_timeout_fires_on_silent_peer: TCP accepts but never speaks TLS; handshake errors in <2s

Docs (docs/STDLIB_REFERENCE.md): TCP, TLS, HTTP sections updated; the "no connect timeout" / "no per-request
timeout" v1 limitations replaced with the new behavior and env-var names.

https://git.navicore.tech/navicore/patch-seq/issues/484 Summary of what landed: TCP connect timeout (crates/runtime/src/tcp.rs) - Default 10s, SEQ_TCP_CONNECT_TIMEOUT_MS override - connect_to_addrs now uses TcpStream::connect_timeout per address TLS handshake timeout (crates/runtime/src/tls.rs) - Default 10s, SEQ_TLS_HANDSHAKE_TIMEOUT_MS override - Per-IO budget on the underlying may stream during complete_io; cleared after handshake - #[cfg(test)] override hook for deterministic test timing HTTP request/response timeout (crates/runtime/src/http_client/) - HttpStream trait extended with set_read_timeout/set_write_timeout so both plain TCP and TLS-wrapped streams expose the underlying may deadline - Default 30s, SEQ_HTTP_REQUEST_TIMEOUT_MS override - Bounds every read/write inside run_once; cleared before returning to pool - #[cfg(test)] override hook Tests (crates/runtime/src/http_client/integration_tests.rs) - http_request_timeout_fires_on_silent_server: silent server stalls; client errors in <2s with 200ms override - tls_handshake_timeout_fires_on_silent_peer: TCP accepts but never speaks TLS; handshake errors in <2s Docs (docs/STDLIB_REFERENCE.md): TCP, TLS, HTTP sections updated; the "no connect timeout" / "no per-request timeout" v1 limitations replaced with the new behavior and env-var names.
● CI is green. Issue #484 deadline pass complete.
All checks were successful
CI - Linux / CI - Linux x86_64 (pull_request) Successful in 8m53s
8f58149406
Summary of what landed:

  TCP connect timeout (crates/runtime/src/tcp.rs)
  - Default 10s, SEQ_TCP_CONNECT_TIMEOUT_MS override
  - connect_to_addrs now uses TcpStream::connect_timeout per address

  TLS handshake timeout (crates/runtime/src/tls.rs)
  - Default 10s, SEQ_TLS_HANDSHAKE_TIMEOUT_MS override
  - Per-IO budget on the underlying may stream during complete_io; cleared after handshake
  - #[cfg(test)] override hook for deterministic test timing

  HTTP request/response timeout (crates/runtime/src/http_client/)
  - HttpStream trait extended with set_read_timeout/set_write_timeout so both plain TCP and TLS-wrapped streams
  expose the underlying may deadline
  - Default 30s, SEQ_HTTP_REQUEST_TIMEOUT_MS override
  - Bounds every read/write inside run_once; cleared before returning to pool
  - #[cfg(test)] override hook

  Tests (crates/runtime/src/http_client/integration_tests.rs)
  - http_request_timeout_fires_on_silent_server: silent server stalls; client errors in <2s with 200ms override
  - tls_handshake_timeout_fires_on_silent_peer: TCP accepts but never speaks TLS; handshake errors in <2s

  Docs (docs/STDLIB_REFERENCE.md): TCP, TLS, HTTP sections updated; the "no connect timeout" / "no per-request
  timeout" v1 limitations replaced with the new behavior and env-var names.
Author
Owner

Review — PR #488 (networking deadline pass, closes #484)

Reviewed against origin/issue-484 on top of the issue-483 branch (PR #487 commits omitted from this review's scope). Three IO layers (TCP connect, TLS handshake, HTTP request) each got a deadline via the same shape: env-var override + LazyLock-cached default + per-op timeout on the underlying may TcpStream. The HttpStream trait extension is the right abstraction for the HTTP layer to drive both plain-TCP and TLS-wrapped streams uniformly. Architecture is coherent. One real flake risk to fix, one doc framing issue worth tightening, and a couple of smaller observations.

What works well

  • Three layers, one mechanism. All three timeouts cash out to set_read_timeout/set_write_timeout on the underlying may::net::TcpStream. May's per-op deadline already plumbs through to the cooperative scheduler, so no new "may-aware deadline primitive" was needed — the existing one is reused via the trait. The PR's choice to extend HttpStream rather than introduce a new abstraction is the smallest viable change.
  • Per-op state hygiene in run_once. The set→IO→clear shape with an IIFE-captured Result ensures the clear runs whether wire::write_request / wire::read_response succeeded or returned an Err. A pool entry returned after a timeout has its deadline cleared, so the next consumer of that connection starts clean. Same hygiene in build_tls's success path.
  • TLS-handshake-timeout-then-clear-for-app-IO. The comment ("leaving the handshake's short deadline in place would cap every subsequent read/write at the handshake's bound, which is the wrong budget for app traffic") is correct and worth keeping. The HTTP layer overrides the timeout anyway, but the clear is the right defensive choice for non-HTTP callers (net.tls.client followed by raw net.tcp.read/net.tcp.write).
  • Per-connect timeout, not per-call. The doc explicitly notes that connect_to_addrs walking N addresses can take N × timeout total. Honest about the cost; consistent with the no-happy-eyeballs limitation that follows.
  • Test-only override hooks (set_test_*_timeout). Standard Mutex<Option<Duration>> indirection via current_*_timeout()#[cfg(test)]-gated so production has zero footprint, and the runtime call site reads identically in both builds. Same pattern as the TLS-config override from PR #487. RAII guards (HttpRequestTimeoutGuard, TlsHandshakeTimeoutGuard) restore state on panic.
  • The two new integration tests run on std::thread / std::net, not may. Consistent with the same-process rustls test from PR #487, with the same scheduler-congestion rationale. Right call.

Flake risk — HTTP request timeout test lacks a serial tag

http_request_timeout_fires_on_silent_server mutates the process-wide HTTP_REQUEST_TIMEOUT_OVERRIDE to 200ms, but it isn't #[serial_test::serial(...)]-tagged. Every other HTTP integration test calls perform_request_with_targetrun_oncehttp_request_timeout(), which reads that same global override.

Concrete flake shape: if http_request_timeout_fires_on_silent_server is mid-run (override = 200ms) while http_chunked_response_decodes_end_to_end is doing its multi-chunk read, the chunked test sees a 200ms per-read budget instead of 30s. On a loaded CI host with a slow scheduler tick, that's a flake. The window is small (the timeout test finishes in <2s) but cargo-test's default parallelism (NUM_CPUS) means several tests overlap.

The TLS-side analogue avoided this by tagging the handshake test #[serial_test::serial(tls_global_state)], which collides with the HTTPS test. The HTTP test needs the same treatment:

#[test]
#[serial_test::serial(http_global_state)]
fn http_request_timeout_fires_on_silent_server() { ... }

…and ideally every other HTTP test in this file picks up the same tag so all HTTP tests serialise as a group. The cost is parallelism within the HTTP suite; the benefit is no override-leak races. Worth the trade.

Alternative: scope the override to the test only by reading the env var afresh each call (no LazyLock) — but that changes the production read pattern, which is worse. The serial tag is the right fix.

Doc framing — slow-trickle defence is partly overstated

The new STDLIB_REFERENCE.md line:

"A peer that goes silent mid-response — including the slow-trickle / never-EOF case — surfaces as a wire error within the bound."

The "goes silent" half is exactly right. The "slow-trickle" half is true if the trickle is slower than the timeout (e.g., 1 byte every 35s with a 30s bound — the read times out). But a sub-timeout trickle (1 byte every 29s with a 30s bound) defeats per-op timeouts: every read() returns 1 byte just before the deadline fires, and the loop keeps going.

For EOF-framed bodies, wire.rs:174 uses .take(MAX_BODY_SIZE + 1).read_to_end(&mut buf). read_to_end calls read repeatedly until it returns 0. With a 1-byte-per-29s trickle, each read returns 1 byte; the loop runs MAX_BODY_SIZE + 1 = 10 MiB + 1 times. At 29s per byte that's ~9 years of wall time before MAX_BODY_SIZE kicks in. Not a practical attack at that rate, but it does mean the per-IO timeout isn't a total-time bound.

This is the well-known residual risk of per-op-only timeouts (Slowloris-shaped). The standard defence is to accept it because the alternative — a hard total deadline — fails legitimate slow-but-progressing responses (large bodies on slow links).

I'd tighten the doc framing to:

"A peer that goes silent mid-response — no bytes for timeout — surfaces as a wire error within the bound. Note: this is per-IO, not total. A peer that sends one byte every timeout − ε defeats the bound on a per-read basis; in practice the residual risk is bounded by MAX_BODY_SIZE (10 MB) × timeout, but a total-time deadline (planned follow-up) would be the strict fix."

Or words to that effect. The current phrasing reads like "slow-trickle is fully defended" — which is reasonable to expect from "a deadline pass" but isn't quite what the code does.

Worth flagging — no test for the TCP connect timeout

The issue had three IO layers. Tests landed for HTTP request and TLS handshake but not TCP connect. The connect_timeout switch is a one-line API change, but a regression that drops back to TcpStream::connect (e.g., a future refactor adding happy-eyeballs and missing the timeout) wouldn't be caught.

The test shape is similar to the existing ones: drop the timeout via env (or a new test override hook on TCP_CONNECT_TIMEOUT), dial a non-responsive IP, measure elapsed. The trick is what address to dial — a port on 127.0.0.1 with no listener completes connect almost instantly (ECONNREFUSED), not the silent-peer behaviour you need. Three options:

  1. A blackhole multicast IP, similar to the SSRF rebinding test's 224.0.0.1 — but multicast typically fails fast (ENETUNREACH), not slow-SYN. Wrong failure mode.
  2. TEST-NET-1 (192.0.2.0/24) — slow SYN on default-route boxes. The SSRF test explicitly avoids this because the test would take the full SYN timeout. Inverted here — with a 200ms override it WOULD finish fast, but only after explicitly setting a connect-timeout override.
  3. Connect to a port that's iptables -j DROP'd — heavy CI setup, probably out of scope.

The cleanest fix is to add a test-only set_test_tcp_connect_timeout hook (matching the TLS pattern), then use TEST-NET-1 with a 200ms override. The override hook is ~10 lines.

Defer-or-do is the author's call — the API is small and well-trodden, so the regression risk is genuinely lower than the HTTP/TLS sides. Just flagging that the third leg of the deadline tripod has no anchor.

Smaller observations

  • build_tls's success-path clear is let _ = tcp.set_read_timeout(None). If the clear fails on a successful handshake, the returned TLS stream carries a 10s handshake timeout into the application IO phase. The HTTP layer overwrites it via run_once, so users of the HTTP client are safe. But for direct net.tls.client + raw net.tcp.read callers, a clear-failure leaves the read bounded by 10s. Edge case (setsockopt rarely fails on a healthy fd), but the silent-failure choice is worth a one-liner: "ignoring the clear-failure path because a returned handshake error surfaces the underlying state more clearly than a forced cleanup error" or whatever the author's actual reasoning is.

  • http_request_timeout() and tls_handshake_timeout() lock a mutex on every call in #[cfg(test)] builds. The lock is uncontended in production-shape tests, but if the deadline pass ever lands in a hot inner loop (it doesn't today), the test-only lock would be a noticeable cost vs. the LazyLock cache. Not a problem at current usage; flagging only.

  • HttpStream trait gets set_read_timeout / set_write_timeout with Option<Duration>. None clears; Some(dur) sets. Consistent with std and may's APIs. Trait surface is now three methods (raw_fd + 2 timeouts) — still narrow. Good.

  • The silent-server background threads sleep for 60s before dropping the stream. Test passes in <2s, but the thread sits for 58+s after the test returns. cargo-test waits for the test fn, not the spawned thread; the process exits when all tests finish and the kernel reaps the thread. So no functional issue, just a paranoid-aesthetic observation: a Drop-on-channel-close pattern would be tidier, but the current shape is fine for a test that runs once per CI invocation.

  • SEQ_*_TIMEOUT_MS env vars treat 0 as fallback-to-default. Both the doc and the filter(|n| *n > 0) capture this. Reasonable — 0 is ambiguous between "as fast as possible" and "no timeout," and falling back to the default is the safer default. Worth noting that this means there's no env-var-only way to disable the timeout (which is itself defensible — disabling would re-introduce the hazard the PR closes).

Verdict

Land after:

  • Adding #[serial_test::serial(http_global_state)] to http_request_timeout_fires_on_silent_server (and ideally to the other HTTP integration tests in the same file, so they form one serialised group).
  • Tightening the slow-trickle doc framing so it doesn't overclaim against the sub-timeout-trickle attack shape.

Recommended (optional for this PR):

  • A TCP-connect-timeout integration test via a new set_test_tcp_connect_timeout hook + TEST-NET-1.
  • A comment on build_tls's let _ = tcp.set_read_timeout(None) swallowing the clear failure on the happy path.

The architecture is the right shape and the code is clean. The deadline pass closes the largest remaining v1 hazard from the PR1–PR5 review backlog. Good closeout of issue #484.

## Review — PR #488 (networking deadline pass, closes #484) Reviewed against `origin/issue-484` on top of the `issue-483` branch (PR #487 commits omitted from this review's scope). Three IO layers (TCP connect, TLS handshake, HTTP request) each got a deadline via the same shape: env-var override + LazyLock-cached default + per-op timeout on the underlying may TcpStream. The `HttpStream` trait extension is the right abstraction for the HTTP layer to drive both plain-TCP and TLS-wrapped streams uniformly. Architecture is coherent. One **real flake risk** to fix, one **doc framing issue** worth tightening, and a couple of smaller observations. ### What works well - **Three layers, one mechanism.** All three timeouts cash out to `set_read_timeout`/`set_write_timeout` on the underlying `may::net::TcpStream`. May's per-op deadline already plumbs through to the cooperative scheduler, so no new "may-aware deadline primitive" was needed — the existing one is reused via the trait. The PR's choice to extend `HttpStream` rather than introduce a new abstraction is the smallest viable change. - **Per-op state hygiene in `run_once`.** The set→IO→clear shape with an IIFE-captured `Result` ensures the clear runs whether `wire::write_request` / `wire::read_response` succeeded or returned an `Err`. A pool entry returned after a timeout has its deadline cleared, so the next consumer of that connection starts clean. Same hygiene in `build_tls`'s success path. - **TLS-handshake-timeout-then-clear-for-app-IO.** The comment ("leaving the handshake's short deadline in place would cap every subsequent read/write at the handshake's bound, which is the wrong budget for app traffic") is correct and worth keeping. The HTTP layer overrides the timeout anyway, but the clear is the right defensive choice for non-HTTP callers (`net.tls.client` followed by raw `net.tcp.read`/`net.tcp.write`). - **Per-connect timeout, not per-call.** The doc explicitly notes that `connect_to_addrs` walking N addresses can take `N × timeout` total. Honest about the cost; consistent with the no-happy-eyeballs limitation that follows. - **Test-only override hooks (`set_test_*_timeout`).** Standard `Mutex<Option<Duration>>` indirection via `current_*_timeout()` — `#[cfg(test)]`-gated so production has zero footprint, and the runtime call site reads identically in both builds. Same pattern as the TLS-config override from PR #487. RAII guards (`HttpRequestTimeoutGuard`, `TlsHandshakeTimeoutGuard`) restore state on panic. - **The two new integration tests run on `std::thread` / `std::net`, not may.** Consistent with the same-process rustls test from PR #487, with the same scheduler-congestion rationale. Right call. ### Flake risk — HTTP request timeout test lacks a serial tag `http_request_timeout_fires_on_silent_server` mutates the process-wide `HTTP_REQUEST_TIMEOUT_OVERRIDE` to 200ms, but it isn't `#[serial_test::serial(...)]`-tagged. Every other HTTP integration test calls `perform_request_with_target` → `run_once` → `http_request_timeout()`, which reads that same global override. Concrete flake shape: if `http_request_timeout_fires_on_silent_server` is mid-run (override = 200ms) while `http_chunked_response_decodes_end_to_end` is doing its multi-chunk read, the chunked test sees a 200ms per-read budget instead of 30s. On a loaded CI host with a slow scheduler tick, that's a flake. The window is small (the timeout test finishes in <2s) but cargo-test's default parallelism (NUM_CPUS) means several tests overlap. The TLS-side analogue avoided this by tagging the handshake test `#[serial_test::serial(tls_global_state)]`, which collides with the HTTPS test. The HTTP test needs the same treatment: ```rust #[test] #[serial_test::serial(http_global_state)] fn http_request_timeout_fires_on_silent_server() { ... } ``` …and ideally every other HTTP test in this file picks up the same tag so all HTTP tests serialise as a group. The cost is parallelism within the HTTP suite; the benefit is no override-leak races. Worth the trade. Alternative: scope the override to the *test only* by reading the env var afresh each call (no LazyLock) — but that changes the production read pattern, which is worse. The serial tag is the right fix. ### Doc framing — slow-trickle defence is partly overstated The new STDLIB_REFERENCE.md line: > "A peer that goes silent mid-response — including the slow-trickle / never-EOF case — surfaces as a wire error within the bound." The "goes silent" half is exactly right. The "slow-trickle" half is true if the trickle is *slower than the timeout* (e.g., 1 byte every 35s with a 30s bound — the read times out). But a sub-timeout trickle (1 byte every 29s with a 30s bound) defeats per-op timeouts: every `read()` returns 1 byte just before the deadline fires, and the loop keeps going. For EOF-framed bodies, `wire.rs:174` uses `.take(MAX_BODY_SIZE + 1).read_to_end(&mut buf)`. `read_to_end` calls `read` repeatedly until it returns 0. With a 1-byte-per-29s trickle, each `read` returns 1 byte; the loop runs `MAX_BODY_SIZE + 1 = 10 MiB + 1` times. At 29s per byte that's ~9 years of wall time before MAX_BODY_SIZE kicks in. Not a practical attack at that rate, but it does mean the per-IO timeout isn't a total-time bound. This is the well-known residual risk of per-op-only timeouts (Slowloris-shaped). The standard defence is to accept it because the alternative — a hard total deadline — fails legitimate slow-but-progressing responses (large bodies on slow links). I'd tighten the doc framing to: > "A peer that goes silent mid-response — no bytes for `timeout` — surfaces as a wire error within the bound. **Note: this is per-IO, not total.** A peer that sends one byte every `timeout − ε` defeats the bound on a per-`read` basis; in practice the residual risk is bounded by `MAX_BODY_SIZE` (10 MB) × `timeout`, but a total-time deadline (planned follow-up) would be the strict fix." Or words to that effect. The current phrasing reads like "slow-trickle is fully defended" — which is reasonable to expect from "a deadline pass" but isn't quite what the code does. ### Worth flagging — no test for the TCP connect timeout The issue had three IO layers. Tests landed for HTTP request and TLS handshake but not TCP connect. The connect_timeout switch is a one-line API change, but a regression that drops back to `TcpStream::connect` (e.g., a future refactor adding happy-eyeballs and missing the timeout) wouldn't be caught. The test shape is similar to the existing ones: drop the timeout via env (or a new test override hook on `TCP_CONNECT_TIMEOUT`), dial a non-responsive IP, measure elapsed. The trick is what address to dial — a port on `127.0.0.1` with no listener completes connect almost instantly (ECONNREFUSED), not the silent-peer behaviour you need. Three options: 1. **A blackhole multicast IP**, similar to the SSRF rebinding test's 224.0.0.1 — but multicast typically fails fast (ENETUNREACH), not slow-SYN. Wrong failure mode. 2. **TEST-NET-1 (192.0.2.0/24)** — slow SYN on default-route boxes. The SSRF test explicitly avoids this because the test would take the full SYN timeout. Inverted here — with a 200ms override it WOULD finish fast, but only after explicitly setting a connect-timeout override. 3. **Connect to a port that's `iptables -j DROP`'d** — heavy CI setup, probably out of scope. The cleanest fix is to add a test-only `set_test_tcp_connect_timeout` hook (matching the TLS pattern), then use TEST-NET-1 with a 200ms override. The override hook is ~10 lines. Defer-or-do is the author's call — the API is small and well-trodden, so the regression risk is genuinely lower than the HTTP/TLS sides. Just flagging that the third leg of the deadline tripod has no anchor. ### Smaller observations - **`build_tls`'s success-path clear is `let _ = tcp.set_read_timeout(None)`.** If the clear fails on a successful handshake, the returned TLS stream carries a 10s handshake timeout into the application IO phase. The HTTP layer overwrites it via `run_once`, so users of the HTTP client are safe. But for direct `net.tls.client` + raw `net.tcp.read` callers, a clear-failure leaves the read bounded by 10s. Edge case (setsockopt rarely fails on a healthy fd), but the silent-failure choice is worth a one-liner: "ignoring the clear-failure path because a returned handshake error surfaces the underlying state more clearly than a forced cleanup error" or whatever the author's actual reasoning is. - **`http_request_timeout()` and `tls_handshake_timeout()` lock a mutex on every call** in `#[cfg(test)]` builds. The lock is uncontended in production-shape tests, but if the deadline pass ever lands in a hot inner loop (it doesn't today), the test-only lock would be a noticeable cost vs. the LazyLock cache. Not a problem at current usage; flagging only. - **`HttpStream` trait gets `set_read_timeout` / `set_write_timeout` with `Option<Duration>`.** `None` clears; `Some(dur)` sets. Consistent with std and may's APIs. Trait surface is now three methods (raw_fd + 2 timeouts) — still narrow. Good. - **The silent-server background threads sleep for 60s before dropping the stream.** Test passes in <2s, but the thread sits for 58+s after the test returns. cargo-test waits for the test fn, not the spawned thread; the process exits when all tests finish and the kernel reaps the thread. So no functional issue, just a paranoid-aesthetic observation: a `Drop`-on-channel-close pattern would be tidier, but the current shape is fine for a test that runs once per CI invocation. - **`SEQ_*_TIMEOUT_MS` env vars treat 0 as fallback-to-default.** Both the doc and the `filter(|n| *n > 0)` capture this. Reasonable — `0` is ambiguous between "as fast as possible" and "no timeout," and falling back to the default is the safer default. Worth noting that this means there's no env-var-only way to *disable* the timeout (which is itself defensible — disabling would re-introduce the hazard the PR closes). ### Verdict **Land after:** - Adding `#[serial_test::serial(http_global_state)]` to `http_request_timeout_fires_on_silent_server` (and ideally to the other HTTP integration tests in the same file, so they form one serialised group). - Tightening the slow-trickle doc framing so it doesn't overclaim against the sub-timeout-trickle attack shape. **Recommended (optional for this PR):** - A TCP-connect-timeout integration test via a new `set_test_tcp_connect_timeout` hook + TEST-NET-1. - A comment on `build_tls`'s `let _ = tcp.set_read_timeout(None)` swallowing the clear failure on the happy path. The architecture is the right shape and the code is clean. The deadline pass closes the largest remaining v1 hazard from the PR1–PR5 review backlog. Good closeout of issue #484.
● CI green. Review fixes complete:
All checks were successful
CI - Linux / CI - Linux x86_64 (pull_request) Successful in 8m53s
05ba59c399
- Serial tag on HTTP tests (integration_tests.rs): every test that calls perform_request* and reaches run_once
  now carries #[serial_test::serial(http_global_state)]. HTTPS test stacks tls_global_state + http_global_state
  because it reads both globals. DNS rebinding and TLS-handshake-timeout tests keep their existing single tags
  because they fail before run_once (no HTTP timeout read).
  - Slow-trickle doc framing (STDLIB_REFERENCE.md): replaced the "including the slow-trickle / never-EOF case"
  claim with explicit per-IO vs total framing and a note that a sub-timeout trickle defeats the per-read bound;
  residual exposure capped at MAX_BODY_SIZE × timeout.
  - TCP connect timeout test + override (tcp.rs + integration_tests.rs): added #[cfg(test)]
  set_test_tcp_connect_timeout mirroring the TLS/HTTP hooks; new tcp_connect_timeout_fires_on_silent_route test
  dials 192.0.2.1 (TEST-NET-1) with a 200ms override and asserts <2s elapsed. Documented the no-default-route
  caveat.
  - build_tls clear-failure comment (tls.rs): one-block explanation of why we swallow the set_*_timeout(None)
  errors on the happy path (setsockopt on a healthy fd ~can't fail; HTTP layer re-sets per request anyway).

  Declined: mutex-cost-in-tests observation, drop-on-channel-close aesthetics, and the 0-means-default convention
  — reviewer flagged those as non-issues.
Author
Owner

Second pass — review-driven fixes look right

Walked the delta from 8f58149..05ba59c. All four items landed cleanly.

Serial-tag coverage across the HTTP test suite

The tagging matrix is consistent: every test that calls perform_request* and reaches run_once (which reads HTTP_REQUEST_TIMEOUT_OVERRIDE) now carries #[serial_test::serial(http_global_state)]. I walked the call paths to confirm the carve-outs are right:

  • DNS rebinding test — stays dns_global_state only. Path: perform_request → validate_url → resolve_to_ips returns [224.0.0.1] → SSRF passes → dial_fresh → connect_to_addrs(224.0.0.1, 80) → kernel returns ENETUNREACH immediately (multicast). dial_fresh returns Err; run_once is never reached. So HTTP_REQUEST_TIMEOUT_OVERRIDE is never read. ✓
  • TLS handshake timeout test — stays tls_global_state only. Path: dial_fresh → connect succeeds on 127.0.0.1 → upgrade_tcp_in_place → build_tls → complete_io times out at the handshake bound → dial_fresh returns Err; run_once never reached. ✓
  • HTTPS round-trip test — correctly stacks both tls_global_state AND http_global_state. It needs TLS config installed AND it does reach run_once for the actual HTTPS round-trip. The comment captures the reasoning ("this test reaches run_once and so reads HTTP_REQUEST_TIMEOUT_OVERRIDE").

The stacked-tag pattern is a less-common serial_test shape — worth a one-line acknowledgement that serial_test::serial composes (a test in two groups is serialised against any test in either group). Cargo docs confirm this composes correctly; the runtime behaviour matches the comment's promise.

The header comment on http_get_end_to_end_against_local_server (the first tagged test) explains the rationale for the cohort tag in one place. Good — a future test author adding an HTTP test now has a discoverable note explaining why to follow the pattern.

Slow-trickle doc framing

The rewritten line is exactly the framing I asked for. Explicit per-IO vs total, the MAX_BODY_SIZE × timeout residual-exposure bound is quantified, and "a total-time deadline is a planned follow-up" sets the expectation that this isn't the final word. The "today the per-IO bound covers the 'goes fully silent' hazard, not the slow-trickle-but-progressing one" sentence is honest about what's covered and what isn't. Closes the over-claim cleanly.

TCP connect timeout test + override

Override hook mirrors the TLS/HTTP shape (Mutex<Option<Duration>> + set_test_tcp_connect_timeout + tcp_connect_timeout() helper that prefers override over LazyLock). The internal connect_to_addrs call switched from *TCP_CONNECT_TIMEOUT to tcp_connect_timeout() so the test override is plumbed through. Production path is unchanged.

The test uses 192.0.2.1 (TEST-NET-1) with a 200ms override. The honest no-default-route caveat in the comment is the right shape — the test still passes on hosts where the IP returns ENETUNREACH instantly, but on hosts with the slow-SYN shape (typical CI containers/VMs), a regression dropping back to plain connect would balloon to 60s+ and trigger the < 2s assertion. Pin is environment-dependent but specifically designed to catch the most likely regression on the most likely CI shape.

One minor observation, not for fixing: the test is tagged http_global_state only. I traced whether anything else cares about TCP_CONNECT_TIMEOUT_OVERRIDE:

  • DNS rebinding test dials multicast — kernel fast-fail before timeout matters.
  • TLS handshake test dials 127.0.0.1 — connects in microseconds before timeout matters.
  • All HTTP tests share http_global_state and so are serialised against this one.

So the existing tag is sufficient — no other concurrent test is sensitive to the override. A tcp_global_state group would be more semantically pure, but adds a group with no current second member. Defer-or-do is the author's call.

TcpConnectTimeoutGuard is the right RAII guard — clears on drop including panic. Consistent with the HTTP and TLS guards.

build_tls clear-failure comment

The new block is the right rationale: setsockopt on a healthy just-handshook fd ~can't fail; the only realistic failure mode is concurrent fd-close, where the returned stream is already dead and the next read/write will surface it; propagating the clear-failure would mask the underlying state with a synthetic error; HTTP-client callers re-set per request in run_once anyway, so a stale handshake deadline can't leak into app IO via the HTTP path.

Captures everything a future reader would want to know in five lines. Good.

Declined items

All three correctly judged non-issues:

  • Mutex cost in test-only overrides — only paid in #[cfg(test)] builds, not in production.
  • Drop-on-channel-close aesthetics — the 60s sleep-then-drop pattern is fine for tests that run once per CI invocation.
  • 0-means-default convention — defensible default; disabling would re-introduce the hazard.

No issue.

Verdict

LGTM. Ship it.

The deadline pass closes the largest remaining v1 hazard from the PR1–PR5 review backlog, the three-layer architecture is coherent and the per-op state hygiene is right, every regression shape now has a test anchor, and the doc is honest about what's covered and what's deferred to a future total-time-deadline pass.

## Second pass — review-driven fixes look right Walked the delta from `8f58149..05ba59c`. All four items landed cleanly. ### Serial-tag coverage across the HTTP test suite The tagging matrix is consistent: every test that calls `perform_request*` and reaches `run_once` (which reads `HTTP_REQUEST_TIMEOUT_OVERRIDE`) now carries `#[serial_test::serial(http_global_state)]`. I walked the call paths to confirm the carve-outs are right: - **DNS rebinding test** — stays `dns_global_state` only. Path: `perform_request → validate_url → resolve_to_ips` returns `[224.0.0.1]` → SSRF passes → `dial_fresh → connect_to_addrs(224.0.0.1, 80)` → kernel returns ENETUNREACH immediately (multicast). `dial_fresh` returns Err; `run_once` is never reached. So `HTTP_REQUEST_TIMEOUT_OVERRIDE` is never read. ✓ - **TLS handshake timeout test** — stays `tls_global_state` only. Path: dial_fresh → connect succeeds on 127.0.0.1 → `upgrade_tcp_in_place → build_tls → complete_io` times out at the handshake bound → dial_fresh returns Err; `run_once` never reached. ✓ - **HTTPS round-trip test** — correctly stacks both `tls_global_state` AND `http_global_state`. It needs TLS config installed AND it does reach `run_once` for the actual HTTPS round-trip. The comment captures the reasoning ("this test reaches `run_once` and so reads `HTTP_REQUEST_TIMEOUT_OVERRIDE`"). The stacked-tag pattern is a less-common `serial_test` shape — worth a one-line acknowledgement that `serial_test::serial` composes (a test in two groups is serialised against any test in either group). Cargo docs confirm this composes correctly; the runtime behaviour matches the comment's promise. The header comment on `http_get_end_to_end_against_local_server` (the first tagged test) explains the rationale for the cohort tag in one place. Good — a future test author adding an HTTP test now has a discoverable note explaining why to follow the pattern. ### Slow-trickle doc framing The rewritten line is exactly the framing I asked for. Explicit per-IO vs total, the `MAX_BODY_SIZE × timeout` residual-exposure bound is quantified, and "a total-time deadline is a planned follow-up" sets the expectation that this isn't the final word. The "today the per-IO bound covers the 'goes fully silent' hazard, not the slow-trickle-but-progressing one" sentence is honest about what's covered and what isn't. Closes the over-claim cleanly. ### TCP connect timeout test + override Override hook mirrors the TLS/HTTP shape (`Mutex<Option<Duration>>` + `set_test_tcp_connect_timeout` + `tcp_connect_timeout()` helper that prefers override over LazyLock). The internal `connect_to_addrs` call switched from `*TCP_CONNECT_TIMEOUT` to `tcp_connect_timeout()` so the test override is plumbed through. Production path is unchanged. The test uses 192.0.2.1 (TEST-NET-1) with a 200ms override. The honest no-default-route caveat in the comment is the right shape — the test still passes on hosts where the IP returns ENETUNREACH instantly, but on hosts with the slow-SYN shape (typical CI containers/VMs), a regression dropping back to plain `connect` would balloon to 60s+ and trigger the `< 2s` assertion. Pin is environment-dependent but specifically designed to catch the most likely regression on the most likely CI shape. One minor observation, not for fixing: the test is tagged `http_global_state` only. I traced whether anything else cares about `TCP_CONNECT_TIMEOUT_OVERRIDE`: - DNS rebinding test dials multicast — kernel fast-fail before timeout matters. - TLS handshake test dials 127.0.0.1 — connects in microseconds before timeout matters. - All HTTP tests share `http_global_state` and so are serialised against this one. So the existing tag is sufficient — no other concurrent test is sensitive to the override. A `tcp_global_state` group would be more semantically pure, but adds a group with no current second member. Defer-or-do is the author's call. `TcpConnectTimeoutGuard` is the right RAII guard — clears on drop including panic. Consistent with the HTTP and TLS guards. ### build_tls clear-failure comment The new block is the right rationale: setsockopt on a healthy just-handshook fd ~can't fail; the only realistic failure mode is concurrent fd-close, where the returned stream is already dead and the next read/write will surface it; propagating the clear-failure would *mask* the underlying state with a synthetic error; HTTP-client callers re-set per request in `run_once` anyway, so a stale handshake deadline can't leak into app IO via the HTTP path. Captures everything a future reader would want to know in five lines. Good. ### Declined items All three correctly judged non-issues: - **Mutex cost in test-only overrides** — only paid in `#[cfg(test)]` builds, not in production. - **Drop-on-channel-close aesthetics** — the 60s sleep-then-drop pattern is fine for tests that run once per CI invocation. - **0-means-default convention** — defensible default; disabling would re-introduce the hazard. No issue. ### Verdict LGTM. Ship it. The deadline pass closes the largest remaining v1 hazard from the PR1–PR5 review backlog, the three-layer architecture is coherent and the per-op state hygiene is right, every regression shape now has a test anchor, and the doc is honest about what's covered and what's deferred to a future total-time-deadline pass.
navicore deleted branch issue-484 2026-05-14 15:29:12 +00:00
Sign in to join this conversation.
No description provided.