● CI is green. Issue #484 deadline pass complete. #488
No reviewers
Labels
No labels
bug
dependencies
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
refactor
rust
technical-debt
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
navicore/patch-seq!488
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "issue-484"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
#484
Summary of what landed:
TCP connect timeout (crates/runtime/src/tcp.rs)
TLS handshake timeout (crates/runtime/src/tls.rs)
HTTP request/response timeout (crates/runtime/src/http_client/)
expose the underlying may deadline
Tests (crates/runtime/src/http_client/integration_tests.rs)
Docs (docs/STDLIB_REFERENCE.md): TCP, TLS, HTTP sections updated; the "no connect timeout" / "no per-request
timeout" v1 limitations replaced with the new behavior and env-var names.
Review — PR #488 (networking deadline pass, closes #484)
Reviewed against
origin/issue-484on top of theissue-483branch (PR #487 commits omitted from this review's scope). Three IO layers (TCP connect, TLS handshake, HTTP request) each got a deadline via the same shape: env-var override + LazyLock-cached default + per-op timeout on the underlying may TcpStream. TheHttpStreamtrait extension is the right abstraction for the HTTP layer to drive both plain-TCP and TLS-wrapped streams uniformly. Architecture is coherent. One real flake risk to fix, one doc framing issue worth tightening, and a couple of smaller observations.What works well
set_read_timeout/set_write_timeouton the underlyingmay::net::TcpStream. May's per-op deadline already plumbs through to the cooperative scheduler, so no new "may-aware deadline primitive" was needed — the existing one is reused via the trait. The PR's choice to extendHttpStreamrather than introduce a new abstraction is the smallest viable change.run_once. The set→IO→clear shape with an IIFE-capturedResultensures the clear runs whetherwire::write_request/wire::read_responsesucceeded or returned anErr. A pool entry returned after a timeout has its deadline cleared, so the next consumer of that connection starts clean. Same hygiene inbuild_tls's success path.net.tls.clientfollowed by rawnet.tcp.read/net.tcp.write).connect_to_addrswalking N addresses can takeN × timeouttotal. Honest about the cost; consistent with the no-happy-eyeballs limitation that follows.set_test_*_timeout). StandardMutex<Option<Duration>>indirection viacurrent_*_timeout()—#[cfg(test)]-gated so production has zero footprint, and the runtime call site reads identically in both builds. Same pattern as the TLS-config override from PR #487. RAII guards (HttpRequestTimeoutGuard,TlsHandshakeTimeoutGuard) restore state on panic.std::thread/std::net, not may. Consistent with the same-process rustls test from PR #487, with the same scheduler-congestion rationale. Right call.Flake risk — HTTP request timeout test lacks a serial tag
http_request_timeout_fires_on_silent_servermutates the process-wideHTTP_REQUEST_TIMEOUT_OVERRIDEto 200ms, but it isn't#[serial_test::serial(...)]-tagged. Every other HTTP integration test callsperform_request_with_target→run_once→http_request_timeout(), which reads that same global override.Concrete flake shape: if
http_request_timeout_fires_on_silent_serveris mid-run (override = 200ms) whilehttp_chunked_response_decodes_end_to_endis doing its multi-chunk read, the chunked test sees a 200ms per-read budget instead of 30s. On a loaded CI host with a slow scheduler tick, that's a flake. The window is small (the timeout test finishes in <2s) but cargo-test's default parallelism (NUM_CPUS) means several tests overlap.The TLS-side analogue avoided this by tagging the handshake test
#[serial_test::serial(tls_global_state)], which collides with the HTTPS test. The HTTP test needs the same treatment:…and ideally every other HTTP test in this file picks up the same tag so all HTTP tests serialise as a group. The cost is parallelism within the HTTP suite; the benefit is no override-leak races. Worth the trade.
Alternative: scope the override to the test only by reading the env var afresh each call (no LazyLock) — but that changes the production read pattern, which is worse. The serial tag is the right fix.
Doc framing — slow-trickle defence is partly overstated
The new STDLIB_REFERENCE.md line:
The "goes silent" half is exactly right. The "slow-trickle" half is true if the trickle is slower than the timeout (e.g., 1 byte every 35s with a 30s bound — the read times out). But a sub-timeout trickle (1 byte every 29s with a 30s bound) defeats per-op timeouts: every
read()returns 1 byte just before the deadline fires, and the loop keeps going.For EOF-framed bodies,
wire.rs:174uses.take(MAX_BODY_SIZE + 1).read_to_end(&mut buf).read_to_endcallsreadrepeatedly until it returns 0. With a 1-byte-per-29s trickle, eachreadreturns 1 byte; the loop runsMAX_BODY_SIZE + 1 = 10 MiB + 1times. At 29s per byte that's ~9 years of wall time before MAX_BODY_SIZE kicks in. Not a practical attack at that rate, but it does mean the per-IO timeout isn't a total-time bound.This is the well-known residual risk of per-op-only timeouts (Slowloris-shaped). The standard defence is to accept it because the alternative — a hard total deadline — fails legitimate slow-but-progressing responses (large bodies on slow links).
I'd tighten the doc framing to:
Or words to that effect. The current phrasing reads like "slow-trickle is fully defended" — which is reasonable to expect from "a deadline pass" but isn't quite what the code does.
Worth flagging — no test for the TCP connect timeout
The issue had three IO layers. Tests landed for HTTP request and TLS handshake but not TCP connect. The connect_timeout switch is a one-line API change, but a regression that drops back to
TcpStream::connect(e.g., a future refactor adding happy-eyeballs and missing the timeout) wouldn't be caught.The test shape is similar to the existing ones: drop the timeout via env (or a new test override hook on
TCP_CONNECT_TIMEOUT), dial a non-responsive IP, measure elapsed. The trick is what address to dial — a port on127.0.0.1with no listener completes connect almost instantly (ECONNREFUSED), not the silent-peer behaviour you need. Three options:iptables -j DROP'd — heavy CI setup, probably out of scope.The cleanest fix is to add a test-only
set_test_tcp_connect_timeouthook (matching the TLS pattern), then use TEST-NET-1 with a 200ms override. The override hook is ~10 lines.Defer-or-do is the author's call — the API is small and well-trodden, so the regression risk is genuinely lower than the HTTP/TLS sides. Just flagging that the third leg of the deadline tripod has no anchor.
Smaller observations
build_tls's success-path clear islet _ = tcp.set_read_timeout(None). If the clear fails on a successful handshake, the returned TLS stream carries a 10s handshake timeout into the application IO phase. The HTTP layer overwrites it viarun_once, so users of the HTTP client are safe. But for directnet.tls.client+ rawnet.tcp.readcallers, a clear-failure leaves the read bounded by 10s. Edge case (setsockopt rarely fails on a healthy fd), but the silent-failure choice is worth a one-liner: "ignoring the clear-failure path because a returned handshake error surfaces the underlying state more clearly than a forced cleanup error" or whatever the author's actual reasoning is.http_request_timeout()andtls_handshake_timeout()lock a mutex on every call in#[cfg(test)]builds. The lock is uncontended in production-shape tests, but if the deadline pass ever lands in a hot inner loop (it doesn't today), the test-only lock would be a noticeable cost vs. the LazyLock cache. Not a problem at current usage; flagging only.HttpStreamtrait getsset_read_timeout/set_write_timeoutwithOption<Duration>.Noneclears;Some(dur)sets. Consistent with std and may's APIs. Trait surface is now three methods (raw_fd + 2 timeouts) — still narrow. Good.The silent-server background threads sleep for 60s before dropping the stream. Test passes in <2s, but the thread sits for 58+s after the test returns. cargo-test waits for the test fn, not the spawned thread; the process exits when all tests finish and the kernel reaps the thread. So no functional issue, just a paranoid-aesthetic observation: a
Drop-on-channel-close pattern would be tidier, but the current shape is fine for a test that runs once per CI invocation.SEQ_*_TIMEOUT_MSenv vars treat 0 as fallback-to-default. Both the doc and thefilter(|n| *n > 0)capture this. Reasonable —0is ambiguous between "as fast as possible" and "no timeout," and falling back to the default is the safer default. Worth noting that this means there's no env-var-only way to disable the timeout (which is itself defensible — disabling would re-introduce the hazard the PR closes).Verdict
Land after:
#[serial_test::serial(http_global_state)]tohttp_request_timeout_fires_on_silent_server(and ideally to the other HTTP integration tests in the same file, so they form one serialised group).Recommended (optional for this PR):
set_test_tcp_connect_timeouthook + TEST-NET-1.build_tls'slet _ = tcp.set_read_timeout(None)swallowing the clear failure on the happy path.The architecture is the right shape and the code is clean. The deadline pass closes the largest remaining v1 hazard from the PR1–PR5 review backlog. Good closeout of issue #484.
Second pass — review-driven fixes look right
Walked the delta from
8f58149..05ba59c. All four items landed cleanly.Serial-tag coverage across the HTTP test suite
The tagging matrix is consistent: every test that calls
perform_request*and reachesrun_once(which readsHTTP_REQUEST_TIMEOUT_OVERRIDE) now carries#[serial_test::serial(http_global_state)]. I walked the call paths to confirm the carve-outs are right:dns_global_stateonly. Path:perform_request → validate_url → resolve_to_ipsreturns[224.0.0.1]→ SSRF passes →dial_fresh → connect_to_addrs(224.0.0.1, 80)→ kernel returns ENETUNREACH immediately (multicast).dial_freshreturns Err;run_onceis never reached. SoHTTP_REQUEST_TIMEOUT_OVERRIDEis never read. ✓tls_global_stateonly. Path: dial_fresh → connect succeeds on 127.0.0.1 →upgrade_tcp_in_place → build_tls → complete_iotimes out at the handshake bound → dial_fresh returns Err;run_oncenever reached. ✓tls_global_stateANDhttp_global_state. It needs TLS config installed AND it does reachrun_oncefor the actual HTTPS round-trip. The comment captures the reasoning ("this test reachesrun_onceand so readsHTTP_REQUEST_TIMEOUT_OVERRIDE").The stacked-tag pattern is a less-common
serial_testshape — worth a one-line acknowledgement thatserial_test::serialcomposes (a test in two groups is serialised against any test in either group). Cargo docs confirm this composes correctly; the runtime behaviour matches the comment's promise.The header comment on
http_get_end_to_end_against_local_server(the first tagged test) explains the rationale for the cohort tag in one place. Good — a future test author adding an HTTP test now has a discoverable note explaining why to follow the pattern.Slow-trickle doc framing
The rewritten line is exactly the framing I asked for. Explicit per-IO vs total, the
MAX_BODY_SIZE × timeoutresidual-exposure bound is quantified, and "a total-time deadline is a planned follow-up" sets the expectation that this isn't the final word. The "today the per-IO bound covers the 'goes fully silent' hazard, not the slow-trickle-but-progressing one" sentence is honest about what's covered and what isn't. Closes the over-claim cleanly.TCP connect timeout test + override
Override hook mirrors the TLS/HTTP shape (
Mutex<Option<Duration>>+set_test_tcp_connect_timeout+tcp_connect_timeout()helper that prefers override over LazyLock). The internalconnect_to_addrscall switched from*TCP_CONNECT_TIMEOUTtotcp_connect_timeout()so the test override is plumbed through. Production path is unchanged.The test uses 192.0.2.1 (TEST-NET-1) with a 200ms override. The honest no-default-route caveat in the comment is the right shape — the test still passes on hosts where the IP returns ENETUNREACH instantly, but on hosts with the slow-SYN shape (typical CI containers/VMs), a regression dropping back to plain
connectwould balloon to 60s+ and trigger the< 2sassertion. Pin is environment-dependent but specifically designed to catch the most likely regression on the most likely CI shape.One minor observation, not for fixing: the test is tagged
http_global_stateonly. I traced whether anything else cares aboutTCP_CONNECT_TIMEOUT_OVERRIDE:http_global_stateand so are serialised against this one.So the existing tag is sufficient — no other concurrent test is sensitive to the override. A
tcp_global_stategroup would be more semantically pure, but adds a group with no current second member. Defer-or-do is the author's call.TcpConnectTimeoutGuardis the right RAII guard — clears on drop including panic. Consistent with the HTTP and TLS guards.build_tls clear-failure comment
The new block is the right rationale: setsockopt on a healthy just-handshook fd ~can't fail; the only realistic failure mode is concurrent fd-close, where the returned stream is already dead and the next read/write will surface it; propagating the clear-failure would mask the underlying state with a synthetic error; HTTP-client callers re-set per request in
run_onceanyway, so a stale handshake deadline can't leak into app IO via the HTTP path.Captures everything a future reader would want to know in five lines. Good.
Declined items
All three correctly judged non-issues:
#[cfg(test)]builds, not in production.No issue.
Verdict
LGTM. Ship it.
The deadline pass closes the largest remaining v1 hazard from the PR1–PR5 review backlog, the three-layer architecture is coherent and the per-op state hygiene is right, every regression shape now has a test anchor, and the doc is honest about what's covered and what's deferred to a future total-time-deadline pass.