Networking: deadline pass across TCP connect, TLS handshake, HTTP request #484

Closed
opened 2026-05-13 17:04:01 +00:00 by navicore · 1 comment
Owner

Networking — deadline pass across TCP / TLS / HTTP

Largest single follow-up from the PR1–PR5 networking arc. Three IO layers each have an unbounded-park hazard today; they want one coherent deadline mechanism rather than three independent ones.

The three hazards

  • Connect timeout (deferred from PR2 #478)
    may::net::TcpStream::connect parks the strand for the full OS SYN timeout (~60–130s on Linux) against a silent peer. No caller-side bound.

  • TLS handshake timeout (deferred from PR3 #479)
    ClientConnection::complete_io drives reads/writes until handshake succeeds, fails, or the peer goes mute forever. Stacks on top of connect timeout — a partly-broken peer can park a strand for SYN_timeout + handshake_indefinite.

    PR3 reviewer:

    "Combined with PR2's missing connect timeout, a strand can park for OS-SYN-timeout + indefinite-handshake-time on a partly broken peer. Same shape as the PR2 follow-up — worth grouping into a 'deadline' pass once."

  • HTTP per-request timeout / EOF-framed body hang (deferred from PR4 #480)
    No deadline at all. The worst-case shape is a response with neither Content-Length nor chunked encoding: the client reads until EOF, which an attacker-controlled server can stretch indefinitely. Currently documented in STDLIB_REFERENCE.md v1 limitations:

    "The worst-case shape under this gap is a response with neither Content-Length nor Transfer-Encoding: chunked (EOF-framed body): the client reads until EOF, which an attacker-controlled server can stretch indefinitely. Content-Length and chunked responses are bounded by their own framing and by MAX_BODY_SIZE (10 MB)."

Why one design, not three

The hazards share machinery: each needs a deadline plumbed through cooperative IO that the may scheduler can use to wake the strand on expiry. The reviewers explicitly grouped them across PR2/3/4:

"A deadline pass across the networking stack is a planned follow-up (alongside the connect-timeout gap inherited from PR2 and the handshake-timeout gap from PR3)."
— PR4 STDLIB_REFERENCE doc

Open design questions (resolve via /design before any code)

  • API shape: per-call argument? Per-strand deadline that all IO inside the strand inherits? Both?
  • May-aware deadline primitive: do we need a new crate::time::deadline helper, or compose existing strand-cancel mechanisms (strand.weave-cancel and friends)?
  • Granularity: connect + handshake + request as one bound, or distinct bounds per phase?
  • Behaviour on expiry: drop the strand mid-IO and return an error map, or surface a typed timeout error distinct from connection failures? How does this interact with the HTTP idempotent-retry path?
  • TLS specifics: rustls's complete_io may need a wrapping loop that checks deadline between rounds, since rustls itself has no deadline parameter.

Rough scope

Multi-PR.

  1. /design cycle to settle the API + the may-aware deadline primitive.
  2. First PR: deadline primitive + net.tcp.connect timeout (smallest consumer).
  3. Subsequent PRs: migrate TLS handshake (net.tls.client), then HTTP request (net.http.*) to the same primitive.

Probably 1–2 weeks of focused work plus design discussion.

## Networking — deadline pass across TCP / TLS / HTTP Largest single follow-up from the PR1–PR5 networking arc. Three IO layers each have an unbounded-park hazard today; they want one coherent deadline mechanism rather than three independent ones. ### The three hazards - **Connect timeout** (deferred from PR2 #478) `may::net::TcpStream::connect` parks the strand for the full OS SYN timeout (~60–130s on Linux) against a silent peer. No caller-side bound. - **TLS handshake timeout** (deferred from PR3 #479) `ClientConnection::complete_io` drives reads/writes until handshake succeeds, fails, or the peer goes mute forever. Stacks on top of connect timeout — a partly-broken peer can park a strand for `SYN_timeout + handshake_indefinite`. PR3 reviewer: > "Combined with PR2's missing connect timeout, a strand can park for OS-SYN-timeout + indefinite-handshake-time on a partly broken peer. Same shape as the PR2 follow-up — worth grouping into a 'deadline' pass once." - **HTTP per-request timeout / EOF-framed body hang** (deferred from PR4 #480) No deadline at all. The worst-case shape is a response with neither `Content-Length` nor chunked encoding: the client reads until EOF, which an attacker-controlled server can stretch indefinitely. Currently documented in `STDLIB_REFERENCE.md` v1 limitations: > "The worst-case shape under this gap is a response with neither `Content-Length` nor `Transfer-Encoding: chunked` (EOF-framed body): the client reads until EOF, which an attacker-controlled server can stretch indefinitely. `Content-Length` and chunked responses are bounded by their own framing and by `MAX_BODY_SIZE` (10 MB)." ### Why one design, not three The hazards share machinery: each needs a deadline plumbed through cooperative IO that the may scheduler can use to wake the strand on expiry. The reviewers explicitly grouped them across PR2/3/4: > "A deadline pass across the networking stack is a planned follow-up (alongside the connect-timeout gap inherited from PR2 and the handshake-timeout gap from PR3)." > — PR4 STDLIB_REFERENCE doc ### Open design questions (resolve via `/design` before any code) - **API shape**: per-call argument? Per-strand deadline that all IO inside the strand inherits? Both? - **May-aware deadline primitive**: do we need a new `crate::time::deadline` helper, or compose existing strand-cancel mechanisms (`strand.weave-cancel` and friends)? - **Granularity**: connect + handshake + request as one bound, or distinct bounds per phase? - **Behaviour on expiry**: drop the strand mid-IO and return an error map, or surface a typed timeout error distinct from connection failures? How does this interact with the HTTP idempotent-retry path? - **TLS specifics**: rustls's `complete_io` may need a wrapping loop that checks deadline between rounds, since rustls itself has no deadline parameter. ### Rough scope Multi-PR. 1. `/design` cycle to settle the API + the may-aware deadline primitive. 2. First PR: deadline primitive + `net.tcp.connect` timeout (smallest consumer). 3. Subsequent PRs: migrate TLS handshake (`net.tls.client`), then HTTP request (`net.http.*`) to the same primitive. Probably 1–2 weeks of focused work plus design discussion.
Author
Owner
https://git.navicore.tech/navicore/patch-seq/pulls/488
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
navicore/patch-seq#484
No description provided.