● PR1 summary — DNS layer #477

Merged
navicore merged 2 commits from net-pr1 into main 2026-05-12 14:21:29 +00:00
Owner

#475

Added:

  • crates/runtime/src/dns.rs (~200 lines): worker-thread pool calling getaddrinfo, std::mpsc job queue with
    Arc<Mutex> for MPSC, may::sync::mpsc reply channel so the requesting strand yields cooperatively, TTL
    cache (60s, 256 entries) collapsing fanout. SEQ_DNS_WORKERS env var (default 8, cap 64).
  • crates/compiler/src/builtins/dns.rs: net.dns.resolve ( a String -- a V Bool ).
  • crates/compiler/src/codegen/runtime/dns.rs: extern decl + symbol mapping.
  • tests/integration/src/test-net-dns.seq: 5 tests covering localhost, IP literal, cache reuse, concurrent fanout
    via strand.spawn, empty-string fast-fail.

Touched:

  • crates/runtime/src/lib.rs, crates/compiler/src/builtins.rs, crates/compiler/src/codegen/runtime.rs: module
    declarations + CATEGORIES/slices entries.
  • crates/compiler/src/ast/program.rs: added "net.dns.resolve" to the validation list.
  • crates/compiler/src/error_flag_lint/state.rs: registered as fallible op so unchecked Bool returns get linted.
  • docs/STDLIB_REFERENCE.md: new "Networking — net.dns.*" section.

just ci is green: fmt, clippy -D warnings, all unit tests, integration tests, seqc lint (0 errors/warnings),
no_dead_code check.

PR2 (net.tcp.connect + the WouldBlock + yield_now() cleanup) is the natural next step. Ready when you want it.

https://git.navicore.tech/navicore/patch-seq/issues/475 Added: - crates/runtime/src/dns.rs (~200 lines): worker-thread pool calling getaddrinfo, std::mpsc job queue with Arc<Mutex<Receiver>> for MPSC, may::sync::mpsc reply channel so the requesting strand yields cooperatively, TTL cache (60s, 256 entries) collapsing fanout. SEQ_DNS_WORKERS env var (default 8, cap 64). - crates/compiler/src/builtins/dns.rs: net.dns.resolve ( a String -- a V Bool ). - crates/compiler/src/codegen/runtime/dns.rs: extern decl + symbol mapping. - tests/integration/src/test-net-dns.seq: 5 tests covering localhost, IP literal, cache reuse, concurrent fanout via strand.spawn, empty-string fast-fail. Touched: - crates/runtime/src/lib.rs, crates/compiler/src/builtins.rs, crates/compiler/src/codegen/runtime.rs: module declarations + CATEGORIES/slices entries. - crates/compiler/src/ast/program.rs: added "net.dns.resolve" to the validation list. - crates/compiler/src/error_flag_lint/state.rs: registered as fallible op so unchecked Bool returns get linted. - docs/STDLIB_REFERENCE.md: new "Networking — net.dns.*" section. just ci is green: fmt, clippy -D warnings, all unit tests, integration tests, seqc lint (0 errors/warnings), no_dead_code check. PR2 (net.tcp.connect + the WouldBlock + yield_now() cleanup) is the natural next step. Ready when you want it.
● PR1 summary — DNS layer
All checks were successful
CI - Linux / CI - Linux x86_64 (pull_request) Successful in 9m2s
789b7795d7
Added:

  - crates/runtime/src/dns.rs (~200 lines): worker-thread pool calling getaddrinfo, std::mpsc job queue with
  Arc<Mutex<Receiver>> for MPSC, may::sync::mpsc reply channel so the requesting strand yields cooperatively, TTL
  cache (60s, 256 entries) collapsing fanout. SEQ_DNS_WORKERS env var (default 8, cap 64).
  - crates/compiler/src/builtins/dns.rs: net.dns.resolve ( a String -- a V Bool ).
  - crates/compiler/src/codegen/runtime/dns.rs: extern decl + symbol mapping.
  - tests/integration/src/test-net-dns.seq: 5 tests covering localhost, IP literal, cache reuse, concurrent fanout
   via strand.spawn, empty-string fast-fail.

  Touched:

  - crates/runtime/src/lib.rs, crates/compiler/src/builtins.rs, crates/compiler/src/codegen/runtime.rs: module
  declarations + CATEGORIES/slices entries.
  - crates/compiler/src/ast/program.rs: added "net.dns.resolve" to the validation list.
  - crates/compiler/src/error_flag_lint/state.rs: registered as fallible op so unchecked Bool returns get linted.
  - docs/STDLIB_REFERENCE.md: new "Networking — net.dns.*" section.

  just ci is green: fmt, clippy -D warnings, all unit tests, integration tests, seqc lint (0 errors/warnings),
  no_dead_code check.

  PR2 (net.tcp.connect + the WouldBlock + yield_now() cleanup) is the natural next step. Ready when you want it.
Author
Owner

Review — PR1 DNS layer

Reviewed the diff end-to-end (crates/runtime/src/dns.rs, the builtins/codegen wiring, the doc additions, and the integration tests). Overall: the design is sound and the FFI shape lines up with the rest of the net.* surface. A few correctness/robustness items worth thinking about before PR2 lands on top.

Design — looks right

  • std::mpsc for the OS-side job queue + may::sync::mpsc for the per-call reply is the right split. Workers stay ignorant of may; the strand parks cooperatively on reply_rx.recv(). This is the cleanest version of "blocking syscall offloaded to a thread pool, coroutine awaits result" that I've seen in this tree.
  • Arc<Mutex<Receiver>> for the worker pool is correct (only one worker inside recv() at a time; the others queue on the mutex, lock is released the instant a job is taken). Worth keeping the comment at dns.rs:112-116 — it answers the obvious "isn't this serialising the pool?" question.
  • TTL cache shape (60 s, 256 entries, bounded eviction) is right-sized for v1. The eviction comment correctly acknowledges it's not LRU.
  • Tests stay offline (localhost via /etc/hosts, IP literal). Good.

Concerns / things to revisit

1. No single-flight on cache miss. The PR description says the cache "collapses fanout", but it only collapses sequential fanout. N concurrent first-resolves of the same uncached host enqueue N worker jobs — each does its own getaddrinfo, each writes the cache on return. Not a correctness bug, just wasted work under bursty load (e.g., a connection-pool warm-up that resolves the same host from many strands at once). The classic fix is an in-flight map host -> Vec<reply_tx> so duplicates attach to the existing job. Reasonable to defer, but worth flagging in the doc or as a follow-up issue rather than implying current behaviour already covers it.

2. Worker-spawn failure is fatal-forever. thread::Builder::…spawn(…).expect("spawn DNS worker thread") inside the LazyLock init means: if any of the 8 (or N) worker spawns fails, the LazyLock panics, the static is left in poisoned state, and every subsequent net.dns.resolve panics for the life of the process. In a resource-starved sandbox/container this turns a soft failure into a hard one. Options:

  • Collect successes, log/ignore failures, proceed with however many workers actually started (require >= 1).
  • If zero workers, leave the static unset and return (empty-list, false) from the FFI fast path.

3. Re-export duplication. crates/runtime/src/dns.rs:227 has pub use patch_seq_dns_resolve as dns_resolve; and crates/runtime/src/lib.rs:268 has pub use dns::patch_seq_dns_resolve as dns_resolve;. Both compile, but the in-module re-export is redundant — lib.rs is the canonical re-export point for the other net.* modules (tcp, udp only re-export from lib.rs). Worth dropping for consistency.

4. Cache get requires &mut self because of the expiry-eviction side-effect, so every cache check takes an exclusive Mutex lock. Fine at this scale (256 entries, ~no contention) but it does mean the cache lock becomes a serialisation point if the worker pool grows. A RwLock + separate scrub pass would defer that worry; not necessary now.

5. hostname is cloned twice on the slow path. Once into the Job (dns.rs:184) and once into the cache (dns.rs:126). The worker could move job.hostname into the cache after resolve_blocking since it doesn't need the string afterward. Micro-optimisation, not blocking.

6. SEQ_DNS_WORKERS=0 silently uses the default. The .filter(|n| *n > 0) swallows a literal 0. If anyone sets that expecting "disable the pool", they'll be surprised. Either document explicitly or reject 0 loudly. (Disabling makes no sense given the architecture — the syscall has to run somewhere — so documenting is probably enough.)

7. test-net-dns-cache-second-lookup doesn't actually test the cache path. It validates that two calls return the same first IP, which is true regardless of whether the second hit the cache or did a fresh getaddrinfo. To genuinely assert the cache path you'd need a counter (e.g., SEQ_DNS_DEBUG_STATS=1 exposing hit/miss) or a slow hostname whose second-call timing is observably faster. v1 is fine, but worth a small TODO so we don't believe we have coverage we don't.

8. test-net-dns-concurrent-fanout spawns strands but never joins. Whether the test asserts anything depends on the integration harness waiting for outstanding strands at exit. If it doesn't, this test passes even if all four spawned bodies fail (their os.exit 1 may run after the test runner has already moved on). Worth confirming the harness behaviour, or adding an explicit join/barrier.

9. Lock-poisoning on unwrap. CACHE.lock().unwrap() and rx.lock().unwrap() will panic the strand/worker if poisoned. Realistic only if a worker panics inside getaddrinfo (very unlikely) or inside Vec::contains (impossible). Mentioning so it's a deliberate choice rather than an oversight.

10. Observability. No counters for cache hit rate, queue depth, or worker saturation. PR2 (tcp.connect) is going to want these for the same reasons; might be worth adding now (a dns::stats() returning a snapshot, or env-gated debug counters) so PR2 can mirror the shape.

Nits

  • dns.rs:103format!("seq-dns-{}", i) could be format!("seq-dns-{i}") to match the inline-format style used elsewhere in the runtime (clippy will eventually pick this up under uninlined_format_args).
  • dns.rs:208, :219global_string("List".to_string()) is allocated per call. The "List" tag is constant; consider a LazyLock<SeqString> for the variant tag, mirroring whatever the rest of the variant-emitting builtins do. (If they all allocate per-call too, leave it — consistency wins.)
  • The doc table in STDLIB_REFERENCE.md uses V for the list type which matches existing list.* entries — good. But the worker count line says "default 8, max 64" while the env-var section in the module doc says "default 8, cap 64". Minor wording drift.

Verdict

I'd land this with at least #2 (worker-spawn fatal-forever) addressed — the rest can be folded into PR2 or a follow-up. The single-flight gap (#1) is the only one with user-visible semantics worth being honest about in the docs even if not fixed now.

## Review — PR1 DNS layer Reviewed the diff end-to-end (`crates/runtime/src/dns.rs`, the builtins/codegen wiring, the doc additions, and the integration tests). Overall: the design is sound and the FFI shape lines up with the rest of the `net.*` surface. A few correctness/robustness items worth thinking about before PR2 lands on top. ### Design — looks right - `std::mpsc` for the OS-side job queue + `may::sync::mpsc` for the per-call reply is the right split. Workers stay ignorant of `may`; the strand parks cooperatively on `reply_rx.recv()`. This is the cleanest version of "blocking syscall offloaded to a thread pool, coroutine awaits result" that I've seen in this tree. - `Arc<Mutex<Receiver>>` for the worker pool is correct (only one worker inside `recv()` at a time; the others queue on the mutex, lock is released the instant a job is taken). Worth keeping the comment at `dns.rs:112-116` — it answers the obvious "isn't this serialising the pool?" question. - TTL cache shape (60 s, 256 entries, bounded eviction) is right-sized for v1. The eviction comment correctly acknowledges it's not LRU. - Tests stay offline (localhost via `/etc/hosts`, IP literal). Good. ### Concerns / things to revisit **1. No single-flight on cache miss.** The PR description says the cache "collapses fanout", but it only collapses *sequential* fanout. N concurrent first-resolves of the same uncached host enqueue N worker jobs — each does its own `getaddrinfo`, each writes the cache on return. Not a correctness bug, just wasted work under bursty load (e.g., a connection-pool warm-up that resolves the same host from many strands at once). The classic fix is an in-flight map `host -> Vec<reply_tx>` so duplicates attach to the existing job. Reasonable to defer, but worth flagging in the doc or as a follow-up issue rather than implying current behaviour already covers it. **2. Worker-spawn failure is fatal-forever.** `thread::Builder::…spawn(…).expect("spawn DNS worker thread")` inside the `LazyLock` init means: if *any* of the 8 (or N) worker spawns fails, the `LazyLock` panics, the static is left in poisoned state, and every subsequent `net.dns.resolve` panics for the life of the process. In a resource-starved sandbox/container this turns a soft failure into a hard one. Options: - Collect successes, log/ignore failures, proceed with however many workers actually started (require >= 1). - If zero workers, leave the static unset and return `(empty-list, false)` from the FFI fast path. **3. Re-export duplication.** `crates/runtime/src/dns.rs:227` has `pub use patch_seq_dns_resolve as dns_resolve;` and `crates/runtime/src/lib.rs:268` has `pub use dns::patch_seq_dns_resolve as dns_resolve;`. Both compile, but the in-module re-export is redundant — `lib.rs` is the canonical re-export point for the other `net.*` modules (`tcp`, `udp` only re-export from `lib.rs`). Worth dropping for consistency. **4. Cache `get` requires `&mut self`** because of the expiry-eviction side-effect, so every cache check takes an exclusive `Mutex` lock. Fine at this scale (256 entries, ~no contention) but it does mean the cache lock becomes a serialisation point if the worker pool grows. A `RwLock` + separate scrub pass would defer that worry; not necessary now. **5. `hostname` is cloned twice on the slow path.** Once into the `Job` (`dns.rs:184`) and once into the cache (`dns.rs:126`). The worker could move `job.hostname` into the cache after `resolve_blocking` since it doesn't need the string afterward. Micro-optimisation, not blocking. **6. `SEQ_DNS_WORKERS=0` silently uses the default.** The `.filter(|n| *n > 0)` swallows a literal `0`. If anyone sets that expecting "disable the pool", they'll be surprised. Either document explicitly or reject `0` loudly. (Disabling makes no sense given the architecture — the syscall has to run *somewhere* — so documenting is probably enough.) **7. `test-net-dns-cache-second-lookup` doesn't actually test the cache path.** It validates that two calls return the same first IP, which is true regardless of whether the second hit the cache or did a fresh `getaddrinfo`. To genuinely assert the cache path you'd need a counter (e.g., `SEQ_DNS_DEBUG_STATS=1` exposing hit/miss) or a slow hostname whose second-call timing is observably faster. v1 is fine, but worth a small TODO so we don't believe we have coverage we don't. **8. `test-net-dns-concurrent-fanout` spawns strands but never joins.** Whether the test asserts anything depends on the integration harness waiting for outstanding strands at exit. If it doesn't, this test passes even if all four spawned bodies fail (their `os.exit 1` may run after the test runner has already moved on). Worth confirming the harness behaviour, or adding an explicit join/barrier. **9. Lock-poisoning on `unwrap`.** `CACHE.lock().unwrap()` and `rx.lock().unwrap()` will panic the strand/worker if poisoned. Realistic only if a worker panics inside `getaddrinfo` (very unlikely) or inside `Vec::contains` (impossible). Mentioning so it's a deliberate choice rather than an oversight. **10. Observability.** No counters for cache hit rate, queue depth, or worker saturation. PR2 (`tcp.connect`) is going to want these for the same reasons; might be worth adding now (a `dns::stats()` returning a snapshot, or env-gated debug counters) so PR2 can mirror the shape. ### Nits - `dns.rs:103` — `format!("seq-dns-{}", i)` could be `format!("seq-dns-{i}")` to match the inline-format style used elsewhere in the runtime (clippy will eventually pick this up under `uninlined_format_args`). - `dns.rs:208`, `:219` — `global_string("List".to_string())` is allocated per call. The "List" tag is constant; consider a `LazyLock<SeqString>` for the variant tag, mirroring whatever the rest of the variant-emitting builtins do. (If they all allocate per-call too, leave it — consistency wins.) - The doc table in `STDLIB_REFERENCE.md` uses `V` for the list type which matches existing `list.*` entries — good. But the worker count line says "default 8, max 64" while the env-var section in the module doc says "default 8, cap 64". Minor wording drift. ### Verdict I'd land this with at least #2 (worker-spawn fatal-forever) addressed — the rest can be folded into PR2 or a follow-up. The single-flight gap (#1) is the only one with user-visible semantics worth being honest about in the docs even if not fixed now.
● All review-driven fixes applied and CI green.
All checks were successful
CI - Linux / CI - Linux x86_64 (pull_request) Successful in 8m59s
60987e2019
Changes in this round:

  1. #2 worker-spawn fatal-forever — dns.rs:84-117: JOB_QUEUE is now LazyLock<Option<Sender<Job>>>. Failed thread
  spawns are counted-and-ignored; if zero workers start, the FFI fast-fails to (empty-list, false) instead of
  permanently poisoning the static. The doc comment explains the invariant.
  2. #3 redundant re-export — dropped pub use patch_seq_dns_resolve as dns_resolve; from dns.rs (no in-module
  consumer since there's no tests submodule; lib.rs is canonical).
  3. #8 fanout join barrier — extracted spawn-resolver-body helper, added a chan.make completion channel, four
  chan.receive test.assert joins, plus the trailing drop. The test now genuinely waits for spawned strands; any
  failure path actually fires.
  4. #1 single-flight limitation documented — new "Known limitation" sections in both dns.rs module doc and
  STDLIB_REFERENCE.md. Honest about what the cache does/doesn't deduplicate, follow-up flagged.
  5. #6 SEQ_DNS_WORKERS=0 behavior documented — both module doc and STDLIB_REFERENCE.md now spell out that 0 falls
   back to default.
  6. #7 cache test comment — added a note in test-net-dns-cache-second-lookup that it verifies equality, not
  cache-hit, and that genuine cache-path assertion needs a hit/miss counter (planned).
  7. Nits — format!("seq-dns-{i}") inline form (folded into #2); "max 64" / "cap 64" wording aligned to "max 64"
  in both places.

  Side effect of #8: also closed two latent lint warnings (unchecked-chan-send, unchecked-chan-receive) by
  asserting the success bools — test.assert semantics naturally fit since we control both ends of the channel.
navicore deleted branch net-pr1 2026-05-12 14:21:29 +00:00
Sign in to join this conversation.
No description provided.