● PR1 summary — DNS layer #477
No reviewers
Labels
No labels
bug
dependencies
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
refactor
rust
technical-debt
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
navicore/patch-seq!477
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "net-pr1"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
#475
Added:
Arc<Mutex> for MPSC, may::sync::mpsc reply channel so the requesting strand yields cooperatively, TTL
cache (60s, 256 entries) collapsing fanout. SEQ_DNS_WORKERS env var (default 8, cap 64).
via strand.spawn, empty-string fast-fail.
Touched:
declarations + CATEGORIES/slices entries.
just ci is green: fmt, clippy -D warnings, all unit tests, integration tests, seqc lint (0 errors/warnings),
no_dead_code check.
PR2 (net.tcp.connect + the WouldBlock + yield_now() cleanup) is the natural next step. Ready when you want it.
Review — PR1 DNS layer
Reviewed the diff end-to-end (
crates/runtime/src/dns.rs, the builtins/codegen wiring, the doc additions, and the integration tests). Overall: the design is sound and the FFI shape lines up with the rest of thenet.*surface. A few correctness/robustness items worth thinking about before PR2 lands on top.Design — looks right
std::mpscfor the OS-side job queue +may::sync::mpscfor the per-call reply is the right split. Workers stay ignorant ofmay; the strand parks cooperatively onreply_rx.recv(). This is the cleanest version of "blocking syscall offloaded to a thread pool, coroutine awaits result" that I've seen in this tree.Arc<Mutex<Receiver>>for the worker pool is correct (only one worker insiderecv()at a time; the others queue on the mutex, lock is released the instant a job is taken). Worth keeping the comment atdns.rs:112-116— it answers the obvious "isn't this serialising the pool?" question./etc/hosts, IP literal). Good.Concerns / things to revisit
1. No single-flight on cache miss. The PR description says the cache "collapses fanout", but it only collapses sequential fanout. N concurrent first-resolves of the same uncached host enqueue N worker jobs — each does its own
getaddrinfo, each writes the cache on return. Not a correctness bug, just wasted work under bursty load (e.g., a connection-pool warm-up that resolves the same host from many strands at once). The classic fix is an in-flight maphost -> Vec<reply_tx>so duplicates attach to the existing job. Reasonable to defer, but worth flagging in the doc or as a follow-up issue rather than implying current behaviour already covers it.2. Worker-spawn failure is fatal-forever.
thread::Builder::…spawn(…).expect("spawn DNS worker thread")inside theLazyLockinit means: if any of the 8 (or N) worker spawns fails, theLazyLockpanics, the static is left in poisoned state, and every subsequentnet.dns.resolvepanics for the life of the process. In a resource-starved sandbox/container this turns a soft failure into a hard one. Options:(empty-list, false)from the FFI fast path.3. Re-export duplication.
crates/runtime/src/dns.rs:227haspub use patch_seq_dns_resolve as dns_resolve;andcrates/runtime/src/lib.rs:268haspub use dns::patch_seq_dns_resolve as dns_resolve;. Both compile, but the in-module re-export is redundant —lib.rsis the canonical re-export point for the othernet.*modules (tcp,udponly re-export fromlib.rs). Worth dropping for consistency.4. Cache
getrequires&mut selfbecause of the expiry-eviction side-effect, so every cache check takes an exclusiveMutexlock. Fine at this scale (256 entries, ~no contention) but it does mean the cache lock becomes a serialisation point if the worker pool grows. ARwLock+ separate scrub pass would defer that worry; not necessary now.5.
hostnameis cloned twice on the slow path. Once into theJob(dns.rs:184) and once into the cache (dns.rs:126). The worker could movejob.hostnameinto the cache afterresolve_blockingsince it doesn't need the string afterward. Micro-optimisation, not blocking.6.
SEQ_DNS_WORKERS=0silently uses the default. The.filter(|n| *n > 0)swallows a literal0. If anyone sets that expecting "disable the pool", they'll be surprised. Either document explicitly or reject0loudly. (Disabling makes no sense given the architecture — the syscall has to run somewhere — so documenting is probably enough.)7.
test-net-dns-cache-second-lookupdoesn't actually test the cache path. It validates that two calls return the same first IP, which is true regardless of whether the second hit the cache or did a freshgetaddrinfo. To genuinely assert the cache path you'd need a counter (e.g.,SEQ_DNS_DEBUG_STATS=1exposing hit/miss) or a slow hostname whose second-call timing is observably faster. v1 is fine, but worth a small TODO so we don't believe we have coverage we don't.8.
test-net-dns-concurrent-fanoutspawns strands but never joins. Whether the test asserts anything depends on the integration harness waiting for outstanding strands at exit. If it doesn't, this test passes even if all four spawned bodies fail (theiros.exit 1may run after the test runner has already moved on). Worth confirming the harness behaviour, or adding an explicit join/barrier.9. Lock-poisoning on
unwrap.CACHE.lock().unwrap()andrx.lock().unwrap()will panic the strand/worker if poisoned. Realistic only if a worker panics insidegetaddrinfo(very unlikely) or insideVec::contains(impossible). Mentioning so it's a deliberate choice rather than an oversight.10. Observability. No counters for cache hit rate, queue depth, or worker saturation. PR2 (
tcp.connect) is going to want these for the same reasons; might be worth adding now (adns::stats()returning a snapshot, or env-gated debug counters) so PR2 can mirror the shape.Nits
dns.rs:103—format!("seq-dns-{}", i)could beformat!("seq-dns-{i}")to match the inline-format style used elsewhere in the runtime (clippy will eventually pick this up underuninlined_format_args).dns.rs:208,:219—global_string("List".to_string())is allocated per call. The "List" tag is constant; consider aLazyLock<SeqString>for the variant tag, mirroring whatever the rest of the variant-emitting builtins do. (If they all allocate per-call too, leave it — consistency wins.)STDLIB_REFERENCE.mdusesVfor the list type which matches existinglist.*entries — good. But the worker count line says "default 8, max 64" while the env-var section in the module doc says "default 8, cap 64". Minor wording drift.Verdict
I'd land this with at least #2 (worker-spawn fatal-forever) addressed — the rest can be folded into PR2 or a follow-up. The single-flight gap (#1) is the only one with user-visible semantics worth being honest about in the docs even if not fixed now.