binary in string #436
No reviewers
Labels
No labels
bug
dependencies
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
refactor
rust
technical-debt
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
navicore/patch-seq!436
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "encoder"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
All green. CP5 done — UDP and TCP read/receive_from are now byte-clean.
Stopping here as a logical checkpoint before CP6 (OSC encoder).
To answer your earlier worry directly:
The bonus capability that came along for free in this checkpoint:
The replaced test_udp_receive_from_rejects_non_utf8 test →
test_udp_receive_from_preserves_non_utf8_bytes now asserts the opposite
contract: a wire datagram with [0xFF, 0xFE, 0x00, b'x', 0xC0, 0x42] must
come back byte-for-byte identical. That's the load-bearing assertion
that protects against future regressions.
Code Review — PR #436: Binary-clean SeqString (CP5)
Overview
This PR removes the UTF-8 invariant from
SeqString, making UDP, TCP, and file I/O byte-clean so binary protocols (OSC, DNS, NTP, MessagePack) can round-trip through the runtime without being dropped. The core design is sound and the checkpoint is well-structured. A few concrete correctness issues are worth fixing before merging.What's working well
Dropfix (seqstring.rs): Switching fromString::from_raw_partstoVec::<u8>::from_raw_partsis the correct fix — the old code was UB on non-UTF-8 buffers.PartialEq: Changed to compareas_bytes()— correct.Clone: Changed toglobal_bytes(self.as_bytes().to_vec())— correct, no UTF-8 round-trip.seqstring.rs: Excellent regression coverage for the core type.REGISTRY_LOCKin UDP tests: Good fix for the flaky double-close races.global_string/arena_stringkept as wrappers: Clean backward-compatible API.Correctness Issues
1.
string.byte-lengthreturns 0 for non-UTF-8 strings (string_ops/basic.rs)as_str_or_empty()returns""for non-UTF-8 bytes, so byte-length of a binary buffer silently returns 0. This is the primary introspection primitive for binary data — it needs to useas_bytes().2.
string.empty?misreports non-empty binary strings as empty (string_ops/basic.rs)Any non-UTF-8, non-empty buffer will be reported as empty.
3.
string.equalincorrectly equates all non-UTF-8 strings (string_ops/basic.rs)Two different non-UTF-8 buffers both collapse to
"", sostring.equalreturnstruefor any pair of binary strings. ThePartialEqimpl was fixed correctly; the FFI op was not.4.
string.concatsilently drops binary content (string_ops/basic.rs)The module-level doc in
seqstring.rsexplicitly listsconcatas a byte-clean operation. The implementation doesn't match the contract.5.
io.rswrite/write-linedrop non-UTF-8 bytesWriting a binary response body to stdout silently truncates to empty for non-UTF-8 content. Since
libc::writealready takes a raw pointer and length, there's no reason to go through&str.6. Crypto ops use
as_str_or_empty()for binary plaintext (crypto/aes.rs,crypto/ed25519.rs,crypto/pbkdf.rs)These call
aes_gcm_encrypt(plaintext.as_str_or_empty(), ...)etc. If the plaintext is binary (e.g., a raw key or pre-processed blob), it silently encrypts""instead. The underlying crypto functions take&str— consider whether their signatures should be updated to&[u8], or document clearly that these ops only accept UTF-8 plaintext.Minor Observations
string.contains/string.starts-with/string.find(string_ops/access.rs): All useas_str_or_empty(). For binary strings, these return degenerate results (""contains""→ always true). If binary search is out of scope for now, a comment or failure-tuple return for non-UTF-8 input would be preferable to silent wrong answers.serialize.rsTypedValue: ConvertsSeqString→ ownedStringviaas_str_or_empty(). Serialization of binary values silently loses content. Worth a TODO or explicit note that serialization is text-only.encoding.rsdecode paths:base64_decode/hex_decodestill gate the result onString::from_utf8(bytes), so decoding binary-encoded binary data will fail (decoded bytes returned as empty withfalse). This may be intentional for CP5, but it's the natural next step — the encode side is now byte-clean, the decode side isn't.Summary
The core
SeqStringredesign is correct and the I/O layer (TCP/UDP/file) is properly byte-clean. The main issue is that many string-level operations (byte-length,empty?,equal,concat,write) were updated to useas_str_or_empty()when they should useas_bytes()— causing silent data loss or wrong answers for binary strings. Items 1–5 above are the highest-priority fixes before CP6 (the OSC encoder) relies on these ops for binary payloads.Code Review - CP5 SeqString Byte-Cleanliness (placeholder)
Updating placeholder with full review...
Code Review - CP5: SeqString Byte-Cleanliness
Overview: This PR removes the UTF-8 invariant from SeqString, making Seq string type a transparent byte buffer. The change is consistent, well-motivated, and clearly documented. Binary protocols (UDP/OSC, TCP binary frames, raw crypto payloads, base64/hex round-trips) that previously silently failed or produced undefined behavior now work correctly. The scope is large (~1200 add / 540 del) but mechanically uniform.
What the PR does well
Core design is sound. The four-way split -- as_bytes() (always succeeds), as_str() returning Option (validates UTF-8), as_str_lossy() (human display), as_str_or_empty() (text-ops with existing empty-input failure path) -- has clear call-site guidance. The invariant table in the module doc matches the design doc.
Unsafe memory management is correct. Replacing String::from_raw_parts with Vec-u8::from_raw_parts in Drop is the right fix. Using String::from_raw_parts on non-UTF-8 bytes was undefined behavior. Capacity bookkeeping is unchanged.
PartialEq now uses bytes. Two non-UTF-8 buffers with different bytes were previously both None under the old as_str() and would have compared equal. Fixed.
Test sentinel coverage is thorough. Every mutated subsystem gets a sentinel-byte regression test. The REGISTRY_LOCK in UDP tests correctly addresses id-reuse flakiness.
Concerns
1. as_str_or_empty() silently degrades text operations (medium): The design doc says text-level ops should fail loudly with the (value Bool) failure tuple on non-UTF-8 input, but those call sites use as_str_or_empty() which silently returns empty string. For example, string.length of a 7-byte non-UTF-8 buffer silently returns 0, indistinguishable from empty string. The design doc puts these in the Text-required bucket and says to call as_str() and return a failure tuple on None. The code uses as_str_or_empty() instead. Code and doc contradict each other. Suggestion: Either update the doc to say silent-empty is the chosen strategy, or add TODO CP6 comments at representative call sites.
2. symbol_equal fallback uses as_str_or_empty() instead of as_bytes() (minor): In the conversion.rs fallback path, s1.as_str_or_empty() == s2.as_str_or_empty() -- two symbols with different non-UTF-8 bytes both collapse to empty string and incorrectly compare equal. The hot path (pointer equality) is fine, but this fallback can silently lie. Suggestion: Change to s1.as_bytes() == s2.as_bytes().
3. string.split with empty delimiter produces extra elements (minor): The hand-rolled byte-split pushes empty leading and trailing elements around each individual byte. This differs from Rust str::split behavior and may differ from Seq documented contract. No test covers this case. Suggestion: Add a test splitting abc on empty string and document whether the extra-element behaviour is intentional.
4. serialize.rs / test.rs silently drop non-UTF-8 strings (minor): as_str_or_empty() in serialization converts a binary SeqString to empty string. The design doc notes SON does not round-trip binary, so this may be intentional -- but a comment at those call sites would make that explicit.
5. HTTP body bytes silently discarded (low, likely by design): http_client.rs calls as_str_or_empty() for the request body. A binary POST body silently becomes empty. If HTTP binary bodies are out of scope, a comment would prevent future confusion.
6. path_str() helper inconsistently applied (low): file.rs introduces path_str() with good documentation. os.rs calls as_str_or_empty() directly without explanation. Consider reusing the helper or the comment pattern.
Safety audit: Old Drop with String::from_raw_parts on non-UTF-8 bytes was UB -- fixed correctly with Vec-u8::from_raw_parts. The as_bytes() safety comment correctly ties the slice lifetime to self. unsafe impl Send/Sync unchanged and valid. Arena strings use capacity=0 to signal do-not-free; invariant preserved.
Summary: The architectural decision is correct and the implementation is largely faithful to it. Three actionable items before merge: (1) Fix symbol_equal fallback to s1.as_bytes() == s2.as_bytes() -- one-liner. (2) Resolve design-doc / code contradiction on text-ops: update the doc or add TODOs. (3) Add empty-delimiter split test documenting the extra-element behaviour. Everything else is non-blocking and can be followed up in CP6.
Review by Claude Sonnet 4.6
Code Review
Overview
This PR (CP5) completes the byte-cleanliness initiative for the Seq runtime. It drops the UTF-8 invariant from
SeqString, migrates every runtime module to useas_bytes()for byte-clean operations andas_str()/as_str_or_empty()only at text-processing boundaries, and covers the work with a comprehensive regression test suite. The design is principled and the execution is thorough.Strengths
as_bytes,as_str,as_str_lossy,as_str_or_empty— maps cleanly to the different caller categories.DropfromString::from_raw_partstoVec::<u8>::from_raw_partsremoves a now-invalid UTF-8 requirement from memory deallocation without changing the allocation size, which is correct.as_strusedfrom_utf8_unchecked, which would have been UB on any non-UTF-8 buffer constructible through other paths. Replacing it with a checkedfrom_utf8(...).ok()is the right fix.seqstring,encoding,string_ops,crypto, andhttp_clienttest modules makes regressions easy to catch.Issues
Moderate
string.joinstill converts viaas_str_or_empty— silently loses binary parts.In
access.rs:302, thejoinmap closure converts eachValue::Stringthroughas_str_or_empty(), which silently collapses non-UTF-8 parts to"". This is inconsistent:splitis now byte-clean butjoinis not. If you split a binary buffer on a delimiter then reassemble withjoinon the same delimiter, the parts are silently corrupted. Eitherjoinshould be byte-clean (byte-concatenate with separator bytes), or the limitation should be clearly documented.Regex on non-UTF-8 input silently degenerates via
as_str_or_empty.regex.rsroutes both the pattern and the text throughas_str_or_empty(). A non-UTF-8 input degenerates to matching""against"", returningtruefor any pattern that matches the empty string — or returning no matches when the actual bytes contain a match. The fix is either to validate UTF-8 upfront and return the failure tuple, or to use theregexcrate'sbytessub-module. Silent degeneration is worse than an explicit failure.test.assert-eq-stris broken for binary strings.test.rs:412doesas_str_or_empty()on both sides. Two different non-UTF-8 strings both become""and compare equal, making binary test assertions vacuous. Programs usingtest.assert-eq-strwith binary content will silently pass. Consider adding atest.assert-eq-bytesop, or routing throughas_bytes()directly.string.splitempty-delimiter shape should be documented at the user level.The implementation mirrors Rust's
&str::split("")shape (empty leading and trailing pieces), which is non-obvious to Seq programmers. The test pins the behaviour, but the docstring inaccess.rsand the language guide should say explicitly what("abc" "" string.split)returns.Minor
path_stris duplicated betweenfile.rsandos.rs.Both files define an identical
fn path_str(s: &SeqString) -> &strhelper with copy-pasted comments. This could live in a shared utility or as aSeqStringmethod.read_response_bytesbody-size check is off-by-one.http_client.rsreads up toMAX_BODY_SIZE + 1bytes, then checksif body.len() > MAX_BODY_SIZE. A response of exactlyMAX_BODY_SIZEbytes therefore reads one extra byte beyond the declared limit. Worth cleaning up while the area is touched.as_str_or_emptyis easy to misuse.The helper intentionally returns
""on non-UTF-8 input and is meant only for operations that already have a degenerate-input path. A doc comment warning future callers of this contract would help prevent accidental misuse.Test Coverage
Coverage is strong. The sentinel-based unit tests in
seqstring,crypto,encoding,string_ops, andhttp_clientform a coherent regression guard. The UDP loopback test polarity flip (from "rejects" to "preserves") is exactly right. Gaps worth closing before the next checkpoint:join+splitround-trip with binary content (would expose thejoinissue above).regex.*on non-UTF-8 input to pin the degeneration behaviour.test.assert-eq-strwith two distinct non-UTF-8 strings (currently passes vacuously).Security
No new security issues introduced.
constant_time_eqand HMAC correctly switch toas_bytes(), maintaining their constant-time properties. The SSRF check still validates the URL viaas_str_or_empty(), so a non-UTF-8 URL silently becomes""and fails validation — fine in practice, but worth a comment.Summary
The core change is correct and well-tested. The two most actionable items before merge:
string.jointo be byte-clean, or explicitly document the asymmetry withsplit.test.assert-eq-strto useas_bytes— vacuous passing assertions undermine the test framework.The regex silent-degeneration is less urgent but should be tracked as a follow-up.