binary in string #436

Merged
navicore merged 3 commits from encoder into main 2026-04-27 14:49:22 +00:00
navicore commented 2026-04-27 04:08:49 +00:00 (Migrated from github.com)

All green. CP5 done — UDP and TCP read/receive_from are now byte-clean.
Stopping here as a logical checkpoint before CP6 (OSC encoder).

To answer your earlier worry directly:

  • The language-guide.md line now reads: "UTF-8 text; also carries arbitrary bytes for binary I/O" — full UTF-8 support preserved, binary capability documented as additive.
  • STDLIB_REFERENCE.md and GRAMMAR.md are unchanged because their existing claims (string.char-at returns Unicode codepoints, \xNN produces U+00NN) are still accurate.
  • Every Seq program written for the previous Seq version produces the same output today on the same UTF-8 input.

The bonus capability that came along for free in this checkpoint:

  • tcp.read is now byte-clean. Any TCP-based binary protocol (HTTP/2 frames, gRPC, raw TLS bytes, protocol buffers over TCP) round-trips intact instead of being rejected at the read boundary.
  • udp.receive-from is byte-clean — OSC, DNS, NTP, MessagePack-over-UDP, all binary datagrams arrive intact.
  • file.slurp reads binary files (was already in CP4).
  • file.spit writes binary files (also CP4).

The replaced test_udp_receive_from_rejects_non_utf8 test →
test_udp_receive_from_preserves_non_utf8_bytes now asserts the opposite
contract: a wire datagram with [0xFF, 0xFE, 0x00, b'x', 0xC0, 0x42] must
come back byte-for-byte identical. That's the load-bearing assertion
that protects against future regressions.

All green. CP5 done — UDP and TCP read/receive_from are now byte-clean. Stopping here as a logical checkpoint before CP6 (OSC encoder). To answer your earlier worry directly: - The language-guide.md line now reads: "UTF-8 text; also carries arbitrary bytes for binary I/O" — full UTF-8 support preserved, binary capability documented as additive. - STDLIB_REFERENCE.md and GRAMMAR.md are unchanged because their existing claims (string.char-at returns Unicode codepoints, \xNN produces U+00NN) are still accurate. - Every Seq program written for the previous Seq version produces the same output today on the same UTF-8 input. The bonus capability that came along for free in this checkpoint: - tcp.read is now byte-clean. Any TCP-based binary protocol (HTTP/2 frames, gRPC, raw TLS bytes, protocol buffers over TCP) round-trips intact instead of being rejected at the read boundary. - udp.receive-from is byte-clean — OSC, DNS, NTP, MessagePack-over-UDP, all binary datagrams arrive intact. - file.slurp reads binary files (was already in CP4). - file.spit writes binary files (also CP4). The replaced test_udp_receive_from_rejects_non_utf8 test → test_udp_receive_from_preserves_non_utf8_bytes now asserts the opposite contract: a wire datagram with [0xFF, 0xFE, 0x00, b'x', 0xC0, 0x42] must come back byte-for-byte identical. That's the load-bearing assertion that protects against future regressions.
claude[bot] commented 2026-04-27 04:11:07 +00:00 (Migrated from github.com)

Code Review — PR #436: Binary-clean SeqString (CP5)

Overview

This PR removes the UTF-8 invariant from SeqString, making UDP, TCP, and file I/O byte-clean so binary protocols (OSC, DNS, NTP, MessagePack) can round-trip through the runtime without being dropped. The core design is sound and the checkpoint is well-structured. A few concrete correctness issues are worth fixing before merging.


What's working well

  • Drop fix (seqstring.rs): Switching from String::from_raw_parts to Vec::<u8>::from_raw_parts is the correct fix — the old code was UB on non-UTF-8 buffers.
  • PartialEq: Changed to compare as_bytes() — correct.
  • Clone: Changed to global_bytes(self.as_bytes().to_vec()) — correct, no UTF-8 round-trip.
  • TCP/UDP/file I/O: All three now use raw byte buffers — exactly right.
  • Sentinel tests in seqstring.rs: Excellent regression coverage for the core type.
  • REGISTRY_LOCK in UDP tests: Good fix for the flaky double-close races.
  • global_string/arena_string kept as wrappers: Clean backward-compatible API.

Correctness Issues

1. string.byte-length returns 0 for non-UTF-8 strings (string_ops/basic.rs)

// current — wrong
let len = s.as_str_or_empty().len() as i64;

// should be
let len = s.as_bytes().len() as i64;

as_str_or_empty() returns "" for non-UTF-8 bytes, so byte-length of a binary buffer silently returns 0. This is the primary introspection primitive for binary data — it needs to use as_bytes().

2. string.empty? misreports non-empty binary strings as empty (string_ops/basic.rs)

// current — wrong
let is_empty = s.as_str_or_empty().is_empty();

// should be
let is_empty = s.as_bytes().is_empty();

Any non-UTF-8, non-empty buffer will be reported as empty.

3. string.equal incorrectly equates all non-UTF-8 strings (string_ops/basic.rs)

// current — wrong
let equal = s1.as_str_or_empty() == s2.as_str_or_empty();

// should be (consistent with PartialEq, which already uses as_bytes)
let equal = s1.as_bytes() == s2.as_bytes();

Two different non-UTF-8 buffers both collapse to "", so string.equal returns true for any pair of binary strings. The PartialEq impl was fixed correctly; the FFI op was not.

4. string.concat silently drops binary content (string_ops/basic.rs)

// current — loses non-UTF-8 bytes
let result = format!("{}{}", s1.as_str_or_empty(), s2.as_str_or_empty());

// should be
let mut result = Vec::with_capacity(s1.as_bytes().len() + s2.as_bytes().len());
result.extend_from_slice(s1.as_bytes());
result.extend_from_slice(s2.as_bytes());
let stack = unsafe { push(stack, Value::String(global_bytes(result))) };

The module-level doc in seqstring.rs explicitly lists concat as a byte-clean operation. The implementation doesn't match the contract.

5. io.rs write/write-line drop non-UTF-8 bytes

// current — wrong
let str_slice = s.as_str_or_empty();
libc::write(1, str_slice.as_ptr() as *const _, str_slice.len());

// should be
let bytes = s.as_bytes();
libc::write(1, bytes.as_ptr() as *const _, bytes.len());

Writing a binary response body to stdout silently truncates to empty for non-UTF-8 content. Since libc::write already takes a raw pointer and length, there's no reason to go through &str.

6. Crypto ops use as_str_or_empty() for binary plaintext (crypto/aes.rs, crypto/ed25519.rs, crypto/pbkdf.rs)

These call aes_gcm_encrypt(plaintext.as_str_or_empty(), ...) etc. If the plaintext is binary (e.g., a raw key or pre-processed blob), it silently encrypts "" instead. The underlying crypto functions take &str — consider whether their signatures should be updated to &[u8], or document clearly that these ops only accept UTF-8 plaintext.


Minor Observations

  • string.contains / string.starts-with / string.find (string_ops/access.rs): All use as_str_or_empty(). For binary strings, these return degenerate results ("" contains "" → always true). If binary search is out of scope for now, a comment or failure-tuple return for non-UTF-8 input would be preferable to silent wrong answers.

  • serialize.rs TypedValue: Converts SeqString → owned String via as_str_or_empty(). Serialization of binary values silently loses content. Worth a TODO or explicit note that serialization is text-only.

  • encoding.rs decode paths: base64_decode / hex_decode still gate the result on String::from_utf8(bytes), so decoding binary-encoded binary data will fail (decoded bytes returned as empty with false). This may be intentional for CP5, but it's the natural next step — the encode side is now byte-clean, the decode side isn't.


Summary

The core SeqString redesign is correct and the I/O layer (TCP/UDP/file) is properly byte-clean. The main issue is that many string-level operations (byte-length, empty?, equal, concat, write) were updated to use as_str_or_empty() when they should use as_bytes() — causing silent data loss or wrong answers for binary strings. Items 1–5 above are the highest-priority fixes before CP6 (the OSC encoder) relies on these ops for binary payloads.

## Code Review — PR #436: Binary-clean SeqString (CP5) ### Overview This PR removes the UTF-8 invariant from `SeqString`, making UDP, TCP, and file I/O byte-clean so binary protocols (OSC, DNS, NTP, MessagePack) can round-trip through the runtime without being dropped. The core design is sound and the checkpoint is well-structured. A few concrete correctness issues are worth fixing before merging. --- ### What's working well - **`Drop` fix** (`seqstring.rs`): Switching from `String::from_raw_parts` to `Vec::<u8>::from_raw_parts` is the correct fix — the old code was UB on non-UTF-8 buffers. - **`PartialEq`**: Changed to compare `as_bytes()` — correct. - **`Clone`**: Changed to `global_bytes(self.as_bytes().to_vec())` — correct, no UTF-8 round-trip. - **TCP/UDP/file I/O**: All three now use raw byte buffers — exactly right. - **Sentinel tests** in `seqstring.rs`: Excellent regression coverage for the core type. - **`REGISTRY_LOCK`** in UDP tests: Good fix for the flaky double-close races. - **`global_string`/`arena_string` kept as wrappers**: Clean backward-compatible API. --- ### Correctness Issues #### 1. `string.byte-length` returns 0 for non-UTF-8 strings (`string_ops/basic.rs`) ```rust // current — wrong let len = s.as_str_or_empty().len() as i64; // should be let len = s.as_bytes().len() as i64; ``` `as_str_or_empty()` returns `""` for non-UTF-8 bytes, so byte-length of a binary buffer silently returns 0. This is the primary introspection primitive for binary data — it needs to use `as_bytes()`. #### 2. `string.empty?` misreports non-empty binary strings as empty (`string_ops/basic.rs`) ```rust // current — wrong let is_empty = s.as_str_or_empty().is_empty(); // should be let is_empty = s.as_bytes().is_empty(); ``` Any non-UTF-8, non-empty buffer will be reported as empty. #### 3. `string.equal` incorrectly equates all non-UTF-8 strings (`string_ops/basic.rs`) ```rust // current — wrong let equal = s1.as_str_or_empty() == s2.as_str_or_empty(); // should be (consistent with PartialEq, which already uses as_bytes) let equal = s1.as_bytes() == s2.as_bytes(); ``` Two different non-UTF-8 buffers both collapse to `""`, so `string.equal` returns `true` for any pair of binary strings. The `PartialEq` impl was fixed correctly; the FFI op was not. #### 4. `string.concat` silently drops binary content (`string_ops/basic.rs`) ```rust // current — loses non-UTF-8 bytes let result = format!("{}{}", s1.as_str_or_empty(), s2.as_str_or_empty()); // should be let mut result = Vec::with_capacity(s1.as_bytes().len() + s2.as_bytes().len()); result.extend_from_slice(s1.as_bytes()); result.extend_from_slice(s2.as_bytes()); let stack = unsafe { push(stack, Value::String(global_bytes(result))) }; ``` The module-level doc in `seqstring.rs` explicitly lists `concat` as a byte-clean operation. The implementation doesn't match the contract. #### 5. `io.rs` `write`/`write-line` drop non-UTF-8 bytes ```rust // current — wrong let str_slice = s.as_str_or_empty(); libc::write(1, str_slice.as_ptr() as *const _, str_slice.len()); // should be let bytes = s.as_bytes(); libc::write(1, bytes.as_ptr() as *const _, bytes.len()); ``` Writing a binary response body to stdout silently truncates to empty for non-UTF-8 content. Since `libc::write` already takes a raw pointer and length, there's no reason to go through `&str`. #### 6. Crypto ops use `as_str_or_empty()` for binary plaintext (`crypto/aes.rs`, `crypto/ed25519.rs`, `crypto/pbkdf.rs`) These call `aes_gcm_encrypt(plaintext.as_str_or_empty(), ...)` etc. If the plaintext is binary (e.g., a raw key or pre-processed blob), it silently encrypts `""` instead. The underlying crypto functions take `&str` — consider whether their signatures should be updated to `&[u8]`, or document clearly that these ops only accept UTF-8 plaintext. --- ### Minor Observations - **`string.contains` / `string.starts-with` / `string.find`** (`string_ops/access.rs`): All use `as_str_or_empty()`. For binary strings, these return degenerate results (`""` contains `""` → always true). If binary search is out of scope for now, a comment or failure-tuple return for non-UTF-8 input would be preferable to silent wrong answers. - **`serialize.rs` `TypedValue`**: Converts `SeqString` → owned `String` via `as_str_or_empty()`. Serialization of binary values silently loses content. Worth a TODO or explicit note that serialization is text-only. - **`encoding.rs` decode paths**: `base64_decode` / `hex_decode` still gate the result on `String::from_utf8(bytes)`, so decoding binary-encoded binary data will fail (decoded bytes returned as empty with `false`). This may be intentional for CP5, but it's the natural next step — the encode side is now byte-clean, the decode side isn't. --- ### Summary The core `SeqString` redesign is correct and the I/O layer (TCP/UDP/file) is properly byte-clean. The main issue is that many string-level operations (`byte-length`, `empty?`, `equal`, `concat`, `write`) were updated to use `as_str_or_empty()` when they should use `as_bytes()` — causing silent data loss or wrong answers for binary strings. Items 1–5 above are the highest-priority fixes before CP6 (the OSC encoder) relies on these ops for binary payloads.
claude[bot] commented 2026-04-27 04:49:47 +00:00 (Migrated from github.com)

Code Review - CP5 SeqString Byte-Cleanliness (placeholder)

Code Review - CP5 SeqString Byte-Cleanliness (placeholder)
claude[bot] commented 2026-04-27 04:50:24 +00:00 (Migrated from github.com)

Updating placeholder with full review...

Updating placeholder with full review...
claude[bot] commented 2026-04-27 04:51:12 +00:00 (Migrated from github.com)

Code Review - CP5: SeqString Byte-Cleanliness

Overview: This PR removes the UTF-8 invariant from SeqString, making Seq string type a transparent byte buffer. The change is consistent, well-motivated, and clearly documented. Binary protocols (UDP/OSC, TCP binary frames, raw crypto payloads, base64/hex round-trips) that previously silently failed or produced undefined behavior now work correctly. The scope is large (~1200 add / 540 del) but mechanically uniform.


What the PR does well

Core design is sound. The four-way split -- as_bytes() (always succeeds), as_str() returning Option (validates UTF-8), as_str_lossy() (human display), as_str_or_empty() (text-ops with existing empty-input failure path) -- has clear call-site guidance. The invariant table in the module doc matches the design doc.

Unsafe memory management is correct. Replacing String::from_raw_parts with Vec-u8::from_raw_parts in Drop is the right fix. Using String::from_raw_parts on non-UTF-8 bytes was undefined behavior. Capacity bookkeeping is unchanged.

PartialEq now uses bytes. Two non-UTF-8 buffers with different bytes were previously both None under the old as_str() and would have compared equal. Fixed.

Test sentinel coverage is thorough. Every mutated subsystem gets a sentinel-byte regression test. The REGISTRY_LOCK in UDP tests correctly addresses id-reuse flakiness.


Concerns

1. as_str_or_empty() silently degrades text operations (medium): The design doc says text-level ops should fail loudly with the (value Bool) failure tuple on non-UTF-8 input, but those call sites use as_str_or_empty() which silently returns empty string. For example, string.length of a 7-byte non-UTF-8 buffer silently returns 0, indistinguishable from empty string. The design doc puts these in the Text-required bucket and says to call as_str() and return a failure tuple on None. The code uses as_str_or_empty() instead. Code and doc contradict each other. Suggestion: Either update the doc to say silent-empty is the chosen strategy, or add TODO CP6 comments at representative call sites.

2. symbol_equal fallback uses as_str_or_empty() instead of as_bytes() (minor): In the conversion.rs fallback path, s1.as_str_or_empty() == s2.as_str_or_empty() -- two symbols with different non-UTF-8 bytes both collapse to empty string and incorrectly compare equal. The hot path (pointer equality) is fine, but this fallback can silently lie. Suggestion: Change to s1.as_bytes() == s2.as_bytes().

3. string.split with empty delimiter produces extra elements (minor): The hand-rolled byte-split pushes empty leading and trailing elements around each individual byte. This differs from Rust str::split behavior and may differ from Seq documented contract. No test covers this case. Suggestion: Add a test splitting abc on empty string and document whether the extra-element behaviour is intentional.

4. serialize.rs / test.rs silently drop non-UTF-8 strings (minor): as_str_or_empty() in serialization converts a binary SeqString to empty string. The design doc notes SON does not round-trip binary, so this may be intentional -- but a comment at those call sites would make that explicit.

5. HTTP body bytes silently discarded (low, likely by design): http_client.rs calls as_str_or_empty() for the request body. A binary POST body silently becomes empty. If HTTP binary bodies are out of scope, a comment would prevent future confusion.

6. path_str() helper inconsistently applied (low): file.rs introduces path_str() with good documentation. os.rs calls as_str_or_empty() directly without explanation. Consider reusing the helper or the comment pattern.


Safety audit: Old Drop with String::from_raw_parts on non-UTF-8 bytes was UB -- fixed correctly with Vec-u8::from_raw_parts. The as_bytes() safety comment correctly ties the slice lifetime to self. unsafe impl Send/Sync unchanged and valid. Arena strings use capacity=0 to signal do-not-free; invariant preserved.


Summary: The architectural decision is correct and the implementation is largely faithful to it. Three actionable items before merge: (1) Fix symbol_equal fallback to s1.as_bytes() == s2.as_bytes() -- one-liner. (2) Resolve design-doc / code contradiction on text-ops: update the doc or add TODOs. (3) Add empty-delimiter split test documenting the extra-element behaviour. Everything else is non-blocking and can be followed up in CP6.

Review by Claude Sonnet 4.6

**Code Review - CP5: SeqString Byte-Cleanliness** **Overview:** This PR removes the UTF-8 invariant from SeqString, making Seq string type a transparent byte buffer. The change is consistent, well-motivated, and clearly documented. Binary protocols (UDP/OSC, TCP binary frames, raw crypto payloads, base64/hex round-trips) that previously silently failed or produced undefined behavior now work correctly. The scope is large (~1200 add / 540 del) but mechanically uniform. --- **What the PR does well** Core design is sound. The four-way split -- as_bytes() (always succeeds), as_str() returning Option (validates UTF-8), as_str_lossy() (human display), as_str_or_empty() (text-ops with existing empty-input failure path) -- has clear call-site guidance. The invariant table in the module doc matches the design doc. Unsafe memory management is correct. Replacing String::from_raw_parts with Vec-u8::from_raw_parts in Drop is the right fix. Using String::from_raw_parts on non-UTF-8 bytes was undefined behavior. Capacity bookkeeping is unchanged. PartialEq now uses bytes. Two non-UTF-8 buffers with different bytes were previously both None under the old as_str() and would have compared equal. Fixed. Test sentinel coverage is thorough. Every mutated subsystem gets a sentinel-byte regression test. The REGISTRY_LOCK in UDP tests correctly addresses id-reuse flakiness. --- **Concerns** **1. as_str_or_empty() silently degrades text operations (medium):** The design doc says text-level ops should fail loudly with the (value Bool) failure tuple on non-UTF-8 input, but those call sites use as_str_or_empty() which silently returns empty string. For example, string.length of a 7-byte non-UTF-8 buffer silently returns 0, indistinguishable from empty string. The design doc puts these in the Text-required bucket and says to call as_str() and return a failure tuple on None. The code uses as_str_or_empty() instead. Code and doc contradict each other. Suggestion: Either update the doc to say silent-empty is the chosen strategy, or add TODO CP6 comments at representative call sites. **2. symbol_equal fallback uses as_str_or_empty() instead of as_bytes() (minor):** In the conversion.rs fallback path, s1.as_str_or_empty() == s2.as_str_or_empty() -- two symbols with different non-UTF-8 bytes both collapse to empty string and incorrectly compare equal. The hot path (pointer equality) is fine, but this fallback can silently lie. Suggestion: Change to s1.as_bytes() == s2.as_bytes(). **3. string.split with empty delimiter produces extra elements (minor):** The hand-rolled byte-split pushes empty leading and trailing elements around each individual byte. This differs from Rust str::split behavior and may differ from Seq documented contract. No test covers this case. Suggestion: Add a test splitting abc on empty string and document whether the extra-element behaviour is intentional. **4. serialize.rs / test.rs silently drop non-UTF-8 strings (minor):** as_str_or_empty() in serialization converts a binary SeqString to empty string. The design doc notes SON does not round-trip binary, so this may be intentional -- but a comment at those call sites would make that explicit. **5. HTTP body bytes silently discarded (low, likely by design):** http_client.rs calls as_str_or_empty() for the request body. A binary POST body silently becomes empty. If HTTP binary bodies are out of scope, a comment would prevent future confusion. **6. path_str() helper inconsistently applied (low):** file.rs introduces path_str() with good documentation. os.rs calls as_str_or_empty() directly without explanation. Consider reusing the helper or the comment pattern. --- **Safety audit:** Old Drop with String::from_raw_parts on non-UTF-8 bytes was UB -- fixed correctly with Vec-u8::from_raw_parts. The as_bytes() safety comment correctly ties the slice lifetime to self. unsafe impl Send/Sync unchanged and valid. Arena strings use capacity=0 to signal do-not-free; invariant preserved. --- **Summary:** The architectural decision is correct and the implementation is largely faithful to it. Three actionable items before merge: (1) Fix symbol_equal fallback to s1.as_bytes() == s2.as_bytes() -- one-liner. (2) Resolve design-doc / code contradiction on text-ops: update the doc or add TODOs. (3) Add empty-delimiter split test documenting the extra-element behaviour. Everything else is non-blocking and can be followed up in CP6. _Review by Claude Sonnet 4.6_
claude[bot] commented 2026-04-27 14:41:02 +00:00 (Migrated from github.com)

Code Review

Overview

This PR (CP5) completes the byte-cleanliness initiative for the Seq runtime. It drops the UTF-8 invariant from SeqString, migrates every runtime module to use as_bytes() for byte-clean operations and as_str()/as_str_or_empty() only at text-processing boundaries, and covers the work with a comprehensive regression test suite. The design is principled and the execution is thorough.


Strengths

  • Correct core abstraction. Removing the UTF-8 invariant from the type and validating at the text-operation boundary (rather than at construction time) is the right approach. The four-method surface — as_bytes, as_str, as_str_lossy, as_str_or_empty — maps cleanly to the different caller categories.
  • Unsafe block narrowed. Moving Drop from String::from_raw_parts to Vec::<u8>::from_raw_parts removes a now-invalid UTF-8 requirement from memory deallocation without changing the allocation size, which is correct.
  • UB eliminated. The old as_str used from_utf8_unchecked, which would have been UB on any non-UTF-8 buffer constructible through other paths. Replacing it with a checked from_utf8(...).ok() is the right fix.
  • Consistent sentinel testing. Using the same 6–7 byte non-UTF-8 sentinel across seqstring, encoding, string_ops, crypto, and http_client test modules makes regressions easy to catch.
  • TCP and UDP are now actually byte-clean. The old code rejected non-UTF-8 datagrams at receive time. Both are now correct.

Issues

Moderate

string.join still converts via as_str_or_empty — silently loses binary parts.
In access.rs:302, the join map closure converts each Value::String through as_str_or_empty(), which silently collapses non-UTF-8 parts to "". This is inconsistent: split is now byte-clean but join is not. If you split a binary buffer on a delimiter then reassemble with join on the same delimiter, the parts are silently corrupted. Either join should be byte-clean (byte-concatenate with separator bytes), or the limitation should be clearly documented.

Regex on non-UTF-8 input silently degenerates via as_str_or_empty.
regex.rs routes both the pattern and the text through as_str_or_empty(). A non-UTF-8 input degenerates to matching "" against "", returning true for any pattern that matches the empty string — or returning no matches when the actual bytes contain a match. The fix is either to validate UTF-8 upfront and return the failure tuple, or to use the regex crate's bytes sub-module. Silent degeneration is worse than an explicit failure.

test.assert-eq-str is broken for binary strings.
test.rs:412 does as_str_or_empty() on both sides. Two different non-UTF-8 strings both become "" and compare equal, making binary test assertions vacuous. Programs using test.assert-eq-str with binary content will silently pass. Consider adding a test.assert-eq-bytes op, or routing through as_bytes() directly.

string.split empty-delimiter shape should be documented at the user level.
The implementation mirrors Rust's &str::split("") shape (empty leading and trailing pieces), which is non-obvious to Seq programmers. The test pins the behaviour, but the docstring in access.rs and the language guide should say explicitly what ("abc" "" string.split) returns.

Minor

path_str is duplicated between file.rs and os.rs.
Both files define an identical fn path_str(s: &SeqString) -> &str helper with copy-pasted comments. This could live in a shared utility or as a SeqString method.

read_response_bytes body-size check is off-by-one.
http_client.rs reads up to MAX_BODY_SIZE + 1 bytes, then checks if body.len() > MAX_BODY_SIZE. A response of exactly MAX_BODY_SIZE bytes therefore reads one extra byte beyond the declared limit. Worth cleaning up while the area is touched.

as_str_or_empty is easy to misuse.
The helper intentionally returns "" on non-UTF-8 input and is meant only for operations that already have a degenerate-input path. A doc comment warning future callers of this contract would help prevent accidental misuse.


Test Coverage

Coverage is strong. The sentinel-based unit tests in seqstring, crypto, encoding, string_ops, and http_client form a coherent regression guard. The UDP loopback test polarity flip (from "rejects" to "preserves") is exactly right. Gaps worth closing before the next checkpoint:

  • No test for the join + split round-trip with binary content (would expose the join issue above).
  • No test for regex.* on non-UTF-8 input to pin the degeneration behaviour.
  • No test for test.assert-eq-str with two distinct non-UTF-8 strings (currently passes vacuously).

Security

No new security issues introduced. constant_time_eq and HMAC correctly switch to as_bytes(), maintaining their constant-time properties. The SSRF check still validates the URL via as_str_or_empty(), so a non-UTF-8 URL silently becomes "" and fails validation — fine in practice, but worth a comment.


Summary

The core change is correct and well-tested. The two most actionable items before merge:

  1. Fix string.join to be byte-clean, or explicitly document the asymmetry with split.
  2. Fix test.assert-eq-str to use as_bytes — vacuous passing assertions undermine the test framework.

The regex silent-degeneration is less urgent but should be tracked as a follow-up.

## Code Review ### Overview This PR (CP5) completes the byte-cleanliness initiative for the Seq runtime. It drops the UTF-8 invariant from `SeqString`, migrates every runtime module to use `as_bytes()` for byte-clean operations and `as_str()`/`as_str_or_empty()` only at text-processing boundaries, and covers the work with a comprehensive regression test suite. The design is principled and the execution is thorough. --- ### Strengths - **Correct core abstraction.** Removing the UTF-8 invariant from the type and validating at the text-operation boundary (rather than at construction time) is the right approach. The four-method surface — `as_bytes`, `as_str`, `as_str_lossy`, `as_str_or_empty` — maps cleanly to the different caller categories. - **Unsafe block narrowed.** Moving `Drop` from `String::from_raw_parts` to `Vec::<u8>::from_raw_parts` removes a now-invalid UTF-8 requirement from memory deallocation without changing the allocation size, which is correct. - **UB eliminated.** The old `as_str` used `from_utf8_unchecked`, which would have been UB on any non-UTF-8 buffer constructible through other paths. Replacing it with a checked `from_utf8(...).ok()` is the right fix. - **Consistent sentinel testing.** Using the same 6–7 byte non-UTF-8 sentinel across `seqstring`, `encoding`, `string_ops`, `crypto`, and `http_client` test modules makes regressions easy to catch. - **TCP and UDP are now actually byte-clean.** The old code rejected non-UTF-8 datagrams at receive time. Both are now correct. --- ### Issues #### Moderate **`string.join` still converts via `as_str_or_empty` — silently loses binary parts.** In `access.rs:302`, the `join` map closure converts each `Value::String` through `as_str_or_empty()`, which silently collapses non-UTF-8 parts to `""`. This is inconsistent: `split` is now byte-clean but `join` is not. If you split a binary buffer on a delimiter then reassemble with `join` on the same delimiter, the parts are silently corrupted. Either `join` should be byte-clean (byte-concatenate with separator bytes), or the limitation should be clearly documented. **Regex on non-UTF-8 input silently degenerates via `as_str_or_empty`.** `regex.rs` routes both the pattern and the text through `as_str_or_empty()`. A non-UTF-8 input degenerates to matching `""` against `""`, returning `true` for any pattern that matches the empty string — or returning no matches when the actual bytes contain a match. The fix is either to validate UTF-8 upfront and return the failure tuple, or to use the `regex` crate's `bytes` sub-module. Silent degeneration is worse than an explicit failure. **`test.assert-eq-str` is broken for binary strings.** `test.rs:412` does `as_str_or_empty()` on both sides. Two different non-UTF-8 strings both become `""` and compare equal, making binary test assertions vacuous. Programs using `test.assert-eq-str` with binary content will silently pass. Consider adding a `test.assert-eq-bytes` op, or routing through `as_bytes()` directly. **`string.split` empty-delimiter shape should be documented at the user level.** The implementation mirrors Rust's `&str::split("")` shape (empty leading and trailing pieces), which is non-obvious to Seq programmers. The test pins the behaviour, but the docstring in `access.rs` and the language guide should say explicitly what `("abc" "" string.split)` returns. #### Minor **`path_str` is duplicated between `file.rs` and `os.rs`.** Both files define an identical `fn path_str(s: &SeqString) -> &str` helper with copy-pasted comments. This could live in a shared utility or as a `SeqString` method. **`read_response_bytes` body-size check is off-by-one.** `http_client.rs` reads up to `MAX_BODY_SIZE + 1` bytes, then checks `if body.len() > MAX_BODY_SIZE`. A response of exactly `MAX_BODY_SIZE` bytes therefore reads one extra byte beyond the declared limit. Worth cleaning up while the area is touched. **`as_str_or_empty` is easy to misuse.** The helper intentionally returns `""` on non-UTF-8 input and is meant only for operations that already have a degenerate-input path. A doc comment warning future callers of this contract would help prevent accidental misuse. --- ### Test Coverage Coverage is strong. The sentinel-based unit tests in `seqstring`, `crypto`, `encoding`, `string_ops`, and `http_client` form a coherent regression guard. The UDP loopback test polarity flip (from "rejects" to "preserves") is exactly right. Gaps worth closing before the next checkpoint: - No test for the `join` + `split` round-trip with binary content (would expose the `join` issue above). - No test for `regex.*` on non-UTF-8 input to pin the degeneration behaviour. - No test for `test.assert-eq-str` with two distinct non-UTF-8 strings (currently passes vacuously). --- ### Security No new security issues introduced. `constant_time_eq` and HMAC correctly switch to `as_bytes()`, maintaining their constant-time properties. The SSRF check still validates the URL via `as_str_or_empty()`, so a non-UTF-8 URL silently becomes `""` and fails validation — fine in practice, but worth a comment. --- ### Summary The core change is correct and well-tested. The two most actionable items before merge: 1. **Fix `string.join`** to be byte-clean, or explicitly document the asymmetry with `split`. 2. **Fix `test.assert-eq-str`** to use `as_bytes` — vacuous passing assertions undermine the test framework. The regex silent-degeneration is less urgent but should be tracked as a follow-up.
Sign in to join this conversation.
No description provided.