cp6 #437

Merged
navicore merged 2 commits from cp6 into main 2026-04-27 21:02:00 +00:00
navicore commented 2026-04-27 20:19:14 +00:00 (Migrated from github.com)
No description provided.
claude[bot] commented 2026-04-27 20:21:49 +00:00 (Migrated from github.com)

Code Review — cp6: Byte-clean string literals + OSC byte primitives

Overview

This PR lands two related features:

  1. Byte-clean StringLiteral — changes the AST node payload from String to Vec<u8> so \xNN escapes produce the literal byte 0xNN rather than the UTF-8 encoding of U+00NN. High-byte values (0x80–0xFF) no longer silently expand to two-byte UTF-8 sequences.
  2. Binary protocol builtinsint.to-bytes-i32-be and float.to-bytes-f32-be produce 4-byte big-endian strings, plus a real OSC 1.0 encoder as a motivating example.

The change is internally consistent: the migration from StringVec<u8> touches the parser, AST, codegen, and all test files uniformly.


Code Quality

Strengths

  • unescape_string change is exactly right: result.push(byte_val) vs. the old result.push(byte_val as char) which silently went through UTF-8 encoding.
  • escape_llvm_string refactored to byte-by-byte matching — cleaner and correct for the new payload type.
  • Non-escape source characters are emitted via ch.encode_utf8(&mut buf), which preserves valid UTF-8 for human-readable text while treating escape sequences as raw bytes. Good layering.
  • patch_seq_push_string_bytes null-pointer guard for the len == 0 case is correct.
  • The OSC encoder is a solid integration demonstration with byte-exact expected outputs.

Concerns

1. Backslash escape in escape_llvm_string — possible pre-existing bug carried forward

crates/compiler/src/codegen/globals.rs:

b'\\' => result.push_str(r"\\"),

This emits \\ in LLVM IR. Per the LLVM Language Reference, the only recognized string escape inside an assembly string constant is \XX (two hex digits). \\ is \ followed by \ — the second \ begins a new escape expecting two hex chars. The correct encoding of a literal backslash would be \5C.

This was present before this PR, but the refactoring is a good time to fix it. If the tests currently pass, it likely means the LLVM assembler is lenient here, but relying on undocumented tolerance is fragile.

Suggestion: change the arm to b'\\' => result.push_str(r"\5C") to match the hex-escape convention used for every other non-printable byte.

2. patch_seq_push_string is now silently more permissive

crates/runtime/src/io.rs — before:

CStr::from_ptr(c_str).to_str().expect("push_string: invalid UTF-8")

After:

let bytes = unsafe { CStr::from_ptr(c_str).to_bytes() };

The old code would panic loudly on non-UTF-8 C strings; the new code accepts them silently. The comment documents the intended callers (variant tags, NULL-FFI fallbacks — all ASCII), so this is unlikely to matter in practice. But the loss of the diagnostic is worth noting: if a future codegen path accidentally routes binary data through push_string instead of push_string_bytes, it will produce wrong output silently rather than panicking.

Suggestion: add a debug_assert that verifies UTF-8 in debug builds, or at minimum note the accepted bytes are expected ASCII in the safety contract.

3. Hardcoded binary path in justfile

target/examples/projects-live-coding-csound-test_osc

If build-examples changes the output naming convention, or runs on a platform where the binary has an extension, this line silently produces a "no such file" error that might be swallowed depending on the shell. Consider using just run-example live-coding-csound/test_osc (if such a recipe exists) or at least adding a -f existence check with a clear error message.

4. osc.zero-pad — out-of-range silent fallthrough

: osc.zero-pad ( Int -- String )
  dup 1 = [ drop "\x00" ]
  [ dup 2 = [ drop "\x00\x00" ]
    [ dup 3 = [ drop "\x00\x00\x00" ]
              [ drop "\x00\x00\x00\x00" ] if ] if ] if
;

The else-of-else arm silently handles any value ≥ 4 (including 0) as "four NULs". Given the calling context 4 - (len % 4) always yields 1–4, this is currently safe. Adding a comment or an assertion that the expected range is 1–4 would make the constraint explicit.


Test Coverage

Good coverage overall:

  • Parser tests confirm byte-level round-tripping for \x00\xFF.
  • test_hex_escape_full_byte_range is a useful new sanity test.
  • bytes_ops/tests.rs covers positive, negative, truncation, zero, and canonical OSC cases for both builtins.
  • OSC encoder has byte-exact expected outputs for all message shapes.

Gaps worth considering:

  • No codegen test that verifies the LLVM IR output for a string literal containing an embedded NUL (i.e., confirming \00 appears in the .str global and push_string_bytes is called with the correct length). A snapshot/golden test for the LLVM IR of "\x00abc" would lock in the end-to-end behavior.
  • No test for osc.string on a string whose byte-length is already a multiple of 4 plus 0 remainder (e.g. len=4: should produce 4 NUL bytes total, not 0).

Performance

Each call to int_to_bytes_i32_be / float_to_bytes_f32_be allocates a 4-element Vec<u8> before passing it to global_bytes. For live-coding OSC at audio rates (potentially thousands of messages per second), this is fine. If this primitive later lands in a tight loop at much higher rates, consider a fixed-size array path.


Summary

This is a well-motivated and cleanly executed change. The migration is thorough and consistent. The main actionable items are:

  1. (Low risk, worth fixing)\5C vs \\ in escape_llvm_string for backslash bytes.
  2. (Documentation) — Add a debug assertion or safety note to patch_seq_push_string about its ASCII-only intended callers.
  3. (Robustness) — Guard the hardcoded binary path in justfile.
  4. (Nice to have) — A LLVM IR snapshot test for strings with embedded NULs.

None of these are blockers. The core semantics change is correct.

## Code Review — cp6: Byte-clean string literals + OSC byte primitives ### Overview This PR lands two related features: 1. **Byte-clean `StringLiteral`** — changes the AST node payload from `String` to `Vec<u8>` so `\xNN` escapes produce the literal byte `0xNN` rather than the UTF-8 encoding of U+00NN. High-byte values (0x80–0xFF) no longer silently expand to two-byte UTF-8 sequences. 2. **Binary protocol builtins** — `int.to-bytes-i32-be` and `float.to-bytes-f32-be` produce 4-byte big-endian strings, plus a real OSC 1.0 encoder as a motivating example. The change is internally consistent: the migration from `String` → `Vec<u8>` touches the parser, AST, codegen, and all test files uniformly. --- ### Code Quality **Strengths** - `unescape_string` change is exactly right: `result.push(byte_val)` vs. the old `result.push(byte_val as char)` which silently went through UTF-8 encoding. - `escape_llvm_string` refactored to byte-by-byte matching — cleaner and correct for the new payload type. - Non-escape source characters are emitted via `ch.encode_utf8(&mut buf)`, which preserves valid UTF-8 for human-readable text while treating escape sequences as raw bytes. Good layering. - `patch_seq_push_string_bytes` null-pointer guard for the `len == 0` case is correct. - The OSC encoder is a solid integration demonstration with byte-exact expected outputs. **Concerns** #### 1. Backslash escape in `escape_llvm_string` — possible pre-existing bug carried forward `crates/compiler/src/codegen/globals.rs`: ```rust b'\\' => result.push_str(r"\\"), ``` This emits `\\` in LLVM IR. Per the LLVM Language Reference, the only recognized string escape inside an assembly string constant is `\XX` (two hex digits). `\\` is `\` followed by `\` — the second `\` begins a new escape expecting two hex chars. The correct encoding of a literal backslash would be `\5C`. This was present before this PR, but the refactoring is a good time to fix it. If the tests currently pass, it likely means the LLVM assembler is lenient here, but relying on undocumented tolerance is fragile. **Suggestion:** change the arm to `b'\\' => result.push_str(r"\5C")` to match the hex-escape convention used for every other non-printable byte. #### 2. `patch_seq_push_string` is now silently more permissive `crates/runtime/src/io.rs` — before: ```rust CStr::from_ptr(c_str).to_str().expect("push_string: invalid UTF-8") ``` After: ```rust let bytes = unsafe { CStr::from_ptr(c_str).to_bytes() }; ``` The old code would panic loudly on non-UTF-8 C strings; the new code accepts them silently. The comment documents the intended callers (variant tags, NULL-FFI fallbacks — all ASCII), so this is unlikely to matter in practice. But the loss of the diagnostic is worth noting: if a future codegen path accidentally routes binary data through `push_string` instead of `push_string_bytes`, it will produce wrong output silently rather than panicking. **Suggestion:** add a `debug_assert` that verifies UTF-8 in debug builds, or at minimum note the accepted bytes are expected ASCII in the safety contract. #### 3. Hardcoded binary path in `justfile` ```makefile target/examples/projects-live-coding-csound-test_osc ``` If `build-examples` changes the output naming convention, or runs on a platform where the binary has an extension, this line silently produces a "no such file" error that might be swallowed depending on the shell. Consider using `just run-example live-coding-csound/test_osc` (if such a recipe exists) or at least adding a `-f` existence check with a clear error message. #### 4. `osc.zero-pad` — out-of-range silent fallthrough ``` : osc.zero-pad ( Int -- String ) dup 1 = [ drop "\x00" ] [ dup 2 = [ drop "\x00\x00" ] [ dup 3 = [ drop "\x00\x00\x00" ] [ drop "\x00\x00\x00\x00" ] if ] if ] if ; ``` The else-of-else arm silently handles any value ≥ 4 (including 0) as "four NULs". Given the calling context `4 - (len % 4)` always yields 1–4, this is currently safe. Adding a comment or an assertion that the expected range is 1–4 would make the constraint explicit. --- ### Test Coverage Good coverage overall: - Parser tests confirm byte-level round-tripping for `\x00`–`\xFF`. - `test_hex_escape_full_byte_range` is a useful new sanity test. - `bytes_ops/tests.rs` covers positive, negative, truncation, zero, and canonical OSC cases for both builtins. - OSC encoder has byte-exact expected outputs for all message shapes. **Gaps worth considering:** - No codegen test that verifies the LLVM IR output for a string literal containing an embedded NUL (i.e., confirming `\00` appears in the `.str` global and `push_string_bytes` is called with the correct length). A snapshot/golden test for the LLVM IR of `"\x00abc"` would lock in the end-to-end behavior. - No test for `osc.string` on a string whose byte-length is already a multiple of 4 plus 0 remainder (e.g. len=4: should produce 4 NUL bytes total, not 0). --- ### Performance Each call to `int_to_bytes_i32_be` / `float_to_bytes_f32_be` allocates a 4-element `Vec<u8>` before passing it to `global_bytes`. For live-coding OSC at audio rates (potentially thousands of messages per second), this is fine. If this primitive later lands in a tight loop at much higher rates, consider a fixed-size array path. --- ### Summary This is a well-motivated and cleanly executed change. The migration is thorough and consistent. The main actionable items are: 1. **(Low risk, worth fixing)** — `\5C` vs `\\` in `escape_llvm_string` for backslash bytes. 2. **(Documentation)** — Add a debug assertion or safety note to `patch_seq_push_string` about its ASCII-only intended callers. 3. **(Robustness)** — Guard the hardcoded binary path in `justfile`. 4. **(Nice to have)** — A LLVM IR snapshot test for strings with embedded NULs. None of these are blockers. The core semantics change is correct.
claude[bot] commented 2026-04-27 20:56:31 +00:00 (Migrated from github.com)

Code Review — PR #437: cp6 (Byte-clean strings + binary protocol builtins)

Overview

This PR makes Seq strings genuinely byte-clean by changing StringLiteral(String)StringLiteral(Vec<u8>) throughout the compiler, fixing \xNN escapes to produce literal bytes (not UTF-8-encoded codepoints), switching the codegen target for string literals from patch_seq_push_string (C-string, NUL-terminates) to a new patch_seq_push_string_bytes (explicit length), and adding two new builtins (int.to-bytes-i32-be, float.to-bytes-f32-be) along with a working OSC 1.0 encoder as a proof-of-concept.

This is a clean, well-motivated refactoring. The core change is correct, the diff is mechanically thorough, and the new tests are well-chosen.


Strengths

  • Complete migration: Every StringLiteral(String) call site was updated; the compiler would have caught any missed ones, but it's good that none slipped through.
  • LLVM IR correctness fix: Encoding backslash as \5C instead of \\ is technically more correct per LLVM IR spec, and the comment explaining why is exactly the right level of detail.
  • Good test layering: escape_llvm_string unit tests, bytes_ops unit tests, updated parser tests, and end-to-end OSC byte-format tests together give strong coverage of the new behaviour.
  • Null-safe push_string_bytes: The if len == 0 { Vec::new() } guard correctly handles zero-length with a potentially-null pointer.
  • test_hex_escape_full_byte_range: Nice sanity test that covers the full 0x000xFF range explicitly.

Issues and Suggestions

1. Debug-only assertion in patch_seq_push_string (io.rs)

debug_assert!(
    std::str::from_utf8(bytes).is_ok(),
    "push_string: input must be valid UTF-8 ...",
);

In release builds this silently accepts non-UTF-8 bytes, which could cause silent data corruption if a future codegen path incorrectly routes binary data here. The comment says "the comment above is the contract" — but that relies on the codegen never making a mistake. Consider using #[cfg(debug_assertions)] with an actual panic path (not just assert), or noting in the function's # Safety doc that callers must guarantee UTF-8 and any violation is UB/undefined behaviour from the ABI contract perspective.

2. Redundant pub use aliases at the bottom of bytes_ops.rs

pub use patch_seq_float_to_bytes_f32_be as float_to_bytes_f32_be;
pub use patch_seq_int_to_bytes_i32_be as int_to_bytes_i32_be;

lib.rs already re-exports these directly from their full names (patch_seq_float_to_bytes_f32_be). These aliases inside the module are unused externally and don't match the pattern of other runtime modules (e.g., float_ops, string_ops don't have internal re-export aliases). They add noise without benefit — remove them or move the alias responsibility entirely to lib.rs.

3. LLVM ABI: i64 vs isize for len in push_string_bytes

declare ptr @patch_seq_push_string_bytes(ptr, ptr, i64)

The Rust signature uses usize. On 64-bit (the only real target today) this is fine, but the hardcoded i64 in the LLVM declaration will silently misbehave on 32-bit systems where usize = i32. If 32-bit is a non-goal, a single comment to that effect would head off future confusion. If 32-bit ever needs to work, this needs a target-width check.

4. No integration test for \xNN high-byte round-trip through the full pipeline

The parser tests verify the Vec<u8> payload; the codegen tests verify LLVM IR emission; the bytes_ops unit tests verify the runtime functions. But there's no integration test that compiles and runs a Seq program containing "\xDC\xFF" and verifies the bytes arrive intact at the runtime (e.g. via string.byte-length or comparing against a known pattern). The OSC tests cover this indirectly, but a targeted integration test makes the guarantee explicit and will catch regressions in the codegen path independently of OSC semantics.

5. osc.zero-pad — silent fallback for out-of-range inputs

: osc.zero-pad ( Int -- String )
  dup 1 = [ drop "\x00" ]
  [ dup 2 = [ drop "\x00\x00" ]
    [ dup 3 = [ drop "\x00\x00\x00" ]
              [ drop "\x00\x00\x00\x00" ] if ] if ] if
;

The comment says "passing values outside 1..4 silently produces 4 NULs." For any caller other than osc.string this is a silent wrong answer, not a helpful error. Since the current only caller is osc.string (and the formula is statically bounded to 1..4), this isn't a runtime risk today — but if osc.zero-pad is ever used directly by user code, the failure mode is non-obvious. Consider an assertion/panic for the out-of-range case, or rename it to make the contract more obvious (e.g., osc.zero-pad-1-to-4).

6. OSC encoder: seq:allow(unchecked-mod) placement

# The Bool from `i.%` is dropped intentionally — the divisor is the
# literal `4`, so divide-by-zero is statically impossible.
# seq:allow(unchecked-mod)
: osc.string ( String -- String )

The annotation is placed inside the doc comment block, which is the correct Seq convention, but it's worth double-checking that seq:allow placed on the word definition suppresses the warning for the drop on the Bool result (as opposed to the division itself). If the linter attaches to the drop call-site rather than the word, the annotation may not fire.

7. Little-endian variants

OSC mandates big-endian, so the current scope is correct. But if bytes_ops is intended as the start of a general binary-encoding module (as the module-level doc implies: "OSC, DNS, NTP, MessagePack, Protobuf, etc."), little-endian variants will be needed shortly — MessagePack uses big-endian, but DNS, NTP, and most "native" protocols vary. No action needed now, just worth a TODO note in the module doc or a tracked issue so this doesn't get forgotten.


Minor Nits

  • The // ===... banner comment in io.rs for the two-line push_string_bytes section is heavier than necessary given the function-level doc is already clear.
  • In bytes_ops/tests.rs, pop_string_bytes calls s.as_bytes().to_vec() — since SeqString is already byte-backed, check whether as_bytes() copies unnecessarily. A borrow might suffice.
  • The OSC test comment # Int32 + float32: ,if. The Csound POC's bread-and-butter: trigger has a stray backtick that breaks the inline formatting.

Test verification

Manually verified the OSC test vectors:

  • /synth/play (11 bytes) → pad by 1 → 12 bytes ✓
  • ,if (3 bytes) → pad by 1 → 4 bytes ✓
  • 10000x000003E8
  • 220.0 → f32 0x435C0000
  • Total 24 bytes matches test-osc-msg-if-length-aligned

All other test vectors check out.


Overall: Approve with the pub use redundancy and LLVM ABI comment as the only changes I'd call required before merge. The integration test gap and out-of-range osc.zero-pad behavior are worth tracking but not blockers.

## Code Review — PR #437: cp6 (Byte-clean strings + binary protocol builtins) ### Overview This PR makes Seq strings genuinely byte-clean by changing `StringLiteral(String)` → `StringLiteral(Vec<u8>)` throughout the compiler, fixing `\xNN` escapes to produce literal bytes (not UTF-8-encoded codepoints), switching the codegen target for string literals from `patch_seq_push_string` (C-string, NUL-terminates) to a new `patch_seq_push_string_bytes` (explicit length), and adding two new builtins (`int.to-bytes-i32-be`, `float.to-bytes-f32-be`) along with a working OSC 1.0 encoder as a proof-of-concept. This is a clean, well-motivated refactoring. The core change is correct, the diff is mechanically thorough, and the new tests are well-chosen. --- ### Strengths - **Complete migration**: Every `StringLiteral(String)` call site was updated; the compiler would have caught any missed ones, but it's good that none slipped through. - **LLVM IR correctness fix**: Encoding backslash as `\5C` instead of `\\` is technically more correct per LLVM IR spec, and the comment explaining why is exactly the right level of detail. - **Good test layering**: `escape_llvm_string` unit tests, `bytes_ops` unit tests, updated parser tests, and end-to-end OSC byte-format tests together give strong coverage of the new behaviour. - **Null-safe `push_string_bytes`**: The `if len == 0 { Vec::new() }` guard correctly handles zero-length with a potentially-null pointer. - **`test_hex_escape_full_byte_range`**: Nice sanity test that covers the full `0x00`–`0xFF` range explicitly. --- ### Issues and Suggestions #### 1. Debug-only assertion in `patch_seq_push_string` (io.rs) ```rust debug_assert!( std::str::from_utf8(bytes).is_ok(), "push_string: input must be valid UTF-8 ...", ); ``` In release builds this silently accepts non-UTF-8 bytes, which could cause silent data corruption if a future codegen path incorrectly routes binary data here. The comment says "the comment above is the contract" — but that relies on the codegen never making a mistake. Consider using `#[cfg(debug_assertions)]` with an actual panic path (not just assert), or noting in the function's `# Safety` doc that callers must guarantee UTF-8 and any violation is UB/undefined behaviour from the ABI contract perspective. #### 2. Redundant `pub use` aliases at the bottom of `bytes_ops.rs` ```rust pub use patch_seq_float_to_bytes_f32_be as float_to_bytes_f32_be; pub use patch_seq_int_to_bytes_i32_be as int_to_bytes_i32_be; ``` `lib.rs` already re-exports these directly from their full names (`patch_seq_float_to_bytes_f32_be`). These aliases inside the module are unused externally and don't match the pattern of other runtime modules (e.g., `float_ops`, `string_ops` don't have internal re-export aliases). They add noise without benefit — remove them or move the alias responsibility entirely to `lib.rs`. #### 3. LLVM ABI: `i64` vs `isize` for `len` in `push_string_bytes` ``` declare ptr @patch_seq_push_string_bytes(ptr, ptr, i64) ``` The Rust signature uses `usize`. On 64-bit (the only real target today) this is fine, but the hardcoded `i64` in the LLVM declaration will silently misbehave on 32-bit systems where `usize` = `i32`. If 32-bit is a non-goal, a single comment to that effect would head off future confusion. If 32-bit ever needs to work, this needs a target-width check. #### 4. No integration test for `\xNN` high-byte round-trip through the full pipeline The parser tests verify the `Vec<u8>` payload; the codegen tests verify LLVM IR emission; the `bytes_ops` unit tests verify the runtime functions. But there's no integration test that compiles and runs a Seq program containing `"\xDC\xFF"` and verifies the bytes arrive intact at the runtime (e.g. via `string.byte-length` or comparing against a known pattern). The OSC tests cover this indirectly, but a targeted integration test makes the guarantee explicit and will catch regressions in the codegen path independently of OSC semantics. #### 5. `osc.zero-pad` — silent fallback for out-of-range inputs ```seq : osc.zero-pad ( Int -- String ) dup 1 = [ drop "\x00" ] [ dup 2 = [ drop "\x00\x00" ] [ dup 3 = [ drop "\x00\x00\x00" ] [ drop "\x00\x00\x00\x00" ] if ] if ] if ; ``` The comment says "passing values outside 1..4 silently produces 4 NULs." For any caller other than `osc.string` this is a silent wrong answer, not a helpful error. Since the current only caller is `osc.string` (and the formula is statically bounded to 1..4), this isn't a runtime risk today — but if `osc.zero-pad` is ever used directly by user code, the failure mode is non-obvious. Consider an assertion/panic for the out-of-range case, or rename it to make the contract more obvious (e.g., `osc.zero-pad-1-to-4`). #### 6. OSC encoder: `seq:allow(unchecked-mod)` placement ```seq # The Bool from `i.%` is dropped intentionally — the divisor is the # literal `4`, so divide-by-zero is statically impossible. # seq:allow(unchecked-mod) : osc.string ( String -- String ) ``` The annotation is placed inside the doc comment block, which is the correct Seq convention, but it's worth double-checking that `seq:allow` placed on the `word definition` suppresses the warning for the `drop` on the Bool result (as opposed to the division itself). If the linter attaches to the `drop` call-site rather than the word, the annotation may not fire. #### 7. Little-endian variants OSC mandates big-endian, so the current scope is correct. But if `bytes_ops` is intended as the start of a general binary-encoding module (as the module-level doc implies: "OSC, DNS, NTP, MessagePack, Protobuf, etc."), little-endian variants will be needed shortly — MessagePack uses big-endian, but DNS, NTP, and most "native" protocols vary. No action needed now, just worth a TODO note in the module doc or a tracked issue so this doesn't get forgotten. --- ### Minor Nits - The `// ===...` banner comment in `io.rs` for the two-line `push_string_bytes` section is heavier than necessary given the function-level doc is already clear. - In `bytes_ops/tests.rs`, `pop_string_bytes` calls `s.as_bytes().to_vec()` — since `SeqString` is already byte-backed, check whether `as_bytes()` copies unnecessarily. A borrow might suffice. - The OSC test comment `# Int32 + float32: `,if`. The Csound POC's bread-and-butter: trigger` has a stray backtick that breaks the inline formatting. --- ### Test verification Manually verified the OSC test vectors: - `/synth/play` (11 bytes) → pad by 1 → 12 bytes ✓ - `,if` (3 bytes) → pad by 1 → 4 bytes ✓ - `1000` → `0x000003E8` ✓ - `220.0` → f32 `0x435C0000` ✓ - Total 24 bytes matches `test-osc-msg-if-length-aligned` ✓ All other test vectors check out. --- **Overall**: Approve with the `pub use` redundancy and LLVM ABI comment as the only changes I'd call required before merge. The integration test gap and out-of-range `osc.zero-pad` behavior are worth tracking but not blockers.
Sign in to join this conversation.
No description provided.