just ci now runs: Seq tests → build → lisp tests (489) → reader tests… #65

Merged
navicore merged 2 commits from self-hosting-tokenizer into main 2026-03-27 17:44:54 +00:00
navicore commented 2026-03-27 17:13:48 +00:00 (Migrated from github.com)

… (32) → LSP tests (8). The reader-test recipe will grow naturally as we

add parser tests in Phase 2.

… (32) → LSP tests (8). The reader-test recipe will grow naturally as we add parser tests in Phase 2.
claude[bot] commented 2026-03-27 17:21:47 +00:00 (Migrated from github.com)

PR Review: Self-hosted Tokenizer + Reader Tests

Nice addition! This is a solid Phase 2 foundation. The code is well-structured, clearly commented, and the test suite gives good coverage. A few observations below.


Potential Issues

1. Duplicate accessor definitions in test file

tests/lisp/reader/tokenizer.slisp redefines token-text, token-start-line, and token-start-col — all three already defined in lib/reader/tokenizer.slisp. Since the files are concatenated before evaluation, the test-file definitions silently shadow the library ones. In practice they are identical so no test breaks today, but if the library accessors ever change the test-file copies will drift silently. Consider removing the duplicates from the test file and relying on the library exports.

2. String token text stores processed (unescaped) content

scan-string accumulates unescaped characters, so the token text for "a\nb" contains a literal newline rather than the two-character source sequence \n. This means:

  • Token text cannot reconstruct the original source.
  • end-col may be incorrect for strings with escape sequences that change logical vs. source length.

If downstream phases (parser, LSP hover, error reporting) need source-faithful text, storing the raw source slice is usually safer. Worth a deliberate design note either way.

3. Unterminated string closes itself silently

The recovery path in scan-string appends a synthetic closing quote, so the resulting token looks like a well-formed string. A downstream parser has no way to know the input was malformed. Consider emitting a distinct error token type so the parser can surface a useful diagnostic.


Performance Observations

4. O(n^2) token list construction via append

Each state-emit call appends to the full accumulated token list. For n tokens this is O(n^2). Standard fix: cons tokens onto the front, then reverse once inside the public tokenize function.

5. O(n^2) string accumulation in scan-atom / scan-string

(string-append accum ch) on every character is quadratic for long atoms/strings. Negligible for typical source files, but worth noting if there is a known input-size ceiling.


Minor / Nits

6. Missing tests for token-end-line / token-end-col accessors

The span tests only exercise start positions. A couple of end-span assertions would give the accessor wiring better coverage.

7. No test for backslash at end of input inside a string

The escape-at-EOF branch exists in scan-string but has no corresponding test — easy to regress in future refactors.

8. Unquoted trap variable in reader-test recipe

trap "rm -f $tmp" EXIT leaves $tmp unquoted. Fine in practice since mktemp never produces spaces, but quoting it is safer and more consistent with set -euo pipefail strictness.


What is Working Well

  • State-threading approach is clean and makes the pure-functional style legible.
  • Character-level helpers (char-at, whitespace?, delimiter?) are well-factored.
  • tokenize-loop handles all reader macro prefixes correctly, including the peek-ahead for ,@.
  • 32 test cases covering atoms, parens, quotes, strings, comments, whitespace, and spans is a great starting point.
  • CI integration is clean and composes naturally with the existing lisp-test pattern.
## PR Review: Self-hosted Tokenizer + Reader Tests Nice addition! This is a solid Phase 2 foundation. The code is well-structured, clearly commented, and the test suite gives good coverage. A few observations below. --- ### Potential Issues **1. Duplicate accessor definitions in test file** `tests/lisp/reader/tokenizer.slisp` redefines `token-text`, `token-start-line`, and `token-start-col` — all three already defined in `lib/reader/tokenizer.slisp`. Since the files are concatenated before evaluation, the test-file definitions silently shadow the library ones. In practice they are identical so no test breaks today, but if the library accessors ever change the test-file copies will drift silently. Consider removing the duplicates from the test file and relying on the library exports. **2. String token text stores processed (unescaped) content** `scan-string` accumulates unescaped characters, so the token text for `"a\nb"` contains a literal newline rather than the two-character source sequence `\n`. This means: - Token text cannot reconstruct the original source. - `end-col` may be incorrect for strings with escape sequences that change logical vs. source length. If downstream phases (parser, LSP hover, error reporting) need source-faithful text, storing the raw source slice is usually safer. Worth a deliberate design note either way. **3. Unterminated string closes itself silently** The recovery path in `scan-string` appends a synthetic closing quote, so the resulting token looks like a well-formed string. A downstream parser has no way to know the input was malformed. Consider emitting a distinct error token type so the parser can surface a useful diagnostic. --- ### Performance Observations **4. O(n^2) token list construction via `append`** Each `state-emit` call appends to the full accumulated token list. For n tokens this is O(n^2). Standard fix: `cons` tokens onto the front, then `reverse` once inside the public `tokenize` function. **5. O(n^2) string accumulation in `scan-atom` / `scan-string`** `(string-append accum ch)` on every character is quadratic for long atoms/strings. Negligible for typical source files, but worth noting if there is a known input-size ceiling. --- ### Minor / Nits **6. Missing tests for `token-end-line` / `token-end-col` accessors** The span tests only exercise start positions. A couple of end-span assertions would give the accessor wiring better coverage. **7. No test for backslash at end of input inside a string** The escape-at-EOF branch exists in `scan-string` but has no corresponding test — easy to regress in future refactors. **8. Unquoted trap variable in `reader-test` recipe** `trap "rm -f $tmp" EXIT` leaves `$tmp` unquoted. Fine in practice since `mktemp` never produces spaces, but quoting it is safer and more consistent with `set -euo pipefail` strictness. --- ### What is Working Well - State-threading approach is clean and makes the pure-functional style legible. - Character-level helpers (`char-at`, `whitespace?`, `delimiter?`) are well-factored. - `tokenize-loop` handles all reader macro prefixes correctly, including the peek-ahead for `,@`. - 32 test cases covering atoms, parens, quotes, strings, comments, whitespace, and spans is a great starting point. - CI integration is clean and composes naturally with the existing `lisp-test` pattern.
claude[bot] commented 2026-03-27 17:46:17 +00:00 (Migrated from github.com)

Follow-up Review: Address Verification + New Findings

All five items from the first pass have been addressed. A couple of new observations below.

Confirmed Fixed

  • 1: Duplicate accessor definitions in test file - Removed
  • 4: O(n^2) append in state-emit - Fixed: now cons + single reverse in tokenize
  • 6: Missing token-end-line / token-end-col tests - 3 new end-span tests added
  • 7: No backslash-at-EOF test - Test added
  • 8: Unquoted $tmp in trap - Quoted

New Finding: string end-col is off by one (lib/reader/tokenizer.slisp line 127)

The codebase uses an exclusive end-column convention (one past the last character). For example, ( at col 1 gives end-col = 2, and foo at col 1 gives end-col = 4. scan-string breaks this: when it hits the closing quote, st is still positioned AT the quote, so (state-col st) is the column of the closing quote itself, not one past it (line 127). For a 5-char string like "foo" starting at col 1: atoms give end-col = start + length, strings give end-col = start + length - 1.

The same off-by-one exists in the unterminated-string path at line 120.

The end-span tests do not cover strings, which is why this slipped through. Fix: change both sites to (+ (state-col st) 1) and add an end-span test for a string token.

Minor: escape-at-EOF recovery text (lib/reader/tokenizer.slisp line 132)

For a string ending with a lone backslash, the recovery path at line 132 appends a backslash plus a synthetic closing quote to accum. The closing quote is intentional (matching the unterminated-string convention), but it is worth verifying the backslash prefix is intended rather than just appending a closing quote. The existing backslash-at-EOF test only checks token-count = 1, not the token text. Low impact since this only fires on malformed input, but worth noting for whoever implements the error-token path later.

Deferred Items - Rationale Accepted

  • #2 raw vs unescaped string content: Matching the Seq tokenizer is the right call for now.
  • #3 unterminated string error token: Fine to defer until the parser needs diagnostics.
  • #5 O(n^2) string accumulation: Negligible for realistic source sizes.

The tokenizer logic is solid and the test suite is well-organized. The end-col inconsistency is the only thing worth fixing before span data is consumed downstream.

Follow-up Review: Address Verification + New Findings All five items from the first pass have been addressed. A couple of new observations below. **Confirmed Fixed** - 1: Duplicate accessor definitions in test file - Removed - 4: O(n^2) append in state-emit - Fixed: now cons + single reverse in tokenize - 6: Missing token-end-line / token-end-col tests - 3 new end-span tests added - 7: No backslash-at-EOF test - Test added - 8: Unquoted $tmp in trap - Quoted **New Finding: string end-col is off by one (lib/reader/tokenizer.slisp line 127)** The codebase uses an exclusive end-column convention (one past the last character). For example, `(` at col 1 gives end-col = 2, and `foo` at col 1 gives end-col = 4. `scan-string` breaks this: when it hits the closing quote, `st` is still positioned AT the quote, so `(state-col st)` is the column of the closing quote itself, not one past it (line 127). For a 5-char string like `"foo"` starting at col 1: atoms give end-col = start + length, strings give end-col = start + length - 1. The same off-by-one exists in the unterminated-string path at line 120. The end-span tests do not cover strings, which is why this slipped through. Fix: change both sites to `(+ (state-col st) 1)` and add an end-span test for a string token. **Minor: escape-at-EOF recovery text (lib/reader/tokenizer.slisp line 132)** For a string ending with a lone backslash, the recovery path at line 132 appends a backslash plus a synthetic closing quote to accum. The closing quote is intentional (matching the unterminated-string convention), but it is worth verifying the backslash prefix is intended rather than just appending a closing quote. The existing backslash-at-EOF test only checks token-count = 1, not the token text. Low impact since this only fires on malformed input, but worth noting for whoever implements the error-token path later. **Deferred Items - Rationale Accepted** - #2 raw vs unescaped string content: Matching the Seq tokenizer is the right call for now. - #3 unterminated string error token: Fine to defer until the parser needs diagnostics. - #5 O(n^2) string accumulation: Negligible for realistic source sizes. The tokenizer logic is solid and the test suite is well-organized. The end-col inconsistency is the only thing worth fixing before span data is consumed downstream.
Sign in to join this conversation.
No description provided.