Phase 2 delivered: parser #66

Merged
navicore merged 3 commits from self-hosting-parser into main 2026-03-27 22:56:04 +00:00
navicore commented 2026-03-27 22:15:00 +00:00 (Migrated from github.com)
  • lib/reader/parser.slisp — self-hosted recursive descent parser (~160 lines)
  • tests/lisp/reader/parser.slisp — 36 tests covering atoms, lists, quote sugar, dot notation, multi-expression parsing, edge cases
  • Handles: integers, floats, booleans, strings, symbols, nested lists, '/`/,/,@ sugar, dot notation, comments
  • Uses higher-order function (parser parameter) to break mutual recursion without forward references
  • Uses eval-with-errors for string->symbol and string->float conversion (these would be good candidates for native builtins later)
  • Checkpoint met: (parse "(if #t 1 2)") returns (if #t 1 2) as data
- lib/reader/parser.slisp — self-hosted recursive descent parser (~160 lines) - tests/lisp/reader/parser.slisp — 36 tests covering atoms, lists, quote sugar, dot notation, multi-expression parsing, edge cases - Handles: integers, floats, booleans, strings, symbols, nested lists, '/`/,/,@ sugar, dot notation, comments - Uses higher-order function (parser parameter) to break mutual recursion without forward references - Uses eval-with-errors for string->symbol and string->float conversion (these would be good candidates for native builtins later) - Checkpoint met: (parse "(if #t 1 2)") returns (if #t 1 2) as data
claude[bot] commented 2026-03-27 22:20:37 +00:00 (Migrated from github.com)

PR Review: Phase 2 - Self-Hosted Parser

Nice work - a clean recursive descent parser in ~200 lines with solid test coverage. The higher-order function trick to break mutual recursion between parse-expr and parse-list-items is elegant. A few things worth flagging:

strip-quotes - verify substring semantics are clear

The function is: (define (strip-quotes text) (substring text 1 (- (string-length text) 2))). substring takes (str start length). For token hello (7 chars with surrounding quotes): (substring text 1 5) gives hello. Correct. But easy to misread - a comment clarifying that the tokenizer always produces token text starting and ending with a double-quote would help future readers.

Unmatched closing paren is silently consumed as an atom

In parse-expr, a closing paren token appearing outside a list context falls through to the atom branch and becomes the symbol ) via string->symbol. No error is raised. Similarly, an unclosed opening paren in parse-list-items silently returns whatever was accumulated when tokens run out. Acceptable for Phase 2, but worth documenting as a known gap for Phase 3 / error recovery.

eval-with-errors creates an evaluator dependency

The PR description acknowledges this already. Two specific concerns: (1) Fallback type mismatch: If string->symbol fails, it returns name as a string instead of a symbol. Any code expecting a symbol from text-to-atom would get a string, causing silent type errors downstream. (2) Ordering dependency: The parser can only be loaded after the evaluator is initialized - worth a comment since it constrains bootstrapping order.

string-to-integer - minus sign handling assumes pos=0

The minus branch resets acc to 0 and negates the result. Works correctly when called as (string-to-integer text 0 0) since minus only appears at position 0 for valid number tokens. A comment noting this assumption would prevent misuse if the function is reused later.

Test gaps

A few cases not covered: (1) Negative floats: (parse -3.14) - exercises number-token? and string->float-val on a sign-prefixed string. (2) Strings with escape sequences - verifies strip-quotes works correctly after the tokenizer resolves escapes. (3) parse-all with empty input: (parse-all ) - edge case for the parse-all-exprs accumulator path. (4) Resilience to unclosed paren: A test documenting current behavior (partial result, no crash) would catch regressions when error handling is added.

Minor: digit-char? and digit-value duplication

Both implement 0-9 membership via nested if chains. Noted in the PR as a limitation without a native char->int builtin - the two are consistent. Worth combining once that builtin lands.

Overall

Architecture is sound, happy-path test coverage is comprehensive, and the HOF pattern for mutual recursion is clean. The main things to address before production use: silent error swallowing (unmatched tokens become atoms, unclosed lists return partial results silently) and the string->symbol fallback returning a string where a symbol is expected.

## PR Review: Phase 2 - Self-Hosted Parser Nice work - a clean recursive descent parser in ~200 lines with solid test coverage. The higher-order function trick to break mutual recursion between parse-expr and parse-list-items is elegant. A few things worth flagging: **strip-quotes - verify substring semantics are clear** The function is: (define (strip-quotes text) (substring text 1 (- (string-length text) 2))). substring takes (str start length). For token hello (7 chars with surrounding quotes): (substring text 1 5) gives hello. Correct. But easy to misread - a comment clarifying that the tokenizer always produces token text starting and ending with a double-quote would help future readers. **Unmatched closing paren is silently consumed as an atom** In parse-expr, a closing paren token appearing outside a list context falls through to the atom branch and becomes the symbol ) via string->symbol. No error is raised. Similarly, an unclosed opening paren in parse-list-items silently returns whatever was accumulated when tokens run out. Acceptable for Phase 2, but worth documenting as a known gap for Phase 3 / error recovery. **eval-with-errors creates an evaluator dependency** The PR description acknowledges this already. Two specific concerns: (1) Fallback type mismatch: If string->symbol fails, it returns name as a string instead of a symbol. Any code expecting a symbol from text-to-atom would get a string, causing silent type errors downstream. (2) Ordering dependency: The parser can only be loaded after the evaluator is initialized - worth a comment since it constrains bootstrapping order. **string-to-integer - minus sign handling assumes pos=0** The minus branch resets acc to 0 and negates the result. Works correctly when called as (string-to-integer text 0 0) since minus only appears at position 0 for valid number tokens. A comment noting this assumption would prevent misuse if the function is reused later. **Test gaps** A few cases not covered: (1) Negative floats: (parse -3.14) - exercises number-token? and string->float-val on a sign-prefixed string. (2) Strings with escape sequences - verifies strip-quotes works correctly after the tokenizer resolves escapes. (3) parse-all with empty input: (parse-all ) - edge case for the parse-all-exprs accumulator path. (4) Resilience to unclosed paren: A test documenting current behavior (partial result, no crash) would catch regressions when error handling is added. **Minor: digit-char? and digit-value duplication** Both implement 0-9 membership via nested if chains. Noted in the PR as a limitation without a native char->int builtin - the two are consistent. Worth combining once that builtin lands. **Overall** Architecture is sound, happy-path test coverage is comprehensive, and the HOF pattern for mutual recursion is clean. The main things to address before production use: silent error swallowing (unmatched tokens become atoms, unclosed lists return partial results silently) and the string->symbol fallback returning a string where a symbol is expected.
claude[bot] commented 2026-03-27 22:35:27 +00:00 (Migrated from github.com)

Follow-up: Corrections to the previous review

The prior review's "Test gaps" section contains inaccuracies. The following tests it flagged as missing are already present in tests/lisp/reader/parser.slisp:

  • Negative floats: (test "parse negative float" (assert-eq (parse "-3.14") -3.14))
  • Strings with escape sequences: (test "parse string with escape" ...)
  • parse-all with empty input: (test "parse-all empty" (assert-eq (parse-all "") '()))
  • Resilience to unclosed paren: (test "unclosed paren returns partial" (assert-eq (parse "(a b") '(a b)))

Test coverage is actually quite good. Two genuine issues not previously called out:

Dead code: string-starts-with? is defined but never used

Defined in parser.slisp but never called. Was probably a stepping stone during development. Safe to remove.

"extra close paren" test has a likely wrong expected value

The test named "extra close paren parsed as symbol" expects (list) which equals '() (empty list). But tracing the code: parse-expr hits the atom branch, calls text-to-atom ")", which calls string->symbol ")", which calls eval-with-errors on the string "')". Since ') is invalid syntax, eval-with-errors returns an error result and string->symbol falls back to returning the string ")". So parse of ")" would return the string ")", not '(). The test name, expected value, or string->symbol fallback behavior needs clarification.

## Follow-up: Corrections to the previous review The prior review's **"Test gaps"** section contains inaccuracies. The following tests it flagged as missing are already present in `tests/lisp/reader/parser.slisp`: - Negative floats: `(test "parse negative float" (assert-eq (parse "-3.14") -3.14))` - Strings with escape sequences: `(test "parse string with escape" ...)` - parse-all with empty input: `(test "parse-all empty" (assert-eq (parse-all "") '()))` - Resilience to unclosed paren: `(test "unclosed paren returns partial" (assert-eq (parse "(a b") '(a b)))` Test coverage is actually quite good. Two genuine issues not previously called out: **Dead code: `string-starts-with?` is defined but never used** Defined in parser.slisp but never called. Was probably a stepping stone during development. Safe to remove. **"extra close paren" test has a likely wrong expected value** The test named "extra close paren parsed as symbol" expects `(list)` which equals `'()` (empty list). But tracing the code: `parse-expr` hits the atom branch, calls `text-to-atom ")"`, which calls `string->symbol ")"`, which calls `eval-with-errors` on the string `"')"`. Since `')` is invalid syntax, `eval-with-errors` returns an error result and `string->symbol` falls back to returning the string `")"`. So `parse` of `")"` would return the string `")"`, not `'()`. The test name, expected value, or `string->symbol` fallback behavior needs clarification.
claude[bot] commented 2026-03-27 22:50:59 +00:00 (Migrated from github.com)

Code Review - Phase 2: Self-Hosted Parser

Overall this is solid, well-structured work. The recursive descent approach is clean, the higher-order function trick to break mutual recursion is elegant, and the test suite is comprehensive. A few things worth looking at.

Potential Bug 1: strip-quotes off-by-one risk

The strip-quotes function uses: (substring text 1 (- (string-length text) 2)). This is correct if substring takes (str start length), which appears to be the convention based on char-at in the tokenizer. But the semantics are non-obvious - a future reader or a refactor to an end-based substring could silently introduce an off-by-one. A brief inline comment noting the calling convention would help.

Potential Bug 2: Test expectation mismatch on extra close paren

The test asserts: (assert-eq (parse close-paren) empty-list). Tracing through: close-paren is not open-paren, not a quote/quasiquote/unquote token, so it falls to the atom branch. text-to-atom calls string->symbol which calls eval-with-errors with quote+close-paren. Evaluating that in the host is malformed (quote with no argument), so string->symbol falls back and returns the string close-paren. The actual return would likely be the string, not the empty list. This test may be asserting the wrong expected value and masking a latent issue.

Potential Bug 3: string->symbol and string->float-val silent fallback to string

Both functions fall back to returning a raw string when eval-with-errors fails. If eval-with-errors fails for any identifier (e.g. a symbol name that is reserved or malformed in the host), downstream consumers expecting a symbol silently get a string instead. This could produce confusing behavior that is hard to debug. At minimum, a comment noting the known failure modes would help.

Design Consideration: Dot Notation

The choice to treat dot as a literal symbol rather than construct a proper cons pair is documented and intentional, and the tests reflect it. Worth noting this means the self-hosted parser cannot round-trip code that relies on proper pair structure. Documenting this as a known compatibility gap would be useful.

Minor notes

float-token? will return #t for strings like 1.2.3 (passes number-token? and contains a dot). string->float-val would then fail and silently return the text as a string. Low risk given the tokenizer produces well-formed tokens, but worth noting. The digit-char? / digit-value nested-if chains are verbose but appropriate given the absence of cond at this stage. The PR description acknowledging these as future builtin candidates is appreciated.

Test Coverage

36 tests is solid for a first pass. A few gaps worth filling eventually: symbols with special characters like !, ->, set! as bare atoms (not just in list position); float variants: 0.5, -0.5, 1.0; strings with multiple escape sequences in sequence; parse-all with mixed multi-level input.

Summary

The architecture is clean and consistent with the tokenizer. The higher-order parser parameter to break the parse-expr/parse-list-items mutual recursion without forward references is the highlight of this PR. The main actionable items are verifying the close-paren edge case test expectation and documenting the string->symbol fallback behavior.

## Code Review - Phase 2: Self-Hosted Parser Overall this is solid, well-structured work. The recursive descent approach is clean, the higher-order function trick to break mutual recursion is elegant, and the test suite is comprehensive. A few things worth looking at. **Potential Bug 1: strip-quotes off-by-one risk** The strip-quotes function uses: (substring text 1 (- (string-length text) 2)). This is correct if substring takes (str start length), which appears to be the convention based on char-at in the tokenizer. But the semantics are non-obvious - a future reader or a refactor to an end-based substring could silently introduce an off-by-one. A brief inline comment noting the calling convention would help. **Potential Bug 2: Test expectation mismatch on extra close paren** The test asserts: (assert-eq (parse close-paren) empty-list). Tracing through: close-paren is not open-paren, not a quote/quasiquote/unquote token, so it falls to the atom branch. text-to-atom calls string->symbol which calls eval-with-errors with quote+close-paren. Evaluating that in the host is malformed (quote with no argument), so string->symbol falls back and returns the string close-paren. The actual return would likely be the string, not the empty list. This test may be asserting the wrong expected value and masking a latent issue. **Potential Bug 3: string->symbol and string->float-val silent fallback to string** Both functions fall back to returning a raw string when eval-with-errors fails. If eval-with-errors fails for any identifier (e.g. a symbol name that is reserved or malformed in the host), downstream consumers expecting a symbol silently get a string instead. This could produce confusing behavior that is hard to debug. At minimum, a comment noting the known failure modes would help. **Design Consideration: Dot Notation** The choice to treat dot as a literal symbol rather than construct a proper cons pair is documented and intentional, and the tests reflect it. Worth noting this means the self-hosted parser cannot round-trip code that relies on proper pair structure. Documenting this as a known compatibility gap would be useful. **Minor notes** float-token? will return #t for strings like 1.2.3 (passes number-token? and contains a dot). string->float-val would then fail and silently return the text as a string. Low risk given the tokenizer produces well-formed tokens, but worth noting. The digit-char? / digit-value nested-if chains are verbose but appropriate given the absence of cond at this stage. The PR description acknowledging these as future builtin candidates is appreciated. **Test Coverage** 36 tests is solid for a first pass. A few gaps worth filling eventually: symbols with special characters like !, ->, set! as bare atoms (not just in list position); float variants: 0.5, -0.5, 1.0; strings with multiple escape sequences in sequence; parse-all with mixed multi-level input. **Summary** The architecture is clean and consistent with the tokenizer. The higher-order parser parameter to break the parse-expr/parse-list-items mutual recursion without forward references is the highlight of this PR. The main actionable items are verifying the close-paren edge case test expectation and documenting the string->symbol fallback behavior.
Sign in to join this conversation.
No description provided.