just ci now runs: Seq tests → build → lisp tests (489) → reader tests… #65
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
seq-lisp-2-backport
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
navicore/seq-lisp!65
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "self-hosting-tokenizer"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
… (32) → LSP tests (8). The reader-test recipe will grow naturally as we
add parser tests in Phase 2.
PR Review: Self-hosted Tokenizer + Reader Tests
Nice addition! This is a solid Phase 2 foundation. The code is well-structured, clearly commented, and the test suite gives good coverage. A few observations below.
Potential Issues
1. Duplicate accessor definitions in test file
tests/lisp/reader/tokenizer.slispredefinestoken-text,token-start-line, andtoken-start-col— all three already defined inlib/reader/tokenizer.slisp. Since the files are concatenated before evaluation, the test-file definitions silently shadow the library ones. In practice they are identical so no test breaks today, but if the library accessors ever change the test-file copies will drift silently. Consider removing the duplicates from the test file and relying on the library exports.2. String token text stores processed (unescaped) content
scan-stringaccumulates unescaped characters, so the token text for"a\nb"contains a literal newline rather than the two-character source sequence\n. This means:end-colmay be incorrect for strings with escape sequences that change logical vs. source length.If downstream phases (parser, LSP hover, error reporting) need source-faithful text, storing the raw source slice is usually safer. Worth a deliberate design note either way.
3. Unterminated string closes itself silently
The recovery path in
scan-stringappends a synthetic closing quote, so the resulting token looks like a well-formed string. A downstream parser has no way to know the input was malformed. Consider emitting a distinct error token type so the parser can surface a useful diagnostic.Performance Observations
4. O(n^2) token list construction via
appendEach
state-emitcall appends to the full accumulated token list. For n tokens this is O(n^2). Standard fix:constokens onto the front, thenreverseonce inside the publictokenizefunction.5. O(n^2) string accumulation in
scan-atom/scan-string(string-append accum ch)on every character is quadratic for long atoms/strings. Negligible for typical source files, but worth noting if there is a known input-size ceiling.Minor / Nits
6. Missing tests for
token-end-line/token-end-colaccessorsThe span tests only exercise start positions. A couple of end-span assertions would give the accessor wiring better coverage.
7. No test for backslash at end of input inside a string
The escape-at-EOF branch exists in
scan-stringbut has no corresponding test — easy to regress in future refactors.8. Unquoted trap variable in
reader-testrecipetrap "rm -f $tmp" EXITleaves$tmpunquoted. Fine in practice sincemktempnever produces spaces, but quoting it is safer and more consistent withset -euo pipefailstrictness.What is Working Well
char-at,whitespace?,delimiter?) are well-factored.tokenize-loophandles all reader macro prefixes correctly, including the peek-ahead for,@.lisp-testpattern.Follow-up Review: Address Verification + New Findings
All five items from the first pass have been addressed. A couple of new observations below.
Confirmed Fixed
New Finding: string end-col is off by one (lib/reader/tokenizer.slisp line 127)
The codebase uses an exclusive end-column convention (one past the last character). For example,
(at col 1 gives end-col = 2, andfooat col 1 gives end-col = 4.scan-stringbreaks this: when it hits the closing quote,stis still positioned AT the quote, so(state-col st)is the column of the closing quote itself, not one past it (line 127). For a 5-char string like"foo"starting at col 1: atoms give end-col = start + length, strings give end-col = start + length - 1.The same off-by-one exists in the unterminated-string path at line 120.
The end-span tests do not cover strings, which is why this slipped through. Fix: change both sites to
(+ (state-col st) 1)and add an end-span test for a string token.Minor: escape-at-EOF recovery text (lib/reader/tokenizer.slisp line 132)
For a string ending with a lone backslash, the recovery path at line 132 appends a backslash plus a synthetic closing quote to accum. The closing quote is intentional (matching the unterminated-string convention), but it is worth verifying the backslash prefix is intended rather than just appending a closing quote. The existing backslash-at-EOF test only checks token-count = 1, not the token text. Low impact since this only fires on malformed input, but worth noting for whoever implements the error-token path later.
Deferred Items - Rationale Accepted
The tokenizer logic is solid and the test suite is well-organized. The end-col inconsistency is the only thing worth fixing before span data is consumed downstream.