Symbol interning optimization #166

Closed
opened 2026-01-03 04:11:18 +00:00 by navicore · 1 comment
navicore commented 2026-01-03 04:11:18 +00:00 (Migrated from github.com)

Context

Issue #148 implemented symbol syntax (:foo). Currently, symbols are stored as regular SeqString values - each symbol literal allocates new memory.

Current Behavior

:foo :foo   \ Two separate allocations

Each :foo creates a new SeqString, so:

  • Equality comparison is O(n) string comparison
  • Memory usage grows with each symbol use
  • No sharing between identical symbols

Desired Behavior

Implement compile-time symbol interning:

  • Symbol literals with the same name share a single allocation
  • Equality could be O(1) pointer comparison
  • Lower memory usage

Design Constraints (from user)

We'll want to design it carefully to make sure it is compile-time only sharing with no runtime locks or copying.

This means:

  • Interning table built at compile time, not runtime
  • No global HashMap with locks for runtime symbol creation
  • string->symbol may not participate in interning (or uses lock-free approach)

Implementation Ideas

  1. Compile-time string deduplication: Compiler emits each unique symbol literal once as a global constant. All uses reference the same constant.

  2. Static interning table: Generate a perfect hash or sorted array of all symbol literals at compile time.

  3. Runtime symbols stay uninterned: string->symbol creates uninterned symbols (like Clojure's symbol vs :keyword).

Pattern Matching Performance Regression (PR #168)

Issue #149 changed variant tags from u32 to SeqString. This caused a runtime performance regression in pattern matching:

Aspect Before After
Match dispatch O(1) LLVM switch O(n) cascading if-else
Tag comparison Integer equality String comparison

Impact: Match expressions with N arms now require up to N-1 string comparisons in worst case.

Symbol interning will fix this: With interned symbols, pattern matching can use O(1) pointer equality instead of O(n) string comparison. The codegen could potentially be optimized back to LLVM switch on the interned pointer value.

Non-blocking

This is a performance optimization. The current implementation is semantically correct and sufficient for SON.

  • #148 - Symbol syntax (implemented)
  • #149 - Dynamic variant construction (implemented, introduced pattern match regression)
  • PR #168 - Implementation PR for #149
  • SON implementation roadmap
## Context Issue #148 implemented symbol syntax (`:foo`). Currently, symbols are stored as regular `SeqString` values - each symbol literal allocates new memory. ## Current Behavior ```seq :foo :foo \ Two separate allocations ``` Each `:foo` creates a new `SeqString`, so: - Equality comparison is O(n) string comparison - Memory usage grows with each symbol use - No sharing between identical symbols ## Desired Behavior Implement **compile-time symbol interning**: - Symbol literals with the same name share a single allocation - Equality could be O(1) pointer comparison - Lower memory usage ## Design Constraints (from user) > We'll want to design it carefully to make sure it is **compile-time only sharing** with **no runtime locks or copying**. This means: - Interning table built at compile time, not runtime - No global `HashMap` with locks for runtime symbol creation - `string->symbol` may not participate in interning (or uses lock-free approach) ## Implementation Ideas 1. **Compile-time string deduplication**: Compiler emits each unique symbol literal once as a global constant. All uses reference the same constant. 2. **Static interning table**: Generate a perfect hash or sorted array of all symbol literals at compile time. 3. **Runtime symbols stay uninterned**: `string->symbol` creates uninterned symbols (like Clojure's `symbol` vs `:keyword`). ## Pattern Matching Performance Regression (PR #168) Issue #149 changed variant tags from `u32` to `SeqString`. This caused a **runtime performance regression** in pattern matching: | Aspect | Before | After | |--------|--------|-------| | Match dispatch | O(1) LLVM switch | O(n) cascading if-else | | Tag comparison | Integer equality | String comparison | **Impact:** Match expressions with N arms now require up to N-1 string comparisons in worst case. **Symbol interning will fix this:** With interned symbols, pattern matching can use O(1) pointer equality instead of O(n) string comparison. The codegen could potentially be optimized back to LLVM switch on the interned pointer value. ## Non-blocking This is a performance optimization. The current implementation is semantically correct and sufficient for SON. ## Related - #148 - Symbol syntax (implemented) - #149 - Dynamic variant construction (implemented, introduced pattern match regression) - PR #168 - Implementation PR for #149 - SON implementation roadmap
navicore commented 2026-01-04 00:54:23 +00:00 (Migrated from github.com)
https://github.com/navicore/patch-seq/pull/171
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
navicore/patch-seq#166
No description provided.