Optimization: Use NaN-boxing to reduce Value size from 40 to 8 bytes #188
Labels
No labels
bug
dependencies
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
refactor
rust
technical-debt
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
navicore/patch-seq#188
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Implement NaN-boxing to pack values into 8 bytes instead of the current 40-byte tagged union, reducing memory bandwidth and improving cache utilization.
Current Behavior
Values use a 40-byte struct:
This means:
Proposed Optimization
Use NaN-boxing to encode values in 8 bytes:
Encoding Scheme (example)
Trade-offs
Pros:
Cons:
Implementation Complexity
This is a significant refactor affecting:
StackValuerepresentationAlternatives
References
Related
Part of performance optimization effort. See benchmarks in
benchmarks/compute/.Analysis Complete
I've done a comprehensive codebase analysis for the NaN-boxing migration. Key findings:
Scope
Critical Files
runtime/value.rsruntime/tagged_stack.rsruntime/stack.rscodegen/program.rs%ValueLLVM type declarationcodegen/inline/dispatch.rsmul i64 %n, 40New Complexity
Types currently using multiple 64-bit slots will need heap allocation:
QuotationDatastructClosureDatastructWeaveCtxDatastructEstimated Effort
~2-3 weeks for careful implementation:
Planning Document
Full analysis with migration strategy added to
docs/NANBOX_PLAN.mdOpen Questions
Response to "Open Questions"
@claude can you review the analysis above and comment? do we have things covered in our plan in docs/NANBOX_PLAN.md?
Claude finished @navicore's task in 4m 46s —— View job
Review of NaN-boxing Analysis
I've completed a thorough review of your analysis and the codebase. Here's my assessment:
Critical Finding
The planning document
docs/NANBOX_PLAN.mddoesn't exist yet - you mentioned it was "added" in your analysis, but it's not in the repository. I'll review what such a plan should cover based on your analysis and the actual codebase.Scope Validation ✅
Your scope analysis is accurate and well-researched:
Current Architecture (confirmed):
crates/runtime/src/value.rs:136-198)crates/runtime/src/tagged_stack.rs:55-66)%Value = type { i64, i64, i64, i64, i64 }(crates/compiler/src/codegen/program.rs:81)%Valuereferences and 73getelementptr %ValueoperationsHardcoded Constants (confirmed):
mul i64 %{}, 40atinline/dispatch.rs:1061(roll operation)40appears ~30 times across codebasevalue.rs:325,tagged_stack.rs:90Types Requiring Heap Migration (confirmed from
value.rs:167-197):What the Plan Must Cover
Based on codebase analysis, here's what
docs/NANBOX_PLAN.mdshould include:1. Encoding Scheme Design ⚠️
Missing from analysis: Specific bit layout decisions needed:
Critical decisions needed:
2. New Heap Data Structures
Found in analysis, needs detailed design:
Plan should specify:
3. Runtime Migration Strategy
32 runtime files with FFI exports need updates:
High-impact files (from analysis):
crates/runtime/src/stack.rs: 656 Value pattern matches across 24 filescrates/runtime/src/value.rs: Core enum definitioncrates/runtime/src/tagged_stack.rs: StackValue structSuggested phased approach (plan should detail):
#[cfg(feature = "nanbox")]dual-mode supportsv.slot0 == DISC_INT→is_nanbox_int(sv.slot0)stack.rs:224-299)4. Codegen Migration Strategy
8 codegen files need updates:
Critical changes:
program.rs:81:%Value = type { i64, i64, i64, i64, i64 }→%Value = type { i64 }inline/dispatch.rs:1061:mul i64 %{}, 40→mul i64 %{}, 8getelementptr %Valueoperations: pointer stride changes 40→8Biggest risk: LLVM IR generation correctness
5. Testing & Validation
Plan should include:
benchmarks/compute/)6. Migration Phases
Your 2-3 week estimate aligns with scope, but plan should detail:
Phase 1: Foundation (2-3 days)
Phase 2: Runtime Dual-Mode (3-5 days)
#[cfg(feature = "nanbox")]scaffoldingPhase 3: Codegen (3-5 days, highest risk)
%ValueLLVM type declarationPhase 4: FFI & Integration (2-3 days)
Phase 5: Cleanup (1-2 days)
Gaps & Recommendations
Missing from Your Analysis
Float NaN handling: Current code has Float values - what about actual NaN floats? Need special case?
Type system impact: Does the typechecker need updates? Overflow checks at compile-time?
ABI compatibility: FFI boundaries with C code - does anyone call these functions directly?
Performance assumptions: You assume 5x speedup, but:
Rollback plan: If NaN-boxing causes issues, how to revert? Feature flag helps but needs testing.
Recommendations
Answers to Your Questions
1. Integer overflow policy
You want compile-time failure. This requires:
Recommendation: 48-bit range with compile-time literal checks is most practical.
2. Feature flag
Yes, absolutely use
#[cfg(feature = "nanbox")]during migration. Essential for:3. Arena allocator
Confirmed: Arena exists at
crates/runtime/src/arena.rswith bump allocation.Summary
Your analysis is thorough and accurate. The scope is well-understood. However:
✅ Covered well: Scope, critical files, heap type migration, effort estimate
⚠️ Needs detail: Encoding scheme, overflow policy, testing strategy, rollback plan
❌ Missing: The actual planning document, performance baseline, ABI compatibility analysis
Next step: Create
docs/NANBOX_PLAN.mdincorporating:The 2-3 week estimate is reasonable given the scope, but success depends on careful encoding design and thorough testing at each phase.
@claude docs/NANBOX_PLAN.md exists now in main
Claude finished @navicore's task in 2m 43s —— View job
Review of
docs/NANBOX_PLAN.mddocs/NANBOX_PLAN.mddocumentSummary
The planning document is excellent and comprehensive. It addresses all critical gaps from my previous review and provides a solid foundation for implementation. I found only minor areas for enhancement.
✅ What's Covered Well
1. Encoding Scheme Design
Status: Complete
The document clearly specifies:
This addresses my previous concern about "missing encoding scheme details."
2. Heap Data Structures
Status: Complete
All three multi-slot types are documented (lines 137-152):
QuotationData: 2×u64 (wrapper + impl_)ClosureData: fn_ptr + env pointerWeaveCtxData: 2×Arc pointersThe migration from inline storage to heap allocation is clear.
3. Impact Analysis
Status: Excellent
Thorough file-by-file breakdown:
This matches and extends my previous analysis. The line number references are particularly helpful.
4. Migration Strategy
Status: Complete
Five-phase plan with clear boundaries (lines 105-135):
--features nanbox) for dual-mode supportThis addresses my recommendation for "phased migration plan with feature flags."
5. Testing Strategy
Status: Good
Four-tier approach (lines 167-172):
6. Trade-offs Analysis
Status: Complete
Balanced presentation of pros/cons (lines 62-74):
7. Open Questions
Status: Addressed
The plan identifies four critical decisions (lines 174-179):
⚠️ Minor Enhancements Recommended
1. Integer Overflow Policy (Partially Resolved)
Current state (line 176): "Error at compile time? Runtime? Silent wrap?"
User's answer: Compile-time failure preferred
Recommendation: Document the implementation approach in the plan:
2. NaN Canonicalization Strategy
Current state (line 177): Question still open
Recommendation: Choose one approach and document it:
Option A: Reserve NaN Range (recommended)
0x7FF8_0000_0000_0000if is_nan(f) { f = CANONICAL_NAN }Option B: Store Real NaNs
0xFfor "actual NaN values"Suggest adding a decision to the plan.
3. Allocation Strategy Details
Current state (line 178): "Arena allocator? Per-quotation heap alloc?"
User's answer: Use existing arena allocator
Recommendation: Clarify which types use arena vs. Arc:
4. Testing Matrix
Enhancement: Add cross-architecture testing plan
The plan mentions ARM64 vs x86-64 pointer differences but doesn't specify testing:
5. Rollback Procedure
Missing: What if NaN-boxing causes production issues?
Recommendation: Add a rollback section:
🎯 Critical Success Factors
The plan correctly identifies the highest-risk phase (Phase 3: Codegen, lines 119-123). Key concerns:
LLVM IR Correctness
%Valuetype change propagates to 118+ IR referencesgetelementptroffsets change from 40→8Size Constant Updates
40literals in codebase (search revealed ~30 occurrences)FFI Boundaries
extern "C"functionsRecommendation: Add Phase 3 checkpoint - run full test suite after LLVM IR changes before proceeding to Phase 4.
🔍 Cross-Reference Check
I verified the plan against actual codebase:
runtime/value.rs:136-198tagged_stack.rs:55-66stack.rsDISC_* constantsmul i64 %n, 40inline/dispatch.rs:1061tagged_stack.rs:90All references are correct.
📊 Effort Estimate Review
Planned: 2-3 weeks (lines 183-190)
Assessment: Reasonable, with caveats:
Total realistic: 2.5-4 weeks with contingency
The plan wisely notes "careful implementation with testing" - this is critical.
✨ Overall Assessment
Grade: A
The plan is production-ready with only minor enhancements needed:
Strengths
Recommended Next Steps
Final Recommendation
Proceed with implementation using this plan. The foundation is solid. The only critical addition is documenting the three open decisions (overflow policy, NaN handling, allocation details) before starting Phase 1.
The user's answers provide clear direction:
Once these are documented, the plan is complete and ready for execution.
References checked: All file paths, line numbers, and code references verified against current
mainbranch.Estimated plan completeness: 95% (minor documentation additions recommended)
we implemented it but don't have full f64 anymore and the performance benefits are not totally clear - seems negligible.
On the middle ground: The jump from 8 bytes (nanbox) to 40 bytes is indeed large. Some options that could be explored later:
The 40-byte size is driven by the largest inline variant (SeqString, WeaveCtx with two Arc pointers). There may be room to shrink by making more types heap-indirect.
On profiling: This is the key insight. We optimized based on intuition ("smaller values = faster") without data showing:
The 30% fib improvement and 20% leibniz_pi regression suggest different workloads stress different parts of the system.
Next step when revisiting performance: Add profiling infrastructure first. Flame graphs, cache miss counters, allocation tracking. Then optimize what the data says matters.