Skynet performance: optimization opportunities from profiling #110
Labels
No labels
bug
dependencies
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
refactor
rust
technical-debt
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
navicore/patch-seq#110
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Detailed profiling of the skynet benchmark revealed where time is spent and identified concrete optimization opportunities.
Profiling Results
Key insight: May runtime accounts for 9.6x slowdown vs Go. Seq adds 2.7x on top of that.
Cost Breakdown (per 1M skynet spawns)
Optimization Opportunities
1. Reduce May stack size (High Impact)
set_stack_size(0x100000))crates/runtime/src/scheduler.rs:2792. Inline common stack operations (Medium Impact)
crates/compiler/src/codegen.rs3. Specialize for Int operations (Medium Impact)
Valueenum, pattern matched on every opcrates/runtime/src/arithmetic.rs4. Cache channel handles (Low Impact)
crates/runtime/src/channel.rs5. Reduce strand registry overhead (Low Impact)
crates/runtime/src/scheduler.rs:370Benchmarking
Profiling tools created in
benchmarks/skynet/:may_bench/- Cargo project for May profilingstack_clone_profile.rs- Stack cloning overheadstack_ops_profile.rs- Stack operation overheadspawn_profile.rs- Spawn component breakdownRun benchmarks:
References
Deep Dive Investigation Results
Current Performance
Root Cause: FFI Call Overhead
Analyzed the generated LLVM IR for skynet (
seqc build --keep-ir):Every Seq operation becomes an FFI call:
FFI calls per skynet function: 251
Top operations by frequency:
push_intdrop_oppick_opaddswapspawnchan_receiveWith ~11K branch nodes executing the full function, that's ~1.1M FFI calls just for simple stack operations.
What Each FFI Call Costs
Looking at
push_int→push→pool_allocate:NODE_POOL.with)borrow_mut())Changes Made (Marginal Improvements)
#[cfg(debug_assertions)](minimal impact)Proposed Solution: Inline LLVM IR Generation
Instead of:
Generate:
Implementation Roadmap
Phase 1: Inline Integer Operations (High Impact)
push_int- just allocate node + store tag + store valuedrop_op- just read next pointer + return to pooladd,subtract,multiply,divide- pop 2, compute, push resultPhase 2: Inline Stack Shuffles (Medium Impact)
dup- read top value, push copyswap- swap top two node values (no alloc needed!)over,pick- traverse + copyPhase 3: Specialize for Integers (Medium Impact)
Estimated Impact
Combined potential: 6-10x improvement, bringing Seq to ~3-5x of Go (similar to native Rust + May baseline of 9.6x).
Files to Modify
crates/compiler/src/codegen.rs- Main codegen logiccrates/compiler/src/llvm_types.rs- Add StackNode/Value type definitionscrates/runtime/src/pool.rs- Exportpool_alloc_fastfor inline useAlternative: Register-Based VM
A more radical change would be to keep values in LLVM SSA registers instead of a heap stack. This would eliminate allocation entirely for local computations but requires significant architectural changes.
thx to this test we learned the huge costs of our architecture - codegen creates code that makes malloc and dealloc on every stack op and FFI calls -
solution is https://github.com/navicore/patch-seq/issues/111
we're even worse in the benchmark compared to go despite implementing https://github.com/navicore/patch-seq/pull/112
Update: Zero-Mutex Channel Implementation
What We Tried
Eliminated the mutex-protected channel registry. Channels are now passed directly as
Value::Channel(Arc<ChannelData>)on the stack - zero mutex, zero HashMap lookup on send/receive.Results
Analysis
Channel-heavy workloads improved significantly. Skynet regressed - likely due to Arc refcount overhead during stack cloning on spawn (1M spawns, minimal channel use).
Corrections to Original Issue
Some items listed as "optimization opportunities" were already completed:
Remaining Bottlenecks to Investigate
Strand registry overhead - Every spawn calls:
SystemTime::now()- this is a syscallMay coroutine overhead - Original profiling showed May itself accounts for ~9.6x vs Go.
Arc clone on spawn - Stack clone now clones Arc refcounts for Channel values.
Suggested Next Steps
Put diagnostics behind a feature flag - The strand registry (
SystemTime::now()+ O(n) scans) is only needed forkill -3production debugging. Add a Cargo feature likediagnostics(on by default) that can be disabled for benchmarks:Profile spawn path to quantify each component's cost
Investigate if stack cloning can be optimized for spawn (lazy clone, copy-on-write, etc.)
Final Analysis: Spawn Overhead vs Message Passing
Root Cause Identified
The skynet slowdown is syscall overhead during spawn, not ongoing runtime overhead:
May's coroutine implementation uses mmap/munmap with guard pages for each stack. This requires syscalls at spawn/teardown time. Go's goroutines use segmented stacks with minimal syscalls.
Practical Implications
Long-lived actor systems are fine:
Spawn-heavy patterns are slow:
Tuning Attempted
rand_work_stealfeatureChanges Made (will merge)
diagnosticsfeature flag - Disables strand registry + SIGQUIT handler (useful for production)rand_work_stealenabled - May help with work-stealing contentionRecommendation: Won't Fix
The fundamental issue is May's stack allocation strategy vs Go's runtime. Fixing this would require:
For real-world actor systems, the current performance is acceptable. Skynet is a synthetic benchmark that specifically stress-tests spawn overhead.
Documentation
Will document:
diagnosticsfeature flag for production tuningSEQ_STACK_SIZE,SEQ_POOL_CAPACITYlong lived actor use case perfectly supported at golang levels of performance. the skynet challenge is interesting but not a priority for seq.