Skynet performance: optimization opportunities from profiling #110

Closed
opened 2025-12-18 03:49:51 +00:00 by navicore · 6 comments
navicore commented 2025-12-18 03:49:51 +00:00 (Migrated from github.com)

Summary

Detailed profiling of the skynet benchmark revealed where time is spent and identified concrete optimization opportunities.

Profiling Results

Runtime Time vs Go
Go 169ms 1.0x
Native Rust + May 1,628ms 9.6x
Seq 4,426ms 26x

Key insight: May runtime accounts for 9.6x slowdown vs Go. Seq adds 2.7x on top of that.

Cost Breakdown (per 1M skynet spawns)

Component Time % of Seq
May coroutine spawn ~1,200ms 27%
Strand registry scan ~220ms 5%
Stack cloning on spawn ~110ms 2.5%
Stack operations ~75ms 1.7%
Channel registry lookups ~31ms 0.7%
Seq codegen/FFI overhead ~2,800ms 63%

Optimization Opportunities

1. Reduce May stack size (High Impact)

  • Current: 1MB per coroutine (set_stack_size(0x100000))
  • Opportunity: May supports smaller stacks. Go uses 2KB initial with growth.
  • Location: crates/runtime/src/scheduler.rs:279
  • Risk: Stack overflow for deep recursion
  • Experiment: Try 64KB, 128KB, 256KB and benchmark

2. Inline common stack operations (Medium Impact)

  • Current: Every push/pop/dup is an FFI call through LLVM
  • Opportunity: Generate inline LLVM IR for hot operations
  • Location: crates/compiler/src/codegen.rs
  • Benefit: Eliminate ~10M FFI round-trips in skynet

3. Specialize for Int operations (Medium Impact)

  • Current: All values wrapped in Value enum, pattern matched on every op
  • Opportunity: Fast path for Int-only arithmetic (skynet is all Ints)
  • Location: crates/runtime/src/arithmetic.rs
  • Approach: Type inference or runtime specialization

4. Cache channel handles (Low Impact)

  • Current: Every send/receive locks mutex + HashMap lookup
  • Opportunity: Return channel handle directly, bypass registry on use
  • Location: crates/runtime/src/channel.rs
  • Benefit: Eliminate ~2M mutex acquisitions in skynet

5. Reduce strand registry overhead (Low Impact)

  • Current: O(1024) scan + SystemTime::now() per spawn
  • Opportunity: Make registry optional via env var for benchmarks
  • Location: crates/runtime/src/scheduler.rs:370

Benchmarking

Profiling tools created in benchmarks/skynet/:

  • may_bench/ - Cargo project for May profiling
  • stack_clone_profile.rs - Stack cloning overhead
  • stack_ops_profile.rs - Stack operation overhead
  • spawn_profile.rs - Spawn component breakdown

Run benchmarks:

cd benchmarks && ./run.sh skynet

References

## Summary Detailed profiling of the skynet benchmark revealed where time is spent and identified concrete optimization opportunities. ## Profiling Results | Runtime | Time | vs Go | |---------|------|-------| | **Go** | 169ms | 1.0x | | **Native Rust + May** | 1,628ms | 9.6x | | **Seq** | 4,426ms | 26x | **Key insight**: May runtime accounts for 9.6x slowdown vs Go. Seq adds 2.7x on top of that. ### Cost Breakdown (per 1M skynet spawns) | Component | Time | % of Seq | |-----------|------|----------| | May coroutine spawn | ~1,200ms | 27% | | Strand registry scan | ~220ms | 5% | | Stack cloning on spawn | ~110ms | 2.5% | | Stack operations | ~75ms | 1.7% | | Channel registry lookups | ~31ms | 0.7% | | **Seq codegen/FFI overhead** | ~2,800ms | **63%** | ## Optimization Opportunities ### 1. Reduce May stack size (High Impact) - **Current**: 1MB per coroutine (`set_stack_size(0x100000)`) - **Opportunity**: May supports smaller stacks. Go uses 2KB initial with growth. - **Location**: `crates/runtime/src/scheduler.rs:279` - **Risk**: Stack overflow for deep recursion - **Experiment**: Try 64KB, 128KB, 256KB and benchmark ### 2. Inline common stack operations (Medium Impact) - **Current**: Every push/pop/dup is an FFI call through LLVM - **Opportunity**: Generate inline LLVM IR for hot operations - **Location**: `crates/compiler/src/codegen.rs` - **Benefit**: Eliminate ~10M FFI round-trips in skynet ### 3. Specialize for Int operations (Medium Impact) - **Current**: All values wrapped in `Value` enum, pattern matched on every op - **Opportunity**: Fast path for Int-only arithmetic (skynet is all Ints) - **Location**: `crates/runtime/src/arithmetic.rs` - **Approach**: Type inference or runtime specialization ### 4. Cache channel handles (Low Impact) - **Current**: Every send/receive locks mutex + HashMap lookup - **Opportunity**: Return channel handle directly, bypass registry on use - **Location**: `crates/runtime/src/channel.rs` - **Benefit**: Eliminate ~2M mutex acquisitions in skynet ### 5. Reduce strand registry overhead (Low Impact) - **Current**: O(1024) scan + SystemTime::now() per spawn - **Opportunity**: Make registry optional via env var for benchmarks - **Location**: `crates/runtime/src/scheduler.rs:370` ## Benchmarking Profiling tools created in `benchmarks/skynet/`: - `may_bench/` - Cargo project for May profiling - `stack_clone_profile.rs` - Stack cloning overhead - `stack_ops_profile.rs` - Stack operation overhead - `spawn_profile.rs` - Spawn component breakdown Run benchmarks: ```bash cd benchmarks && ./run.sh skynet ``` ## References - Profiling session that identified these opportunities - May coroutine library: https://github.com/Xudong-Huang/may
navicore commented 2025-12-18 04:44:14 +00:00 (Migrated from github.com)

Deep Dive Investigation Results

Current Performance

  • Seq: 4.47s (with 128KB stack + stats disabled)
  • Go: 165ms
  • Ratio: ~27x slower

Root Cause: FFI Call Overhead

Analyzed the generated LLVM IR for skynet (seqc build --keep-ir):

Every Seq operation becomes an FFI call:

%33 = call ptr @patch_seq_push_int(ptr %32, i64 3)   ; push 3
%34 = call ptr @patch_seq_pick_op(ptr %33)           ; pick
%35 = call ptr @patch_seq_push_int(ptr %34, i64 10)  ; push 10  
%36 = call ptr @patch_seq_multiply(ptr %35)          ; multiply

FFI calls per skynet function: 251

Top operations by frequency:

Operation Calls Could Inline?
push_int 56 Yes
drop_op 49 Yes
pick_op 30 Yes
add 19 Yes
swap 18 Yes
spawn 11 No (complex)
chan_receive 11 No (complex)

With ~11K branch nodes executing the full function, that's ~1.1M FFI calls just for simple stack operations.

What Each FFI Call Costs

Looking at push_intpushpool_allocate:

  1. Function call overhead (stack frame, registers)
  2. Thread-local storage access (NODE_POOL.with)
  3. RefCell borrow (borrow_mut())
  4. Pool allocation logic
  5. Return value passing

Changes Made (Marginal Improvements)

  1. Reduced default stack size: 1MB → 128KB (~16% faster)
  2. Made pool stats debug-only: #[cfg(debug_assertions)] (minimal impact)

Proposed Solution: Inline LLVM IR Generation

Instead of:

%1 = call ptr @patch_seq_push_int(ptr %0, i64 42)

Generate:

; Inline push_int - no FFI call
%node = call ptr @pool_alloc_fast()      ; or inline further
%tag_ptr = getelementptr %Value, ptr %node, i32 0
store i64 0, ptr %tag_ptr                 ; tag = Int (0)
%val_ptr = getelementptr %Value, ptr %node, i32 1
store i64 42, ptr %val_ptr                ; value = 42
%next_ptr = getelementptr %StackNode, ptr %node, i32 1
store ptr %0, ptr %next_ptr               ; next = old stack

Implementation Roadmap

Phase 1: Inline Integer Operations (High Impact)

  • push_int - just allocate node + store tag + store value
  • drop_op - just read next pointer + return to pool
  • add, subtract, multiply, divide - pop 2, compute, push result

Phase 2: Inline Stack Shuffles (Medium Impact)

  • dup - read top value, push copy
  • swap - swap top two node values (no alloc needed!)
  • over, pick - traverse + copy

Phase 3: Specialize for Integers (Medium Impact)

  • Type inference pass to identify int-only code paths
  • Generate specialized versions without Value enum overhead

Estimated Impact

Phase Effort Expected Speedup
Phase 1 2-3 days 2-3x
Phase 2 1-2 days 1.5x
Phase 3 1 week 2x

Combined potential: 6-10x improvement, bringing Seq to ~3-5x of Go (similar to native Rust + May baseline of 9.6x).

Files to Modify

  • crates/compiler/src/codegen.rs - Main codegen logic
  • crates/compiler/src/llvm_types.rs - Add StackNode/Value type definitions
  • crates/runtime/src/pool.rs - Export pool_alloc_fast for inline use

Alternative: Register-Based VM

A more radical change would be to keep values in LLVM SSA registers instead of a heap stack. This would eliminate allocation entirely for local computations but requires significant architectural changes.

## Deep Dive Investigation Results ### Current Performance - **Seq**: 4.47s (with 128KB stack + stats disabled) - **Go**: 165ms - **Ratio**: ~27x slower ### Root Cause: FFI Call Overhead Analyzed the generated LLVM IR for skynet (`seqc build --keep-ir`): **Every Seq operation becomes an FFI call:** ```llvm %33 = call ptr @patch_seq_push_int(ptr %32, i64 3) ; push 3 %34 = call ptr @patch_seq_pick_op(ptr %33) ; pick %35 = call ptr @patch_seq_push_int(ptr %34, i64 10) ; push 10 %36 = call ptr @patch_seq_multiply(ptr %35) ; multiply ``` **FFI calls per skynet function: 251** Top operations by frequency: | Operation | Calls | Could Inline? | |-----------|-------|---------------| | `push_int` | 56 | ✅ Yes | | `drop_op` | 49 | ✅ Yes | | `pick_op` | 30 | ✅ Yes | | `add` | 19 | ✅ Yes | | `swap` | 18 | ✅ Yes | | `spawn` | 11 | ❌ No (complex) | | `chan_receive` | 11 | ❌ No (complex) | With ~11K branch nodes executing the full function, that's **~1.1M FFI calls** just for simple stack operations. ### What Each FFI Call Costs Looking at `push_int` → `push` → `pool_allocate`: 1. Function call overhead (stack frame, registers) 2. Thread-local storage access (`NODE_POOL.with`) 3. RefCell borrow (`borrow_mut()`) 4. Pool allocation logic 5. Return value passing ### Changes Made (Marginal Improvements) 1. **Reduced default stack size**: 1MB → 128KB (~16% faster) 2. **Made pool stats debug-only**: `#[cfg(debug_assertions)]` (minimal impact) ### Proposed Solution: Inline LLVM IR Generation Instead of: ```llvm %1 = call ptr @patch_seq_push_int(ptr %0, i64 42) ``` Generate: ```llvm ; Inline push_int - no FFI call %node = call ptr @pool_alloc_fast() ; or inline further %tag_ptr = getelementptr %Value, ptr %node, i32 0 store i64 0, ptr %tag_ptr ; tag = Int (0) %val_ptr = getelementptr %Value, ptr %node, i32 1 store i64 42, ptr %val_ptr ; value = 42 %next_ptr = getelementptr %StackNode, ptr %node, i32 1 store ptr %0, ptr %next_ptr ; next = old stack ``` ### Implementation Roadmap **Phase 1: Inline Integer Operations (High Impact)** - [ ] `push_int` - just allocate node + store tag + store value - [ ] `drop_op` - just read next pointer + return to pool - [ ] `add`, `subtract`, `multiply`, `divide` - pop 2, compute, push result **Phase 2: Inline Stack Shuffles (Medium Impact)** - [ ] `dup` - read top value, push copy - [ ] `swap` - swap top two node values (no alloc needed!) - [ ] `over`, `pick` - traverse + copy **Phase 3: Specialize for Integers (Medium Impact)** - [ ] Type inference pass to identify int-only code paths - [ ] Generate specialized versions without Value enum overhead ### Estimated Impact | Phase | Effort | Expected Speedup | |-------|--------|------------------| | Phase 1 | 2-3 days | 2-3x | | Phase 2 | 1-2 days | 1.5x | | Phase 3 | 1 week | 2x | Combined potential: **6-10x improvement**, bringing Seq to ~3-5x of Go (similar to native Rust + May baseline of 9.6x). ### Files to Modify - `crates/compiler/src/codegen.rs` - Main codegen logic - `crates/compiler/src/llvm_types.rs` - Add StackNode/Value type definitions - `crates/runtime/src/pool.rs` - Export `pool_alloc_fast` for inline use ### Alternative: Register-Based VM A more radical change would be to keep values in LLVM SSA registers instead of a heap stack. This would eliminate allocation entirely for local computations but requires significant architectural changes.
navicore commented 2025-12-18 05:31:33 +00:00 (Migrated from github.com)

thx to this test we learned the huge costs of our architecture - codegen creates code that makes malloc and dealloc on every stack op and FFI calls -

solution is https://github.com/navicore/patch-seq/issues/111

thx to this test we learned the huge costs of our architecture - codegen creates code that makes malloc and dealloc on every stack op and FFI calls - solution is https://github.com/navicore/patch-seq/issues/111
navicore commented 2025-12-21 04:48:29 +00:00 (Migrated from github.com)

we're even worse in the benchmark compared to go despite implementing https://github.com/navicore/patch-seq/pull/112

we're even worse in the benchmark compared to go despite implementing https://github.com/navicore/patch-seq/pull/112
navicore commented 2025-12-21 05:01:40 +00:00 (Migrated from github.com)

Update: Zero-Mutex Channel Implementation

What We Tried

Eliminated the mutex-protected channel registry. Channels are now passed directly as Value::Channel(Arc<ChannelData>) on the stack - zero mutex, zero HashMap lookup on send/receive.

Results

Benchmark Before After Change
pingpong Go 1.58x faster Go 1.36x faster +14%
fanout Go 1.15x faster Seq 1.18x faster +35%
skynet Go 28x faster Go 35x faster -24%

Analysis

Channel-heavy workloads improved significantly. Skynet regressed - likely due to Arc refcount overhead during stack cloning on spawn (1M spawns, minimal channel use).

Corrections to Original Issue

Some items listed as "optimization opportunities" were already completed:

  • Stack size: Already 128KB (not 1MB as stated)
  • Inline stack ops: Arithmetic, comparisons, stack shuffling generate inline LLVM IR
  • Channel handles: Now zero-mutex with direct Arc handles

Remaining Bottlenecks to Investigate

  1. Strand registry overhead - Every spawn calls:

    • SystemTime::now() - this is a syscall
    • O(1024) scan to find free slot
    • O(1024) scan again on completion
  2. May coroutine overhead - Original profiling showed May itself accounts for ~9.6x vs Go.

  3. Arc clone on spawn - Stack clone now clones Arc refcounts for Channel values.

Suggested Next Steps

  1. Put diagnostics behind a feature flag - The strand registry (SystemTime::now() + O(n) scans) is only needed for kill -3 production debugging. Add a Cargo feature like diagnostics (on by default) that can be disabled for benchmarks:

    [features]
    default = ["diagnostics"]
    diagnostics = []
    
  2. Profile spawn path to quantify each component's cost

  3. Investigate if stack cloning can be optimized for spawn (lazy clone, copy-on-write, etc.)

## Update: Zero-Mutex Channel Implementation ### What We Tried Eliminated the mutex-protected channel registry. Channels are now passed directly as `Value::Channel(Arc<ChannelData>)` on the stack - zero mutex, zero HashMap lookup on send/receive. ### Results | Benchmark | Before | After | Change | |-----------|--------|-------|--------| | pingpong | Go 1.58x faster | Go 1.36x faster | **+14%** ✓ | | fanout | Go 1.15x faster | **Seq 1.18x faster** | **+35%** ✓ | | skynet | Go 28x faster | Go 35x faster | **-24%** ✗ | ### Analysis Channel-heavy workloads improved significantly. Skynet regressed - likely due to Arc refcount overhead during stack cloning on spawn (1M spawns, minimal channel use). ### Corrections to Original Issue Some items listed as "optimization opportunities" were already completed: - ✅ Stack size: Already 128KB (not 1MB as stated) - ✅ Inline stack ops: Arithmetic, comparisons, stack shuffling generate inline LLVM IR - ✅ Channel handles: Now zero-mutex with direct Arc handles ### Remaining Bottlenecks to Investigate 1. **Strand registry overhead** - Every spawn calls: - `SystemTime::now()` - this is a syscall - O(1024) scan to find free slot - O(1024) scan again on completion 2. **May coroutine overhead** - Original profiling showed May itself accounts for ~9.6x vs Go. 3. **Arc clone on spawn** - Stack clone now clones Arc refcounts for Channel values. ### Suggested Next Steps 1. **Put diagnostics behind a feature flag** - The strand registry (`SystemTime::now()` + O(n) scans) is only needed for `kill -3` production debugging. Add a Cargo feature like `diagnostics` (on by default) that can be disabled for benchmarks: ```toml [features] default = ["diagnostics"] diagnostics = [] ``` 2. Profile spawn path to quantify each component's cost 3. Investigate if stack cloning can be optimized for spawn (lazy clone, copy-on-write, etc.)
navicore commented 2025-12-21 05:47:01 +00:00 (Migrated from github.com)

Final Analysis: Spawn Overhead vs Message Passing

Root Cause Identified

The skynet slowdown is syscall overhead during spawn, not ongoing runtime overhead:

Benchmark Pattern Seq System Time vs Go
Pingpong 2 strands, 1M messages 3.3ms (1.4%) 1.21x slower
Skynet 100k strands, minimal work 18,000ms (300%) 35x slower

May's coroutine implementation uses mmap/munmap with guard pages for each stack. This requires syscalls at spawn/teardown time. Go's goroutines use segmented stacks with minimal syscalls.

Practical Implications

Long-lived actor systems are fine:

  • Spawn 1M actors at startup: ~60 seconds one-time cost
  • Send millions of messages: Nearly as fast as Go (1.2x)
  • Actors running for hours: No ongoing syscall overhead

Spawn-heavy patterns are slow:

  • Skynet-style "spawn, do minimal work, exit" patterns pay full syscall cost
  • This is a pathological benchmark, not representative of real actor workloads

Tuning Attempted

Setting Result
Pool capacity 1000→10000 User time ↓66%, but syscalls still dominate
rand_work_steal feature No improvement
Diagnostics feature flag No measurable impact on skynet

Changes Made (will merge)

  1. diagnostics feature flag - Disables strand registry + SIGQUIT handler (useful for production)
  2. Pool capacity increased - 1000→10000 default reduces allocations
  3. rand_work_steal enabled - May help with work-stealing contention

Recommendation: Won't Fix

The fundamental issue is May's stack allocation strategy vs Go's runtime. Fixing this would require:

  • Forking/modifying May
  • Writing a custom coroutine library
  • Switching to a different approach entirely

For real-world actor systems, the current performance is acceptable. Skynet is a synthetic benchmark that specifically stress-tests spawn overhead.

Documentation

Will document:

  1. Performance characteristics (spawn overhead vs message passing)
  2. diagnostics feature flag for production tuning
  3. Environment variables: SEQ_STACK_SIZE, SEQ_POOL_CAPACITY
## Final Analysis: Spawn Overhead vs Message Passing ### Root Cause Identified The skynet slowdown is **syscall overhead during spawn**, not ongoing runtime overhead: | Benchmark | Pattern | Seq System Time | vs Go | |-----------|---------|-----------------|-------| | **Pingpong** | 2 strands, 1M messages | 3.3ms (1.4%) | **1.21x slower** | | **Skynet** | 100k strands, minimal work | 18,000ms (300%) | **35x slower** | May's coroutine implementation uses mmap/munmap with guard pages for each stack. This requires syscalls at spawn/teardown time. Go's goroutines use segmented stacks with minimal syscalls. ### Practical Implications **Long-lived actor systems are fine:** - Spawn 1M actors at startup: ~60 seconds one-time cost - Send millions of messages: Nearly as fast as Go (1.2x) - Actors running for hours: No ongoing syscall overhead **Spawn-heavy patterns are slow:** - Skynet-style "spawn, do minimal work, exit" patterns pay full syscall cost - This is a pathological benchmark, not representative of real actor workloads ### Tuning Attempted | Setting | Result | |---------|--------| | Pool capacity 1000→10000 | User time ↓66%, but syscalls still dominate | | `rand_work_steal` feature | No improvement | | Diagnostics feature flag | No measurable impact on skynet | ### Changes Made (will merge) 1. **`diagnostics` feature flag** - Disables strand registry + SIGQUIT handler (useful for production) 2. **Pool capacity increased** - 1000→10000 default reduces allocations 3. **`rand_work_steal` enabled** - May help with work-stealing contention ### Recommendation: Won't Fix The fundamental issue is May's stack allocation strategy vs Go's runtime. Fixing this would require: - Forking/modifying May - Writing a custom coroutine library - Switching to a different approach entirely For real-world actor systems, the current performance is acceptable. Skynet is a synthetic benchmark that specifically stress-tests spawn overhead. ### Documentation Will document: 1. Performance characteristics (spawn overhead vs message passing) 2. `diagnostics` feature flag for production tuning 3. Environment variables: `SEQ_STACK_SIZE`, `SEQ_POOL_CAPACITY`
navicore commented 2025-12-21 06:10:55 +00:00 (Migrated from github.com)

long lived actor use case perfectly supported at golang levels of performance. the skynet challenge is interesting but not a priority for seq.

long lived actor use case perfectly supported at golang levels of performance. the skynet challenge is interesting but not a priority for seq.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
navicore/patch-seq#110
No description provided.