navicore/patch-seq

Fork 0

Skynet performance: optimization opportunities from profiling #110

New issue

Closed

opened 2025-12-18 03:49:51 +00:00 by navicore · 6 comments

navicore commented

2025-12-18 03:49:51 +00:00

(Migrated from github.com)

Summary

Detailed profiling of the skynet benchmark revealed where time is spent and identified concrete optimization opportunities.

Profiling Results

Runtime	Time	vs Go
Go	169ms	1.0x
Native Rust + May	1,628ms	9.6x
Seq	4,426ms	26x

Key insight: May runtime accounts for 9.6x slowdown vs Go. Seq adds 2.7x on top of that.

Cost Breakdown (per 1M skynet spawns)

Component	Time	% of Seq
May coroutine spawn	~1,200ms	27%
Strand registry scan	~220ms	5%
Stack cloning on spawn	~110ms	2.5%
Stack operations	~75ms	1.7%
Channel registry lookups	~31ms	0.7%
Seq codegen/FFI overhead	~2,800ms	63%

Optimization Opportunities

1. Reduce May stack size (High Impact)

Current: 1MB per coroutine (set_stack_size(0x100000))
Opportunity: May supports smaller stacks. Go uses 2KB initial with growth.
Location: crates/runtime/src/scheduler.rs:279
Risk: Stack overflow for deep recursion
Experiment: Try 64KB, 128KB, 256KB and benchmark

2. Inline common stack operations (Medium Impact)

Current: Every push/pop/dup is an FFI call through LLVM
Opportunity: Generate inline LLVM IR for hot operations
Location: crates/compiler/src/codegen.rs
Benefit: Eliminate ~10M FFI round-trips in skynet

3. Specialize for Int operations (Medium Impact)

Current: All values wrapped in Value enum, pattern matched on every op
Opportunity: Fast path for Int-only arithmetic (skynet is all Ints)
Location: crates/runtime/src/arithmetic.rs
Approach: Type inference or runtime specialization

4. Cache channel handles (Low Impact)

Current: Every send/receive locks mutex + HashMap lookup
Opportunity: Return channel handle directly, bypass registry on use
Location: crates/runtime/src/channel.rs
Benefit: Eliminate ~2M mutex acquisitions in skynet

5. Reduce strand registry overhead (Low Impact)

Current: O(1024) scan + SystemTime::now() per spawn
Opportunity: Make registry optional via env var for benchmarks
Location: crates/runtime/src/scheduler.rs:370

Benchmarking

Profiling tools created in benchmarks/skynet/:

may_bench/ - Cargo project for May profiling
stack_clone_profile.rs - Stack cloning overhead
stack_ops_profile.rs - Stack operation overhead
spawn_profile.rs - Spawn component breakdown

Run benchmarks:

cd benchmarks && ./run.sh skynet

References

Profiling session that identified these opportunities
May coroutine library: https://github.com/Xudong-Huang/may

## Summary Detailed profiling of the skynet benchmark revealed where time is spent and identified concrete optimization opportunities. ## Profiling Results | Runtime | Time | vs Go | |---------|------|-------| | **Go** | 169ms | 1.0x | | **Native Rust + May** | 1,628ms | 9.6x | | **Seq** | 4,426ms | 26x | **Key insight**: May runtime accounts for 9.6x slowdown vs Go. Seq adds 2.7x on top of that. ### Cost Breakdown (per 1M skynet spawns) | Component | Time | % of Seq | |-----------|------|----------| | May coroutine spawn | ~1,200ms | 27% | | Strand registry scan | ~220ms | 5% | | Stack cloning on spawn | ~110ms | 2.5% | | Stack operations | ~75ms | 1.7% | | Channel registry lookups | ~31ms | 0.7% | | **Seq codegen/FFI overhead** | ~2,800ms | **63%** | ## Optimization Opportunities ### 1. Reduce May stack size (High Impact) - **Current**: 1MB per coroutine (`set_stack_size(0x100000)`) - **Opportunity**: May supports smaller stacks. Go uses 2KB initial with growth. - **Location**: `crates/runtime/src/scheduler.rs:279` - **Risk**: Stack overflow for deep recursion - **Experiment**: Try 64KB, 128KB, 256KB and benchmark ### 2. Inline common stack operations (Medium Impact) - **Current**: Every push/pop/dup is an FFI call through LLVM - **Opportunity**: Generate inline LLVM IR for hot operations - **Location**: `crates/compiler/src/codegen.rs` - **Benefit**: Eliminate ~10M FFI round-trips in skynet ### 3. Specialize for Int operations (Medium Impact) - **Current**: All values wrapped in `Value` enum, pattern matched on every op - **Opportunity**: Fast path for Int-only arithmetic (skynet is all Ints) - **Location**: `crates/runtime/src/arithmetic.rs` - **Approach**: Type inference or runtime specialization ### 4. Cache channel handles (Low Impact) - **Current**: Every send/receive locks mutex + HashMap lookup - **Opportunity**: Return channel handle directly, bypass registry on use - **Location**: `crates/runtime/src/channel.rs` - **Benefit**: Eliminate ~2M mutex acquisitions in skynet ### 5. Reduce strand registry overhead (Low Impact) - **Current**: O(1024) scan + SystemTime::now() per spawn - **Opportunity**: Make registry optional via env var for benchmarks - **Location**: `crates/runtime/src/scheduler.rs:370` ## Benchmarking Profiling tools created in `benchmarks/skynet/`: - `may_bench/` - Cargo project for May profiling - `stack_clone_profile.rs` - Stack cloning overhead - `stack_ops_profile.rs` - Stack operation overhead - `spawn_profile.rs` - Spawn component breakdown Run benchmarks: ```bash cd benchmarks && ./run.sh skynet ``` ## References - Profiling session that identified these opportunities - May coroutine library: https://github.com/Xudong-Huang/may

navicore commented

2025-12-18 04:44:14 +00:00

(Migrated from github.com)

Deep Dive Investigation Results

Current Performance

Seq: 4.47s (with 128KB stack + stats disabled)
Go: 165ms
Ratio: ~27x slower

Root Cause: FFI Call Overhead

Analyzed the generated LLVM IR for skynet (seqc build --keep-ir):

Every Seq operation becomes an FFI call:

%33 = call ptr @patch_seq_push_int(ptr %32, i64 3)   ; push 3
%34 = call ptr @patch_seq_pick_op(ptr %33)           ; pick
%35 = call ptr @patch_seq_push_int(ptr %34, i64 10)  ; push 10  
%36 = call ptr @patch_seq_multiply(ptr %35)          ; multiply

FFI calls per skynet function: 251

Top operations by frequency:

Operation	Calls	Could Inline?
`push_int`	56	✅ Yes
`drop_op`	49	✅ Yes
`pick_op`	30	✅ Yes
`add`	19	✅ Yes
`swap`	18	✅ Yes
`spawn`	11	❌ No (complex)
`chan_receive`	11	❌ No (complex)

With ~11K branch nodes executing the full function, that's ~1.1M FFI calls just for simple stack operations.

What Each FFI Call Costs

Looking at push_int → push → pool_allocate:

Function call overhead (stack frame, registers)
Thread-local storage access (NODE_POOL.with)
RefCell borrow (borrow_mut())
Pool allocation logic
Return value passing

Changes Made (Marginal Improvements)

Reduced default stack size: 1MB → 128KB (~16% faster)
Made pool stats debug-only: #[cfg(debug_assertions)] (minimal impact)

Proposed Solution: Inline LLVM IR Generation

Instead of:

%1 = call ptr @patch_seq_push_int(ptr %0, i64 42)

Generate:

; Inline push_int - no FFI call
%node = call ptr @pool_alloc_fast()      ; or inline further
%tag_ptr = getelementptr %Value, ptr %node, i32 0
store i64 0, ptr %tag_ptr                 ; tag = Int (0)
%val_ptr = getelementptr %Value, ptr %node, i32 1
store i64 42, ptr %val_ptr                ; value = 42
%next_ptr = getelementptr %StackNode, ptr %node, i32 1
store ptr %0, ptr %next_ptr               ; next = old stack

Implementation Roadmap

Phase 1: Inline Integer Operations (High Impact)

push_int - just allocate node + store tag + store value
drop_op - just read next pointer + return to pool
add, subtract, multiply, divide - pop 2, compute, push result

Phase 2: Inline Stack Shuffles (Medium Impact)

dup - read top value, push copy
swap - swap top two node values (no alloc needed!)
over, pick - traverse + copy

Phase 3: Specialize for Integers (Medium Impact)

Type inference pass to identify int-only code paths
Generate specialized versions without Value enum overhead

Estimated Impact

Phase	Effort	Expected Speedup
Phase 1	2-3 days	2-3x
Phase 2	1-2 days	1.5x
Phase 3	1 week	2x

Combined potential: 6-10x improvement, bringing Seq to ~3-5x of Go (similar to native Rust + May baseline of 9.6x).

Files to Modify

crates/compiler/src/codegen.rs - Main codegen logic
crates/compiler/src/llvm_types.rs - Add StackNode/Value type definitions
crates/runtime/src/pool.rs - Export pool_alloc_fast for inline use

Alternative: Register-Based VM

A more radical change would be to keep values in LLVM SSA registers instead of a heap stack. This would eliminate allocation entirely for local computations but requires significant architectural changes.

## Deep Dive Investigation Results ### Current Performance - **Seq**: 4.47s (with 128KB stack + stats disabled) - **Go**: 165ms - **Ratio**: ~27x slower ### Root Cause: FFI Call Overhead Analyzed the generated LLVM IR for skynet (`seqc build --keep-ir`): **Every Seq operation becomes an FFI call:** ```llvm %33 = call ptr @patch_seq_push_int(ptr %32, i64 3) ; push 3 %34 = call ptr @patch_seq_pick_op(ptr %33) ; pick %35 = call ptr @patch_seq_push_int(ptr %34, i64 10) ; push 10 %36 = call ptr @patch_seq_multiply(ptr %35) ; multiply ``` **FFI calls per skynet function: 251** Top operations by frequency: | Operation | Calls | Could Inline? | |-----------|-------|---------------| | `push_int` | 56 | ✅ Yes | | `drop_op` | 49 | ✅ Yes | | `pick_op` | 30 | ✅ Yes | | `add` | 19 | ✅ Yes | | `swap` | 18 | ✅ Yes | | `spawn` | 11 | ❌ No (complex) | | `chan_receive` | 11 | ❌ No (complex) | With ~11K branch nodes executing the full function, that's **~1.1M FFI calls** just for simple stack operations. ### What Each FFI Call Costs Looking at `push_int` → `push` → `pool_allocate`: 1. Function call overhead (stack frame, registers) 2. Thread-local storage access (`NODE_POOL.with`) 3. RefCell borrow (`borrow_mut()`) 4. Pool allocation logic 5. Return value passing ### Changes Made (Marginal Improvements) 1. **Reduced default stack size**: 1MB → 128KB (~16% faster) 2. **Made pool stats debug-only**: `#[cfg(debug_assertions)]` (minimal impact) ### Proposed Solution: Inline LLVM IR Generation Instead of: ```llvm %1 = call ptr @patch_seq_push_int(ptr %0, i64 42) ``` Generate: ```llvm ; Inline push_int - no FFI call %node = call ptr @pool_alloc_fast() ; or inline further %tag_ptr = getelementptr %Value, ptr %node, i32 0 store i64 0, ptr %tag_ptr ; tag = Int (0) %val_ptr = getelementptr %Value, ptr %node, i32 1 store i64 42, ptr %val_ptr ; value = 42 %next_ptr = getelementptr %StackNode, ptr %node, i32 1 store ptr %0, ptr %next_ptr ; next = old stack ``` ### Implementation Roadmap **Phase 1: Inline Integer Operations (High Impact)** - [ ] `push_int` - just allocate node + store tag + store value - [ ] `drop_op` - just read next pointer + return to pool - [ ] `add`, `subtract`, `multiply`, `divide` - pop 2, compute, push result **Phase 2: Inline Stack Shuffles (Medium Impact)** - [ ] `dup` - read top value, push copy - [ ] `swap` - swap top two node values (no alloc needed!) - [ ] `over`, `pick` - traverse + copy **Phase 3: Specialize for Integers (Medium Impact)** - [ ] Type inference pass to identify int-only code paths - [ ] Generate specialized versions without Value enum overhead ### Estimated Impact | Phase | Effort | Expected Speedup | |-------|--------|------------------| | Phase 1 | 2-3 days | 2-3x | | Phase 2 | 1-2 days | 1.5x | | Phase 3 | 1 week | 2x | Combined potential: **6-10x improvement**, bringing Seq to ~3-5x of Go (similar to native Rust + May baseline of 9.6x). ### Files to Modify - `crates/compiler/src/codegen.rs` - Main codegen logic - `crates/compiler/src/llvm_types.rs` - Add StackNode/Value type definitions - `crates/runtime/src/pool.rs` - Export `pool_alloc_fast` for inline use ### Alternative: Register-Based VM A more radical change would be to keep values in LLVM SSA registers instead of a heap stack. This would eliminate allocation entirely for local computations but requires significant architectural changes.

navicore commented

2025-12-18 05:31:33 +00:00

(Migrated from github.com)

thx to this test we learned the huge costs of our architecture - codegen creates code that makes malloc and dealloc on every stack op and FFI calls -

solution is https://github.com/navicore/patch-seq/issues/111

thx to this test we learned the huge costs of our architecture - codegen creates code that makes malloc and dealloc on every stack op and FFI calls - solution is https://github.com/navicore/patch-seq/issues/111

navicore commented

2025-12-21 04:48:29 +00:00

(Migrated from github.com)

we're even worse in the benchmark compared to go despite implementing https://github.com/navicore/patch-seq/pull/112

navicore commented

2025-12-21 05:01:40 +00:00

(Migrated from github.com)

Update: Zero-Mutex Channel Implementation

What We Tried

Eliminated the mutex-protected channel registry. Channels are now passed directly as Value::Channel(Arc<ChannelData>) on the stack - zero mutex, zero HashMap lookup on send/receive.

Results

Benchmark	Before	After	Change
pingpong	Go 1.58x faster	Go 1.36x faster	+14% ✓
fanout	Go 1.15x faster	Seq 1.18x faster	+35% ✓
skynet	Go 28x faster	Go 35x faster	-24% ✗

Analysis

Channel-heavy workloads improved significantly. Skynet regressed - likely due to Arc refcount overhead during stack cloning on spawn (1M spawns, minimal channel use).

Corrections to Original Issue

Some items listed as "optimization opportunities" were already completed:

✅ Stack size: Already 128KB (not 1MB as stated)
✅ Inline stack ops: Arithmetic, comparisons, stack shuffling generate inline LLVM IR
✅ Channel handles: Now zero-mutex with direct Arc handles

Remaining Bottlenecks to Investigate

Strand registry overhead - Every spawn calls:
- SystemTime::now() - this is a syscall
- O(1024) scan to find free slot
- O(1024) scan again on completion
May coroutine overhead - Original profiling showed May itself accounts for ~9.6x vs Go.
Arc clone on spawn - Stack clone now clones Arc refcounts for Channel values.

Suggested Next Steps

Put diagnostics behind a feature flag - The strand registry (SystemTime::now() + O(n) scans) is only needed for kill -3 production debugging. Add a Cargo feature like diagnostics (on by default) that can be disabled for benchmarks:
```
[features]
default = ["diagnostics"]
diagnostics = []
```
Profile spawn path to quantify each component's cost
Investigate if stack cloning can be optimized for spawn (lazy clone, copy-on-write, etc.)

## Update: Zero-Mutex Channel Implementation ### What We Tried Eliminated the mutex-protected channel registry. Channels are now passed directly as `Value::Channel(Arc<ChannelData>)` on the stack - zero mutex, zero HashMap lookup on send/receive. ### Results | Benchmark | Before | After | Change | |-----------|--------|-------|--------| | pingpong | Go 1.58x faster | Go 1.36x faster | **+14%** ✓ | | fanout | Go 1.15x faster | **Seq 1.18x faster** | **+35%** ✓ | | skynet | Go 28x faster | Go 35x faster | **-24%** ✗ | ### Analysis Channel-heavy workloads improved significantly. Skynet regressed - likely due to Arc refcount overhead during stack cloning on spawn (1M spawns, minimal channel use). ### Corrections to Original Issue Some items listed as "optimization opportunities" were already completed: - ✅ Stack size: Already 128KB (not 1MB as stated) - ✅ Inline stack ops: Arithmetic, comparisons, stack shuffling generate inline LLVM IR - ✅ Channel handles: Now zero-mutex with direct Arc handles ### Remaining Bottlenecks to Investigate 1. **Strand registry overhead** - Every spawn calls: - `SystemTime::now()` - this is a syscall - O(1024) scan to find free slot - O(1024) scan again on completion 2. **May coroutine overhead** - Original profiling showed May itself accounts for ~9.6x vs Go. 3. **Arc clone on spawn** - Stack clone now clones Arc refcounts for Channel values. ### Suggested Next Steps 1. **Put diagnostics behind a feature flag** - The strand registry (`SystemTime::now()` + O(n) scans) is only needed for `kill -3` production debugging. Add a Cargo feature like `diagnostics` (on by default) that can be disabled for benchmarks: ```toml [features] default = ["diagnostics"] diagnostics = [] ``` 2. Profile spawn path to quantify each component's cost 3. Investigate if stack cloning can be optimized for spawn (lazy clone, copy-on-write, etc.)

navicore commented

2025-12-21 05:47:01 +00:00

(Migrated from github.com)

Final Analysis: Spawn Overhead vs Message Passing

Root Cause Identified

The skynet slowdown is syscall overhead during spawn, not ongoing runtime overhead:

Benchmark	Pattern	Seq System Time	vs Go
Pingpong	2 strands, 1M messages	3.3ms (1.4%)	1.21x slower
Skynet	100k strands, minimal work	18,000ms (300%)	35x slower

May's coroutine implementation uses mmap/munmap with guard pages for each stack. This requires syscalls at spawn/teardown time. Go's goroutines use segmented stacks with minimal syscalls.

Practical Implications

Long-lived actor systems are fine:

Spawn 1M actors at startup: ~60 seconds one-time cost
Send millions of messages: Nearly as fast as Go (1.2x)
Actors running for hours: No ongoing syscall overhead

Spawn-heavy patterns are slow:

Skynet-style "spawn, do minimal work, exit" patterns pay full syscall cost
This is a pathological benchmark, not representative of real actor workloads

Tuning Attempted

Setting	Result
Pool capacity 1000→10000	User time ↓66%, but syscalls still dominate
`rand_work_steal` feature	No improvement
Diagnostics feature flag	No measurable impact on skynet

Changes Made (will merge)

diagnostics feature flag - Disables strand registry + SIGQUIT handler (useful for production)
Pool capacity increased - 1000→10000 default reduces allocations
rand_work_steal enabled - May help with work-stealing contention

Recommendation: Won't Fix

The fundamental issue is May's stack allocation strategy vs Go's runtime. Fixing this would require:

Forking/modifying May
Writing a custom coroutine library
Switching to a different approach entirely

For real-world actor systems, the current performance is acceptable. Skynet is a synthetic benchmark that specifically stress-tests spawn overhead.

Documentation

Will document:

Performance characteristics (spawn overhead vs message passing)
diagnostics feature flag for production tuning
Environment variables: SEQ_STACK_SIZE, SEQ_POOL_CAPACITY

## Final Analysis: Spawn Overhead vs Message Passing ### Root Cause Identified The skynet slowdown is **syscall overhead during spawn**, not ongoing runtime overhead: | Benchmark | Pattern | Seq System Time | vs Go | |-----------|---------|-----------------|-------| | **Pingpong** | 2 strands, 1M messages | 3.3ms (1.4%) | **1.21x slower** | | **Skynet** | 100k strands, minimal work | 18,000ms (300%) | **35x slower** | May's coroutine implementation uses mmap/munmap with guard pages for each stack. This requires syscalls at spawn/teardown time. Go's goroutines use segmented stacks with minimal syscalls. ### Practical Implications **Long-lived actor systems are fine:** - Spawn 1M actors at startup: ~60 seconds one-time cost - Send millions of messages: Nearly as fast as Go (1.2x) - Actors running for hours: No ongoing syscall overhead **Spawn-heavy patterns are slow:** - Skynet-style "spawn, do minimal work, exit" patterns pay full syscall cost - This is a pathological benchmark, not representative of real actor workloads ### Tuning Attempted | Setting | Result | |---------|--------| | Pool capacity 1000→10000 | User time ↓66%, but syscalls still dominate | | `rand_work_steal` feature | No improvement | | Diagnostics feature flag | No measurable impact on skynet | ### Changes Made (will merge) 1. **`diagnostics` feature flag** - Disables strand registry + SIGQUIT handler (useful for production) 2. **Pool capacity increased** - 1000→10000 default reduces allocations 3. **`rand_work_steal` enabled** - May help with work-stealing contention ### Recommendation: Won't Fix The fundamental issue is May's stack allocation strategy vs Go's runtime. Fixing this would require: - Forking/modifying May - Writing a custom coroutine library - Switching to a different approach entirely For real-world actor systems, the current performance is acceptable. Skynet is a synthetic benchmark that specifically stress-tests spawn overhead. ### Documentation Will document: 1. Performance characteristics (spawn overhead vs message passing) 2. `diagnostics` feature flag for production tuning 3. Environment variables: `SEQ_STACK_SIZE`, `SEQ_POOL_CAPACITY`

navicore commented

2025-12-21 06:10:55 +00:00

(Migrated from github.com)

long lived actor use case perfectly supported at golang levels of performance. the skynet challenge is interesting but not a priority for seq.