Atomic.fetch_and_add on every CEK step added unnecessary overhead. The step counter is per-invocation (not shared across threads), so a plain int ref with incr is sufficient and faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>