PhoenixSim Job System — Gap Analysis & Roadmap
This document captures a gap analysis between PhoenixSim’s current job/threading stack and the modern work-stealing, fiber-aware design described in Job Systems for Game Engines (Mighty Professional, 2026). It is the planning artifact for a series of staged improvements; each section ends with a concrete next step.
Current architecture (baseline)
| Concern | Implementation |
|---|---|
| Worker pool | Phoenix::ThreadPool — min(hw_concurrency, 8) - 1 OS threads, one global instance, multiplexed across all worlds (src/PhoenixSim/Parallel.h:62, tests/TestRTS/app.cpp:313). |
| Submission queue | Single Vyukov bounded MPMC ring buffer (src/PhoenixSim/Containers/MPMCQueue.h). Capacity defaults to 1024. |
| Task payload | Phoenix::Task wrapping std::function<void()> + std::shared_ptr<TaskHandle> (Parallel.h:30). |
| DAG scheduler | ECS::JobScheduler with RemainingPredecessors atomic per node; auto-derives implicit edges from component Read/Write conflicts (src/PhoenixSim/ECS/JobScheduler.h, SystemJob.h:32-65). |
| Idle strategy | Exponential-backoff PHX_THREAD_PAUSE() on the worker side, plain std::this_thread::yield() on caller-side waits. No condition variables, no futex. |
| Range parallelism | ParallelForEach(n, fn) and ParallelRange(n, minRange, fn) (Parallel.h:123-185). |
| Fibers / mid-job wait | None. Waits are expressed by splitting into two jobs with a dependency edge. |
What the design gets right
- One worker per core minus one, OS threads only — matches the “threads about the machine, tasks about the work” rule.
- Vyukov MPMC is textbook lock-free and ABA-safe via per-cell sequence numbers.
PHX_THREAD_PAUSE()exponential backoff on the worker idle path is correct.- Component-access dependency derivation in the ECS scheduler is better than Unity’s runtime
AtomicSafetyHandle— it’s a schedule-time guarantee, not an editor-only check. ParallelRangeenforces a minimum granularity, the right way to express data-parallel work per the Cilk-5 work-first principle.
Gaps, ranked by performance impact
1. False sharing on the MPMC hotspot
Containers/MPMCQueue.h:101-102:
std::atomic<size_t> EnqueuePos;
std::atomic<size_t> DequeuePos;
Adjacent 8-byte atomics; every producer writes the first, every consumer writes the second. The cache line ping-pongs between cores on every operation. The same pattern exists on the ThreadPool hot atomics (Done, ActiveWorkerCount, SpinningWorkerCount — Parallel.h:85-88).
Fix: alignas(64) on each hot atomic. Pad Cell to a full line if sizeof(T) + sizeof(atomic<size_t>) < 64. Expect 1.5–3× throughput on the queue itself on x86; more on Apple Silicon (128-byte lines).
2. No work stealing — single global queue is the contention point
Every submit and every dequeue races on the same EnqueuePos/DequeuePos. The Chase-Lev pattern replaces this with N per-worker deques where the owner uses near-zero-cost push/pop (relaxed bottom counter) and only thieves do CAS. Owner-side cost drops to “tens of nanoseconds” vs. “hundreds” for steals.
For PhoenixSim’s typical millisecond-scale physics batches, the gap is hidden. It becomes visible with:
ParallelForEach(N, fn)and small-N fan-outs from one big archetype.- Scaling beyond 8–16 cores.
Fix path: port a Chase-Lev deque (use the Lê et al. 2013 weak-memory-model variant — important for Switch/ARM targets). Keep the global MPMC as the submission inbox; drain it into per-worker deques.
3. ParallelForEach violates the granularity rule
Parallel.h:123-131 submits one task per element. The Cilk-5 work-first heuristic: each job should do 10–100× more work than the queue operation costs. Push/pop here is ~100 ns minimum (CAS + sequence store + shared_ptr alloc), so per-element jobs need ≥1 µs of real work to break even.
Fix: make ParallelForEach a shim that forwards to ParallelRange with a default minRange. Better yet, deprecate it in favor of ParallelRange everywhere.
4. std::function + std::shared_ptr<TaskHandle> per task
Every Submit() performs:
make_shared<TaskHandle>()— heap alloc + atomic refcount init.std::functionmove — heap alloc if capture exceeds the SBO (typically 16–32 B).TryEnqueue(Task)— copies the Task (incl. another refcount bump).- Return shared_ptr — refcount bump.
Three atomic ops and 1–2 allocations per task before any work runs. Reinalter’s Molecular Matters reference design uses per-thread linear allocators, intrusive sibling lists for dependents, and raw pointers tracked via a generational pool.
Fix path: per-worker linear allocator for Task instances, reset between frames. Replace std::function with an in-place 48-byte callable buffer (Delegates.h is the natural extension point). Replace shared_ptr<TaskHandle> with TaskHandle* + generation counter.
5. No fibers / no inline WaitForCounter
TaskHandle::WaitForCompleted (Parallel.cpp:22-34) is a spin with yield(). If a worker thread enters it from inside a job body, the worker stalls — exactly the Naughty Dog pain point. PhoenixSim mitigates by forcing continuation-passing style via the JobScheduler DAG: split into two jobs joined by a dependency edge.
That’s a defensible choice. But it forces every “wait for X” pattern into two-job form, which is invasive for:
- PhysicsSystem’s iterative solver (currently uses bespoke scheduler instances per phase).
- Scripting (PhoenixLua) calling into engine code that wants to wait.
Fix: introduce stackful fibers (per Naughty Dog GDC 2015 / Marl) so WaitForCounter can park a fiber and switch to another runnable fiber on the same worker.
6. Caller-side waits don’t use PAUSE
The worker idle path uses PHX_THREAD_PAUSE(). The caller-side waits don’t:
TaskHandle::WaitForCompleted(Parallel.cpp:22-34)ThreadPool::WaitIdle(Parallel.cpp:194-205)JobScheduler::Executefinal drain (JobScheduler.cpp:282-283)
All are yield()-only. Adding a 64-iter PAUSE loop before each yield() cuts wake latency from microseconds to nanoseconds when the wait is short. Matters most at end-of-frame fences.
7. Memory ordering review for weak memory models
The MPMC is correct (textbook Vyukov). JobScheduler mixes orderings — RemainingPredecessors.store(..., memory_order_relaxed) on init and fetch_sub(..., memory_order_acq_rel) on decrement. Should be fine because the init store happens-before any worker observes the node via the queue, but worth a TSan run and (ideally) a test on an Apple M-series box before any Switch-class deployment.
Smaller items
- No priority tiers (Naughty Dog uses low/normal/high).
- No Pipe-equivalent for arbitrary work (Unreal’s serialization primitive for non-thread-safe APIs).
- No per-worker stats exposed (counters for “jobs completed”, “jobs stolen”) — Tracy zones only.
- No
std::execution-style sender/receiver vocabulary (long-term consideration only).
Implementation status
| Step | Commit(s) | Status |
|---|---|---|
| Docs + pinning tests | 35ec511 |
✅ landed |
| #1 Cache-line padding | 8684ae9 |
✅ landed |
| #6 PAUSE backoff on caller-side waits; real PAUSE on POSIX | 7f2ce68 |
✅ landed |
#4 TInlineCallable<void(), 128> in Task body |
a1c0fc4 |
✅ landed |
#3 ParallelForEach shim + InFlight counter (fixes WaitIdle TOCTOU) |
2e20825 |
✅ landed |
#2a TChaseLevDeque<T> template + tests |
794346f |
✅ landed |
#2b/c Slab + per-worker deques + submission inbox in ThreadPool |
af0069b |
✅ landed |
#5a ECS::ParallelForEntities<TComponents...> primitive |
— | open |
| #5b Fibers | — | deferred (see below) |
| #7 Memory-ordering audit on ARM hardware | — | open (recheck on real Apple Silicon / Switch box) |
45 tests / 15,505 assertions, 10+ consecutive clean runs as of af0069b.
#5 fibers — deferred, with reasoning
The initial roadmap entry justified fibers with “PhysicsSystem’s iterative solver and PhoenixLua reentrancy.” A close read of src/PhoenixPhysics/PhysicsSystem.cpp and src/PhoenixSim/ECS/JobScheduler.cpp shows that the PhysicsSystem half of that justification is wrong — fibers wouldn’t change anything there.
What PhysicsSystem actually does
PhysicsSystem holds three bespoke JobScheduler instances (IntegrateVelocitiesScheduler, CalculateContactPairsScheduler, IntegrateScheduler). Each registers exactly one job and never uses the multi-job DAG features. They exist because:
IJob<TComponents...>gives free per-archetype batching and the per-entityExecute(world, e, cb, comp...)signature.- These jobs need to run multiple times per frame inside
OnPostWorldUpdate’s iteration loop —FeatureECS::RegisterJobonly runs jobs once per frame, so the bespoke schedulers are the workaround.
OnPostWorldUpdate runs on the main thread, not on a worker. Its ExecuteScheduler and WorldTaskQueue::Flush calls block the main thread while workers process — no worker is ever blocked. The “I’m a worker stuck waiting for jobs” problem fibers solve doesn’t exist here.
The actual design smell is a missing primitive
What PhysicsSystem wants is a single function:
template <class... TComponents, class Fn>
void ECS::ParallelForEntities(WorldRef world, Fn&& fn);
It resolves matching archetypes via EntityQueryBuilder, fans batches into the thread pool, joins on WaitIdle. Same primitives JobScheduler::Execute already uses, minus the multi-job DAG bookkeeping.
With that primitive, PhysicsSystem::OnPostWorldUpdate becomes a flat loop of ParallelForEntities / ParallelRange calls. The three JobScheduler members and their OnWorldInitialize setup vanish, as do the IntegrateVelocitiesJob, CalculateContactPairsJob, IntegrateJob class definitions (they collapse to lambdas at the call site).
That’s tracked as #5a above.
Where fibers would still earn their keep
| Use case | Fiber benefit |
|---|---|
PhoenixLua scripts that want to await events |
Scripts run on the main thread today; any blocking wait freezes the whole sim. A fiber-aware wait_for(event) from a script would park the script’s fiber and let the main thread keep ticking. |
| Asset streaming during a frame | “Wait for the texture I/O without burning a worker.” The canonical Naughty Dog example. |
| Cooperative multi-frame computation | Path planning, AI deliberation — tasks that pause across frame boundaries. |
None of these are PhysicsSystem use cases. #5b stays open, but only worth committing to when one of the three use cases above becomes a concrete, scheduled need. Building fibers speculatively would deliver a maintenance tax (per-platform asm stack switching, debugger friction, stack sizing) for no PhysicsSystem benefit.
References
- Mighty Professional (2026). Job Systems for Game Engines. (Source diagnosis blog.)
- Gyrling, C. (2015). Parallelizing the Naughty Dog Engine Using Fibers. GDC.
- Chase, D., Lev, Y. (2005). Dynamic Circular Work-Stealing Deque. SPAA.
- Lê et al. (2013). Correct and Efficient Work-Stealing for Weak Memory Models. PPoPP.
- Frigo, Leiserson, Randall (1998). The Implementation of the Cilk-5 Multithreaded Language. PLDI.
- Reinalter, S. (2015–2016). Job System 2.0: Lock-Free Work Stealing, parts 1–5. Molecular Musings.