Skip to the content.

PhoenixSim Job System — Gap Analysis & Roadmap

This document captures a gap analysis between PhoenixSim’s current job/threading stack and the modern work-stealing, fiber-aware design described in Job Systems for Game Engines (Mighty Professional, 2026). It is the planning artifact for a series of staged improvements; each section ends with a concrete next step.

Current architecture (baseline)

Concern Implementation
Worker pool Phoenix::ThreadPoolmin(hw_concurrency, 8) - 1 OS threads, one global instance, multiplexed across all worlds (src/PhoenixSim/Parallel.h:62, tests/TestRTS/app.cpp:313).
Submission queue Single Vyukov bounded MPMC ring buffer (src/PhoenixSim/Containers/MPMCQueue.h). Capacity defaults to 1024.
Task payload Phoenix::Task wrapping std::function<void()> + std::shared_ptr<TaskHandle> (Parallel.h:30).
DAG scheduler ECS::JobScheduler with RemainingPredecessors atomic per node; auto-derives implicit edges from component Read/Write conflicts (src/PhoenixSim/ECS/JobScheduler.h, SystemJob.h:32-65).
Idle strategy Exponential-backoff PHX_THREAD_PAUSE() on the worker side, plain std::this_thread::yield() on caller-side waits. No condition variables, no futex.
Range parallelism ParallelForEach(n, fn) and ParallelRange(n, minRange, fn) (Parallel.h:123-185).
Fibers / mid-job wait None. Waits are expressed by splitting into two jobs with a dependency edge.

What the design gets right

Gaps, ranked by performance impact

1. False sharing on the MPMC hotspot

Containers/MPMCQueue.h:101-102:

std::atomic<size_t> EnqueuePos;
std::atomic<size_t> DequeuePos;

Adjacent 8-byte atomics; every producer writes the first, every consumer writes the second. The cache line ping-pongs between cores on every operation. The same pattern exists on the ThreadPool hot atomics (Done, ActiveWorkerCount, SpinningWorkerCountParallel.h:85-88).

Fix: alignas(64) on each hot atomic. Pad Cell to a full line if sizeof(T) + sizeof(atomic<size_t>) < 64. Expect 1.5–3× throughput on the queue itself on x86; more on Apple Silicon (128-byte lines).

2. No work stealing — single global queue is the contention point

Every submit and every dequeue races on the same EnqueuePos/DequeuePos. The Chase-Lev pattern replaces this with N per-worker deques where the owner uses near-zero-cost push/pop (relaxed bottom counter) and only thieves do CAS. Owner-side cost drops to “tens of nanoseconds” vs. “hundreds” for steals.

For PhoenixSim’s typical millisecond-scale physics batches, the gap is hidden. It becomes visible with:

Fix path: port a Chase-Lev deque (use the Lê et al. 2013 weak-memory-model variant — important for Switch/ARM targets). Keep the global MPMC as the submission inbox; drain it into per-worker deques.

3. ParallelForEach violates the granularity rule

Parallel.h:123-131 submits one task per element. The Cilk-5 work-first heuristic: each job should do 10–100× more work than the queue operation costs. Push/pop here is ~100 ns minimum (CAS + sequence store + shared_ptr alloc), so per-element jobs need ≥1 µs of real work to break even.

Fix: make ParallelForEach a shim that forwards to ParallelRange with a default minRange. Better yet, deprecate it in favor of ParallelRange everywhere.

4. std::function + std::shared_ptr<TaskHandle> per task

Every Submit() performs:

  1. make_shared<TaskHandle>() — heap alloc + atomic refcount init.
  2. std::function move — heap alloc if capture exceeds the SBO (typically 16–32 B).
  3. TryEnqueue(Task) — copies the Task (incl. another refcount bump).
  4. Return shared_ptr — refcount bump.

Three atomic ops and 1–2 allocations per task before any work runs. Reinalter’s Molecular Matters reference design uses per-thread linear allocators, intrusive sibling lists for dependents, and raw pointers tracked via a generational pool.

Fix path: per-worker linear allocator for Task instances, reset between frames. Replace std::function with an in-place 48-byte callable buffer (Delegates.h is the natural extension point). Replace shared_ptr<TaskHandle> with TaskHandle* + generation counter.

5. No fibers / no inline WaitForCounter

TaskHandle::WaitForCompleted (Parallel.cpp:22-34) is a spin with yield(). If a worker thread enters it from inside a job body, the worker stalls — exactly the Naughty Dog pain point. PhoenixSim mitigates by forcing continuation-passing style via the JobScheduler DAG: split into two jobs joined by a dependency edge.

That’s a defensible choice. But it forces every “wait for X” pattern into two-job form, which is invasive for:

Fix: introduce stackful fibers (per Naughty Dog GDC 2015 / Marl) so WaitForCounter can park a fiber and switch to another runnable fiber on the same worker.

6. Caller-side waits don’t use PAUSE

The worker idle path uses PHX_THREAD_PAUSE(). The caller-side waits don’t:

All are yield()-only. Adding a 64-iter PAUSE loop before each yield() cuts wake latency from microseconds to nanoseconds when the wait is short. Matters most at end-of-frame fences.

7. Memory ordering review for weak memory models

The MPMC is correct (textbook Vyukov). JobScheduler mixes orderings — RemainingPredecessors.store(..., memory_order_relaxed) on init and fetch_sub(..., memory_order_acq_rel) on decrement. Should be fine because the init store happens-before any worker observes the node via the queue, but worth a TSan run and (ideally) a test on an Apple M-series box before any Switch-class deployment.

Smaller items

Implementation status

Step Commit(s) Status
Docs + pinning tests 35ec511 ✅ landed
#1 Cache-line padding 8684ae9 ✅ landed
#6 PAUSE backoff on caller-side waits; real PAUSE on POSIX 7f2ce68 ✅ landed
#4 TInlineCallable<void(), 128> in Task body a1c0fc4 ✅ landed
#3 ParallelForEach shim + InFlight counter (fixes WaitIdle TOCTOU) 2e20825 ✅ landed
#2a TChaseLevDeque<T> template + tests 794346f ✅ landed
#2b/c Slab + per-worker deques + submission inbox in ThreadPool af0069b ✅ landed
#5a ECS::ParallelForEntities<TComponents...> primitive open
#5b Fibers deferred (see below)
#7 Memory-ordering audit on ARM hardware open (recheck on real Apple Silicon / Switch box)

45 tests / 15,505 assertions, 10+ consecutive clean runs as of af0069b.

#5 fibers — deferred, with reasoning

The initial roadmap entry justified fibers with “PhysicsSystem’s iterative solver and PhoenixLua reentrancy.” A close read of src/PhoenixPhysics/PhysicsSystem.cpp and src/PhoenixSim/ECS/JobScheduler.cpp shows that the PhysicsSystem half of that justification is wrong — fibers wouldn’t change anything there.

What PhysicsSystem actually does

PhysicsSystem holds three bespoke JobScheduler instances (IntegrateVelocitiesScheduler, CalculateContactPairsScheduler, IntegrateScheduler). Each registers exactly one job and never uses the multi-job DAG features. They exist because:

OnPostWorldUpdate runs on the main thread, not on a worker. Its ExecuteScheduler and WorldTaskQueue::Flush calls block the main thread while workers process — no worker is ever blocked. The “I’m a worker stuck waiting for jobs” problem fibers solve doesn’t exist here.

The actual design smell is a missing primitive

What PhysicsSystem wants is a single function:

template <class... TComponents, class Fn>
void ECS::ParallelForEntities(WorldRef world, Fn&& fn);

It resolves matching archetypes via EntityQueryBuilder, fans batches into the thread pool, joins on WaitIdle. Same primitives JobScheduler::Execute already uses, minus the multi-job DAG bookkeeping.

With that primitive, PhysicsSystem::OnPostWorldUpdate becomes a flat loop of ParallelForEntities / ParallelRange calls. The three JobScheduler members and their OnWorldInitialize setup vanish, as do the IntegrateVelocitiesJob, CalculateContactPairsJob, IntegrateJob class definitions (they collapse to lambdas at the call site).

That’s tracked as #5a above.

Where fibers would still earn their keep

Use case Fiber benefit
PhoenixLua scripts that want to await events Scripts run on the main thread today; any blocking wait freezes the whole sim. A fiber-aware wait_for(event) from a script would park the script’s fiber and let the main thread keep ticking.
Asset streaming during a frame “Wait for the texture I/O without burning a worker.” The canonical Naughty Dog example.
Cooperative multi-frame computation Path planning, AI deliberation — tasks that pause across frame boundaries.

None of these are PhysicsSystem use cases. #5b stays open, but only worth committing to when one of the three use cases above becomes a concrete, scheduled need. Building fibers speculatively would deliver a maintenance tax (per-platform asm stack switching, debugger friction, stack sizing) for no PhysicsSystem benefit.

References