PhoenixSim Job System — Gap Analysis & Roadmap

This document captures a gap analysis between PhoenixSim’s current job/threading stack and the modern work-stealing, fiber-aware design described in Job Systems for Game Engines (Mighty Professional, 2026). It is the planning artifact for a series of staged improvements; each section ends with a concrete next step.

Current architecture (baseline)

Concern	Implementation
Worker pool	`Phoenix::ThreadPool` — `min(hw_concurrency, 8) - 1` OS threads, one global instance, multiplexed across all worlds (`src/PhoenixSim/Parallel.h:62`, `tests/TestRTS/app.cpp:313`).
Submission queue	Single Vyukov bounded MPMC ring buffer (`src/PhoenixSim/Containers/MPMCQueue.h`). Capacity defaults to 1024.
Task payload	`Phoenix::Task` wrapping `std::function<void()>` + `std::shared_ptr<TaskHandle>` (`Parallel.h:30`).
DAG scheduler	`ECS::JobScheduler` with `RemainingPredecessors` atomic per node; auto-derives implicit edges from component Read/Write conflicts (`src/PhoenixSim/ECS/JobScheduler.h`, `SystemJob.h:32-65`).
Idle strategy	Exponential-backoff `PHX_THREAD_PAUSE()` on the worker side, plain `std::this_thread::yield()` on caller-side waits. No condition variables, no futex.
Range parallelism	`ParallelForEach(n, fn)` and `ParallelRange(n, minRange, fn)` (`Parallel.h:123-185`).
Fibers / mid-job wait	None. Waits are expressed by splitting into two jobs with a dependency edge.

What the design gets right

One worker per core minus one, OS threads only — matches the “threads about the machine, tasks about the work” rule.
Vyukov MPMC is textbook lock-free and ABA-safe via per-cell sequence numbers.
PHX_THREAD_PAUSE() exponential backoff on the worker idle path is correct.
Component-access dependency derivation in the ECS scheduler is better than Unity’s runtime AtomicSafetyHandle — it’s a schedule-time guarantee, not an editor-only check.
ParallelRange enforces a minimum granularity, the right way to express data-parallel work per the Cilk-5 work-first principle.

Gaps, ranked by performance impact

Containers/MPMCQueue.h:101-102:

std::atomic<size_t> EnqueuePos;
std::atomic<size_t> DequeuePos;

Adjacent 8-byte atomics; every producer writes the first, every consumer writes the second. The cache line ping-pongs between cores on every operation. The same pattern exists on the ThreadPool hot atomics (Done, ActiveWorkerCount, SpinningWorkerCount — Parallel.h:85-88).

Fix: alignas(64) on each hot atomic. Pad Cell to a full line if sizeof(T) + sizeof(atomic<size_t>) < 64. Expect 1.5–3× throughput on the queue itself on x86; more on Apple Silicon (128-byte lines).

2. No work stealing — single global queue is the contention point

Every submit and every dequeue races on the same EnqueuePos/DequeuePos. The Chase-Lev pattern replaces this with N per-worker deques where the owner uses near-zero-cost push/pop (relaxed bottom counter) and only thieves do CAS. Owner-side cost drops to “tens of nanoseconds” vs. “hundreds” for steals.

For PhoenixSim’s typical millisecond-scale physics batches, the gap is hidden. It becomes visible with:

ParallelForEach(N, fn) and small-N fan-outs from one big archetype.
Scaling beyond 8–16 cores.

Fix path: port a Chase-Lev deque (use the Lê et al. 2013 weak-memory-model variant — important for Switch/ARM targets). Keep the global MPMC as the submission inbox; drain it into per-worker deques.

3. `ParallelForEach` violates the granularity rule

Parallel.h:123-131 submits one task per element. The Cilk-5 work-first heuristic: each job should do 10–100× more work than the queue operation costs. Push/pop here is ~100 ns minimum (CAS + sequence store + shared_ptr alloc), so per-element jobs need ≥1 µs of real work to break even.

Fix: make ParallelForEach a shim that forwards to ParallelRange with a default minRange. Better yet, deprecate it in favor of ParallelRange everywhere.

4. `std::function` + `std::shared_ptr<TaskHandle>` per task

Every Submit() performs:

make_shared<TaskHandle>() — heap alloc + atomic refcount init.
std::function move — heap alloc if capture exceeds the SBO (typically 16–32 B).
TryEnqueue(Task) — copies the Task (incl. another refcount bump).
Return shared_ptr — refcount bump.

Three atomic ops and 1–2 allocations per task before any work runs. Reinalter’s Molecular Matters reference design uses per-thread linear allocators, intrusive sibling lists for dependents, and raw pointers tracked via a generational pool.

Fix path: per-worker linear allocator for Task instances, reset between frames. Replace std::function with an in-place 48-byte callable buffer (Delegates.h is the natural extension point). Replace shared_ptr<TaskHandle> with TaskHandle* + generation counter.

5. No fibers / no inline `WaitForCounter`

TaskHandle::WaitForCompleted (Parallel.cpp:22-34) is a spin with yield(). If a worker thread enters it from inside a job body, the worker stalls — exactly the Naughty Dog pain point. PhoenixSim mitigates by forcing continuation-passing style via the JobScheduler DAG: split into two jobs joined by a dependency edge.

That’s a defensible choice. But it forces every “wait for X” pattern into two-job form, which is invasive for:

PhysicsSystem’s iterative solver (currently uses bespoke scheduler instances per phase).
Scripting (PhoenixLua) calling into engine code that wants to wait.

Fix: introduce stackful fibers (per Naughty Dog GDC 2015 / Marl) so WaitForCounter can park a fiber and switch to another runnable fiber on the same worker.

6. Caller-side waits don’t use PAUSE

The worker idle path uses PHX_THREAD_PAUSE(). The caller-side waits don’t:

TaskHandle::WaitForCompleted (Parallel.cpp:22-34)
ThreadPool::WaitIdle (Parallel.cpp:194-205)
JobScheduler::Execute final drain (JobScheduler.cpp:282-283)

All are yield()-only. Adding a 64-iter PAUSE loop before each yield() cuts wake latency from microseconds to nanoseconds when the wait is short. Matters most at end-of-frame fences.

7. Memory ordering review for weak memory models

The MPMC is correct (textbook Vyukov). JobScheduler mixes orderings — RemainingPredecessors.store(..., memory_order_relaxed) on init and fetch_sub(..., memory_order_acq_rel) on decrement. Should be fine because the init store happens-before any worker observes the node via the queue, but worth a TSan run and (ideally) a test on an Apple M-series box before any Switch-class deployment.

Smaller items

No priority tiers (Naughty Dog uses low/normal/high).
No Pipe-equivalent for arbitrary work (Unreal’s serialization primitive for non-thread-safe APIs).
No per-worker stats exposed (counters for “jobs completed”, “jobs stolen”) — Tracy zones only.
No std::execution-style sender/receiver vocabulary (long-term consideration only).

Implementation status

Step	Commit(s)	Status
Docs + pinning tests	`35ec511`	✅ landed
#1 Cache-line padding	`8684ae9`	✅ landed
#6 PAUSE backoff on caller-side waits; real PAUSE on POSIX	`7f2ce68`	✅ landed
#4 `TInlineCallable<void(), 128>` in Task body	`a1c0fc4`	✅ landed
#3 `ParallelForEach` shim + `InFlight` counter (fixes `WaitIdle` TOCTOU)	`2e20825`	✅ landed
#2a `TChaseLevDeque<T>` template + tests	`794346f`	✅ landed
#2b/c Slab + per-worker deques + submission inbox in `ThreadPool`	`af0069b`	✅ landed
#5a `ECS::ParallelForEntities<TComponents...>` primitive	—	open
#5b Fibers	—	deferred (see below)
#7 Memory-ordering audit on ARM hardware	—	open (recheck on real Apple Silicon / Switch box)

45 tests / 15,505 assertions, 10+ consecutive clean runs as of af0069b.

#5 fibers — deferred, with reasoning

The initial roadmap entry justified fibers with “PhysicsSystem’s iterative solver and PhoenixLua reentrancy.” A close read of src/PhoenixPhysics/PhysicsSystem.cpp and src/PhoenixSim/ECS/JobScheduler.cpp shows that the PhysicsSystem half of that justification is wrong — fibers wouldn’t change anything there.

What PhysicsSystem actually does

PhysicsSystem holds three bespoke JobScheduler instances (IntegrateVelocitiesScheduler, CalculateContactPairsScheduler, IntegrateScheduler). Each registers exactly one job and never uses the multi-job DAG features. They exist because:

IJob<TComponents...> gives free per-archetype batching and the per-entity Execute(world, e, cb, comp...) signature.
These jobs need to run multiple times per frame inside OnPostWorldUpdate’s iteration loop — FeatureECS::RegisterJob only runs jobs once per frame, so the bespoke schedulers are the workaround.

OnPostWorldUpdate runs on the main thread, not on a worker. Its ExecuteScheduler and WorldTaskQueue::Flush calls block the main thread while workers process — no worker is ever blocked. The “I’m a worker stuck waiting for jobs” problem fibers solve doesn’t exist here.

The actual design smell is a missing primitive

What PhysicsSystem wants is a single function:

template <class... TComponents, class Fn>
void ECS::ParallelForEntities(WorldRef world, Fn&& fn);

It resolves matching archetypes via EntityQueryBuilder, fans batches into the thread pool, joins on WaitIdle. Same primitives JobScheduler::Execute already uses, minus the multi-job DAG bookkeeping.

With that primitive, PhysicsSystem::OnPostWorldUpdate becomes a flat loop of ParallelForEntities / ParallelRange calls. The three JobScheduler members and their OnWorldInitialize setup vanish, as do the IntegrateVelocitiesJob, CalculateContactPairsJob, IntegrateJob class definitions (they collapse to lambdas at the call site).

That’s tracked as #5a above.

Where fibers would still earn their keep

Use case	Fiber benefit
PhoenixLua scripts that want to `await` events	Scripts run on the main thread today; any blocking wait freezes the whole sim. A fiber-aware `wait_for(event)` from a script would park the script’s fiber and let the main thread keep ticking.
Asset streaming during a frame	“Wait for the texture I/O without burning a worker.” The canonical Naughty Dog example.
Cooperative multi-frame computation	Path planning, AI deliberation — tasks that pause across frame boundaries.

None of these are PhysicsSystem use cases. #5b stays open, but only worth committing to when one of the three use cases above becomes a concrete, scheduled need. Building fibers speculatively would deliver a maintenance tax (per-platform asm stack switching, debugger friction, stack sizing) for no PhysicsSystem benefit.

References

Mighty Professional (2026). Job Systems for Game Engines. (Source diagnosis blog.)
Gyrling, C. (2015). Parallelizing the Naughty Dog Engine Using Fibers. GDC.
Chase, D., Lev, Y. (2005). Dynamic Circular Work-Stealing Deque. SPAA.
Lê et al. (2013). Correct and Efficient Work-Stealing for Weak Memory Models. PPoPP.
Frigo, Leiserson, Randall (1998). The Implementation of the Cilk-5 Multithreaded Language. PLDI.
Reinalter, S. (2015–2016). Job System 2.0: Lock-Free Work Stealing, parts 1–5. Molecular Musings.

Job System Roadmap

A high-performance, modular simulation engine for real-time strategy games.

PhoenixSim Job System — Gap Analysis & Roadmap

Current architecture (baseline)

What the design gets right

Gaps, ranked by performance impact

2. No work stealing — single global queue is the contention point

3. `ParallelForEach` violates the granularity rule

4. `std::function` + `std::shared_ptr<TaskHandle>` per task

5. No fibers / no inline `WaitForCounter`

6. Caller-side waits don’t use PAUSE

7. Memory ordering review for weak memory models

Smaller items

Implementation status

#5 fibers — deferred, with reasoning

What PhysicsSystem actually does

The actual design smell is a missing primitive

Where fibers would still earn their keep

References

PhoenixSim Job System — Gap Analysis & Roadmap

Current architecture (baseline)

What the design gets right

Gaps, ranked by performance impact

1. False sharing on the MPMC hotspot

2. No work stealing — single global queue is the contention point

3. ParallelForEach violates the granularity rule

4. std::function + std::shared_ptr<TaskHandle> per task

5. No fibers / no inline WaitForCounter

6. Caller-side waits don’t use PAUSE

7. Memory ordering review for weak memory models

Smaller items

Implementation status

#5 fibers — deferred, with reasoning

What PhysicsSystem actually does

The actual design smell is a missing primitive

Where fibers would still earn their keep

References

3. `ParallelForEach` violates the granularity rule

4. `std::function` + `std::shared_ptr<TaskHandle>` per task

5. No fibers / no inline `WaitForCounter`