A context switch in XNU, walked end to end
From the moment an interrupt fires to the moment a different thread is running on the core — trap, AST, thread_invoke, ASID switch, return.
A context switch on XNU is the single most-traveled code path in the kernel. Every preemption, every block-on-Mach-msg, every syscall return that finds a higher-priority thread runnable — they all bottom out in the same sequence. This article walks one of them, from the hardware interrupt that started it to the new thread's first user-mode instruction.
The setup: a single CPU core running thread A in userspace. Thread B is runnable at higher priority on the same core. We'll force the switch by firing the scheduler's preemption timer.
Step 1: timer interrupt
The kernel programs a per-CPU preemption timer at every dispatch — "if this thread is still running in N microseconds, take the CPU away". When the timer fires, the CPU traps into the kernel.
On Apple Silicon, that means executing an exception vector in EL1 (kernel mode). On Intel it's an interrupt through the IDT. Either way, control lands in XNU's per-architecture trap handler:
apple-oss-distributions/xnuosfmk/arm64/sleh.cARM64 synchronous exception + interrupt entry — every trap on Apple Silicon lands here.View on GitHub(line —) apple-oss-distributions/xnuosfmk/i386/trap.cx86_64 trap handler — same role for Intel Macs.View on GitHub(line —)
The trap saves thread A's user-mode register state to its kernel stack and jumps into the interrupt service routine. The ISR identifies the source as the preemption timer, calls the scheduler's timer hook, and sets AST_PREEMPT on the current thread's AST mask.
Step 2: AST checkpoint on the way back
The ISR returns. Before the kernel resumes userspace, it checks the AST mask for the current thread. If anything is set — AST_PREEMPT, AST_BSD (signals), AST_DTRACE, anything — the kernel processes ASTs first.
For AST_PREEMPT, the AST handler calls thread_block with a continuation argument — "preempt me, then resume in userspace." This is the point of no return for thread A on this core.
Step 3: thread_block — saying "I'm done for now"
thread_block is the common entry point for any reason a thread might stop running on a core: preempted, blocked on a Mach port wait, blocked on a futex, voluntarily yielded. The function:
- Removes thread A from the current processor's runq if it was on it.
- If the thread is still runnable (preemption case), puts it back on a runq for some processor.
- Calls
thread_invokewith the next thread to run.
If thread_block was given a continuation, the kernel saves the continuation pointer in thread A's stack frame. When thread A is dispatched again later, the scheduler will jump directly to the continuation instead of unwinding the stack — saving the cost of preserving and restoring a deep kernel call stack across a long block.
This continuation pattern is one of XNU's signature optimizations. A thread blocking on mach_msg_receive doesn't keep a full kernel stack reserved while it sleeps; it leaves only a continuation pointer, and the kernel stack can be reused for the next thread that needs one.
Step 4: choosing thread B
The scheduler's thread_select is called from inside thread_invoke. It walks the per-processor runq, then the broader cluster runqs, looking for the highest-priority runnable thread. Tie-breaking biases for cache warmth (preferring threads recently on this core) and for the right cluster (P vs E based on QoS recommendation — see the scheduler article).
For our example, thread B is at the top of the per-processor runq. thread_select returns it.
Step 5: machine_switch_context — the architecture-specific switch
Now thread_invoke calls machine_switch_context(old=A, continuation, new=B). This is the lowest-level part, hand-written per architecture:
apple-oss-distributions/xnuosfmk/arm64/cswitch.sARM64 context switch — saves A's callee-saved registers, loads B's, switches stack pointer.View on GitHub(line —) apple-oss-distributions/xnuosfmk/i386/cswitch.sx86_64 equivalent.View on GitHub(line —)
What the assembly does on ARM64:
- Save thread A's callee-saved general-purpose registers + FP/SIMD state to A's kernel stack frame.
- Save thread A's stack pointer into the thread structure.
- Load thread B's stack pointer from B's thread structure.
- Load B's callee-saved registers from B's stack frame.
- Return — the return address is B's resume point (either a continuation or wherever B was blocked).
On Apple Silicon, this is also where AMX state gets saved/restored if the thread used it, and where APRR/SPRR register state is reloaded for B.
Step 6: address-space switch
Threads A and B might be in different tasks — which means different address spaces. Before B can execute its userspace code safely, the kernel has to install B's task's pmap:
- On ARM64, this means writing B's translation regime into TTBR0_EL1 and loading the right ASID (Address Space Identifier). ASIDs let the TLB hold entries for multiple address spaces simultaneously — no flush needed on switch as long as ASIDs are unique.
- On x86_64, it's a write to CR3.
If thread A and B are in the same task, this step is a no-op. (Common case: a single multi-threaded app's threads switching among themselves.)
Step 7: AST check on the new thread's entry
Before returning to userspace as thread B, the kernel checks B's AST mask. If B has pending signals (AST_BSD), they're delivered now — a signal trampoline is built, B's user PC is rewritten to enter the handler.
This is exactly the signal delivery flow, just on a thread that's about to start running rather than one that's about to return from a syscall.
Step 8: eret — back to userspace
Final step: eret (ARM64) or iret (x86). The CPU restores the user-mode register state from B's saved frame, switches privilege level back to EL0 / ring 3, and resumes at B's user PC.
Thread B is now running on the core where thread A was a microsecond ago.
What this costs
A bare context switch on Apple Silicon is in the low hundreds of nanoseconds. Most of that is:
- The cache miss on loading B's thread structure (~30-80 ns).
- The pmap switch (cheap on ARM64 thanks to ASIDs — a couple of register writes).
- The continuation jump or stack unwind.
Adding pressure: if A and B's working sets are in different L2 caches (different clusters), the next few hundred memory accesses by B will miss. That's why the scheduler's affinity bias matters — keeping a thread on the same cluster pays for itself many times over.
What surprises newcomers
- Most context switches are voluntary, not preemptions. A thread blocking on
mach_msg_receiveor apthread_mutex_lockis the common case; the timer-driven preemption path is a fallback for CPU-bound threads. - The continuation pattern means a sleeping thread's kernel stack is reusable. Memory pressure on the kernel stack pool is far lower than naively expected.
- ASIDs eliminate the TLB-flush penalty that's traditional on context switch. This is a huge win for Apple Silicon.
- The same
thread_invokeruns whether you're preempting a CPU hog or returning from a syscall. One code path, dozens of triggers.
What to read next
Read thread_block and thread_invoke in full:
apple-oss-distributions/xnuosfmk/kern/thread.cThread state machine — runnable, waiting, suspended, etc.View on GitHub(line —) apple-oss-distributions/xnuosfmk/kern/processor.cPer-processor structures — runqs, current thread, idle thread.View on GitHub(line —)
Then re-read the scheduler article and notice how every claim about QoS turns into a runq-selection decision inside thread_select.