A context switch in XNU, walked end to end

A context switch on XNU is the single most-traveled code path in the kernel. Every preemption, every block-on-Mach-msg, every syscall return that finds a higher-priority thread runnable — they all bottom out in the same sequence. This article walks one of them, from the hardware interrupt that started it to the new thread's first user-mode instruction.

The setup: a single CPU core running thread A in userspace. Thread B is runnable at higher priority on the same core. We'll force the switch by firing the scheduler's preemption timer.

Step 1: timer interrupt

The kernel programs a per-CPU preemption timer at every dispatch — "if this thread is still running in N microseconds, take the CPU away". When the timer fires, the CPU traps into the kernel.

On Apple Silicon, that means executing an exception vector in EL1 (kernel mode). On Intel it's an interrupt through the IDT. Either way, control lands in XNU's per-architecture trap handler:

apple-oss-distributions/xnuosfmk/arm64/sleh.cARM64 synchronous exception + interrupt entry — every trap on Apple Silicon lands here.View on GitHub(line —) apple-oss-distributions/xnuosfmk/i386/trap.cx86_64 trap handler — same role for Intel Macs.View on GitHub(line —)

The trap saves thread A's user-mode register state to its kernel stack and jumps into the interrupt service routine. The ISR identifies the source as the preemption timer, calls the scheduler's timer hook, and sets AST_PREEMPT on the current thread's AST mask.

Step 2: AST checkpoint on the way back

The ISR returns. Before the kernel resumes userspace, it checks the AST mask for the current thread. If anything is set — AST_PREEMPT, AST_BSD (signals), AST_DTRACE, anything — the kernel processes ASTs first.

apple-oss-distributions/xnuosfmk/kern/ast.cast_taken_kernel / ast_taken_user — the dispatcher every AST goes through.View on GitHub

For AST_PREEMPT, the AST handler calls thread_block with a continuation argument — "preempt me, then resume in userspace." This is the point of no return for thread A on this core.

Step 3: thread_block — saying "I'm done for now"

thread_block is the common entry point for any reason a thread might stop running on a core: preempted, blocked on a Mach port wait, blocked on a futex, voluntarily yielded. The function:

Removes thread A from the current processor's runq if it was on it.
If the thread is still runnable (preemption case), puts it back on a runq for some processor.
Calls thread_invoke with the next thread to run.

apple-oss-distributions/xnuosfmk/kern/sched_prim.cthread_block / thread_invoke — the heart of voluntary and involuntary context switching.View on GitHub

If thread_block was given a continuation, the kernel saves the continuation pointer in thread A's stack frame. When thread A is dispatched again later, the scheduler will jump directly to the continuation instead of unwinding the stack — saving the cost of preserving and restoring a deep kernel call stack across a long block.

This continuation pattern is one of XNU's signature optimizations. A thread blocking on mach_msg_receive doesn't keep a full kernel stack reserved while it sleeps; it leaves only a continuation pointer, and the kernel stack can be reused for the next thread that needs one.

Step 4: choosing thread B

The scheduler's thread_select is called from inside thread_invoke. It walks the per-processor runq, then the broader cluster runqs, looking for the highest-priority runnable thread. Tie-breaking biases for cache warmth (preferring threads recently on this core) and for the right cluster (P vs E based on QoS recommendation — see the scheduler article).

apple-oss-distributions/xnuosfmk/kern/sched_clutch.cThe clutch scheduler — bucket-based per-core selection with cross-cluster steal.View on GitHub

For our example, thread B is at the top of the per-processor runq. thread_select returns it.

Step 5: machine_switch_context — the architecture-specific switch

Now thread_invoke calls machine_switch_context(old=A, continuation, new=B). This is the lowest-level part, hand-written per architecture:

apple-oss-distributions/xnuosfmk/arm64/cswitch.sARM64 context switch — saves A's callee-saved registers, loads B's, switches stack pointer.View on GitHub(line —) apple-oss-distributions/xnuosfmk/i386/cswitch.sx86_64 equivalent.View on GitHub(line —)

What the assembly does on ARM64:

Save thread A's callee-saved general-purpose registers + FP/SIMD state to A's kernel stack frame.
Save thread A's stack pointer into the thread structure.
Load thread B's stack pointer from B's thread structure.
Load B's callee-saved registers from B's stack frame.
Return — the return address is B's resume point (either a continuation or wherever B was blocked).

On Apple Silicon, this is also where AMX state gets saved/restored if the thread used it, and where APRR/SPRR register state is reloaded for B.

Step 6: address-space switch

Threads A and B might be in different tasks — which means different address spaces. Before B can execute its userspace code safely, the kernel has to install B's task's pmap:

On ARM64, this means writing B's translation regime into TTBR0_EL1 and loading the right ASID (Address Space Identifier). ASIDs let the TLB hold entries for multiple address spaces simultaneously — no flush needed on switch as long as ASIDs are unique.
On x86_64, it's a write to CR3.

apple-oss-distributions/xnuosfmk/arm/pmap.cpmap_switch — install the new task's address-space registers.View on GitHub

If thread A and B are in the same task, this step is a no-op. (Common case: a single multi-threaded app's threads switching among themselves.)

Step 7: AST check on the new thread's entry

Before returning to userspace as thread B, the kernel checks B's AST mask. If B has pending signals (AST_BSD), they're delivered now — a signal trampoline is built, B's user PC is rewritten to enter the handler.

This is exactly the signal delivery flow, just on a thread that's about to start running rather than one that's about to return from a syscall.

Step 8: eret — back to userspace

Final step: eret (ARM64) or iret (x86). The CPU restores the user-mode register state from B's saved frame, switches privilege level back to EL0 / ring 3, and resumes at B's user PC.

Thread B is now running on the core where thread A was a microsecond ago.

What this costs

A bare context switch on Apple Silicon is in the low hundreds of nanoseconds. Most of that is:

The cache miss on loading B's thread structure (~30-80 ns).
The pmap switch (cheap on ARM64 thanks to ASIDs — a couple of register writes).
The continuation jump or stack unwind.

Adding pressure: if A and B's working sets are in different L2 caches (different clusters), the next few hundred memory accesses by B will miss. That's why the scheduler's affinity bias matters — keeping a thread on the same cluster pays for itself many times over.

What surprises newcomers

Most context switches are voluntary, not preemptions. A thread blocking on mach_msg_receive or a pthread_mutex_lock is the common case; the timer-driven preemption path is a fallback for CPU-bound threads.
The continuation pattern means a sleeping thread's kernel stack is reusable. Memory pressure on the kernel stack pool is far lower than naively expected.
ASIDs eliminate the TLB-flush penalty that's traditional on context switch. This is a huge win for Apple Silicon.
The same thread_invoke runs whether you're preempting a CPU hog or returning from a syscall. One code path, dozens of triggers.

What to read next

Read thread_block and thread_invoke in full:

apple-oss-distributions/xnuosfmk/kern/thread.cThread state machine — runnable, waiting, suspended, etc.View on GitHub(line —) apple-oss-distributions/xnuosfmk/kern/processor.cPer-processor structures — runqs, current thread, idle thread.View on GitHub(line —)

Then re-read the scheduler article and notice how every claim about QoS turns into a runq-selection decision inside thread_select.