Apple Silicon and XNU: APRR, unified memory, AMX, Rosetta 2

When the M1 shipped in 2020, XNU was already running on ARM (on iOS, for over a decade). What was new was that Apple now controlled the chip, not just the OS. The kernel could assume hardware features that aren't in the ARM specification, in service of performance and security characteristics no other vendor could match.

This article walks the four most consequential of those: heterogeneous cores, APRR/SPRR, the AMX coprocessor, and Rosetta 2.

P-cores and E-cores

Every Apple Silicon SoC has two kinds of CPU cores: performance (P) and efficiency (E). They share an ISA — both are ARMv8.5+ — but differ in pipeline width, frequency, and power. P-cores are big and hungry; E-cores are small and frugal.

The kernel sees them as just cores. The scheduler is where the policy lives:

apple-oss-distributions/xnuosfmk/kern/sched_clutch.cThe clutch scheduler — picks P vs E core per thread based on QoS + behavior.View on GitHub

A thread carries a recommendation that the scheduler uses to bias core selection: USER_INTERACTIVE and real-time threads bias to P, BACKGROUND to E, everything else to "anywhere." But the scheduler is allowed to override the recommendation when it would otherwise leave a core idle.

The other interesting bit is cluster-level affinity. P-cores and E-cores live in separate clusters with separate L2 caches. The scheduler tries hard to keep a thread within its cluster (cache warmth), but it can migrate across clusters when load imbalance crosses a threshold.

Hardware features specific to the heterogeneous architecture:

CPU-private interrupt routing — the AIC (Apple Interrupt Controller) routes interrupts to specific clusters, so an E-cluster doesn't have to wake up to service a P-cluster's I/O.
Asymmetric performance counters — P and E cores expose different PMU events, so Instruments can compare them meaningfully.
pmgr driver controls per-cluster DVFS — voltage/frequency scaling done at the cluster level, not per-core.

APRR and SPRR: switching page permissions at hardware speed

A JavaScript engine (JIT) needs to write to executable memory pages — generate machine code on the fly. ARM's standard W^X protection (a page is writable or executable but never both) makes this slow: every transition requires changing the page tables and shooting down TLB entries across every core.

Apple's solution is APRR (Apple Protection Register Remapping) on the M1, and SPRR (System Protection Register Remapping) on M2+. Both are hardware features that let the CPU change a page's effective permissions per-thread by writing to a register, without touching the page table or invalidating the TLB.

apple-oss-distributions/xnuosfmk/arm/pmap.cSearch for APRR / SPRR — XNU's interface to the hardware permission switching.View on GitHub

The mechanism, simplified:

A normally-mapped JIT page has a special permission group identifier (PGID) in its translation.
Two banks of permissions exist for that PGID — one read+execute, one read+write.
The thread switches which bank is active by writing a control register.
The MMU consults the active bank on every memory access; no TLB flush is needed.

JavaScriptCore in Safari uses this; Rosetta 2 uses it; the WebKit content processes use it; any third-party app that calls mprotect with PROT_READ|PROT_WRITE|PROT_EXEC (and holds the com.apple.security.cs.allow-jit entitlement) goes through it.

On Intel Macs the equivalent operation cost a TLB shootdown across every core — visible as a several-microsecond pause. On Apple Silicon it's a register write, a few cycles.

AMX: the matrix coprocessor with no public ISA

Every Apple Silicon SoC has an AMX (Apple Matrix Extensions) coprocessor — undocumented in the ARM architecture reference, undocumented by Apple, but extensively used. It performs matrix multiplications dramatically faster than scalar or NEON code can.

What's known (from reverse engineering and the macOS Accelerate framework's symbol names):

AMX is per-cluster, not per-core — instructions are routed to a single coprocessor that serves all cores in the cluster.
It has dedicated registers — 8 X registers and 8 Y registers, each 512 bits, for the inputs.
It has a 32×32 Z accumulator (1024 bytes) for the output.
Instructions are emitted via a set of MSR writes to undocumented system registers — opaque to the disassembler, but a single 32-bit immediate encodes the operation and source/dest registers.

You don't write AMX code directly. You call Accelerate's BNNS/vDSP functions or use Core ML — Accelerate dispatches the right AMX sequence under the hood. Apple has been actively replacing AMX with its successor SME (Scalable Matrix Extension, ARM's standard equivalent) on more recent silicon, so this might end up being an interesting historical detour.

Why mention AMX in a kernel article: the kernel has to save/restore AMX state on context switch, just like FPU state. That code lives in XNU and is one of the few places the existence of AMX is acknowledged in public source.

apple-oss-distributions/xnuosfmk/arm64/cpu_data.hPer-CPU data including the saved AMX state on context switch.View on GitHub

Rosetta 2: a binary translator built into the OS

When Apple shipped the M1, every Mac app in existence was an x86_64 binary. Rosetta 2 was the bridge — Apple's binary translator that runs x86_64 code on ARM, transparently.

The architecture, briefly:

Ahead-of-time (AOT) translation for the bulk of code. When you launch an x86_64 binary the first time, a Rosetta daemon translates the binary's machine code to ARM and caches the result. Subsequent launches use the cached ARM binary directly.
Just-in-time (JIT) translation for code generated at runtime — JIT compilers, dynamic class loading, anything that produces code Rosetta hasn't seen.

The translator runs in a sandboxed userspace daemon (runtime). The cached translations live under /var/db/oah/. The runtime itself is closed source but its userspace footprint is observable — you can see it spawn whenever an Intel binary launches.

Three kernel-side details:

Translated code uses the JIT page-permission mechanism (APRR/SPRR above) — Rosetta is one of the heaviest users of fast W↔X switching.
Hardware acceleration: M-series chips support a special total store ordering mode (TSO) at the CPU level. ARM is normally weakly ordered; x86 is strongly ordered. When a thread is running Rosetta-translated code, the kernel sets a per-thread flag that makes that thread's memory accesses behave with x86-style TSO. This is hardware support specifically for binary translation.
System call translation: an x86_64 binary calls syscall with x86 calling conventions; the kernel's syscall entry has to fish out the right arguments and dispatch. The translation layer is in bsd/dev/i386/munge.s and equivalents.

apple-oss-distributions/xnuosfmk/arm64/genassym.cSearch for TSO and x86 — where Apple Silicon's x86-compatibility mode interacts with the scheduler.View on GitHub

Rosetta is being deprecated. Recent macOS releases warn that future versions will remove it. Apple is using the long deprecation runway to push remaining holdouts to ship native ARM builds.

Unified memory: GPU and CPU on one substrate

The other big architectural change Apple Silicon brought is unified memory — the GPU and CPU share the same DRAM, with the same physical addresses. There's no separate VRAM, no IOSurface copying between CPU and GPU, no PCIe transfers.

For the kernel, this changes who allocates pages and who maps them. An IOSurface allocated for GPU rendering is just a VM region that both the CPU's pmap and the GPU's translation hardware can map. Metal can issue a render command that operates directly on memory the CPU just wrote.

For drivers, this changes the layout of memory descriptors. The GPU driver doesn't have to maintain a CPU-side staging buffer + a GPU-side resident buffer; one mapping serves both. IOKit's IOSurface abstraction was retrofitted to take advantage.

The downside is contention — CPU and GPU compete for memory bandwidth on the same controller. Apple's solution is generous bandwidth (M-series chips have memory bandwidth several times higher than comparable Intel laptops) and aggressive prefetching.

What to read next

For the per-architecture initialization code:

apple-oss-distributions/xnuosfmk/arm64/start.sThe very first ARM instructions XNU executes on boot.View on GitHub(line —) apple-oss-distributions/xnuosfmk/arm/cpu.cPer-CPU bring-up; configures APRR/SPRR, sets up exception vectors.View on GitHub(line —)

And the virtual memory article — Apple Silicon's permission tricks happen at the pmap layer, which that article walks.