The Apple GPU and Metal driver: unified memory in practice

The Apple Silicon GPU is qualitatively different from a discrete GPU. It's a TBDR (tile-based deferred renderer) sharing main DRAM with the CPU, with kernel-side drivers that look and behave very different from their Intel/AMD/Nvidia counterparts. This article walks how Metal commands get from userspace to the silicon, what the kernel driver does, and why unified memory is the architectural feature that matters most.

The hardware: tile-based deferred rendering

A traditional immediate-mode renderer (most desktop GPUs historically) shades every triangle for every fragment as it comes in. Overdraw — when later triangles cover earlier ones — is wasted work; the shader for the covered pixel ran for nothing.

A TBDR splits the framebuffer into tiles (typically 16×16 or 32×32 pixels), processes geometry in two phases:

Tiling pass — for each tile, build a list of triangles that overlap it. No shading yet, just geometry sorting.
Shading pass — for each tile, process the triangles overlapping it back-to-front, determining the final pixel without overdraw.

TBDR was pioneered by PowerVR (now Imagination); Apple's GPU inherits the lineage via their acquisition of PowerVR-derived IP. Every iPhone GPU since the iPhone 5s, and now every Apple Silicon Mac GPU, is TBDR.

The practical consequences:

Lower memory bandwidth for typical workloads — tile data stays on-chip in fast SRAM during shading rather than round-tripping to DRAM.
Some operations are cheaper than on immediate renderers — clears, MSAA resolves, some blit patterns work at no cost because they happen during the implicit tile dump.
Some shader patterns are more expensive — anything that breaks the tile boundary (e.g., reading from a framebuffer location far from the current tile) forces tile flushes.

This is why Apple's Metal best-practices guides emphasize different things than DirectX/Vulkan guides for desktop GPUs.

Unified memory — the architecture-defining feature

The single most consequential thing about Apple Silicon's GPU:

       Apple Silicon              Discrete GPU
       ─────────────              ────────────
                                   ┌──────────┐
   ┌──────┐   ┌───────┐           │  VRAM    │
   │ CPU  │←─→│       │           └────┬─────┘
   ├──────┤   │       │                │
   │ GPU  │←─→│ DRAM  │           ┌────┴─────┐
   ├──────┤   │       │           │  GPU     │
   │ NPU  │←─→│       │           └────┬─────┘
   └──────┘   └───────┘           ┌────┴─────┐
                                   │  PCIe    │
                                   └────┬─────┘
                                   ┌────┴─────┐
                                   │ CPU+DRAM │
                                   └──────────┘

CPU and GPU share the same physical DRAM with the same physical addresses. A pointer to a buffer is the same pointer for both. There is no "upload to VRAM" step.

This isn't just a performance optimization — it changes the API. IOSurface is the kernel-level abstraction for memory that both CPU and GPU can map:

apple-oss-distributions/xnuiokit/Kernel/IOMemoryDescriptor.cppIOMemoryDescriptor — the kernel-side representation of memory that can be mapped multiple ways.View on GitHub

An IOSurface is a chunk of DRAM that:

Has a CPU virtual mapping (read/write via IOSurfaceGetBaseAddress).
Has a GPU virtual mapping (referenced by Metal MTLBuffer/MTLTexture resources).
Has hardware coherency: writes from one side are visible to the other without explicit vkFlushMappedMemoryRanges-style calls.

For most workloads, the right pattern on Apple Silicon is: allocate an IOSurface, CPU populates it, GPU reads from it directly. Zero copies.

The Metal driver stack

From userspace down to silicon:

Metal API (MTLDevice, MTLCommandQueue, MTLCommandBuffer)
   ↓ encoded into command stream
MTLDriver.framework (userspace)
   ↓ IOUserClient via IOConnectCallMethod
Apple GPU driver kext (kernel)
   ↓ DMA-controllable command buffer
GPU firmware (in dedicated firmware processor on the GPU)
   ↓ hardware
GPU shader cores

The userspace Metal framework encodes drawing commands into a binary command stream — Apple's internal command format, not standardized. The command buffer is an IOSurface itself, which both the CPU (encoding) and GPU (consuming) can access.

When MTLCommandBuffer.commit runs, the userspace framework calls into the kernel via IOConnectCallMethod on the GPU's IOUserClient. The kernel driver:

Validates the command buffer (no out-of-bounds resource references, no privilege escalation attempts).
Hands it to the GPU firmware processor via a hardware queue.
Returns to userspace.

The GPU firmware processor — a small embedded controller on the GPU complex — actually parses the command stream and dispatches it to the shader cores. The OS-side kernel driver is surprisingly thin compared to historical desktop drivers; most of the heavy lifting moved into the GPU firmware.

What lives where in the source tree

The Apple GPU kernel driver is closed-source. The bits XNU contributes are the IOKit framework that hosts it:

apple-oss-distributions/xnuiokit/Kernel/IOService.cppThe base class the GPU driver subclasses, like every other IOKit driver.View on GitHub(line —) apple-oss-distributions/xnuiokit/Kernel/IOMemoryDescriptor.cppMemory descriptors — the basis of IOSurface.View on GitHub(line —)

For the userspace Metal framework, headers are in /Applications/Xcode.app/.../Frameworks/Metal.framework/Headers/ once Xcode is installed.

Tile memory and persistent storage

Tile-local memory is fast SRAM on the GPU itself, accessible only to shaders running on that tile. It's much faster than DRAM but tiny (a few hundred KB). Apple exposes it via Metal's tile shaders and threadgroup memory APIs.

The tile-memory model is why Apple recommends:

Memoryless render targets for intermediate buffers — they live only in tile memory, never spill to DRAM, and disappear after the render pass.
Programmable Blending — read the previous fragment's color directly from tile memory in the shader, no DRAM round-trip.

These are first-class on Apple's GPU; emulating them on a discrete GPU is expensive or impossible.

Why this matters for kernel-side resource management

Because GPU resources live in main DRAM:

Memory pressure includes GPU allocations. A heavy GPU workload contributes to overall pressure; the kernel's pageout daemon can target GPU-resident IOSurface pages.
footprint(1) shows GPU memory as part of a process's footprint. Activity Monitor's "Memory" column for an app includes Metal allocations.
No discrete-GPU-style VRAM eviction. On macOS with a discrete GPU (e.g., older Intel Macs with AMD), the driver could evict resources from VRAM and reload them later; on Apple Silicon, there's nothing to evict to — everything's already in main DRAM.

What surprises newcomers

There is no VRAM. Every "GPU memory" allocation is just main DRAM.
No upload step. A CPU-written buffer is immediately GPU-readable.
The GPU driver is thin. Most of the work moved into GPU firmware running on the GPU itself.
TBDR changes shader best practices. Things that are cheap on discrete GPUs may be expensive on Apple GPUs, and vice versa.

What to read next

apple-oss-distributions/xnuiokit/Kernel/IOUserClient.cppIOUserClient — the mechanism Metal uses to call into the GPU driver from userspace.View on GitHub

And the Metal documentation at developer.apple.com — Apple's own best-practices guides cover TBDR-specific patterns.

For the broader Apple Silicon picture, see the Apple Silicon overview.