mmap on XNU: what really happens when you map a file

mmap(2) is the syscall that turns a file into memory. The semantics are simple: pass a length, a file descriptor, and an offset; get back a pointer. Read or write through the pointer and the kernel deals with paging from disk.

The implementation is one of the most layered things in XNU. It crosses the BSD/Mach seam, builds a chain of VM objects, defers all the actual work until the first page fault, then bottoms out in pmap to install the hardware translation. This article walks that chain.

Step 1: BSD takes the syscall

mmap arrives in the kernel through the normal BSD syscall path — it has a positive syscall number, lands in sysent[], dispatches to mmap() in bsd/kern/kern_mman.c:

apple-oss-distributions/xnubsd/kern/kern_mman.cmmap / munmap / mprotect — the BSD-side entry points for the VM syscalls.View on GitHub

The BSD code validates the arguments (flags compatibility, alignment, length non-zero), resolves the file descriptor to a vnode if MAP_ANON wasn't set, checks credentials and signing requirements, then hands off to Mach.

A few interesting validations happen here:

MAP_JIT requires the calling process to hold com.apple.security.cs.allow-jit.
PROT_EXEC on a non-MAP_JIT mapping requires the file to be code-signed.
MAP_FIXED is mostly a hint — XNU may still relocate the mapping if the requested address conflicts with the shared cache region.

Step 2: Mach builds the vm_object

Below BSD, Mach's vm_map_enter is the workhorse. For a file-backed mmap, the kernel needs a vm_object whose pages come from the file. It looks up (or creates) the file's vnode-pager:

apple-oss-distributions/xnuosfmk/vm/vnode_pager.cThe vnode pager — pulls file pages on demand for file-backed mappings.View on GitHub

The vnode-pager is itself a vm_object that knows how to materialize a page by issuing a vnode read at a given offset. Every file you mmap shares the same vnode-pager — a second mmap of the same file doesn't create a second cache; both mappings point at the same vm_object, which is the unified buffer cache.

This unification is why reading a file via read(2) and then mapping it via mmap doesn't double the memory cost: both paths consult the same vm_object, and pages already brought in for read are immediately available to mmap.

Step 3: vm_map_entry — the per-task record

Once Mach has the vm_object, it adds a vm_map_entry to the calling task's vm_map. The entry records:

The virtual address range covered.
A pointer to the vm_object (with reference count taken).
An offset into the vm_object.
Permissions (max-allowed + current).
Flags (private vs shared, copy-on-write, no-copy, jit, etc.).

apple-oss-distributions/xnuosfmk/vm/vm_map.cvm_map_enter — the function every mmap, malloc, and stack-grow ends up calling.View on GitHub

Notably absent: any actual page mappings. The vm_map_entry exists; the pmap has nothing yet. mmap returns the chosen virtual address to userspace, having allocated effectively zero physical memory.

This is the lazy-allocation contract. A program that mmaps 50 GB and never touches most of it pays for only what it reads.

Step 4: the first-touch fault

The first time userspace dereferences the returned pointer, the CPU's MMU has no translation and raises a fault. The XNU trap handler routes it to vm_fault:

apple-oss-distributions/xnuosfmk/vm/vm_fault.cvm_fault — the soft-page-fault entry point. Every demand-paged read lands here.View on GitHub

vm_fault consults the task's vm_map, finds the entry covering the faulting address, asks the entry's vm_object for the page at the right offset. For a vnode-pager-backed object, this triggers a vnode read — the actual disk I/O happens now, not at mmap time.

When the page lands in memory, vm_fault:

Inserts the page into the vm_object's page list, ref-counted.
Calls pmap_enter to install a hardware PTE mapping the user virtual address to the physical page, with the right permissions.
Returns. The CPU retries the faulting instruction. It succeeds.

The whole thing takes anywhere from microseconds (if the page was already in the buffer cache) to milliseconds (if it had to come from SSD). The user sees only "my pointer dereference worked."

Step 5: writes — when shared becomes private

For a MAP_SHARED mapping, writes go to the same vm_object the file backs. Modifications are eventually written back to disk via the pageout daemon — msync(2) forces this immediately.

For a MAP_PRIVATE mapping, the first write to a page triggers copy-on-write. The vm_object's page is read-only in the pmap; the write faults; the fault handler:

Allocates a fresh physical page.
Copies the original page's contents into it.
Inserts the new page into a shadow object the task's vm_map_entry now points through.
Installs the new pmap mapping as read-write.

Subsequent reads/writes use the private page. The original file-backed page stays untouched, available to other tasks mapping the same file.

This shadow-object chain is how fork-and-modify works without immediately doubling memory — the child's vm_map_entries shadow the parent's, and only writes consume new pages.

Step 6: munmap — tearing it down

munmap(addr, length) removes the mapping. The kernel:

Finds the vm_map_entry covering the range (or split entries if the range is partial).
Removes the pmap entries for every mapped page in the range.
Drops the vm_object reference.
If the vm_object's refcount drops to zero, it's destroyed and its pages freed (or paged out to disk for the buffer cache).

For mappings shared with other tasks, the page release goes through ref-count drops, not physical free, until the last task unmaps.

Common surprises

MAP_ANON | MAP_PRIVATE is what malloc uses for large allocations. The vm_object is anonymous, zero-fill on first touch. malloc of 100 MB allocates almost no physical memory until you touch it.
You can mmap a file larger than RAM. The kernel only pages in what you touch; LRU eviction handles the rest. Common pattern for large datasets.
mmap doesn't bypass the page cache. Pages are shared with read(2). Both syscalls hit the same vm_object.
madvise(MADV_DONTNEED) doesn't unmap. It tells the kernel pages can be reclaimed; the next touch faults them back in (with zero-fill for anon, or re-read for file-backed).
The shared region (where dylibs live) is mmaped read-execute into every process at boot. It's the single biggest mmap on the system, and it explains why a fresh process has ~1 GB of virtual size but tiny resident memory — the shared region is overwhelmingly shared.

What to read next

apple-oss-distributions/xnuosfmk/vm/vm_object.cvm_object lifecycle — alloc, ref, deactivate, terminate.View on GitHub(line —) apple-oss-distributions/xnuosfmk/vm/vm_pageout.cThe pageout daemon — what pages dirty mmapped data back to disk.View on GitHub(line —)

And re-read the virtual memory overview — once you've seen one full mmap, the pmap/vm_map/vm_object split makes immediate sense.