Inside APFS: copy-on-write, snapshots, and the sealed system volume
Apple File System, the format under every modern Mac: how it lays out blocks, how it gets snapshots almost for free, and why your /System is read-only at the cryptographic level.
Apple File System landed in 2017 and is now the only filesystem Apple actively ships for boot. If you're on any Mac newer than ~2018 you're running APFS — on the SSD, on Time Machine backups, on the recovery partition, all of it.
It's interesting for three reasons: copy-on-write everywhere, snapshots that cost almost nothing, and a cryptographically sealed system volume that's the foundation of macOS integrity. We'll walk all three.
XNU's job is small
Open XNU and grep for apfs — you'll mostly find VFS plumbing. Apple's APFS driver is closed source. What lives in XNU is the VFS interface: vnode operations, mount handling, the bridge that turns POSIX syscalls (open, read, stat) into VOP calls on whatever filesystem owns that vnode.
apple-oss-distributions/xnubsd/vfs/vfs_syscalls.cPOSIX file syscalls dispatched into VFS.View on GitHub(line —) apple-oss-distributions/xnubsd/sys/vnode_if.hvnop_ table — every operation a filesystem must implement.View on GitHub(line —)
The format itself — superblocks, btrees, encryption keys, snapshot metadata — is documented publicly in Apple's APFS Reference, and that's the source of truth for everything in this article.
Containers and volumes
APFS has two layers of "filesystem":
- A container spans a partition and owns the free space.
- One or more volumes live inside a container, sharing that free space dynamically.
This is why running df on a Mac shows volumes that all report the same free space — the container's free space, not the volume's. The volumes don't have a fixed allocation; they grow and shrink as files come and go.
Disk
└── Container 1 (free space pool)
├── Volume: Macintosh HD ← sealed, read-only system
├── Volume: Macintosh HD - Data ← writable user data
├── Volume: Preboot ← boot policy / signed system
├── Volume: Recovery
└── Volume: VM ← swap files
That's a stock APFS layout. The fact that those five things share one free-space pool is the entire reason Macs don't run out of space on /System while there's room on /Users.
Copy-on-write is the default operation
Every write in APFS is to a fresh block. When you modify a file:
- APFS allocates a new physical block.
- Writes the new data there.
- Updates the file's extent record to point at the new block.
- Updates the volume's b-tree.
- Writes a new checkpoint (the superblock-equivalent) atomically.
The old block stays valid until nothing references it — and a snapshot is, conceptually, a reference. This is why snapshots are nearly free: no copy happens. You just stop garbage-collecting the old blocks because the snapshot still points at them.
Tools you've already used build on this:
- Time Machine on APFS uses local snapshots between hourly external backups.
tmutil snapshotlets you make one yourself.mount_apfs -s <name>mounts a snapshot read-only at a path you choose. Forensics tooling, system updates, and Time Machine itself all use this.
Cloning: zero-copy file copies
Hold ⌥ in Finder while dragging a file — you'll see "Copy" appear. On APFS, that copy doesn't actually copy any data. APFS clones the file's extent records, marks the blocks as shared, and only allocates new blocks when one side writes to them. From the command line:
cp -c source.iso dest.iso # -c = clone on APFS
dest.iso exists instantly even if the source is 50 GB. Modifying either file later breaks the sharing for the blocks you touch — classic copy-on-write semantics.
This is why xcode-select, git-clone-of-mirror, and Docker-on-Mac storage drivers all reach for clonefile(2) first. It's the closest thing macOS has to a free lunch.
The Sealed System Volume
On macOS 11 and later, the system volume isn't just read-only. It's cryptographically sealed:
- Every file's content is hashed.
- Those hashes roll up through a Merkle tree.
- The root of that tree is signed by Apple.
- At boot, the bootloader verifies the seal before mounting.
If a single byte under /System changes, the seal breaks and the system won't boot the modified volume. This is the foundation of System Integrity Protection's modern form — SIP started as a runtime restriction; the SSV makes it a load-time cryptographic guarantee.
The trick that makes this usable is the firmlink — APFS-specific symlinks that join the sealed read-only volume and the writable Data volume into one logical filesystem at boot. So you see one /, but it's actually two volumes:
/ → Macintosh HD (sealed, read-only)
/Users/.../Documents → Macintosh HD - Data (writable, via firmlink)
This is also why "make my Mac writable" via remounting / doesn't work the way it used to — the seal is independent of the mount option, and breaking it disables the boot.
Encryption
APFS containers can encrypt every block on disk. On macOS this is FileVault; on iOS it's the default. Key management uses the Secure Enclave on modern hardware, so the kernel never sees the raw key — it asks the SEP to decrypt I/O at the controller level.
This means block-level performance of FileVault-on-Apple-Silicon is roughly identical to unencrypted, because the AES happens in the SSD controller, not the CPU. Older Intel Macs paid a measurable cost. Apple Silicon does not.
Things that surprise people
- Hard links work, including on directories (the only filesystem with directory hard links in common use). Time Machine's legacy backups depended on this.
- The volume layout is dynamic.
diskutil apfs addVolumecreates a new volume in the same container in seconds, sharing free space, with optional quotas. - There is no
fsckfor APFS in the traditional sense. Recovery usesfsck_apfs, but the format is self-checksummed and most "corruption" you'd hit on HFS+ is structurally impossible. - APFS keeps two metadata copies. Each write updates a fresh copy; on failure mid-write, the previous valid one is still on disk. This is why APFS is power-loss-safe in a way HFS+ never quite was.
What to read next
For the format itself, the APFS Reference PDF is exhaustive and surprisingly readable.
For the kernel side, watch how an open() call lands on a VFS operation:
apple-oss-distributions/xnubsd/vfs/vfs_subr.cVFS plumbing — vnode reclaim, mountpoint management, the bits that aren't filesystem-specific.View on GitHub(line —) apple-oss-distributions/xnubsd/vfs/vfs_cache.cThe name cache (namei) — why most path lookups don't hit disk.View on GitHub(line —)
The closest you can get to "reading APFS" in open source is reading the protocol APFS uses to talk to the kernel — vnop callbacks, mount options, fcntl(F_FULLFSYNC) and friends. Apple's APFS module implements the receiving side.