Skip to content

Inside the APFS object map: how the filesystem b-tree works

Every block on an APFS volume is reachable through one b-tree — the object map. Here's how it's laid out, how it survives a write, and why APFS is self-checksumming by design.

Published 6 min read
APFS object map b-tree structureAPFS layout: container superblock points at the container object map, which locates per-volume object maps. Each volume object map is a copy-on-write b-tree keyed by (OID, XID) → physical address.Container superblocklatest checkpoint rootOBJECT MAP (B-TREE)root nodeXID 142internalOIDs 0–999internalOIDs 1000–1999internalOIDs 2000–2999internalOIDs 3000–3999OID 1042 / XID 142blk 0x4c2OID 1042 / XID 95blk 0x3a1 (snap)OID 1043 / XID 142blk 0x4f8OID 1044 / XID 142blk 0x510OID 1045 / XID 80blk 0x1e2 (snap)Reading this diagramLookup an object: walk root → internal (by OID range) → leaf (exact key). Older XIDs are retained because a snapshot pins them.Every write allocates fresh nodes from root to leaf (copy-on-write). Old root stays alive until the next checkpoint commits and frees it.Same shape for the free-space tree, the extent-reference tree, and the snapshot tree — APFS is b-trees all the way down.

APFS doesn't have inodes scattered across fixed disk regions like HFS+ did. It has one b-tree per volume — the object map — that's the single source of truth for "where does object N live on disk?" Every file, every directory, every snapshot's view of the volume goes through this b-tree.

This article is about how it's structured and what the kernel does on every write to keep it consistent.

What's an "object" in APFS

In APFS terminology, an object is anything addressable by an object ID: file extents, inodes, directory records, attribute records, snapshot metadata. Each object has:

  • An OID (Object Identifier) — 64-bit, unique within the container.
  • An XID (Transaction ID) — monotonically increasing, identifies which checkpoint last touched the object.
  • A type — what kind of object (file, dir, etc.).
  • The actual data.

The object map's job: given an (OID, XID) pair, return the physical disk address where that version of the object lives.

The b-tree structure

The object map is a copy-on-write b-tree. Each node is one block (4 KB or 16 KB depending on container settings) and contains key-value pairs:

  • Key — (OID, XID).
  • Value — physical block address + flags.

Leaf nodes contain real (OID, XID) → physical-address records. Internal nodes contain (OID, XID) → child-node-address records that route lookups.

A lookup starts at the root, walks down — at each level, find the largest key ≤ target and follow its child pointer — until reaching a leaf. Standard b-tree operation; nothing exotic.

What makes it APFS-flavored:

  • Copy-on-write: any update walks from root to leaf, allocating fresh blocks for every node on the path. The old tree stays intact until garbage collected.
  • Multi-version: the (OID, XID) keying means the b-tree contains every recent version of every object. Snapshots are just retained XIDs.
  • Self-checksumming: every node carries a checksum verifying its own integrity. Mismatch = read failure; APFS surfaces this rather than silently corrupting data.

A write, walked

Suppose userspace writes to file F, changing one block of its contents. The kernel:

  1. Allocate a fresh data block. Pick a free block from the volume's space-management b-tree (a separate tree tracking free space).
  2. Write the new data to the new block.
  3. Create a new extent record for the file, pointing at the new block. The old extent records (still pointing at the old block) stay alive for now.
  4. Update F's inode to reference the new extent record. This means allocating a fresh block for the updated inode, since inodes themselves are COW.
  5. Update the object map to reflect the new inode's location. This means walking the b-tree from root to leaf, allocating fresh nodes along the path.
  6. At checkpoint time, write a new checkpoint descriptor pointing at the new b-tree root. Now the new version is reachable; the old root is no longer the live one.
apple-oss-distributions/xnubsd/vfs/vfs_subr.cVFS plumbing every filesystem update goes through.View on GitHub(line )

The whole sequence is atomic: until the checkpoint descriptor is updated, the new version isn't reachable. If power is lost mid-write, the old (consistent) version is still there. APFS never has a partial-write window where the filesystem is in an inconsistent state.

This is why APFS is crash-consistent without a journal. Traditional filesystems use a journal to record intended changes, then replay on recovery; APFS uses COW to keep the old version intact until the new one is fully written.

Checkpoints — the global commit point

A checkpoint is APFS's transaction commit. Every few seconds (or under sync pressure), the kernel:

  1. Collects every dirty b-tree node, every dirty extent.
  2. Writes them all to fresh blocks.
  3. Writes a new checkpoint descriptor block listing the new b-tree roots, the highest XID assigned, free-space accounting.
  4. Writes a final superblock-equivalent (the container superblock) pointing at the new checkpoint descriptor.

The superblock write is the atomic commit point — until it lands, the previous checkpoint is still the live one.

APFS keeps the previous checkpoint's blocks around for a short window — enough to recover from a single bad write. The space is freed back into the pool once the new checkpoint is durably committed.

Multiple b-trees per container

The object map isn't the only b-tree. An APFS container has:

  • The container object map — locates per-volume root structures.
  • The per-volume object map — locates inodes, file metadata, extents.
  • The per-volume free-space tree — tracks which blocks are allocated vs free.
  • The per-volume extent-reference tree — tracks how many references point at each block (for clone refcounting).
  • The per-volume snapshot tree — lists every snapshot and its retained XID.

All COW, all checkpoint-committed atomically. The kernel never has a window where one tree is consistent and another isn't.

Snapshots are just XIDs

When you tmutil snapshot, APFS:

  1. Records the current XID as a snapshot.
  2. Marks every block referenced by the current trees as "snapshot N pins this."
  3. Future writes go to fresh blocks (COW); the old blocks stay alive because snapshot N still references them through the retained XID.

Lookup at a snapshot's XID returns the version of the b-tree as of that XID. The snapshot is a virtual view; no actual data is duplicated.

When the snapshot is deleted, the pin is removed, and any blocks now reachable only by the snapshot become eligible for garbage collection.

This is why APFS snapshots are O(1) at creation and consume disk space only as the live volume diverges from the snapshot.

Why the b-tree is everywhere

The "one big b-tree per type of metadata" design has consequences:

  • Random reads scale logarithmically with volume size. Looking up an inode in an APFS volume with millions of files is ~3-4 b-tree node reads, often all cached.
  • Sequential metadata operations (listing a directory with 100K files) walk a contiguous range of the b-tree, which fits in a small number of nodes — fast.
  • The free-space tree is itself a b-tree. Allocations look up runs of free blocks via b-tree lookup, not bitmap scans.
  • Self-checksumming b-trees detect storage bit-rot. APFS surfaces ENOIO-equivalent errors rather than silently returning corrupt data.

The on-disk format itself is documented exhaustively in Apple's APFS Reference PDF — 100+ pages of struct definitions and lookup algorithms.

For the kernel side that's open source:

apple-oss-distributions/xnubsd/vfs/vfs_cache.cThe namei name cache — most path lookups don't even hit the b-tree because they're cached at the VFS layer.View on GitHub(line ) apple-oss-distributions/xnubsd/sys/vnode_if.hThe vnode operations table APFS implements.View on GitHub(line )

And the APFS overview — once you've seen the b-tree, the container/volume layout and the COW semantics fall out as consequences of the b-tree's design.

Related

clonefile, fclonefileat, fs_snapshot — three syscalls that let you copy 50 GB in 50 milliseconds. Here's what happens under each one, and what doesn't get copied.
Apple File System, the format under every modern Mac: how it lays out blocks, how it gets snapshots almost for free, and why your /System is read-only at the cryptographic level.
How Time Machine uses APFS snapshots for local backups, the per-hour/per-day/per-week retention policy, and what rollback actually does to your filesystem.