Skip to content

APFS clones and snapshots: the kernel calls that make them work

clonefile, fclonefileat, fs_snapshot — three syscalls that let you copy 50 GB in 50 milliseconds. Here's what happens under each one, and what doesn't get copied.

Published 6 min read
APFS clonefile(2) — zero-copy file copyclonefile(2) creates a new inode whose extent map points at the same physical blocks as the source. No data is copied; reference counts are bumped on each shared block.BEFORE — single filesource inodeextents: [12, 13, 14]blk 12refs: 1blk 13refs: 1blk 14refs: 1clonefile(src, dst)copy extent records, bump per-block refcountszero bytes of file data are copiedAFTER — two inodes, same blockssource inodeextents: [12, 13, 14]clone inodeextents: [12, 13, 14]blk 12refs: 2blk 13refs: 2blk 14refs: 2On write — copy-on-write breaks the sharing for one blockwrite to clone's block 1 → kernel allocates fresh block 99 → clone's extent becomes [12, 99, 14] → block 13 ref drops to 1Untouched blocks remain shared. The clone diverges from the source one block at a time, only when actually modified.

The headline feature people remember about APFS is that copying a 50 GB file is instant. The reason is cloning: APFS doesn't copy the file's blocks, it copies its extent map. The blocks become co-owned by both files; the kernel allocates fresh blocks only when one side writes (copy-on-write).

This article is about the three syscalls that expose this — clonefile(2), fclonefileat(2), and fs_snapshot_create(2) — what each one actually does, and where in the kernel the work happens.

clonefile(2): one file, two extent maps, zero data copied

Userspace API:

#include <sys/clonefile.h>
int clonefile(const char *src, const char *dst, uint32_t flags);

Pass a source path and a destination path. APFS does this:

  1. Resolve the source vnode.
  2. Create a new vnode for the destination (a fresh inode in the APFS volume's object map).
  3. Walk the source's extent list. For each extent record, write the same record into the destination, marking the underlying blocks as shared.
  4. Increment the per-block reference count on each shared block in APFS's block-allocation b-tree.
  5. Copy the source's xattrs, ACLs, and HFS+-compat flags into the destination.

apple-oss-distributions/xnubsd/sys/clonefile.hclonefile(2) — the syscall the userspace API turns into.View on GitHub(line ) apple-oss-distributions/xnubsd/vfs/vfs_syscalls.cWhere the syscall lands; from here it dispatches into VFS, then APFS.View on GitHub(line )

What does NOT happen: zero bytes of file content are read or written. The destination occupies essentially the same space on disk as the source until one of them is modified. du reports a different size than you'd expect; df tells the truth (no space was consumed).

Caveat: cloning works within an APFS volume, not across volumes (even within the same container). Cross-volume copies fall back to a normal copy. The reason is the per-volume object map — block references can't cross volume boundaries.

fclonefileat(2): the same thing, with an fd-relative path

int fclonefileat(int src_fd, int dst_dir_fd, const char *dst, uint32_t flags);

Same semantics, but the source is a file descriptor instead of a path, and the destination is resolved relative to a directory fd. This is the recommended form for anything that needs to be race-free against rename — same reason openat is preferred over open for security-sensitive code.

Tools that use it: cp -c, ditto -c, the Finder when copying within a volume, xcopy from Xcode's build system, Docker Desktop's APFS storage driver.

fs_snapshot_create: pin every file at once

A snapshot is a read-only view of an entire APFS volume at a specific moment. Conceptually it's the same trick as a clone, applied to the whole volume:

  1. Allocate a new snapshot object in the volume's snapshot tree.
  2. Record the current root of the volume's filesystem tree (a b-tree node pointer).
  3. Mark every block currently referenced as "do not garbage-collect, snapshot N owns it."
  4. Done.

apple-oss-distributions/xnubsd/sys/snapshot.hfs_snapshot_* syscalls — create / delete / mount / rename / revert / list.View on GitHub(line ) apple-oss-distributions/xnubsd/vfs/vfs_syscalls.cThe syscall plumbing for snapshots — routes into the filesystem's vfs_snapshotop hook.View on GitHub(line )

Same as clones, no data is copied. A snapshot is just a name + a tree root + reference counts on the blocks the root reaches. Creating one takes a few milliseconds regardless of the volume's size.

To use a snapshot from userspace:

  • tmutil snapshot creates one named after the current time. Time Machine schedules these hourly.
  • tmutil listlocalsnapshots / lists them.
  • mount_apfs -s <snapshot-name> /Volumes/snap mounts a snapshot read-only at a path you choose. Forensic tooling and rsync-style differential backups use this.
  • tmutil deletelocalsnapshots <date> removes one. The blocks it pinned become eligible for garbage collection.

The kernel doesn't actively reclaim freed blocks — APFS marks them as candidates and the volume's block allocator picks them up on the next write that needs space. So a deleted snapshot might not show as freed space immediately; under memory pressure or low-space conditions the reclaim runs faster.

What gets shared, what doesn't

When you clonefile a file, the data extents are shared. What's not shared:

  • Metadata — inode, mode, owner, timestamps. Each clone has its own.
  • xattrs — Apple copies the source's xattrs at clone time. Each side can change them independently.
  • resource forks — copied separately, may or may not also be cloned depending on size.

When you take a snapshot, the entire volume tree root is shared. Modifications after the snapshot go to new blocks; the old blocks stay alive as long as the snapshot pins them.

HFS+ also had a "magic" copy via Time Machine — directory hardlinks (the only filesystem in common use that allowed them). But:

  • HFS+ directory hardlinks were a per-file metadata trick, with no actual copy-on-write at the block level. Editing a file via either link mutated the same data.
  • APFS clones DO copy on write — the moment you write() to one of the two files, the touched blocks become non-shared. Both files see the version they think they have.

That's the difference between aliasing and cloning. HFS+ aliased; APFS clones. The cloning model is what makes cp -c safe to use anywhere cp would have been.

apple-oss-distributions/xnubsd/vfs/vfs_subr.cVFS plumbing the snapshot/clone calls share with every other vnode operation.View on GitHub(line )

A complete clonefile call, end to end

For curiosity, here's what's happening from a cp -c source dest invocation:

  1. cp parses -c, calls copyfile(3) with the COPYFILE_CLONE flag.
  2. copyfile in libSystem calls clonefile(2).
  3. The syscall enters the kernel through sysent[], lands in clonefile() in bsd/vfs/vfs_syscalls.c.
  4. The kernel resolves both paths to vnodes via namei.
  5. It calls VNOP_CLONEFILE on the source vnode's vfs ops table.
  6. APFS's clonefile implementation (closed source, in apfs.kext) does the extent-record copy + ref-count update.
  7. The new vnode is returned. From here it's a normal vnode for both reads and writes.

The userspace round-trip is one syscall. The on-disk work is a few b-tree updates and a transaction commit. The 50 GB never moves.

What surprises newcomers

  • cp -c is not the same as cp. Without -c, cp does a byte-for-byte read/write copy even on APFS. The shell builtin doesn't try to be clever.
  • mv within a volume is faster than cp -c because it doesn't even touch the data extents — it's a rename, which is one b-tree update.
  • Snapshots survive reboot. They're persisted in the volume's object map. Time Machine relies on this — you can wake to a fresh boot and your last hour of snapshots is still there.
  • Snapshots don't free space until deleted. A volume "out of space" with terabytes of supposedly free room often turns out to have weeks of Time Machine local snapshots pinning the blocks.

For the VFS plumbing every file syscall takes:

apple-oss-distributions/xnubsd/sys/vnode.hvnode_t — the in-core handle every clone/snapshot operation acts on.View on GitHub(line ) apple-oss-distributions/xnubsd/sys/vnode_if.hVNOP_* — the table of operations APFS implements.View on GitHub(line )

And re-read the APFS overview article — once you've seen how COW works under writes, snapshots are just COW applied to the whole tree root.

Related

Apple File System, the format under every modern Mac: how it lays out blocks, how it gets snapshots almost for free, and why your /System is read-only at the cryptographic level.
What changed in XNU when Apple shipped its own ARM silicon — P/E cores, APRR page-permission switching, the AMX matrix coprocessor, and Rosetta 2.
Same IOKit object model, userland process. Why kexts are dying, what DriverKit gives you, and how a USB driver actually crosses the boundary.