APFS clones and snapshots: the kernel calls that make them work

The headline feature people remember about APFS is that copying a 50 GB file is instant. The reason is cloning: APFS doesn't copy the file's blocks, it copies its extent map. The blocks become co-owned by both files; the kernel allocates fresh blocks only when one side writes (copy-on-write).

This article is about the three syscalls that expose this — clonefile(2), fclonefileat(2), and fs_snapshot_create(2) — what each one actually does, and where in the kernel the work happens.

clonefile(2): one file, two extent maps, zero data copied

Userspace API:

#include <sys/clonefile.h>
int clonefile(const char *src, const char *dst, uint32_t flags);

Pass a source path and a destination path. APFS does this:

Resolve the source vnode.
Create a new vnode for the destination (a fresh inode in the APFS volume's object map).
Walk the source's extent list. For each extent record, write the same record into the destination, marking the underlying blocks as shared.
Increment the per-block reference count on each shared block in APFS's block-allocation b-tree.
Copy the source's xattrs, ACLs, and HFS+-compat flags into the destination.

apple-oss-distributions/xnubsd/sys/clonefile.hclonefile(2) — the syscall the userspace API turns into.View on GitHub(line —) apple-oss-distributions/xnubsd/vfs/vfs_syscalls.cWhere the syscall lands; from here it dispatches into VFS, then APFS.View on GitHub(line —)

What does NOT happen: zero bytes of file content are read or written. The destination occupies essentially the same space on disk as the source until one of them is modified. du reports a different size than you'd expect; df tells the truth (no space was consumed).

Caveat: cloning works within an APFS volume, not across volumes (even within the same container). Cross-volume copies fall back to a normal copy. The reason is the per-volume object map — block references can't cross volume boundaries.

fclonefileat(2): the same thing, with an fd-relative path

int fclonefileat(int src_fd, int dst_dir_fd, const char *dst, uint32_t flags);

Same semantics, but the source is a file descriptor instead of a path, and the destination is resolved relative to a directory fd. This is the recommended form for anything that needs to be race-free against rename — same reason openat is preferred over open for security-sensitive code.

Tools that use it: cp -c, ditto -c, the Finder when copying within a volume, xcopy from Xcode's build system, Docker Desktop's APFS storage driver.

fs_snapshot_create: pin every file at once

A snapshot is a read-only view of an entire APFS volume at a specific moment. Conceptually it's the same trick as a clone, applied to the whole volume:

Allocate a new snapshot object in the volume's snapshot tree.
Record the current root of the volume's filesystem tree (a b-tree node pointer).
Mark every block currently referenced as "do not garbage-collect, snapshot N owns it."
Done.

apple-oss-distributions/xnubsd/sys/snapshot.hfs_snapshot_* syscalls — create / delete / mount / rename / revert / list.View on GitHub(line —) apple-oss-distributions/xnubsd/vfs/vfs_syscalls.cThe syscall plumbing for snapshots — routes into the filesystem's vfs_snapshotop hook.View on GitHub(line —)

Same as clones, no data is copied. A snapshot is just a name + a tree root + reference counts on the blocks the root reaches. Creating one takes a few milliseconds regardless of the volume's size.

To use a snapshot from userspace:

tmutil snapshot creates one named after the current time. Time Machine schedules these hourly.
tmutil listlocalsnapshots / lists them.
mount_apfs -s <snapshot-name> /Volumes/snap mounts a snapshot read-only at a path you choose. Forensic tooling and rsync-style differential backups use this.
tmutil deletelocalsnapshots <date> removes one. The blocks it pinned become eligible for garbage collection.

The kernel doesn't actively reclaim freed blocks — APFS marks them as candidates and the volume's block allocator picks them up on the next write that needs space. So a deleted snapshot might not show as freed space immediately; under memory pressure or low-space conditions the reclaim runs faster.

What gets shared, what doesn't

When you clonefile a file, the data extents are shared. What's not shared:

Metadata — inode, mode, owner, timestamps. Each clone has its own.
xattrs — Apple copies the source's xattrs at clone time. Each side can change them independently.
resource forks — copied separately, may or may not also be cloned depending on size.

When you take a snapshot, the entire volume tree root is shared. Modifications after the snapshot go to new blocks; the old blocks stay alive as long as the snapshot pins them.

Why this is faster than HFS+'s hardlinks

HFS+ also had a "magic" copy via Time Machine — directory hardlinks (the only filesystem in common use that allowed them). But:

HFS+ directory hardlinks were a per-file metadata trick, with no actual copy-on-write at the block level. Editing a file via either link mutated the same data.
APFS clones DO copy on write — the moment you write() to one of the two files, the touched blocks become non-shared. Both files see the version they think they have.

That's the difference between aliasing and cloning. HFS+ aliased; APFS clones. The cloning model is what makes cp -c safe to use anywhere cp would have been.

apple-oss-distributions/xnubsd/vfs/vfs_subr.cVFS plumbing the snapshot/clone calls share with every other vnode operation.View on GitHub

A complete clonefile call, end to end

For curiosity, here's what's happening from a cp -c source dest invocation:

cp parses -c, calls copyfile(3) with the COPYFILE_CLONE flag.
copyfile in libSystem calls clonefile(2).
The syscall enters the kernel through sysent[], lands in clonefile() in bsd/vfs/vfs_syscalls.c.
The kernel resolves both paths to vnodes via namei.
It calls VNOP_CLONEFILE on the source vnode's vfs ops table.
APFS's clonefile implementation (closed source, in apfs.kext) does the extent-record copy + ref-count update.
The new vnode is returned. From here it's a normal vnode for both reads and writes.

The userspace round-trip is one syscall. The on-disk work is a few b-tree updates and a transaction commit. The 50 GB never moves.

What surprises newcomers

cp -c is not the same as cp. Without -c, cp does a byte-for-byte read/write copy even on APFS. The shell builtin doesn't try to be clever.
mv within a volume is faster than cp -c because it doesn't even touch the data extents — it's a rename, which is one b-tree update.
Snapshots survive reboot. They're persisted in the volume's object map. Time Machine relies on this — you can wake to a fresh boot and your last hour of snapshots is still there.
Snapshots don't free space until deleted. A volume "out of space" with terabytes of supposedly free room often turns out to have weeks of Time Machine local snapshots pinning the blocks.

What to read next

For the VFS plumbing every file syscall takes:

apple-oss-distributions/xnubsd/sys/vnode.hvnode_t — the in-core handle every clone/snapshot operation acts on.View on GitHub(line —) apple-oss-distributions/xnubsd/sys/vnode_if.hVNOP_* — the table of operations APFS implements.View on GitHub(line —)

And re-read the APFS overview article — once you've seen how COW works under writes, snapshots are just COW applied to the whole tree root.