Skip to content

TCP internals on XNU: the state machine, congestion control, and the receive path

The 4.4 BSD TCP stack as it actually runs on macOS — state transitions, send/receive locking, Apple's congestion-control variants, and the kqueue path back to your app.

Published 5 min read

The network stack article covered XNU's networking architecture — mbufs, the protocol switch, Skywalk as the modern replacement. This article zooms into the most-used part: TCP. The state machine, the receive path's lock dance, Apple's congestion-control choices, and how all of it ends in a kevent waking your app.

The TCP state machine

Every TCP connection lives in one of 11 states, classic Stevens textbook:

CLOSED → LISTEN → SYN_RCVD → ESTABLISHED → ... → CLOSE_WAIT → LAST_ACK → CLOSED
CLOSED → SYN_SENT → ESTABLISHED → ... → FIN_WAIT_1 → FIN_WAIT_2 → TIME_WAIT → CLOSED

XNU's struct tcpcb (TCP control block) holds the state plus everything else — receive/send buffers, sequence numbers, RTT estimates, congestion window, retransmit timers.

apple-oss-distributions/xnubsd/netinet/tcp_var.hstruct tcpcb — every TCP connection's complete kernel-side state.View on GitHub(line ) apple-oss-distributions/xnubsd/netinet/tcp_subr.cState management, helper functions, connection table.View on GitHub(line )

State transitions happen in:

  • tcp_input — receive path, processing incoming segments.
  • tcp_output — send path, retransmits, ACK generation.
  • tcp_timer — timeouts (retransmit, keepalive, persist, 2MSL).

The classic Stevens transitions are unchanged on XNU; the implementation has been heavily optimized (lock granularity, batching) but the protocol behavior is FreeBSD-derived 4.4 BSD lineage.

The receive path, step by step

A TCP segment arrives:

  1. NIC driver enqueues the mbuf and signals.
  2. ip_input strips the IP header, dispatches by protocol → tcp_input.
  3. tcp_input looks up the matching tcpcb in the connection hash table.
  4. Takes the tcpcb's lock.
  5. Validates the segment (checksums, sequence numbers, window).
  6. Updates RTT estimates if this is an ACK.
  7. Advances state if a state-transition segment (SYN, FIN, RST).
  8. If data, copies the payload into the socket's receive buffer.
  9. If the receive buffer was previously empty, wakes any thread blocked in read(2) AND signals any kqueue watching this socket.
  10. Drops the tcpcb lock.
  11. Optionally generates an ACK via tcp_output.
apple-oss-distributions/xnubsd/netinet/tcp_input.cThe TCP receive function — the most-traveled code path in the network stack.View on GitHub(line )

The lock-around-tcpcb pattern is XNU's chosen concurrency model: each connection has its own lock, so connections can be processed in parallel as long as packets for different connections land on different CPUs. Hashing-based NIC steering helps here on multi-queue NICs.

Congestion control — variants and selection

Modern XNU ships multiple congestion-control algorithms:

  • Cubic (default for most connections) — the FreeBSD-derived classic.
  • NewReno — fallback for compatibility.
  • RACK + Apple-specific heuristics — for loss-recovery-sensitive workloads.
  • LedBat — for low-priority background flows (Time Machine, software updates).
  • BBR-derivatives in some paths (research/experimental).

The choice per connection is influenced by:

  • The application's hints (sockopt for low-priority transmission).
  • The interface (cellular vs Wi-Fi vs Ethernet — different RTT and loss profiles).
  • The QoS class of the originating process.

apple-oss-distributions/xnubsd/netinet/tcp_cubic.cCubic congestion control — the default on most macOS connections.View on GitHub(line ) apple-oss-distributions/xnubsd/netinet/tcp_ledbat.cLedBat — low-extra-delay background transport.View on GitHub(line )

A background process doing a software-update download uses LedBat, which yields bandwidth to foreground traffic — your video call doesn't stutter because Software Update is downloading 5 GB in the background.

The send path

write(2) on a TCP socket:

  1. Copies user data into the socket's send buffer (in an mbuf chain).
  2. Calls tcp_output to decide what to send right now.
  3. tcp_output checks the send window, the congestion window, Nagle's algorithm.
  4. If a segment can go out, allocates an mbuf header, adds TCP + IP headers, hands to ip_output.
  5. ip_output routes (looks up the next hop in the routing table), adds Ethernet headers, hands to the interface driver.
  6. Driver DMAs the packet to the NIC.

The send buffer fills if the network can't keep up; write blocks (or returns EAGAIN for non-blocking sockets) once the buffer is full. The send window plus the congestion window determine "how much data can we send before getting an ACK."

Skywalk takes over for some connections

For apps using Apple's modern Network.framework, the data path goes through Skywalk instead of the classic stack. Same TCP state machine, but:

  • The receive buffer is a shared-memory ring instead of an mbuf chain in the kernel.
  • The userspace ↔ kernel handoff is zero-copy for typical payloads.
  • Per-channel queueing instead of per-socket buffering.

The classic stack remains for legacy POSIX-socket apps. Both paths coexist.

How a kevent wakes your app

When the receive path signals that data is available:

  1. Inside the socket's kqueue filter, the kernel marks the filter as ready.
  2. Any thread currently in kevent(2) on that kqueue is woken.
  3. kevent returns to userspace with the ready event.
  4. The app reads from the socket.

If the app is using libdispatch's DISPATCH_SOURCE_TYPE_READ source (every modern Mac app does), the dispatch worker thread handling the source's queue wakes, runs the source's handler block, the handler calls read.

This is the path NSURLSession completion handlers ultimately use — every network response wakes via a kqueue dispatch source.

TCP fast open and TLS 1.3 0-RTT

Modern Apple devices use both:

  • TCP Fast Open (RFC 7413) — under the right conditions, payload can be carried in the SYN. Saves an RTT on repeat connections to the same server.
  • TLS 1.3 0-RTT — sends application data with the first TLS handshake message. Saves another RTT for resumed sessions.

Together they let a repeat HTTPS connection complete in 0 round trips instead of 3. Apple's CDN traffic and many third-party services exploit this.

The kernel side participates in TFO (the SYN-with-data semantics); TLS 0-RTT is purely userland.

What surprises newcomers

  • TCP state is per-connection, not per-socket. A listen()-then-accept() server has the parent socket in LISTEN; each accepted connection has its own tcpcb in ESTABLISHED.
  • Per-tcpcb locking means high-connection-count servers can scale. The lock granularity matters at scale.
  • Background processes use a slower congestion-control variant. This is what keeps software updates from stomping on user traffic.
  • Apple Silicon's offload engines (network DMA, checksum offload, TSO/LRO) move work off the CPU. The kernel arms them and processes the resulting batched packets.

apple-oss-distributions/xnubsd/netinet/tcp_output.cThe TCP send function — opposite end of the receive path.View on GitHub(line ) apple-oss-distributions/xnubsd/netinet/tcp_timer.cRetransmit, keepalive, persist, 2MSL — every TCP timer lives here.View on GitHub(line )

And the network stack article for the surrounding context.

Related

From the classic 4.4 BSD TCP/IP stack to Apple's modern Skywalk replacement — how packets traverse XNU's networking code, and why Apple is moving the data plane out of the BSD layer.
From double-click to first window: LaunchServices, launchd, posix_spawn, AMFI, dyld, the shared cache, sandbox profile installation, the runloop. Six subsystems in three seconds.
The Unix-est of Unix calls, implemented on a Mach kernel. Why fork is awkward on macOS, what exec actually replaces, and why posix_spawn is now the preferred way to start a process.