Escaping the Linux Networking Stack at Cloudflare

Wait 5 sec.

Courtesy of the complex routing and network configurations that Cloudflare uses, their engineers like to push the Linux network stack to its limits and ideally beyond. In a blog article [Chris Branch] details how they ran into limitations while expanding their use of soft-unicast functionality that fits with their extensive use of anycast to push as much redundancy onto the external network as possible.The particular issue that they ran into had to do with the Netfilter connection tracking (conntrack) module and the Linux socket subsystem when you use packet rewriting. For soft-unicast it is important that multiple processes are aware of the same connection, yet due to how Linux works this made it impossible to use packet rewriting. Instead they had to use a local proxy initially, but this creates overhead.To work around this the solution appeared to be to abuse the TCP_REPAIR socket option in Linux, which normally exists to e.g. migrate VM network connections. This enables one to describe the entire socket connection state, thus ‘repairing’ it. Combined with TCP Fast Open to skip the whole handshake bit with a TFO ‘cookie’. This still left a few more issues to fix, with an early demux providing a potential solution.Ironically, ultimately it was decided to not break the Linux networking stack that much and stick with the much less complicated local proxy to terminate TCP connections and redirect traffic to a local socket. Unfortunately escaping the Linux networking stack isn’t that straightforward.