Tech

Linux Performance Hacks Used by Netflix, Twitter, and Google

Explore the advanced Linux kernel optimization techniques used by tech giants to reduce latency and increase throughput, including kernel bypass, CPU pinning, and NUMA-aware allocation.

June 2026 · 6 min read · 1 views · 0 hearts

Try in editor Tutorial catalog

Netflix, Twitter, and Google All Use These Linux Performance Hacks — Here’s How

When you’re serving billions of requests a day, a 5-millisecond lag isn’t a minor annoyance — it’s a revenue disaster. High-traffic internet companies like Netflix, Twitter, and Google don’t just throw hardware at performance problems. They optimize the Linux kernel itself, often in ways most developers never touch.

Let’s look under the hood at the actual techniques that keep these systems running at insane scale.

The Kernel Bypass Revolution

Standard Linux networking works like this: data comes in through a network card, passes through the kernel’s network stack, gets copied to userspace, and finally reaches your application. That’s three to five context switches and multiple memory copies per packet.

At Netflix scale, this overhead is catastrophic. That’s why they use kernel bypass techniques.

The tool: DPDK (Data Plane Development Kit). Instead of letting the kernel handle networking, DPDK lets user-space applications talk directly to the network card. The kernel is completely out of the way.

Real-world impact: Netflix’s Open Connect CDN appliances use DPDK to push 40 Gbps through a single server with less than 10 microseconds of latency. The same server running normal kernel networking would choke at maybe 10 Gbps with 100x the jitter.

CPU Pinning and Isolated Cores

Twitter’s infrastructure team discovered that standard Linux CPU scheduling was killing their caching layer. When Redis or Memcached suddenly got preempted by a kernel task or a cron job, a queue of waiting requests piled up instantly.

The fix? CPU isolation combined with pinning.

# Isolate cores 1-3 from general scheduling
isolcpus=1-3 nohz_full=1-3 rcu_nocbs=1-3

Add this to your kernel boot parameters. Then manually pin your high-priority processes:

# Pin Redis to isolated core 1
taskset -c 1 redis-server

Result: The CPU cache stays hot because nothing else touches that core. Context switches drop to near zero. Twitter reportedly saw a 30% throughput increase on critical services after isolating just two cores per process.

The NUMA Trap That Costs You 20% Performance

Here’s a classic mistake: on multi-socket servers (which most production machines are), memory access isn’t equal. Core 0 on socket 0 accessing memory on socket 1 is 1.5x to 2x slower than local memory.

High-traffic companies don’t leave this to chance. They use NUMA-aware allocation explicitly.

# Run a process on socket 0's cores, using socket 0's memory
numactl --cpunodebind=0 --membind=0 ./myapp

Pro tip from Google’s production engineering: They actually disable NUMA balancing in the kernel (numa_balancing=disable) because the automatic balancing overhead causes latency spikes. Manual pinning is faster.

The Socket and Epoll Myth

Most developers think epoll is the endgame for high-concurrency networking. But at extreme scale, even epoll has problems.

The hidden issue: epoll’s internal lock contention. When you have 100+ threads all calling epoll_wait on the same descriptor, the kernel’s epoll mutex becomes a bottleneck.

Netflix’s solution: Use SO_REUSEPORT to create multiple listen sockets on the same port, each bound to a separate epoll instance. Each CPU core gets its own socket and its own epoll fd — zero contention.

// Enable multiple processes listening on same port
int opt = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &opt, sizeof(opt));

This simple change let them saturate 40 Gbps NICs without the kernel becoming the bottleneck.

Page Cache is Not Your Friend for Real-Time

High-traffic systems don’t rely on the kernel’s page cache for performance-critical paths. Here’s why: page cache eviction causes unpredictable stalls.

When your database query misses the page cache, the kernel does synchronous I/O. That single miss can spike your p99 latency from 1ms to 100ms.

The workaround: Direct I/O with userspace caching.

Google’s Inktomi search cache famously bypassed the kernel’s buffer cache entirely, managing their own in-memory storage with huge pages. By using O_DIRECT and manually controlling memory with mmap and MAP_HUGETLB, they eliminated kernel cache management overhead.

// Force direct I/O — no kernel caching
int fd = open("/data/file", O_RDONLY | O_DIRECT);

The Good and the Bad

Not every technique applies to every project. If you run a small Rails app on a single server, kernel bypass is overkill. But if you’re pushing millions of requests daily, these three tricks will actually move the needle:

Isolate one CPU core for your most latency-sensitive process
Bind memory and CPU together using numactl
Use SO_REUSEPORT if you have a multi-threaded network server

The best engineers at these companies didn’t invent new algorithms. They just understood how Linux really worked — and removed everything that got in the way.

Comments

Questions, corrections, and tips stay visible for everyone reading this page.

0 in thread

Join the discussion

No comments yet

Be the first to leave a note — it helps the next reader.