Tech

Lessons from Running Millions of Containers in Kubernetes

Explore the architectural bottlenecks of scaling Kubernetes to millions of containers, focusing on etcd optimization, scheduler sharding, and networking performance.

June 2026 · 6 min read · 1 views · 0 hearts

Try in editor Tutorial catalog

The Story of Kubernetes at Scale: Lessons from Running Millions of Containers

You think you know Kubernetes. You’ve spun up a cluster, deployed a microservice, maybe scaled to 50 pods. Then you hit the wall where the control plane starts sweating, etcd starts lagging, and your scheduled jobs get jammed like a Tokyo train at rush hour. Running Kubernetes at millions of containers is a different beast entirely.

Let me take you inside that world.

The Wall You Didn’t See Coming

Scaling Kubernetes isn’t a linear problem. At small sizes, everything is snappy. API calls return in milliseconds. Pods schedule instantly. Then, somewhere around 5,000 nodes or 150,000 pods, the world subtly shifts.

The first sign is usually etcd. It’s the brain of your cluster — but a brain wasn’t designed to handle every node pinging it simultaneously for leader elections. Google’s Borg, Kubernetes’ spiritual ancestor, had similar issues: too many heartbeats, too much churn. The fix? Reduce noise.

Lesson 1: Etcd Is Not a Database — It’s a Warden

Treat etcd like a delicate instrument. Every watch, every list request, every object update hits it. At scale, you see things like:

Watches on all pods in a 200,000-pod cluster consuming 40% of etcd’s CPU.
Single List calls timing out because the response size exceeds gRPC limits.
Leader elections stalling because heartbeats collide with write-heavy operations.

Solutions that work in production: - Limit watchers. Use filtered watches (fieldSelector=status.phase!=Running) instead of blanket watches. - Compaction. Run etcd compaction aggressively — every 5 minutes at scale, not the default 30. - Defrag. Etcd’s internal storage fragments like a hard drive. Manual defrag every few hours prevents latency spikes.

The Scheduler Bottleneck

The Kubernetes scheduler is a beautiful piece of software. It's also a single-threaded loop that runs on a single node. When you ask it to place 10,000 pods in 30 seconds, it chokes.

Real-world story: A large video streaming platform ran into this during a regional outage recovery. Their scheduler was taking 8 seconds per pod during peak load. That’s 22 hours to schedule 10,000 pods — unacceptable when every second of downtime costs revenue.

Lesson 2: Break the Scheduler Bottleneck

Three proven strategies:

Scheduler profiling. By default, the scheduler runs many predicates (node checks). At scale, turn off unnecessary ones. If you know all nodes have the same resources, skip NodeResourcesFit.
Multiple scheduler instances. Run 3-5 scheduler replicas, each handling a subset of nodes (via --node-selector or custom logic). This is the sharding pattern used by Uber’s Peloton.
Pre-scheduling. For batch workloads, pre-allocate node slots with a custom controller. The scheduler only does final placement — reduces decision time from seconds to milliseconds.

The Network That Fought Back

Kubernetes networking is deceptive. It works beautifully at 100 pods. At 100,000 pods, every DNS query, every Service lookup, every kube-proxy iptables rule becomes a tax on the system.

One company found that their DNS resolver was handling 500,000 queries per second per node during pod churn. The culprit? Every new pod triggered a DNS lookup for its service discovery, which in turn triggered a wave of kube-dns updates.

Lesson 3: Tame DNS and Service Mesh

Local DNS cache. Run node-local-dns on every node. It caches common DNS records locally, offloading the central DNS service by 90%.
Service meshes at scale. Istio or Linkerd add overhead at scale — more proxies, more mTLS handshakes. Evaluate if you truly need mutual TLS or if simpler network policies suffice.
Avoid iptables for large clusters. At 50,000+ Services, iptables rules become a performance nightmare. Switch to eBPF-based CNI plugins like Cilium. They scale linearly with rules, not exponentially.

The Human Factor

The biggest lesson from running millions of containers isn’t technical. It’s organizational.

At that scale, no one can “just check the logs.” Your monitoring must be instrumented from day one. Your SRE team must automate everything — upgrades, scaling, rollbacks. One mistake (like deleting a namespace that contains a critical operator) can cascade into a cluster-wide meltdown in minutes.

Lesson 4: Design for Chaos

Chaos engineering isn’t a luxury at scale — it’s survival. Run experiments regularly:

Kill a random node every hour.
Degrade etcd to simulate a slow backend.
Introduce network latency to 10% of pods.

If your cluster survives an hour of these tests, you have a production-grade setup. If not, you found your next improvement.

The Real Cost: Money and Cognitive Load

Running Kubernetes at scale is expensive. Not just cloud costs — operational complexity grows super-linearly. Each new node adds management overhead. Each new namespace adds RBAC complexity. Each new operator adds a potential point of failure.

Smart teams don't run everything on one cluster. They break into clusters by team, region, or workload type (batch vs. long-running). The control plane becomes a multi-cluster federation job via tools like Karmada or Google’s Anthos.

Your Takeaway

If you’re planning to scale Kubernetes — or already feeling the pain — remember:

Optimize etcd first. It’s the bottleneck you can’t ignore.
Break the scheduler. Shard it, profile it, or offload work.
Audit networking. DNS and iptables are silent killers.
Automate chaos. Your cluster should be bored by disaster.

Millions of containers running simultaneously isn't a pipe dream. It’s done every day by Netflix, Uber, and Spotify. But it requires treating Kubernetes not as a magic box, but as an engine you must tune, tweak, and sometimes fight with.

Start small, scale smart, and never stop profiling.

Comments

Questions, corrections, and tips stay visible for everyone reading this page.

0 in thread

Join the discussion

No comments yet

Be the first to leave a note — it helps the next reader.