Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected
Python

The Forgotten Art of Memory Layout and Why Cache Misses Still Decide Who Wins Benchmarks

Your Python code's performance depends more on how data is arranged in memory than on algorithms. Cache misses from pointer chasing in lists and objects can cost 10x–100x throughput, but simple layout choices like arrays and struct-of-arrays keep the CPU fed.

June 2026 6 min read 2 views 0 hearts

The Forgotten Art of Memory Layout and Why Cache Misses Still Decide Who Wins Benchmarks

You can write the most elegant Python code in the world, but if your data is strewn across memory like confetti after a parade, your benchmark will humiliate you. And here’s the kicker: You don’t even realize it’s happening.

Modern CPUs are absurdly fast—but only if you feed them data in the order they expect. One cache miss can cost 100+ cycles. Ten consecutive cache misses? That’s like asking a sports car to win a race on a road full of potholes. And Python, despite its high-level magic, is not immune.


The Real Bottleneck Isn’t Your Algorithm

When people profile Python code, they blame loops, function calls, or even the GIL. Often, the real culprit is memory layout. The CPU has a hierarchy of caches (L1, L2, L3) that are tiny but blazingly fast. When your code accesses data sequentially, the hardware prefetcher pulls the next chunks in advance. When you jump around randomly, it sits idle, waiting for main memory.

Consider this: A typical L1 cache hit takes ~1 nanosecond. A main memory access? 100 nanoseconds. The difference between a well-ordered list and a scattered linked list can be 10x–100x in real-world throughput. Python’s list is already a contiguous array, but its elements are pointers to objects. Each object lives somewhere else in memory. So iterating a list of integers involves chasing pointers. That’s the cache miss tax.


Python’s Hidden Pointer Chase

Let’s make it concrete.

# Example A: List of integers (actually a list of PyObject pointers)
int_list = [i for i in range(1_000_000)]

# Example B: NumPy array (contiguous C array)
import numpy as np
np_array = np.arange(1_000_000, dtype=np.int32)

Iterate int_list in a tight loop? Python’s interpreter overhead aside, every element access dereferences a pointer—jumping to the object’s memory location, then extracting the int value. The CPU’s prefetcher can’t predict where those objects are scattered. With np_array, all 4 million bytes sit shoulder-to-shoulder. The prefetcher sees a stream and pulls in line after line. That’s why NumPy can be 50x faster than plain Python loops for bulk operations—not just because of C, but because of cache locality.


Struct of Arrays vs. Array of Structs

One of the oldest tricks in the systems programmer’s playbook is choosing the right data layout. You’ve probably heard of SoA (Struct of Arrays) vs. AoS (Array of Structs). Python’s default is almost always AoS: a list of dictionaries or objects. Each dict has its own allocation. Accessing person["age"] for 10,000 people means bouncing between memory blocks.

SoA instead stores all ages in one contiguous array, all names in another. When processing only ages, the entire scan is sequential. No wasted cache lines carrying the names you don’t need.

# AoS: scattered
persons = [{"age": 30, "name": "Alice"}, ...]

# SoA with lists: contiguous
ages = [30, 25, ...]
names = ["Alice", ...]

The SoA version can be 2–5x faster for simple aggregations—and Python does not magically optimize this away.


The List-of-Ints Trap

Even with Python’s built-in list, careful design matters. A list of small Python integers is already a list of pointers. Each integer is a separate PyObject allocated on the heap. If the integers are small and reused (like 0 to 256), CPython caches them—but that doesn’t fix the pointer indirection.

Compare:

# Still pointer-based, but at least objects may be cached
small_ints = [0, 1, 2, ... 255] * 10_000

vs.

# Every integer is a new object, scattered
big_ints = [i for i in range(1_000_000)]

The first may have slightly better cache behavior because the small ints live in a shared location, but it’s still a list of pointers. If you need pure performance, array.array or bytearray are your friends—they store C primitives directly.


Data-Oriented Design in Python

You don’t need to drop into C or Cython to benefit from better memory layout. Consider:

  • Use array('i') or bytearray for large lists of primitive numbers. They store values contiguously.
  • Group data by access pattern. If you always iterate over one field, store that field in a separate list.
  • Avoid deep nesting of dicts or objects if you’ll iterate across them. Flatten into tuples of arrays.
  • Prefer __slots__ for classes with many instances. Saves memory and reduces indirection on attribute access.

A real-world example: processing a million “points” where each has x, y, z coordinates. A list of Point objects? Pain. Three array('f') lists for x, y, z? Streaming joy.


Why This Still Matters in 2025

Python’s ecosystem has evolved: PyPy, Numba, and C-extensions like numpy handle some of this automatically. But if you’re writing pure Python and hitting performance walls—especially in data-heavy scripts, real-time processing, or embedded contexts—memory layout is your next lever. Tools like perf (on Linux) can show cache misses. Even Python’s memory_profiler can hint at allocations.

The next time a benchmark surprises you, don’t blame the interpreter. Look at how your data sits in memory. That scattered confetti might be costing you the race.

Comments

Questions, corrections, and tips stay visible for everyone reading this page.

0 in thread

Join the discussion

Shown next to your comment.

Up to 4,000 characters

No comments yet

Be the first to leave a note — it helps the next reader.