The Forgotten Art of Memory Layout and Why Cache Misses Still Decide Who Wins Benchmarks
Your Python code's performance depends more on how data is arranged in memory than on algorithms. Cache misses from pointer chasing in lists and objects can cost 10x–100x throughput, but simple layout choices like arrays and struct-of-arrays keep the CPU fed.
Advertisement
The Forgotten Art of Memory Layout and Why Cache Misses Still Decide Who Wins Benchmarks
You can write the most elegant Python code in the world, but if your data is strewn across memory like confetti after a parade, your benchmark will humiliate you. And here’s the kicker: You don’t even realize it’s happening.
Modern CPUs are absurdly fast—but only if you feed them data in the order they expect. One cache miss can cost 100+ cycles. Ten consecutive cache misses? That’s like asking a sports car to win a race on a road full of potholes. And Python, despite its high-level magic, is not immune.
The Real Bottleneck Isn’t Your Algorithm
When people profile Python code, they blame loops, function calls, or even the GIL. Often, the real culprit is memory layout. The CPU has a hierarchy of caches (L1, L2, L3) that are tiny but blazingly fast. When your code accesses data sequentially, the hardware prefetcher pulls the next chunks in advance. When you jump around randomly, it sits idle, waiting for main memory.
Consider this: A typical L1 cache hit takes ~1 nanosecond. A main memory access? 100 nanoseconds. The difference between a well-ordered list and a scattered linked list can be 10x–100x in real-world throughput. Python’s list is already a contiguous array, but its elements are pointers to objects. Each object lives somewhere else in memory. So iterating a list of integers involves chasing pointers. That’s the cache miss tax.
Python’s Hidden Pointer Chase
Let’s make it concrete.
# Example A: List of integers (actually a list of PyObject pointers)
int_list = [i for i in range(1_000_000)]
# Example B: NumPy array (contiguous C array)
import numpy as np
np_array = np.arange(1_000_000, dtype=np.int32)
Iterate int_list in a tight loop? Python’s interpreter overhead aside, every element access dereferences a pointer—jumping to the object’s memory location, then extracting the int value. The CPU’s prefetcher can’t predict where those objects are scattered. With np_array, all 4 million bytes sit shoulder-to-shoulder. The prefetcher sees a stream and pulls in line after line. That’s why NumPy can be 50x faster than plain Python loops for bulk operations—not just because of C, but because of cache locality.
Struct of Arrays vs. Array of Structs
One of the oldest tricks in the systems programmer’s playbook is choosing the right data layout. You’ve probably heard of SoA (Struct of Arrays) vs. AoS (Array of Structs). Python’s default is almost always AoS: a list of dictionaries or objects. Each dict has its own allocation. Accessing person["age"] for 10,000 people means bouncing between memory blocks.
SoA instead stores all ages in one contiguous array, all names in another. When processing only ages, the entire scan is sequential. No wasted cache lines carrying the names you don’t need.
# AoS: scattered
persons = [{"age": 30, "name": "Alice"}, ...]
# SoA with lists: contiguous
ages = [30, 25, ...]
names = ["Alice", ...]
The SoA version can be 2–5x faster for simple aggregations—and Python does not magically optimize this away.
The List-of-Ints Trap
Even with Python’s built-in list, careful design matters. A list of small Python integers is already a list of pointers. Each integer is a separate PyObject allocated on the heap. If the integers are small and reused (like 0 to 256), CPython caches them—but that doesn’t fix the pointer indirection.
Compare:
# Still pointer-based, but at least objects may be cached
small_ints = [0, 1, 2, ... 255] * 10_000
vs.
# Every integer is a new object, scattered
big_ints = [i for i in range(1_000_000)]
The first may have slightly better cache behavior because the small ints live in a shared location, but it’s still a list of pointers. If you need pure performance, array.array or bytearray are your friends—they store C primitives directly.
Data-Oriented Design in Python
You don’t need to drop into C or Cython to benefit from better memory layout. Consider:
- Use
array('i')orbytearrayfor large lists of primitive numbers. They store values contiguously. - Group data by access pattern. If you always iterate over one field, store that field in a separate list.
- Avoid deep nesting of dicts or objects if you’ll iterate across them. Flatten into tuples of arrays.
- Prefer
__slots__for classes with many instances. Saves memory and reduces indirection on attribute access.
A real-world example: processing a million “points” where each has x, y, z coordinates. A list of Point objects? Pain. Three array('f') lists for x, y, z? Streaming joy.
Why This Still Matters in 2025
Python’s ecosystem has evolved: PyPy, Numba, and C-extensions like numpy handle some of this automatically. But if you’re writing pure Python and hitting performance walls—especially in data-heavy scripts, real-time processing, or embedded contexts—memory layout is your next lever. Tools like perf (on Linux) can show cache misses. Even Python’s memory_profiler can hint at allocations.
The next time a benchmark surprises you, don’t blame the interpreter. Look at how your data sits in memory. That scattered confetti might be costing you the race.
Advertisement
Comments
Questions, corrections, and tips stay visible for everyone reading this page.
Join the discussion
No comments yet
Be the first to leave a note — it helps the next reader.