Tech
How Prometheus Collects and Stores Infrastructure Metrics
Explore Prometheus's pull-based architecture, service discovery, time series database, and PromQL query engine. Understand how it efficiently collects and stores infrastructure metrics for real-time monitoring.
June 2026 · 8 min read · 1 views · 0 hearts
Advertisement
How Prometheus Collects and Stores Infrastructure Metrics
You've probably heard the name Prometheus thrown around in DevOps circles like it's some kind of magic metric fairy. And honestly? It kind of is. But instead of granting wishes, it grants you the ability to know exactly how many CPUs your Kubernetes cluster is melting at 3 AM. Let's peel back the layers and see how this system really works under the hood — no cargo culting required.
The Core Idea: Pull, Don't Push
Most monitoring systems in the old days worked like this: your app screams "I'M AT 99% MEMORY!" into the void, and somewhere a server hopefully catches it. Prometheus flips this on its head. Every few seconds (configurable as your "scrape interval"), Prometheus goes out and pulls metrics from your targets.
This is a genius move. If a target goes silent, Prometheus knows something's wrong because it's expecting a response but gets nothing. In a push model, you'd just assume everything's fine until the pager goes off. The pull model also means your services don't need to know anything about Prometheus — they just expose a /metrics endpoint and mind their own business.
Target Discovery: How Prometheus Finds Things to Scrape
In 2024, nobody's manually typing IP addresses into a config file. Prometheus uses service discovery to find targets dynamically. It can hook into:
- Kubernetes API (pods, services, endpoints)
- EC2, GCE, Azure (cloud provider metadata)
- Consul, DNS SRV records
- Static lists for development (we all have that one local test server)
For Kubernetes specifically, Prometheus watches the API server for pod labels. If your pod has app: my-service, Prometheus knows exactly where to find it, even if it gets rescheduled to a different node.
The Scrape Process: What Actually Happens
When Prometheus decides to scrape a target, here's the dance:
- HTTP GET to
http://target:port/metrics - The target returns plaintext with key-value pairs like:
# HELP http_requests_total Total number of HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET", endpoint="/home"} 1027 - Prometheus parses this text, labels and all
- Each metric gets timestamped with the scrape time (the target doesn't provide timestamps — Prometheus controls the clock)
- The data lands in memory, ready for queries
This is why Prometheus works so well with time series data. Each metric + label set = one time series. That http_requests_total with method="GET" and endpoint="/home" is a unique series. Add method="POST" and you get another one.
Storage: The Time Series Database
Here's where Prometheus does something clever instead of just dumping everything into a SQL database. It uses a time series database (TSDB) that's optimized for:
- Appending data (rarely updating, always writing new points)
- Reading recent data fast (alerts usually care about "the last 5 minutes")
- Compression (metric names and labels repeat a lot — why store them over and over?)
Prometheus organizes data in blocks. Each block covers a time range (like 2 hours) and contains all the samples for that period. Blocks are:
- Written to disk periodically
- Compactable — smaller blocks get merged into bigger ones over time
- Immutable once written (until retention kicks them out)
This design means Prometheus can handle millions of time series on modest hardware. A single server can comfortably ingest 500,000+ samples per second without breaking a sweat.
Labels: The Secret Sauce
If metrics are the ingredients, labels are the recipe. Without labels, cpu_usage is just a number. With cpu_usage{cpu="0", mode="user", host="web-01"}, you can answer questions like "What's the user CPU load on web-01's first core?"
But be careful — labels that change too often kill performance. If you include a request ID as a label, every new request creates a brand new time series. Your database will bloat, queries will slow, and Prometheus will give you that sad look. Stick to labels with low cardinality: service names, methods, endpoints, data centers.
Retention and Downsampling: Don't Store Everything Forever
Prometheus doesn't keep everything for eternity — that's what long-term storage like Thanos or Cortex is for. By default, Prometheus keeps 15 days of data (configurable). When data ages out, the blocks for that time range get deleted.
But here's the thing: Prometheus doesn't downsample natively. That's intentional. Downsampling means averaging data points over time, which loses fidelity. If you need year-old metrics at full resolution, you're better off shipping data elsewhere. Prometheus excels at "what happened in the last hour" — and that's where most problems live anyway.
The Query Engine: PromQL
You can't just stare at raw data — you need to ask questions. PromQL (Prometheus Query Language) is how you do that. Want to know the average CPU across your Kubernetes cluster for the last 5 minutes?
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
That's it. PromQL vectorizes operations — you rarely ask for one metric; you ask for groups of them. The query engine evaluates expressions across time and returns instant vectors (for gauges) or range vectors (for rates).
Putting It All Together
Here's the lifecycle of a metric in Prometheus:
- Your app exposes
my_app_errors_totalwith ahandlerlabel - Prometheus discovers the pod via Kubernetes, scrapes it every 15 seconds
- The data lands in the TSDB as time series — one per handler
- A recording rule computes
rate(my_app_errors_total[5m])every minute for faster queries - An alerting rule fires the pager when the rate exceeds 10 per second
- After 15 days, the data gets dropped unless you've shipped it somewhere
Why People Love It
Prometheus isn't perfect (high cardinality problems, scaling vertically, no native auth), but its simplicity is its superpower. You don't need a PhD in monitoring to set it up. Install the server, configure some scrapes, write a few queries, and you're running circles around legacy systems.
Just don't ask it to store your application logs. Prometheus handles metrics, not logs. Leave that to Loki or ELK. Everyone's got a job — Prometheus does metrics, and it does them well enough that you'll wonder how you ever lived without knowing your exact 99th percentile response time at 2 AM on a Tuesday.
Advertisement
Comments
Questions, corrections, and tips stay visible for everyone reading this page.
Join the discussion
No comments yet
Be the first to leave a note — it helps the next reader.