High-Cardinality Problem
There is a concept that every Prometheus user needs to understand: cardinality. Misunderstanding it can lead to serious performance issues and even system failures.
What is Cardinality in Prometheus?
A time series in Prometheus is uniquely identified by:
- The metric name (e.g.,
http_requests_total
) - A set of key-value pairs called labels (e.g.,
{method="GET", endpoint="/api/users", status="200"}
)
Each unique combination of metric name and label values creates a separate time series that Prometheus must store, process, and query. Let’s visualize this with a diagram:
In this example, a single metric name (http_requests_total
) combined with different label values creates four distinct time series. The cardinality here is 4.
How Prometheus Handles Time Series Data
Prometheus stores each time series as a separate stream of data points. For each time series, Prometheus maintains:
- Metadata: The metric name and labels that uniquely identify the series
- Sample data: The actual measurements (timestamp and value pairs)
When you query Prometheus, it needs to:
- Identify which time series match your query
- Retrieve the relevant data points
- Perform any requested computations (aggregations, transformations, etc.)
- Return the results
The more time series you have, the more work Prometheus needs to do for storage, retrieval, and computation.
What Makes a High-Cardinality Metric?
- High-cardinality metrics have a large number of possible label value combinations. Common sources include:
- Unique IDs: User IDs, request IDs, session IDs
- High-resolution timestamps: Using timestamps as label values
- Dynamic label values: IP addresses, email addresses, URLs with parameters
- Many label combinations: Adding multiple labels that can have many values
Let’s consider an example with Go code:
// Bad practice: Using high-cardinality labels
requestDuration := prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
},
[]string{"method", "endpoint", "user_id", "request_id"}, // High cardinality!
)
// Track request duration for each request
requestDuration.WithLabelValues(
"GET",
"/api/products",
"user_12345", // Unique user ID - high cardinality
"req_abc123" // Unique request ID - very high cardinality
).Observe(duration)
In this example, user_id and request_id are high-cardinality labels that would create a new time series for every user and every request - potentially millions of series.
Why High Cardinality is Problematic
High cardinality creates several serious problems:
Increased Resource Usage
Each time series consumes:
- Memory for metadata and recent samples
- CPU for processing
- Disk space for storage
The relationship between cardinality and resource usage is not linear but often exponential. As cardinality grows, you might see:
Performance Degradation
High cardinality can lead to:
Slower query performance Longer scrape durations Higher latency for alerting Increased heap usage and garbage collection pauses
Cardinality Explosions
A cardinality explosion occurs when there’s a sudden, dramatic increase in the number of time series. This can happen when:
A new high-cardinality label is added A deployment introduces a bug that generates many unique label values Traffic patterns change significantly
A cardinality explosion can quickly exhaust available resources and bring down your monitoring system.
Best Practices to Manage Cardinality
Avoid High-Cardinality Labels
Avoid using these as labels:
- User IDs, session IDs, or request IDs
- Precise timestamps or continuously incrementing values
- Full URLs, email addresses, or raw IP addresses
- Unbounded attributes like free-form text fields
Here’s a better version of our earlier example:
// Good practice: Using low-cardinality labels
requestDuration := prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
},
[]string{"method", "endpoint", "status_code"}, // Low cardinality
)
// Separate counter for user-specific metrics with bucketing
userRequests := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "user_requests_total",
Help: "Total requests by user type",
},
[]string{"user_type"}, // Bucketed instead of individual IDs
)
// Track request duration
requestDuration.WithLabelValues("GET", "/api/products", "200").Observe(duration)
// Increment user counter with a bucketed label
userType := getUserTypeBucket(userId) // e.g., "premium", "basic", "trial"
userRequests.WithLabelValues(userType).Inc()
Use Label Buckets
Instead of using raw high-cardinality values, bucket them into meaningful groups:
- HTTP status codes: Instead of exact codes, use categories like “2xx”, “4xx”, “5xx”
- Response times: Use buckets like “fast”, “medium”, “slow”
- User IDs: Group by user type, region, or cohort
Here’s a simple bucketing function:
func bucketHTTPStatus(status int) string {
if status >= 200 && status < 300 {
return "2xx"
} else if status >= 400 && status < 500 {
return "4xx"
} else if status >= 500 {
return "5xx"
}
return "other"
}
Implement Client-Side Aggregation
Aggregate metrics before sending them to Prometheus:
type RequestStats struct {
mutex sync.Mutex
countByEndpoint map[string]int
}
func (s *RequestStats) TrackRequest(endpoint string, duration float64) {
s.mutex.Lock()
defer s.mutex.Unlock()
s.countByEndpoint[endpoint]++
// Only expose aggregated metrics to Prometheus
}
// Periodically expose to Prometheus (every minute)
func (s *RequestStats) exposeMetrics() {
for endpoint, count := range s.countByEndpoint {
requestCounter.WithLabelValues(endpoint).Add(float64(count))
s.countByEndpoint[endpoint] = 0
}
}