High-Cardinality Problem

learnings

Prometheus

There is a concept that every Prometheus user needs to understand: cardinality. Misunderstanding it can lead to serious performance issues and even system failures.

What is Cardinality in Prometheus?

ℹ️

Cardinality in Prometheus refers to the number of unique time series that exist in your monitoring system.

A time series in Prometheus is uniquely identified by:

The metric name (e.g., http_requests_total)
A set of key-value pairs called labels (e.g., {method="GET", endpoint="/api/users", status="200"})

Each unique combination of metric name and label values creates a separate time series that Prometheus must store, process, and query. Let’s visualize this with a diagram:

In this example, a single metric name (http_requests_total) combined with different label values creates four distinct time series. The cardinality here is 4.

How Prometheus Handles Time Series Data

Prometheus stores each time series as a separate stream of data points. For each time series, Prometheus maintains:

Metadata: The metric name and labels that uniquely identify the series
Sample data: The actual measurements (timestamp and value pairs)

When you query Prometheus, it needs to:

Identify which time series match your query
Retrieve the relevant data points
Perform any requested computations (aggregations, transformations, etc.)
Return the results

The more time series you have, the more work Prometheus needs to do for storage, retrieval, and computation.

What Makes a High-Cardinality Metric?

High-cardinality metrics have a large number of possible label value combinations. Common sources include:
Unique IDs: User IDs, request IDs, session IDs
High-resolution timestamps: Using timestamps as label values
Dynamic label values: IP addresses, email addresses, URLs with parameters
Many label combinations: Adding multiple labels that can have many values

Let’s consider an example with Go code:

// Bad practice: Using high-cardinality labels
requestDuration := prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request duration in seconds",
    },
    []string{"method", "endpoint", "user_id", "request_id"}, // High cardinality!
)

// Track request duration for each request
requestDuration.WithLabelValues(
    "GET",
    "/api/products",
    "user_12345", // Unique user ID - high cardinality
    "req_abc123"  // Unique request ID - very high cardinality
).Observe(duration)

In this example, user_id and request_id are high-cardinality labels that would create a new time series for every user and every request - potentially millions of series.

Why High Cardinality is Problematic

High cardinality creates several serious problems:

Increased Resource Usage

Each time series consumes:

Memory for metadata and recent samples
CPU for processing
Disk space for storage

The relationship between cardinality and resource usage is not linear but often exponential. As cardinality grows, you might see:

Performance Degradation

High cardinality can lead to:

Slower query performance Longer scrape durations Higher latency for alerting Increased heap usage and garbage collection pauses

Cardinality Explosions

A cardinality explosion occurs when there’s a sudden, dramatic increase in the number of time series. This can happen when:

A new high-cardinality label is added A deployment introduces a bug that generates many unique label values Traffic patterns change significantly

A cardinality explosion can quickly exhaust available resources and bring down your monitoring system.

Best Practices to Manage Cardinality

Avoid High-Cardinality Labels

Avoid using these as labels:

User IDs, session IDs, or request IDs
Precise timestamps or continuously incrementing values
Full URLs, email addresses, or raw IP addresses
Unbounded attributes like free-form text fields

Here’s a better version of our earlier example:

// Good practice: Using low-cardinality labels
requestDuration := prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request duration in seconds",
    },
    []string{"method", "endpoint", "status_code"}, // Low cardinality
)

// Separate counter for user-specific metrics with bucketing
userRequests := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "user_requests_total",
        Help: "Total requests by user type",
    },
    []string{"user_type"}, // Bucketed instead of individual IDs
)

// Track request duration
requestDuration.WithLabelValues("GET", "/api/products", "200").Observe(duration)

// Increment user counter with a bucketed label
userType := getUserTypeBucket(userId) // e.g., "premium", "basic", "trial"
userRequests.WithLabelValues(userType).Inc()

Use Label Buckets

Instead of using raw high-cardinality values, bucket them into meaningful groups:

HTTP status codes: Instead of exact codes, use categories like “2xx”, “4xx”, “5xx”
Response times: Use buckets like “fast”, “medium”, “slow”
User IDs: Group by user type, region, or cohort

Here’s a simple bucketing function:

func bucketHTTPStatus(status int) string {
    if status >= 200 && status < 300 {
        return "2xx"
    } else if status >= 400 && status < 500 {
        return "4xx"
    } else if status >= 500 {
        return "5xx"
    }
    return "other"
}

Implement Client-Side Aggregation

Aggregate metrics before sending them to Prometheus:

type RequestStats struct {
    mutex sync.Mutex
    countByEndpoint map[string]int
}

func (s *RequestStats) TrackRequest(endpoint string, duration float64) {
    s.mutex.Lock()
    defer s.mutex.Unlock()
    s.countByEndpoint[endpoint]++
    // Only expose aggregated metrics to Prometheus
}

// Periodically expose to Prometheus (every minute)
func (s *RequestStats) exposeMetrics() {
    for endpoint, count := range s.countByEndpoint {
        requestCounter.WithLabelValues(endpoint).Add(float64(count))
        s.countByEndpoint[endpoint] = 0
    }
}

Resources

Prometheus Grafana