Prometheus

Definition

Prometheus is an open-source systems monitoring and alerting toolkit originally built by SoundCloud. It has become a widely adopted tool for monitoring and alerting, particularly those utilizing cloud-based and microservices architectures.

Prometheus works by scraping metrics from configured targets at specified intervals, evaluating rule expressions, displaying the results, and triggering alerts if certain conditions are met. It stores time series data in a multi-dimensional data model with data identified by metric name and key/value pairs, making it highly efficient for storing and querying time series data.

Prometheus’s architecture is designed to be reliable and simple, allowing it to autonomously handle large volumes of metrics data while providing insights into the operational state of software systems. Its flexible query language, PromQL, enables users to select and aggregate time series data in real time, facilitating detailed analysis of system performance.

graph LR
    A[Targets] -->|Scrape metrics| B(Prometheus Server)
    B -->|Store data| C[Time Series Database]
    B -->|Evaluate rules| D[Alert Manager]
    D -->|Send alerts| E[Notification Channels]
    C -->|Query data| F[User Interface]
    F -->|Display results| G[Dashboard]

Characteristics

  • Multi-dimensional Data Model: Time series data is identified by metric name and key/value pairs.
  • PromQL: A powerful query language for selecting and aggregating time series data.
  • Pull-based Metrics: Metrics are scraped from configured targets at specified intervals.
  • Alerting: Supports alerting based on the results of evaluating rule expressions.
  • Highly Efficient: Designed to handle large volumes of metrics data.
  • Reliable and Simple: Autonomous handling of metrics data and operational insights.

Components

Prometheus is composed of several components, each serving a specific purpose in the monitoring and alerting process.

graph LR
    subgraph Prometheus
    B("HTTP Server that accepts PromQL Queries")
    C["Storage (store the metrics data into a series DB)"]
    E("Retrieval of metrics (pulling metrics data with the data retrieval worker)")
    end

    subgraph External
    A("UI (Prometheus UI or Grafana)")
    D["Services and Applications (where they get the metrics from)"]
    end

    A -->|Send PromQL queries| B
    B -->|Fetch data| C
    D -->|Expose metrics| E
    E -->|Pull metrics| C
    C -->|Query results| B
    B -->|Display data| A

    classDef prometheus fill:#f9f,stroke:#333,stroke-width:4px;
    classDef external fill:#bbf,stroke:#333,stroke-width:4px;

Prometheus Server

The core component of Prometheus, responsible for scraping and storing time series data, evaluating rule expressions, and triggering alerts. It is designed to be highly efficient and reliable, capable of handling large volumes of metrics data.

Exporters

Prometheus exporters are agents that collect metrics from third-party systems and expose them in a format that Prometheus can scrape. They are commonly used to monitor services and applications that do not natively expose Prometheus metrics.

Alertmanager

The component responsible for handling alerts sent by the Prometheus server. It manages the routing, grouping, and silencing of alerts, ensuring that the right people are notified at the right time.

Client Libraries

Prometheus provides client libraries for various programming languages, allowing developers to instrument their code and expose custom metrics to Prometheus.

Pushgateway

A component that allows short-lived jobs to push their metrics to a gateway, which then exposes them to Prometheus. It is useful for batch jobs, service-level checks, and other scenarios where metrics are not exposed directly by the job.

Data Model

  • Targets: The endpoints that Prometheus scrapes metrics from, such as applications, services, and infrastructure components.
  • Units: The measurement units associated with a metric, such as seconds, bytes, or requests. They can be, for example:
    • CPU Status
    • Memory/Disk Space Usage
    • Exception Count
    • Request Count
    • Request Duration
  • Metrics: The unit that is monitored for an specific time. This is a time series data that represents the state of a system at a specific point in time. Metrics are identified by a unique name and a set of key/value pairs. They are human-readable and can be used to monitor the performance and health of a system. Metrics can be of the following type:
    • Counter: A cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.
    • Gauge: A metric that represents a single numerical value that can arbitrarily go up and down.
    • historiogram: how long something took (e.g. how long did a certain request took to complete/how big was the size of a request)
    • Summary: Similar to a histogram, a summary samples observations (usually things like request duration and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.
  • Labels: Key/value pairs that are used to identify time series data in the multi-dimensional data model.

Query Language (PromQL)

Prometheus Query Language (PromQL) is a powerful and flexible language for querying and aggregating time series data. It allows users to select and aggregate time series data in real time, facilitating detailed analysis of system performance. PromQL supports a wide range of operations, including filtering, aggregation, and mathematical functions, enabling users to derive meaningful insights from their metrics data.

Basic Operations

  • Instant Vectors: A set of time series data at a specific point in time.
  • Range Vectors: A set of time series data over a range of time.
  • Selectors: A set of time series data that matches a specific label configuration.
  • Aggregation: Functions for aggregating time series data, such as sum, avg, min, and max.
  • Filtering: Selecting time series data based on specific conditions, such as ==, !=, >, and <.

Examples

  • Instant Vector: http_requests_total
  • Range Vector: rate(http_requests_total[5m])
  • Selectors: http_requests_total{status="200", method="GET"}
  • Aggregation: sum(http_requests_total)
  • Filtering: http_requests_total{status!="200"}
  • Mathematical Functions: rate(http_requests_total[5m]) * 100

Data Collection

Pull-Based Metrics

Prometheus uses a pull-based model to collect metrics from configured targets. The Prometheus server periodically scrapes metrics from the targets and stores the time series data for analysis and alerting. This approach allows Prometheus to efficiently collect metrics from a wide range of systems and applications, making it suitable for monitoring complex and dynamic environments.

This is different from tools like New Relic, Amazon CloudWatch, etc. which use a push-based model to collect metrics. Prometheus requires just a scrapping endpoint. This prevents the monitoring tool from being flooded with requests and also allows multiple instances of Prometheus to pull metrics data.

Which targets to scrape and how often to scrape them are defined in the Prometheus configuration file. The configuration file specifies the targets, the interval at which to scrape metrics, and other relevant settings.

global:
  scrape_interval:     15s
  evaluation_interval: 15s

Rules for aggregating metric values or creating alerts when conditions are met:

rule_files:
  - "first.rules"
  - "second.rules"

What resources Prometheus monitors:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

For targets that only run for a very short time, Prometheus offer a Pushgateway component, as mentioned before.

Exporters

Prometheus exporters are agents that collect metrics from third-party systems and expose them in a format that Prometheus can scrape. They are commonly used to monitor services and applications that do not natively expose Prometheus metrics.

Alerting

Prometheus provides a powerful alerting system that allows users to define custom alerting rules and receive notifications when specific conditions are met. The alerting system is tightly integrated with the Prometheus server, enabling users to define alerts using PromQL expressions and route them to the Alertmanager for further processing.

The Alertmanager is responsible for handling alerts sent by the Prometheus server. It manages the routing, grouping, and silencing of alerts, ensuring that the right people are notified at the right time. The Alertmanager can send notifications via various channels, such as email, Slack, PagerDuty, etc.

Additional Resources