Skip to content

Prometheus Metrics: Mastering Them

  • 19 min read

Ever felt lost in a sea of metrics, trying to make sense of your Prometheus data? You’re not alone. Many DevOps engineers and SREs find themselves grappling with the various types of metrics that Prometheus offers. The sheer volume of information can be overwhelming, but understanding these types is key to effective monitoring and alerting. This article breaks down Prometheus metrics in a clear, straightforward way so you can take control of your monitoring setup.

Understanding Prometheus Metrics Types

Prometheus, at its core, is a time-series database. It stores data points over time, allowing you to track changes and identify patterns in your systems. The data Prometheus collects are known as metrics. These metrics come in four primary types: Counters, Gauges, Histograms, and Summaries. Each of these serves a different purpose and is best suited for specific kinds of data. Let’s get into each of them:

Counters

Counters are metrics that represent a single, ever-increasing value. They are used to track things like requests served, errors encountered, or tasks completed. The key thing about a counter is it always goes up. It can only be reset when the application restarts, or by a manual intervention in rare cases. Here’s a closer look:

  • Monotonicity: The most important thing to note is that counters can only increase, they can’t go down, unless reset.
  • Use Cases: Perfect for tracking the total number of events. Think of it as a tally counter. Every time something happens, you increment the counter.
  • Examples:
    • Number of HTTP requests to an API.
    • Total number of errors in an application.
    • Number of database queries executed.
  • Rate Calculations: Because counters are always increasing, it’s not useful to look at the raw number. Instead, you’ll calculate rate per second or per minute. This tells you the speed at which the counter is growing, which is more informative.
  • Resetting: When an application restarts, a counter typically gets reset. This is important for your alerts. If the counter starts from zero again, you may get a false positive if not handled well.
  • Example in Code:
    “`python
    from prometheus_client import Counter

    REQUESTS = Counter(‘http_requests_total’, ‘Total HTTP requests’)

    When a new request comes in:

    REQUESTS.inc()
    “`
    * When to Use: When you need to track cumulative totals over time and want to compute rates, such as queries per second.

Gauges

Gauges are metrics that can go up, down or stay the same. They represent a value that can fluctuate or change at any given time. Unlike counters, gauges don’t track cumulative sums. They track current states.

  • Fluctuation: The core concept of gauges is their ability to go up, down or remain stable. They reflect the current status at a given time.
  • Use Cases: Best when you need to track real-time changes or fluctuations.
  • Examples:
    • Current CPU usage
    • Memory consumption
    • Number of connected users
    • Temperature in your server room
  • No Cumulative Sum: Because Gauges represent current states, they do not track sums over time.
  • Setting Values: You can directly set the value of a Gauge to what it currently is. This gives a snapshot of the application or system at that exact time.
  • Example in Code:
    “`python
    from prometheus_client import Gauge

    CPU_USAGE = Gauge(‘cpu_usage_percent’, ‘Current CPU Usage’)

    Set the value:

    CPU_USAGE.set(65.5)
    “`
    * When to Use: When you need a measure of what is happening right now, such as memory usage or number of active sessions.

Histograms

Histograms are used to track the distribution of values within a set of observations. They’re perfect when you need to understand things like response times, request sizes, and other variable data. Histograms allow you to visualize data within buckets which are essentially ranges.

  • Bucket Ranges: Histograms organize data into configurable “buckets,” or ranges of values. Each bucket counts how many observations fell within that range.
  • Distribution Analysis: By using buckets, you gain insight into the spread of the data instead of just an average or max value.
  • Use Cases: Perfect when analyzing the distribution of things like:
    • Request response times
    • Request sizes
    • Execution times
  • Quantiles: Histograms allow you to calculate quantiles. Examples of this are the 50th percentile, the 90th percentile, etc. This gives you a sense of where a value is likely to fall.
  • Calculating the Rate: With histograms, you often calculate rate of observations and quantiles within buckets to better analyze distribution shifts.
  • Example in Code:
    “`python
    from prometheus_client import Histogram

    REQUEST_LATENCY = Histogram(‘http_request_latency_seconds’, ‘HTTP request latency’, buckets=(0.1, 0.5, 1.0, 5.0, 10.0))

    When a request finishes:

    REQUEST_LATENCY.observe(0.75)
    “`
    * When to Use: When you need to measure response time distributions for your application to see how many requests fall within the expected latency.

Summaries

Summaries are similar to histograms in that they also track distributions. But they differ in how they calculate and report this information. Summaries calculate quantiles directly on the client-side whereas histograms must calculate quantiles in the server side (Prometheus).

  • Client-Side Quantiles: Summaries calculate quantiles such as the median (50th percentile) or the 90th percentile directly in the application before sending data to Prometheus.
  • Use Cases: Best when you need:
    • Quantiles readily available.
    • Client-side calculation of a value.
    • To reduce the compute load on the Prometheus server.
  • Examples:
    • Request latencies.
    • Processing times.
    • Payload sizes.
  • Configuration: Summaries have configurable quantiles. For example, the 50th, 90th, and 99th percentiles can be precalculated.
  • Reporting Data: Summaries report:
    • The sum of observations.
    • The count of observations.
    • The calculated quantiles.
  • Example in Code:
    “`python
    from prometheus_client import Summary

    REQUEST_LATENCY_SUM = Summary(‘http_request_latency_seconds’, ‘HTTP request latency’, quantiles=(0.5, 0.9, 0.99))

    When a request finishes:

    REQUEST_LATENCY_SUM.observe(0.80)
    “`
    * When to Use: When you need precalculated quantiles without requiring Prometheus to compute them. You’re looking for a balance between data richness and reduced server load.

Choosing the Right Metric Type

Knowing what metrics you can use is half the battle. The other half is knowing when to use each one. Selecting the right metric type is critical for effective monitoring. This selection influences how you can query and analyze your data. A wrong choice can lead to misleading results or inaccurate alerts. So let’s go through each metric type.

When to Use Counters

Counters are best for tracking cumulative events that only go up. This makes them the perfect fit for:

  • Tracking API requests: If you want to know how many requests are hitting your API endpoint you would use a counter.
  • Total number of errors: Want to track application errors? Each time an error happens, increment the counter. This allows you to see how many errors you’re getting in the time frame of your choosing.
  • Monitoring database queries: Keep a count of database queries to see if there’s an unusual pattern.
  • Number of processed messages: If you have a message queue, it’s often useful to track how many messages your service has consumed.
  • Completed jobs: Use a counter to track the number of jobs completed in a batch process.

When to Use Gauges

Gauges are used when tracking values that can go up or down at any time. Some examples include:

  • Resource utilization: This includes CPU usage, memory consumption, and disk space available.
  • Number of concurrent connections: If you want to see how many users are currently connected to your web server, use a gauge.
  • Queue lengths: If you have a message queue you may want to track queue lengths, a gauge is perfect for this job.
  • Temperature: A gauge tracks fluctuating temperatures in systems, like in a server room.
  • Cache hits and misses: Gauges can be used to show how many hits or misses are happening in your cache server.

When to Use Histograms

Histograms excel when you need to see the distribution of events. So, consider them when:

  • Measuring request latencies: This is one of the best use cases for histograms. You want to know how the response time varies for your API. Use buckets to understand how many responses are fast versus slow.
  • Tracking process times: For batch jobs or long-running processes, track how long it takes to execute. This is useful for identifying performance issues.
  • Analyzing packet sizes: If your application deals with network packets you can measure packet sizes and see if a large number of them are too big.
  • Looking at data sizes: Histograms are great to analyze how data sizes vary. For example, you can track the size of data being written to a database or the size of responses being sent out to a client.
  • Analyzing API traffic: If you want to measure the sizes of requests going into your application, histograms can be the way to go.

When to Use Summaries

Choose summaries when you need:

  • Client-side calculations: You want quantiles computed on the client. This reduces the load on the Prometheus server itself.
  • Precalculated quantiles: When you need the 50th, 90th, or 99th percentile readily available from your application.
  • Consistent reporting: If you need to send aggregate data from your application to Prometheus.
  • Latency measurements: Similar to histograms. Use summaries when you have specific percentile needs.
  • Data analysis: When you need the sum of observations in addition to the quantiles, a summary will do.

Implementing Prometheus Metrics

Now that you know the difference between metric types, it is time to implement them. This section will guide you through how to expose and use Prometheus metrics.

Exposing Metrics

The first step is to expose your metrics, your application will have to have an endpoint that Prometheus can scrape. Usually, this is an HTTP endpoint that is made for this purpose. Here are the basics:

  • HTTP Endpoint:
    • Set up an HTTP endpoint in your application that will be used to expose metrics.
    • The convention is to use /metrics as the endpoint.
  • Text Format:
    • Prometheus scrapes metrics in a plain text format. You’ll need to format your output to match this specification.
    • Each metric is represented by a line, followed by its value, and labels if any.
  • Libraries:
    • Use client libraries for your programming language. Prometheus has libraries for Python, Go, Java, and more.
    • These libraries handle the formatting for you and offer abstractions for each of the metric types mentioned above.
  • Example with Python:
    “`python
    from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
    import random
    import time

    Define the metrics

    REQUESTS = Counter(‘http_requests_total’, ‘Total HTTP requests’)
    CPU_USAGE = Gauge(‘cpu_usage_percent’, ‘Current CPU Usage’)
    REQUEST_LATENCY = Histogram(‘http_request_latency_seconds’, ‘HTTP request latency’, buckets=(0.1, 0.5, 1.0, 5.0, 10.0))
    REQUEST_LATENCY_SUM = Summary(‘http_request_latency_seconds_summary’, ‘HTTP request latency’, quantiles=(0.5, 0.9, 0.99))

    Start the HTTP server

    start_http_server(8000)

    Simulate some work and update metrics

    while True:
    REQUESTS.inc()
    CPU_USAGE.set(random.uniform(30, 70))
    latency = random.uniform(0.1, 2.0)
    REQUEST_LATENCY.observe(latency)
    REQUEST_LATENCY_SUM.observe(latency)
    time.sleep(1)
    ``
    In this Python code we're showing all four of the metrics being created, then we spin up a webserver on port
    8000, after this the metrics are being constantly updated in a loop. This is just an example, and in real world use the metrics should be attached to the processes you're trying to monitor.
    * **Accessing Metrics**:
    * After setting up your endpoint, you should be able to visit
    http://your-application:port/metrics`.
    * You should see the metrics printed on screen as plain text.

Using Client Libraries

Prometheus client libraries handle the nitty-gritty details. Here are the typical steps for using client libraries:

  • Installation:
    • Install the appropriate library for your programming language using pip, npm or whatever package manager is used for your environment.
  • Metric Definition:
    • Define your metric using the appropriate constructor.
      • Counter("metric_name", "help text")
      • Gauge("metric_name", "help text")
      • Histogram("metric_name", "help text", buckets=(...))
      • Summary("metric_name", "help text", quantiles=(...))
    • You give the metric a name and a help text describing it.
    • For histograms and summaries, you set buckets or quantiles.
  • Metric Usage:
    • Use the library methods to update the metrics as the program runs.
      • counter.inc()
      • gauge.set(value)
      • histogram.observe(value)
      • summary.observe(value)
  • Exposing the metrics:
    • Start an HTTP server that exposes the metrics in plain text.
    • The libraries usually give you a function that starts the server and outputs the metrics.
  • Integration with Application:
    • Update your code to interact with the metrics as your application is running.
    • Increase counters, update gauges, and observe histograms.

Prometheus Configuration

With the metrics exposed, you will now have to configure your Prometheus server.

  • Prometheus Configuration File:
    • Edit the prometheus.yml file. This tells Prometheus what to scrape and where.
  • Target Definitions:
    • Add a scrape_configs section. This section specifies target endpoints that Prometheus will scrape.
      “`yaml
      scrape_configs:

      • job_name: ‘my_application’
        scrape_interval: 15s # How often to scrape
        static_configs:

        • targets: [‘your-application:port’] # The endpoint you exposed earlier
          “`
  • Restart Prometheus:
    • Restart the Prometheus server so the configuration changes are loaded.
  • Metric Discovery:
    • Prometheus should now automatically discover and scrape your new metrics.
    • You can use Prometheus expression browser to check the metrics.

Querying Prometheus Metrics

Prometheus exposes a powerful query language, PromQL. The query language is used to explore your metrics and build dashboards, alerts, etc.

  • Basic Queries:
    • Start with basic metrics. Use the name of the metric you want to visualize, such as http_requests_total.
  • Time Range:
    • Specify time ranges using selectors, such as [5m] which means, last five minutes.
  • Rate Functions:
    • Use rate() for counters to view the rate of increase, e.g. rate(http_requests_total[5m]).
  • Aggregation:
    • Combine metrics with aggregation, such as sum(rate(http_requests_total[5m])).
  • Filtering with Labels:
    • Filter by labels (if you have them), e.g., http_requests_total{method="GET"}.
  • Histograms Queries:
    • Use histogram_quantile() to obtain quantiles from histograms, e.g., histogram_quantile(0.95, sum(rate(http_request_latency_seconds_bucket[5m]))).
  • Summaries Queries:
    • Summaries provide the quantiles that were precalculated by the client, and are ready to be used, e.g., http_request_latency_seconds_summary_quantile{quantile="0.95"}.
  • Visualizations:
    • Use graphs in the Prometheus expression browser or Grafana for rich visualization of your data.

Labeling Metrics

Labels are key-value pairs used to classify and refine your metrics. Proper labels make it easier to query, filter, and analyze data. They are very useful and are key to making the most out of the monitoring system.

  • Adding Labels:

    • In your application, add labels when you create a metric.
      “`python
      from prometheus_client import Counter

    REQUESTS = Counter(‘http_requests_total’, ‘Total HTTP requests’, [‘method’, ‘status’])

    When a new request comes in:

    REQUESTS.labels(method=”GET”, status=”200″).inc()
    ``
    * **Use Cases**:
    * **Request methods**: Differentiate between
    GET,POST, andPUTrequests.
    * **Status codes**: Track responses such as
    200,404,500.
    * **Application instances**: If you have multiple instances of the application running, each instance has a label, so you can distinguish which instance is being monitored.
    * **Data centers**: Use labels for identifying data centers or regions to see where the problem is coming from.
    * **Environments**: You can add labels such as staging or production to separate the metrics between different environments.
    * **Querying with Labels**:
    * Use PromQL to filter metrics using labels.
    * For example, to get requests with
    POSTmethod:http_requests_total{method=”POST”}`
    * Benefits:
    * Granular data: Labels allow you to drill down to specific parts of your application.
    * Better analysis: Query the data more accurately.
    * Flexible dashboards: You can create dashboards that show different views of your system, such as, an API request dashboard or a database query dashboard.

Best Practices for Using Prometheus Metrics

To ensure you get the best results, it’s good to follow some best practices. These guidelines can help you set up a robust monitoring system that is easy to maintain and interpret.

Naming Conventions

  • Consistency: Use a consistent naming pattern. This makes it easier to find and query your metrics.
  • Prefixes: Use prefixes to show what type of metrics you’re dealing with, such as, http_, db_, or mem_.
  • Suffixes: Add a suffix that describes a metric unit, such as, _total, _seconds, _bytes, _percent, or _count.
  • Examples:
    • http_requests_total
    • db_queries_seconds
    • mem_usage_bytes
    • cpu_usage_percent
  • Clarity: Make sure the name clearly represents the metric’s purpose.

Use Help Text

  • Explanation: Always provide help text for each metric. This description should tell you exactly what the metric is tracking.
  • Clarity: Help text makes it easier for others to understand what a metric is for.
  • Self-documenting: With good help text, the monitoring system becomes easier to understand for new users, making it self-documenting to some extent.
  • Examples:
    • Total number of HTTP requests received.
    • Total time spent executing database queries, in seconds.

Choosing Buckets and Quantiles

  • Buckets: When defining buckets for histograms, choose ranges that reflect real-world performance.
  • Quantiles: For summaries, choose quantiles that help you detect common issues. For example, 50th, 90th, and 99th percentile.
  • Performance: Be mindful not to have too many buckets or quantiles because this will use resources in your application and on the Prometheus server.
  • Iteration: You can start with a generic setup, then refine the buckets and quantiles as you gather more data.

Monitoring and Alerting

  • Alerting:
    • Set up alerts based on your metrics so you get notified when something goes wrong.
    • Use Prometheus alerting rules to send notifications through tools like email or Slack.
    • Example alert for a high rate of HTTP errors
      “`yaml

      • alert: HighHTTPErrorRate
        expr: rate(http_requests_total{status=~”5..”}[5m]) > 0.1
        for: 5m
        labels:
        severity: critical
        annotations:
        summary: “High HTTP error rate detected”
        description: “The HTTP error rate has exceeded 10% in the last 5 minutes.”
        “`
    • Example alert for low disk space
      “`yaml

      • alert: DiskSpaceLow
        expr: disk_space_available_bytes / disk_space_total_bytes * 100 < 20
        for: 5m
        labels:
        severity: warning
        annotations:
        summary: “Low disk space”
        description: “Less than 20% of disk space is available.”
        “`
  • Dashboards:
    • Create dashboards using Grafana to visualize your metrics.
    • Dashboards allow you to monitor important aspects of your applications at a glance.

Proper Labeling

  • Granularity: Make sure your labels add useful granularity. Adding too many labels could make metrics too fine-grained.
  • Cardinality: Don’t label with unbounded sets. Avoid using labels with too many values as this can make the Prometheus server consume too many resources.
  • Consistency: Keep label keys consistent across similar metrics. This makes for a more consistent monitoring environment.
  • Examples:
    • Use instance to label each running application instance.
    • Use region to label the data center.
    • Use environment for staging and production metrics.

Data Storage and Retention

  • Retention: Understand Prometheus retention settings. If you have limited resources, consider reducing the retention of less important metrics.
  • Compression: Prometheus compresses data to reduce disk space consumption.
  • Data backups: Regularly back up Prometheus data. This ensures you don’t lose your valuable monitoring data.

Advanced Prometheus Topics

With a solid grasp of the basics, it’s time to delve into more advanced techniques.

Custom Collectors

Sometimes, you’ll need more than basic metrics. For example, you may want to pull metrics from a non-standard source. That’s when you need to build custom collectors:

  • Use Cases:
    • Gather data from custom APIs or data sources.
    • Fetch data from legacy systems not directly compatible with Prometheus.
    • Transform data before exposing it to Prometheus.
  • How They Work:
    • Create a class or function that fetches your data.
    • Implement methods that interact with the Prometheus client library to create metrics.
    • Register this class or function with the registry.

Recording Rules

Recording rules help you pre-compute complex expressions. This is perfect when you want to calculate things that you will use often, or that are hard to calculate on demand.

  • Purpose:
    • Reduce query times by precomputing common expressions.
    • Simplify complex queries by creating easy to use metrics.
    • Reduce the load on the Prometheus server itself.
  • How They Work:
    • Create a rules file.
    • Define rules that calculate new metrics from existing ones.
      “`yaml
      groups:

      • name: my_rules
        rules:
      • record: http_error_rate:5m
        expr: rate(http_requests_total{status=~”5..”}[5m])
        “`
    • Then you can use the new http_error_rate:5m metric in your dashboards and alerts.
  • Benefits:
    • Improved query performance.
    • Clean up and reduce complexity in PromQL queries.

Exporters

Exporters are applications that expose metrics for services that don’t natively support Prometheus. This can include hardware devices, databases, etc.

  • Use Cases:
    • Monitor data from databases, cloud services, or third-party APIs.
    • Monitor hardware such as routers and servers.
  • Types:
    • Official Exporters: Prometheus has official exporters for common services. For example, the node_exporter for system metrics, the mysql_exporter for database metrics, and the blackbox_exporter for probing endpoints.
    • Community Exporters: The community has built a vast array of exporters for many more use cases.
    • Custom Exporters: You can also create custom exporters for use cases where an existing exporter does not exist.
  • How They Work:
    • Exporters usually connect to the service that you’re monitoring. Then, they fetch the data and format it as Prometheus metrics.
    • Prometheus then scrapes these exporters like any other metrics endpoint.

Federation

Prometheus federation allows you to combine data from multiple Prometheus servers into one. This is useful for a global view of your system.

  • Use Cases:
    • Aggregating metrics across multiple locations.
    • Combining metrics from different environments.
  • How it works:
    • Configure a Prometheus server to pull data from other Prometheus servers.
    • It’s important that the servers are in different networks for this technique to be useful.
  • Benefits:
    • Global and central view of metrics.
    • Simplified monitoring of large environments.

Common Mistakes and How to Avoid Them

As with any monitoring system, there are a few pitfalls to watch out for:

Using the Wrong Metric Type

  • Problem: Misusing metrics such as using counters for values that can go down, or gauges for cumulative events. This will lead to incorrect interpretations of your data.
  • Solution: Take the time to understand each metric type. Follow the guidelines and examples provided in this article.

Overusing Labels

  • Problem: Adding too many labels, especially labels with high cardinality (many unique values), can cause performance issues on the Prometheus server.
  • Solution: Use only necessary labels, and avoid adding unbounded values. Try not to use user IDs, timestamps, or other non-stable values.

Not Using Help Text

  • Problem: Metrics without good help text can confuse new team members and lead to misinterpretations of what is being tracked.
  • Solution: Always add clear and detailed help text for every metric you create.

Inconsistent Naming

  • Problem: Using inconsistent naming across different metrics will cause confusion when querying data.
  • Solution: Stick to consistent naming patterns, prefixes, and suffixes. This will make it easier to find and combine metrics.

Ignoring Monitoring and Alerting

  • Problem: Having metrics but not using them to set up alerts is useless.
  • Solution: Actively set up alerts and dashboards for critical metrics, and adjust them as needed.

Neglecting Data Retention

  • Problem: Not thinking about data retention can cause you to run out of space on the Prometheus server, or to lose important data.
  • Solution: Consider your requirements for how long you need to store the data. Make sure you’re backing it up if required.

Mastering Prometheus Metrics: Your Path to Observability

Prometheus metrics are the core of a robust monitoring system. They are essential for tracking the health and performance of your applications. By mastering the types of metrics—counters, gauges, histograms, and summaries—and understanding how and when to use them, you will gain deeper insights into your system behavior. The implementation details, including the client libraries, the Prometheus configuration, and PromQL are essential steps in the journey. It’s important that you use the metrics properly. Apply best practices, like consistent naming conventions, and proper labeling and data retention so that your monitoring is not only efficient but also easy to use and maintain. With Prometheus, you’re not just monitoring data, but gaining a deeper understanding that will ultimately improve your applications and infrastructure.