DevOps 7: Advanced Monitoring and Alerting with Prometheus

Having established a basic observability framework with Prometheus and Grafana in our Amazon EKS environment, it’s time to leverage Prometheus’s full capabilities for advanced monitoring and alerting. This enables us to proactively manage the health and performance of our microservices, ensuring our systems are both reliable and efficient. In this seventh installment, we explore how to create custom metrics and set up sophisticated alerting rules within Prometheus to detect and respond to anomalies in real-time.

Why Advanced Monitoring and Alerting?

Advanced monitoring goes beyond basic system metrics, diving into the specific metrics that matter most to your application’s performance and user experience. Coupled with targeted alerting, it ensures that teams can quickly identify and address issues, often before they impact users.

Configuring Custom Metrics in Prometheus

Prometheus’s flexible instrumentation allows you to define custom metrics tailored to your application’s operational characteristics. Here’s how to get started:

Step 1: Instrument Your Application

First, instrument your application to expose custom metrics. Prometheus supports multiple libraries for different programming languages. Here’s an example using the Prometheus client for Node.js:

const express = require('express');
const app = express();
const promClient = require('prom-client');

const httpRequestDurationMicroseconds = new promClient.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 5, 15, 50, 100, 500]
});

app.use((req, res, next) => {
  const responseTimeInMs = Date.now();
  res.on('finish', () => {
    const durationInMs = Date.now() - responseTimeInMs;
    httpRequestDurationMicroseconds
      .labels(req.method, req.route.path, res.statusCode)
      .observe(durationInMs);
  });
  next();
});

// Expose the Prometheus scraping endpoint
app.get('/metrics', (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(promClient.register.metrics());
});

app.listen(3000, () => console.log('Server started'));

Step 2: Configure Prometheus to Scrape Custom Metrics

Ensure Prometheus is configured to scrape metrics from your application. Modify your Prometheus configuration (prometheus.yml) to include the target exposing the /metrics endpoint:

scrape_configs:
  - job_name: 'my-application'
    static_configs:
      - targets: ['<your-application-service>:3000']

Replace <your-application-service> with the hostname or IP of your service.

Setting Up Alerts in Prometheus

With custom metrics in place, define alerting rules to notify you of potential issues.

Step 1: Define Alerting Rules

Create a file alerting_rules.yml and define rules based on your custom metrics:

groups:
- name: example
  rules:
  - alert: HighRequestLatency
    expr: http_request_duration_ms_bucket{le="500"} > 5
    for: 1m
    labels:
      severity: page
    annotations:
      summary: High request latency

This rule triggers an alert if more than five requests have a latency greater than 500 ms for over a minute.

Step 2: Load Alerting Rules into Prometheus

Include your alerting rules in the Prometheus configuration:

rule_files:
  - "alerting_rules.yml"

Restart Prometheus to apply the new configuration and rules.

Conclusion

By customizing metrics and setting up targeted alerting rules, we significantly enhance our ability to monitor the health and performance of our EKS-based microservices. This proactive approach to monitoring and alerting ensures that we can maintain high system reliability and quickly respond to any issues, minimizing the impact on end-users.

Gotchas and Tips

Metric Labels: Use labels judiciously to avoid overwhelming Prometheus with too many series, which can impact performance.
Alerting for High Availability: Configure redundant alerting pipelines to ensure alerts are delivered even if one pipeline fails.
Testing Alerts: Regularly test alerting pathways and rules to ensure they work as expected and that alerts reach the intended recipients.

Advanced monitoring and alerting form the backbone of a resilient and performant microservices architecture. By leveraging Prometheus’s capabilities to their fullest, teams can ensure that their services are not just observable but also reliable and efficient.