AWS Cloud Operations & Migrations Blog

Introducing vended metrics for Amazon Managed Service for Prometheus

Today, I’m happy to announce that Amazon Managed Service for Prometheus now vends usage metrics to Amazon CloudWatch. These metrics can be used to help you gain better visibility into your Amazon Managed Service for Prometheus workspace. Let’s dive in to see how you could use these new Prometheus usage metrics in CloudWatch.

I‘ve set up a new workload consisting of two Amazon EC2 instances, each running Prometheus and remote writing metrics to an Amazon Managed Service for Prometheus workspace. Furthermore, within my workspace, I’ve set up some rules to alert on high or low CPU utilization. The alerting rules I’m using look like this:

groups:
- name: example
  rules:
  - alert: HostHighCpuLoad
    expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 60
    for: 5m
    labels:
      severity: warning
      event_type: scale_up
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 60%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostLowCpuLoad
    expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) < 30
    for: 5m
    labels:
      severity: warning
      event_type: scale_down
    annotations:
      summary: Host low CPU load (instance {{ $labels.instance }})
      description: "CPU load is < 30%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

I’ve also configured alert manager to send the alerts to an Amazon Simple Notification Service (Amazon SNS) topic. The alert manager configuration looks like this:

alertmanager_config: |
  route: 
    receiver: default_receiver
    repeat_interval: 5m
        
  receivers:
    - name: default_receiver
      sns_configs:
        - topic_arn: <arn of SNS topic goes here>
          send_resolved: false
          sigv4:
            region: us-east-2
          message: |
            alert_type: {{ .CommonLabels.alertname }}
            event_type: {{ .CommonLabels.event_type }}

Looking at the CloudWatch Usage metric namespace, I select IngestionRate and ActiveSeries to validate and monitor usage against service quotas, as shown in the following figure. If I see either of these metrics approaching my account’s quota, I could request a quota increase via the AWS support console.

The CloudWatch dashboard displays IngestionRate and ActiveSeries metrics for an Amazon Managed Service for Prometheus workspace. Both metrics are increasing from 0, demonstrating that data is being successfully ingested into the workspace.

Figure 1: CloudWatch metrics for IngestionRate and ActiveSeries for an Amazon Managed Service for Prometheus workspace.

I could also review the DiscardedSamples metric in the AWS/Prometheus namespace. Seeing non-zero values in the DiscardedSamples metric may indicate that the workload is being throttled due to an Amazon Managed Service for Prometheus service quota.

For the next step, I’ll review metrics to make sure that Amazon Managed Service for Prometheus rules and alerts are working properly. You can review RuleEvaluationFailures and RuleGroupInterationsMissed in the AWS/Prometheus namespace to see if there are any problems with the rules that you have created. After reviewing those metrics, I looked at the AlertManagerAlertsReceived and AlertManagerNotificationsFailed metrics in the AWS/Prometheus namespace.

I noticed that my workspace didn’t seem to be sending alerts. Sure enough, when looking at the AlertManagerAlertsReceived and AlertManagerNotificationsFailed metrics, I can see that alert manager has received alerts (the blue line), but it has had problems processing the alerts (the red line), as shown in the following figure.

The CloudWatch dashboard that displays AlertManagerAlertsReceived and AlertManagerNotificationsFailed metrics for an Amazon Managed Service for Prometheus workspace. The AlertManagerAlertsReieved metric shows that alerts are being received for the workload, but the AlertManagerNotifcationsFailed metric is non-zero, indicating that the alerts are failing to send.

Figure 2: CloudWatch metrics for AlertManagerAlertsReceived and AlertManagerNotificationsFailed for an Amazon Managed Service for Prometheus workspace.

In reviewing the alert manager definition for the workspace, I discovered that the SNS topic doesn’t allow the workspace to publish messages. After fixing the permission issue by granting the Amazon Managed Service for Prometheus service the sns:Publish and sns:GetTopicAttributes permissions on the SNS topic, the AlertManagerNotificationsFailed metric drops to zero. This indicates that alerts are now successfully being processed.

In this blog post, I demonstrated the use of vended metrics for Amazon Managed Service for Prometheus. I demonstrated how you can monitor your workspace usage against service quotas, and I demonstrated how these metrics helped me identify an issue in an alert manager configuration. Vended metrics are provided free of charge.

You can use these metrics to validate and monitor your usage against quotas, and you can validate that rules and alerts are operating the way you’re expecting. As a next step, review the metrics in the CloudWatch console to ensure that your monitoring stack is working correctly.

About the author

Mike George

Mike George

Mike George is a Principal Solutions Architect based out of Salt Lake City, Utah. He enjoys helping customers solve their technology problems. His interests include software engineering, security, artificial intelligence (AI), and machine learning (ML).