Using Amazon Managed Service for Prometheus Alert Manager to receive alerts with PagerDuty

Many customers using Amazon Managed Service for Prometheus are transitioning from their self-managed Prometheus systems to the fully managed service. Within this transition journey, Amazon Managed Service for Prometheus users need ways to migrate their existing Prometheus and Alert Manager configurations. PagerDuty is a receiver used by many customers to route alerts to their internal team. However, Amazon Managed Service for Prometheus Alert Manager only supports an Amazon Simple Notification Service (Amazon SNS) receiver, and cannot directly send to PagerDuty. This guide walks through how to hook up Amazon Managed Service for Prometheus Alert Manager to Amazon SNS in order to route messages to PagerDuty so as to mimic the various controls and flexibility that exists today with the Alert Manager PagerDuty Receiver.

Component Overview

The Amazon Managed Service for Prometheus Alert Manager handles alerts sent via client applications, including the Amazon Managed Service for Prometheus server. The Amazon Managed Service for Prometheus Alert Manager configuration handles how each alert will be routed. I want to connect the Amazon Managed Service for Prometheus Alert Manager with PagerDuty so that it mimics the native PagerDuty receiver in OSS Alert Manager. To do this, I need to configure the Amazon Managed Service for Prometheus Alert Manager definition, route any output messages via Amazon SNS, and re-route the messages from SNS to PagerDuty via an AWS Lambda function that posts to the api endpoint for PagerDuty.

The following example utilized a Prometheus server to monitor a node exporter job. Then, I am remote writing the collected metrics to my Amazon Managed Service for Prometheus workspace. See Figure 1.

Architecture showing a Prometheus server remote writing to an Amazon Managed Service for Prometheus workspace. This sends notifications to SNS and then to AWS Lambda, which sends the final notification to PagerDuty.

Figure 1: Architecture to configure Amazon Managed Service for Prometheus Alert Manager with PagerDuty

Rules management

Once an Amazon Managed Service for Prometheus workspace has been created, you must configure the workspace with one or more rules. Amazon Managed Service for Prometheus supports both alerting and recording rules. As a simple example, I have created an alerting rule that fires when a Prometheus node exporter job is down.

groups:
 - name: example
   rules:

     - alert: DemoAlert
       expr: up{job="node"} == 0
       for: 1m
       annotations:
         summary: "Prometheus job missing (instance {{ $labels.instance }})"
         description: "A Prometheus job has disappeared\n VALUE : {{ $value }}\n LABELS : {{ $labels }}"
       labels:
         severity: warning

To add a rule to the Amazon Managed Service for Prometheus server, first encode the rules file in a base64 format. I used OpenSSL to base64-encode the YAML rules file as follows:

openssl base64 -in <input file> -out <output file>

Once the file has been base64-encoded, it can be added to the Amazon Managed Service for Prometheus server via this CLI syntax:

aws amp create-rule-group-namespace --data file://<path to base64-encoded file> --name <Namespace> --workspace-id <workspace_id> --region <region>

You can also upload the rule file via the Amazon Managed Service for Prometheus console.

Then, after uploading the rule file, my rule group namespace is created and moves from a Creating to an Active status. By clicking on the namespace link in Amazon Managed Service for Prometheus lets me see that the rule has been successfully imported. See Figure 2.

Figure 2: The alerting rule has been successfully imported into the Amazon Managed Service for Prometheus workspace

SNS and Lambda configuration

Before configuring Alert Manager, set up an SNS topic that Amazon Managed Service for Prometheus will utilize to send alerts. Once the topic has been created, grant the Amazon Managed Service for Prometheus permission to publish to the topic. This is done by going to the Access Policy section of the SNS topic in the SNS console and adding the following statement, in addition to replacing <region_code>, <account_id> and <topic_name> for the actual values:

{ 
  "Effect": "Allow", 
  "Principal": { 
    "Service": "aps.amazonaws.com" 
  }, 
  "Action": [ 
    "sns:Publish", 
    "sns:GetTopicAttributes" 
  ], 
  "Resource": "arn:aws:sns:<region-code>:<account_id>:<topic_name>"
}

This access policy grants Amazon Managed Service for Prometheus the sns:Publish and sns:GetTopicAttributes permissions for the SNS topic identified in the Resource section.

Then, create a Lambda function to trigger off any messages sent to the SNS topic that was created above. As the Alert Manager configuration will eventually be written in YAML, this Lambda function converts the message body received from YAML to JSON, and then sends the resulting JSON to the PagerDuty API. This function uses the PyYAML library, so in order to make the library available within a Lambda function, I must create a deployment package with dependencies. Then, I set up the Lambda function as a subscriber to the SNS topic just created.

import urllib3
import json
import yaml

http = urllib3.PoolManager()
def lambda_handler(event, context):
    
    #In this implementation, payload.summary is set to description (to mimic pagerduty_config.description)
    #In this implementation, payload.source is set to client_url
    
    url = "https://events.pagerduty.com/v2/enqueue"
    msg = yaml.safe_load(event['Records'][0]['Sns']['Message'])
    details = None
    links = None 
    summary = None
    client_url = None
    severity = None
    routing_key = None
    
    ############################################################
    #Remove elements
    if 'description' in msg.keys():
        summary = msg['description']
        msg.pop('description')
    
    if 'client_url' in msg.keys():
        client_url = msg['client_url']
        msg.pop('client_url')
    
    if 'severity' in msg.keys():
        severity = msg['severity']
        msg.pop('severity')
    
    if 'details' in msg.keys():
        details = msg['details']
        msg['details'] = ""
        msg.pop('details')
        
    if 'links' in msg.keys():
        links = msg['links']
        msg['links'] = ""
        msg.pop('links')
        
    #Remove integration key before logging the payload
    if 'routing_key' in msg.keys():
        routing_key = msg['routing_key']
        msg['routing_key'] = ""
        msg.pop('routing_key')

    ############################################################
    
    #Add event_action back in 
    if event['Records'][0]['Sns']['Subject'].find('[RESOLVED]') > -1:
        msg.update({"event_action":"resolve"})
    else:
        msg.update({"event_action":"trigger"})
    
    #Add payload fields back in
    payload = { "payload": { "client_url": client_url, "severity": severity, "summary": summary, "source": client_url } }
    msg.update(payload)
    
    #Add details fields
    if details is not None and len(details) > 0:
        details = { "custom_details": details } 
        msg["payload"].update(details)
    
    #Add links fields
    if links is not None and len(links) > 0:
        msg["links"] = links
    
    encoded_msg = json.dumps(msg).encode('utf-8')
    resp = http.request('POST',url, body=encoded_msg, headers={'x-routing-key': routing_key})
    print({
        "message": msg, 
        "status_code": resp.status, 
        "response": resp.data
    })

Alertmanager configuration

Now that the SNS topic and Lambda function are in place, let’s configure the Amazon Managed Service for Prometheus Alert Manager.

The OSS Alert Manager has the following structure for a PagerDuty receiver:

pagerduty_config:
  - send_resolved: true
    routing_key: <tmpl_secret>
    service_key: <tmpl_secret> # only used when using integration type prometheus
    client_url: <tmpl_string> # link to be included in the alert in PD
    severity: <tmpl_string> # error, info
    description: <tmpl_string> # description of the alert
    details: { <string>: <tmpl_string>, ... } # arbitrary dictionary
    links: [....<link_config>...]

Mimic this interface in the Amazon Managed Service for Prometheus Alert Manager definition, so that it’s easy to lift and shift the existing OSS Alert Manager configurations into Amazon Managed Service for Prometheus Alert Manager. To do this, utilize the SNS receiver block in the Amazon Managed Service for Prometheus Alert Manager definition. Under the message block, I have created keys to mimic the PagerDuty configuration.

sns_configs:
  - send_resolved: true
    topic_arn: <topic_arn>
    sigv4:
      region: <region>
    message: |
      routing_key: <tmpl_secret>
      dedup_key: <tmpl_string> # necessary to resolve the alert in PD
      client_url: <tmpl_string> # link to be included in the alert in PD
      severity: <tmpl_string> # error, info
      description: <tmpl_string> # description of the alert
      details: { <string>: <tmpl_string>, ... } # arbitrary dictionary
      links: [....<link_config>...]

The properties under the message block will be transformed via the Lambda function into JSON and into a structure that the PagerDuty API understands. The Amazon Managed Service for Prometheus Alert Manager configuration must be wrapped in an alertmanager_config block at the YAML file root.

Just like the process for importing Amazon Managed Service for Prometheus rules, the YAML file for Alert Manager must be base64-encoded. Once it has been base64-encoded, it is added to the Amazon Managed Service for Prometheus server via the following CLI syntax:

aws amp create-alert-manager-definition --data file://<path to base64-encoded file> --workspace-id <workspace_id> --region <region>

Likewise, Alert Manager configuration can also be uploaded via the Amazon Managed Service for Prometheus console.

After a few moments, Amazon Managed Service for Prometheus Alert Manager transitions to an Active status, and we can see the alert I created. See Figure 3.

The Amazon Managed Service for Prometheus console showing the Alert manager screen configured to send alerts to an SNS topic. Figure 3: The alert has been successfully created in Amazon Managed Service for Prometheus Alert Manager

Testing out the solution

The Amazon Managed Service for Prometheus rule that was set up is designed to fire when a node exporter job goes down. To test out the full solution, simply stop the node exporter service being monitored. Once the Amazon Managed Service for Prometheus rule fires, Amazon Managed Service for Prometheus Alert Manager sends the alert to the SNS topic that the Lambda function is subscribed to. Lambda then parses the SNS message and sends it off to PagerDuty. In a moment, the PagerDuty dashboard is updated with a new incident. See Figure 4.

PagerDuty screen showing that an alert called “Prometheus job missing (instance localhost:9100)” was triggered by the Amazon Managed Service for Prometheus workspace and received by PagerDuty. Figure 4: An alert successfully pushed to PagerDuty via Amazon Managed Service for Prometheus Alert Manager

After restarting the node exporter job, Amazon Managed Service for Prometheus Alert Manager detects that the issue has been resolved, and then sends a final alert. Once PagerDuty receives the message, it marks the incident as resolved within the PagerDuty dashboard. See Figure 5.

PagerDuty screen showing that the previous alert was automatically resolved via a notification from the Amazon Managed Service for Prometheus workspace. Figure 5: An alert automatically resolved in PagerDuty

Conclusion

This post demonstrated a simple pattern for migrating alerting mechanisms from OSS Alert Manager to Amazon Managed Service for Prometheus Alert Manager. The combination of the Amazon Managed Service for Prometheus Alert Manager, an SNS topic, and a subscribed Lambda function demonstrated how to send alerts from Amazon Managed Service for Prometheus to PagerDuty. This pattern is especially helpful for organizations migrate their existing OSS Alert Manager configurations to Amazon Managed Service for Prometheus Alert Manager.

For more details on Amazon Managed Service for Prometheus Alert Manager, check out our documentation.

AWS Cloud Operations & Migrations Blog