Metrics collection from Amazon ECS using Amazon Managed Service for Prometheus

Prometheus is an open source monitoring solution that has emerged as a very popular tool for collecting metrics from microservices running in a variety of environments including Kubernetes. In tandem with Grafana, a widely deployed data visualization tool, Prometheus enables customers to query and visualize operational metrics collected from their workloads. Customers deploying their Prometheus server in their container environments face challenges in managing a highly available, scalable, and secure Prometheus server environment and infrastructure for long-term storage.

The recent release of Amazon Managed Service for Prometheus (AMP) addresses these problems by providing a fully managed, highly available, secure, and scalable service that customers can use to monitor the performance of containerized workloads on AWS or on-premises, without having to manage the underlying infrastructure.

Although Prometheus supports dynamic discovery of resources, such as nodes, services, and pods in a Kubernetes cluster, it is not tailor-made for deployment to an Amazon Elastic Container Service (Amazon ECS) cluster. In this blog post, we present a solution that will enable customers to deploy Prometheus server on an Amazon ECS cluster, dynamically discover the set of services to collect metrics from, and send the metrics to AMP for subsequent query and visualization as well as long-term storage. This solution will also enable Amazon ECS customers to leverage libraries and servers that help in exporting existing metrics from third-party systems as Prometheus metrics.

Solution overview

At a high level, we will be following the steps outlined below for this solution:

Set up AWS Cloud Map for service discovery.
Deploy application services to an Amazon ECS cluster and register them with AWS Cloud Map.
Deploy Prometheus server to Amazon ECS, configure service discovery and send metrics data to Amazon Managed Service for Prometheus.
Visualize metrics data using Amazon Managed Grafana.

Source code

The source code for the solution outlined in this blog as well as artifacts needed for deploying the resources to an Amazon ECS cluster can an be downloaded from the GitHub repository.

Amazon ECS service discovery with AWS Cloud Map

To use Prometheus as a viable monitoring tool in large-scale deployments on Amazon ECS, having a dynamic service discovery mechanism is imperative. Customers using Amazon ECS can make use of AWS Cloud Map for service discovery needs. AWS Cloud Map is a fully managed service that you can use to register any cloud resources, such as Amazon RDS database instances and Amazon EC2 instances with logical names. Subsequently, client applications that depend on these resources may discover them using DNS queries or API calls, referencing the resources by their logical names. AWS Cloud Map is tightly integrated with Amazon ECS. As new services and tasks spin up or down, they automatically register with AWS Cloud Map.

Service discovery with AWS Cloud Map consists of three key components: service discovery namespace, service discovery service, and service discovery instance.

Diagram illustrating service discovery with AWS Cloud Map's three key components

A service discovery namespace is created as follows and associated with an Amazon Virtual Private Cloud (Amazon VPC).

VPC_ID=vpc-0bef82d36d4527eb6
SERVICE_DISCOVERY_NAMESPACE=ecs-services

OPERATION_ID=$(aws servicediscovery create-private-dns-namespace \
--vpc $VPC_ID \
--name $SERVICE_DISCOVERY_NAMESPACE \
--query "OperationId" --output text)

CLOUDMAP_NAMESPACE_ID=$(aws servicediscovery get-operation \
--operation-id $OPERATION_ID \
--query "Operation.Targets.NAMESPACE" --output text)

Next, a service discovery service, which encapsulates a service registry, is created within this namespace. Typically, there should be a separate service registry in Cloud Map corresponding to each ECS service that we want to collect metrics from. Note that we are using service tags to specify the URL path and port for the endpoint where the service exposes its metrics.

AWS_REGION=us-east-1
METRICS_PATH=/metrics
METRICS_PORT=3000
SERVICE_REGISTRY_NAME="webapp-svc"
SERVICE_REGISTRY_DESCRIPTION="Service registry for Webapp ECS service"

CLOUDMAP_WEBAPP_SERVICE_ID=$(aws servicediscovery create-service \
--name $SERVICE_REGISTRY_NAME \
--description "$SERVICE_REGISTRY_DESCRIPTION" \
--namespace-id $CLOUDMAP_NAMESPACE_ID \
--dns-config "NamespaceId=$CLOUDMAP_NAMESPACE_ID,RoutingPolicy=WEIGHTED,DnsRecords=[{Type=A,TTL=10}]" \
--region $AWS_REGION \
--tags Key=METRICS_PATH,Value=$METRICS_PATH Key=METRICS_PORT,Value=$METRICS_PORT \
--query "Service.Id" --output text)

One or more service discovery instances exist within this service registry and represent the resources that clients within the Amazon VPC can discover using a private DNS name constructed with the format {service-discovery-service}.{service-discovery-namespace}.

Service discovery registration can be performed only at the time of creating an Amazon ECS service, using the Amazon Resource Name (ARN) of a Cloud Map service registry. Each task is registered as a service discovery instance, which is associated with a set of attributes, such as ECS_CLUSTER_NAME, ECS_SERVICE_NAME, and ECS_TASK_DEFINITION_FAMILY.

When an Amazon ECS service is launched using AWS Command Line Interface (AWS CLI), the –service-registries argument is used, as shown below, to enable the service to register itself with a service registry in AWS CloudMap.

CLUSTER_NAME=ecs-prometheus-cluster
SERVICE_NAME=KafkaPublisherService
TASK_DEFINITION=KafkaPublisherTask:1

aws ecs create-service --service-name $SERVICE_NAME \
--cluster $CLUSTER_NAME \
--task-definition $TASK_DEFINITION \
--service-registries "registryArn=$CLOUDMAP_SERVICE_ARN" \
--desired-count 2 \
--network-configuration "awsvpcConfiguration={subnets=$PRIVATE_SUBNET_IDS,securityGroups=[$SECURITY_GROUP_ID],assignPublicIp=DISABLED}" \
--scheduling-strategy REPLICA \
--launch-type EC2

Deploying Prometheus server to Amazon ECS

The complete JSON task definition used for deploying Prometheus server to ECS can be downloaded from the Git repository. Here are key considerations for deploying Prometheus server to an ECS cluster.

The deployment does not make use of any data volumes for persistent storage of metrics scraped by Prometheus server. The data is ingested into an AMP workspace using the remote write mechanism, which enables sending metrics to a remote storage destination using HTTP.

For secure ingestion of metrics into AMP, the HTTP requests must be signed using AWS Signature Version 4 signing process. To facilitate this, an instance of AWS Sig4 Proxy is deployed as a sidecar container and configured to send data to an AMP workspace. The sidecar container definition is shown below:

      {
         "name":"aws-iamproxy",
         "image":"public.ecr.aws/aws-observability/aws-sigv4-proxy:1.0",
         "cpu": 256,
         "memory": 256,         
         "portMappings":[
            {
               "containerPort":8080,
               "protocol":"tcp"
            }
         ],
         "command":[
            "--name",
            "aps",
            "--region",
            "${AWS_REGION}",
            "--host",
            "aps-workspaces.${AWS_REGION}.amazonaws.com"
         ],         
         "logConfiguration":{
            "logDriver":"awslogs",
            "options":{
               "awslogs-group":"/ecs/Prometheus",
               "awslogs-create-group":"true",
               "awslogs-region":"${AWS_REGION}",
               "awslogs-stream-prefix":"iamproxy"
            }
         },
         "essential":true
      },

Prometheus server is configured to use this AWS Sig4 Proxy as the destination for its remote write, as shown in the configuration below. The placeholder variable WORKSPACE_ID is replaced with the ID of an AMP workspace. Refer to the documentation about how to set one up.
```
global:
  evaluation_interval: 1m
  scrape_interval: 30s
  scrape_timeout: 10s
remote_write:
  - url: http://localhost:8080/workspaces/${WORKSPACE_ID}/api/v1/remote_write
```

An AWS Identity and Access Management (IAM) policy with the following set of permissions should be attached to the ECS task role used in this deployment so that the AWS Sig4 Proxy side-car container can send metrics data to AMP.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "aps:RemoteWrite",
                "aps:GetSeries",
                "aps:GetLabels",
                "aps:GetMetricMetadata"
            ],
            "Resource": "*"
        },
    ]
}

To secure against crashes, Prometheus server creates a write-ahead-log (WAL) that can be replayed when the server restarts. To facilitate this, a directory on the Amazon Elastic Compute Cloud (Amazon EC2) instance is bind-mounted into the container. Bind-mounted host volumes are also supported when running tasks on AWS Fargate. In this implementation, the host parameter on the volume is empty as shown in the JSON fragment below. Therefore, the Docker daemon assigns an arbitrary host path for this volume, but the data is not guaranteed to persist across container restarts. If it is imperative to preserve the WAL to avoid any data loss, then an Amazon Elastic File System (Amazon EFS) access point could be bind-mounted into the container as explained in the documentation.
```
{
   "containerDefinitions":[
      {
         "mountPoints":[
            {
               "sourceVolume":"walVolume",
               "containerPath":"/data"
            }            
         ]
      }
   ],
   "volumes":[   
      {
         "name":"walVolume",
         "host":{}
      }      
   ]
}
```
Prometheus provides several dynamic service-discovery mechanisms. However, there is no built-in service discovery for Amazon ECS. Hence, we leverage Prometheus’ file-based service discovery, which provides a more generic way to configure scraping targets and serves as an interface to plug in custom service discovery mechanism. The scraping configuration used by Prometheus server is shown below:
```
scrape_configs:
  - job_name: ecs_services
    file_sd_configs:
      - files:
          - /etc/config/ecs-services.json
        refresh_interval: 30s
```

The custom service discovery mechanism, explained in the next section, is implemented as a separate application, which is deployed as another sidecar container within the same ECS task. The JSON definition of this sidecar container is shown below. The complete source code for this Go language-based application is available for download from the Git repository.

      {
         "name":"config-reloader",
         "image":"937351930975.dkr.ecr.us-east-1.amazonaws.com/prometheus-sdconfig-reloader:latest",
         "user":"root",
         "cpu": 128,
         "memory": 128,         
         "environment":[
            {
               "name":"CONFIG_FILE_DIR",
               "value":"/etc/config"
            },
            {
               "name":"CONFIG_RELOAD_FREQUENCY",
               "value":"30"
            }
         ],         
         "mountPoints":[
            {
               "sourceVolume":"configVolume",
               "containerPath":"/etc/config",
               "readOnly":false
            }
         ],           
         "logConfiguration":{
            "logDriver":"awslogs",
            "options":{
               "awslogs-group":"/ecs/Prometheus",
               "awslogs-create-group":"true",
               "awslogs-region":"us-east-1",
               "awslogs-stream-prefix":"reloader"
            }
         },
         "essential":true
      }

Custom service discovery for Prometheus

The application that performs custom service discovery is configured to read a list of Cloud Map namespaces from AWS Systems Manager Parameter Store. It periodically collects metadata about the ECS tasks registered under each service registry within these namespaces and assembles a JSON configuration file ecs-service.json, which provides a list of scraping targets.

This configuration is used for adding metadata, such as the ECS cluster, service, and task definition names, as labels to the Prometheus metrics scraped from each target, in addition to the ones added by a service using Prometheus client library. This file is plugged into Prometheus’ file-based service discovery mechanism by saving it to the same data volume that is bind-mounted into the Prometheus server, which is configured to refresh its scraping targets from this file every 30 seconds. This step ensures that the metrics samples it collects reflect the current state of microservices deployed to the Amazon ECS cluster. A representative example of this configuration file is shown below.

[
   {
      "targets":[
         "10.10.101.56:9100"
      ],
      "labels":{
         "__metrics_path__":"/metrics",
         "cluster":"ecs-prometheus-cluster",
         "service":"NodeExporterService",
         "taskdefinition":"NodeExporterTask"
      }
   },
   {
      "targets":[
         "10.10.100.60:9100"
      ],
      "labels":{
         "__metrics_path__":"/metrics",
         "cluster":"ecs-prometheus-cluster",
         "service":"NodeExporterService",
         "taskdefinition":"NodeExporterTask"
      }
   },
   {
      "targets":[
         "10.10.100.151:3000"
      ],
      "labels":{
         "__metrics_path__":"/metrics",
         "cluster":"ecs-prometheus-cluster",
         "service":"WebAppService",
         "taskdefinition":"WebAppTask"
      }
   },
   {
      "targets":[
         "10.10.101.65:3000"
      ],
      "labels":{
         "__metrics_path__":"/metrics",
         "cluster":"ecs-prometheus-cluster",
         "service":"WebAppService",
         "taskdefinition":"WebAppTask"
      }
   }
]

As the service discovery application uses both AWS Systems Manager and AWS Cloud Map APIs to perform its task, the IAM policy attached to the ECS task role should contain the following additional set of permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "servicediscovery:*"
            ],
            "Resource": "*"
        }
    ]
}

Prometheus metrics collection in action

The following figure shows a deployment to an Amazon ECS cluster on EC2 that we will use to demonstrate the metrics collection approach outlined above. This deployment has the components:

An ECS task comprising the Prometheus server, AWS Sig4 proxy, and the service discovery application containers.
A stateless web application that is instrumented with Prometheus Go client library. The service exposes a Counter named http_requests_total and a Histogram named request_duration_milliseconds.
Prometheus Node Exporter to monitor system metrics from every container instance in the cluster. This service is deployed using the host networking mode and with the daemon scheduling strategy. Note that we can’t deploy the Node Exporter on AWS Fargate because it does not support the host networking mode and daemon scheduling strategy.

Diagram illustrating Prometheus metrics collection.

The set of services launched is shown below in a view of the ECS Console. The services that are to be scraped for Prometheus metrics are WebAppService and NodeExporterService.

Screenshot showing the services launched in the ECS Console.

The set of service registries where the above ECS services register their respective tasks is shown here in the Cloud Map Console.

Screenshot of the set of service registries where the ECS services register tasks in the CloudMap console.

The following figure shows the metadata associated with a service discovery instance in Cloud Map that corresponds to one of the tasks in the NodeExporterService.

Screenshot showing metadata associated associated with service discovery instance in CloudMap.

The metrics ingested into AMP are visualized using Amazon Managed Grafana. Amazon Managed Grafana is a fully managed service that enables you to query, correlate, and visualize operational metrics, logs, and traces from multiple sources. Refer to the documentation on how to add an AMP workspace as a data source to Amazon Managed Grafana. The figure below shows a visualization of the Counter and Histogram metrics collected from the web application.

HTTP Request Rate chart shows the rate of requests processed by the service, computed as: sum(rate(http_requests_total[5m]))
Average Response Latency chart shows average request processing latency, computed as: sum(rate(request_duration_milliseconds_sum[5m])) / sum(rate(request_duration_milliseconds_count[5m]))
Request Latency Histogram chart shows the percentage of requests served within a specified threshold, computed as: sum(rate(request_duration_milliseconds_bucket{le=”BUCKET_VALUE”}[5m])) / sum(rate(request_duration_milliseconds_count[5m])). The histogram is configured to count observations falling into particular buckets of values, namely, 50, 100, 250, and 500 milliseconds.

Screenshot of metrics ingensted in AMP in Grafana.

The Node Exporter service exposes a wide variety of EC2 instance-specific system metrics. The figure below shows a visualization of the average network traffic received, per second, and average CPU usage over the last minute, computed using the Counters node_network_receive_bytes_total and node_cpu_seconds_total.

Visualization of the average network traffic received, per second, and average CPU usage over the last minute.

Conclusion

Using the solution outlined in this blog post, customers can now collect metrics using Prometheus server with dynamic service discovery using AWS Cloud Map. They also have the option of deploying third-party exporters available for several popular workloads on their Amazon ECS cluster, which enables exporting metrics from these systems as Prometheus metrics. In conjunction with recently released AWS services, such as Amazon Managed Service for Prometheus (AMP) and Amazon Managed Grafana, this approach in the blog post gives customers open source-based options to choose from for their observability needs on workloads hosted on Amazon ECS.