AWS Cloud Operations & Migrations Blog

Amazon Managed Service for Prometheus adds support for 200M active metrics

Today, Amazon Web Services (AWS) is pleased to announce support for 200M active series per workspace for Amazon Managed Service for Prometheus. Amazon Managed Service for Prometheus is a fully managed Prometheus-compatible monitoring service that monitors and alarms on operational metrics at scale. It does this without you having to manage the underlying infrastructure required to scale and secure the ingestion, storage, alerting, and querying of metrics.

With containers being short lived and auto-scaling, the metrics a customer needs to ingest and analyze to monitor their container environments can quickly increase to tens of millions. With self-managed Prometheus, customers would have to pre-provision their Prometheus servers to handle the peak load and often shard their metrics across multiple individual servers, each with only a partial subset of metrics data, to prevent any individual instance from being overwhelmed.

In order to combat scale limitations, historically customers prioritized which metrics to keep and discard. Furthermore, customers want a centralized and unified view of their container workloads across all their Amazon EKS clusters requiring them to put their cross-cluster container metrics in one system. With 200M active series support in Amazon Managed Service for Prometheus, customers don’t have to choose which metrics to keep.

Amazon Managed Service for Prometheus is built on top of Cortex, a horizontally scalable, highly available, multi-tenant, long term storage Prometheus compatible datastore. In order to raise the number of active metrics per workspace, AWS raised the scalability limits of Cortex without compromising on reliability.

AWS worked with the open source Cortex community to propose:

  • Scalability improvements
  • Deployment mechanisms to safely support 200M active metrics per workspace

One of the key bottlenecks in Cortex was the compaction speed. Earlier this year, AWS completed the horizontally scaling compactor, a key component to scale beyond the initial cortex limitation of 20-30M active metrics.

The horizontally scalable compactor raised the number of active metrics a single workspace could have by increasing the compactor’s throughput. However, the compactor throughput became a bottleneck once again as the workspace size approached 200M active metrics per workspace. To enable Cortex to scale to 200M, the horizontally scalable compactor was updated to parallelize compaction by metric definition.

With the ability to ingest 200M active series into a single Amazon Managed Service for Prometheus workspace, customers can now ingest more cardinality than before. This gives them the ability to house more of their Prometheus data in one place, under one roof. Being able to handle 200M active metrics is just the beginning of the scalability journey for Amazon Managed Service for Prometheus. AWS is excited to continue our partnership with the open source Cortex and Prometheus communities.

To get started using Amazon Managed Service for Prometheus, please visit our user guide. To utilize 200M active metrics in a single workspace, issue a limit increase via Service Quotas or Support Center.

 

About the author:

Abhi Khanna

Abhi Khanna is a Senior Product Manager at AWS specializing in Amazon Managed Service for Prometheus. He has been involved with Observability products for the last 3 years, helping customers build towards more perfect visibility. He enjoys helping customers simplify their monitoring experience. His interests include software engineering, product management, and building things.