AWS & OpenTelemetry strikes back: Tail-based Sampling

Marcin Sodkiewicz
6 min readSep 14, 2023

Find out what tail sampling is and how to set up OpenTelemetry collectors to work with it in gateway mode at scale.

This article won’t mention OpenTelemetry basics and it is continuation of previous article about setting up OpenTelemetry on AWS that you can find here: https://medium.com/@sodkiewiczm/aws-opentelemetry-e553de8cadf

Why would I need tail-based sampling?

Before we look at what tail-based sampling is, let’s consider what problem it solves and what other sampling strategies we have at our disposal.

Head (stateless) sampling

The most popular head sampling strategy is probabilistic sampling, which is a dreadfully simple sampling strategy. It decides whether or not to sample your traces by “rolling the dice”. So the way you define your sampling is: “I want to sample x% of all traces”. We love it for its simplicity and this is exactly the sampling strategy we used in the previous article.

At some point you realise that systems overview works best using metrics as you have there “real view” into system as you won’t sample metrics. In that sense having sample of all of your traces makes sense, but not that much. Especially if you have really high volume traffic and not that many “interesting” (buggy) traces. What is more observability systems pricing is based on the volume of sent telemetries. So there are questions popping into your head like:

  • “What if I would like to see more error traces?”
  • “What if I would like to sample some app / group of apps at much higher rate?”
  • “What if I would like to sample only slow traces?”

Yes! You guessed it. Answer for all those questions is tail sampling.

Tail (stateful) sampling

Just as you can imagine, tail sampling can make decisions after collecting the trace, because it is stateful. You can create more complex sampling decisions by using the Tail Sampling Processor. You can define rules based on e.g.:

  • latency
  • attributes (strings, numbers, booleans)
  • probability (like in head sampling)
  • trace status (errors)
  • rate limit

and what is more you can also combine all of those together. Sounds great! How to start using it? Could I just replace sampling strategy and that’s it?

Architecture: Starting point

Let’s assume that we have state of infrastructure from previous article like one below:

Why we have to change it?

I have mentioned before that tail sampling is stateful. So it collects trace with all of its spans and only then it makes a decision.

What happens if we just update our definition to the one that uses tail sampling? Requests will flow into our ELB collector and then they would be sent to one of the instances under ELB. Here is the whole problem. We need to collect all spans for the same traceId in the same collector instance. Otherwise we won’t be able to make the right decision.

Is there any way we could do that with ELB? Not really. We need to change our architecture.

Collecting spans across using ELB

Architectural changes

Since v0.32.0 of ADOT there is new type of exporter available: loadbalancing exporter. This exporter routes incoming spans into further collector instances based on routing key (traceId or service). So we have to add a new layer of loadbalancing collectors in front of our current set of collectors. They will receive request and based on routing_key redirect spans directly to the proper collector node.

Collecting spans with load balancing collectors layer

Collector nodes are discovered based on:

  • static IPs list
  • DNS name
  • k8s

Now this is important. Our collectors have to be routed directly from loadbalancing collectors, if we will use ELB then collector will discover IPs of loadbalancer nodes instead of our collector nodes.

How can we achieve this easily? In our case we are using ECS with Fargate, so we can use great features of ECS — ECS Service Discovery or ECS Service Connect and have DNS routing based on service name as sophisticated as “otel-collector”. It will work in a very similar way with k8s. So our architecture will have been slightly updated into one on the diagram below:

Architecture with additional load balancing collectors layer

Example configurations

Simple load balancing configuration:

receivers:
otlp:
protocols:
grpc:
http:
exporters:
loadbalancing:
protocol:
otlp:
compression: gzip
tls:
insecure: true
resolver:
dns:
hostname: otel-collector.domain.com
port: 4317
interval: 15s
extensions:
health_check:
service:
extensions: [health_check]
pipelines:
traces:
receivers: [otlp]
exporters: [loadbalancing]
metrics:
receivers: [otlp]
exporters: [loadbalancing]

Simple tail sampling policy with different sampling for errors and regular traces:

receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
check_interval: 15s
limit_percentage: 60
spike_limit_percentage: 20
batch:
send_batch_size: 2000
timeout: 10s
tail_sampling:
decision_wait: 10s
expected_new_traces_per_sec: 20000
policies:
- name: high-volume-apps-errors-sampling-policy
type: and
and:
and_sub_policy:
- name: error-policy
type: status_code
status_code:
status_codes:
- "ERROR"
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 5
- name: high-volume-apps-prefix-policy
type: string_attribute
string_attribute:
key: service.name
values:
- high-volume-apps*
enabled_regex_matching: true
- name: high-volume-apps-sampling
type: and
and:
and_sub_policy:
- name: high-volume-apps-prefix-policy
type: string_attribute
string_attribute:
key: service.name
values:
- high-volume-apps*
enabled_regex_matching: true
- name: high-volume-apps-probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 1
- name: low-volume-apps-errors-sampling-policy
type: and
and:
and_sub_policy:
- name: error-policy
type: status_code
status_code:
status_codes:
- "ERROR"
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 100
- name: low-volume-apps-prefix-policy
type: string_attribute
string_attribute:
key: service.name
values:
- low-volume-apps*
enabled_regex_matching: true
- name: low-volume-apps-sampling
type: and
and:
and_sub_policy:
- name: low-volume-apps-prefix-policy
type: string_attribute
string_attribute:
key: service.name
values:
- low-volume-apps*
enabled_regex_matching: true
- name: low-volume-apps-probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 50
exporters:
otlp:
endpoint: https://otlp.nr-data.net:4317
headers:
api-key: ${YOUR_NR_KEY}
compression: gzip
extensions:
health_check:
service:
extensions: [health_check]
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, memory_limiter, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlp]

Things to watch out

Exporting metrics

̶T̶h̶e̶ ̶l̶o̶a̶d̶ ̶b̶a̶l̶a̶n̶c̶i̶n̶g̶ ̶e̶x̶p̶o̶r̶t̶e̶r̶ ̶d̶o̶e̶s̶ ̶n̶o̶t̶ ̶s̶u̶p̶p̶o̶r̶t̶ ̶m̶e̶t̶r̶i̶c̶s̶,̶ ̶w̶h̶i̶c̶h̶ ̶t̶o̶t̶a̶l̶l̶y̶ ̶m̶a̶k̶e̶s̶ ̶s̶e̶n̶s̶e̶,̶ ̶b̶u̶t̶ ̶u̶n̶f̶o̶r̶t̶u̶n̶a̶t̶e̶l̶y̶ ̶m̶e̶t̶r̶i̶c̶s̶ ̶h̶a̶v̶e̶ ̶t̶o̶ ̶b̶e̶ ̶e̶x̶p̶o̶r̶t̶e̶d̶ ̶o̶n̶ ̶t̶h̶e̶ ̶L̶B̶ ̶c̶o̶l̶l̶e̶c̶t̶o̶r̶ ̶l̶a̶y̶e̶r̶ ̶t̶o̶ ̶y̶o̶u̶r̶ ̶t̶e̶l̶e̶m̶e̶t̶r̶y̶ ̶p̶r̶o̶v̶i̶d̶e̶r̶!̶
Exporting metrics are finally supported! :)

Scaling collector: DNS

During collector scaling, the list of discovered IPs may become outdated, so it’s important not to use long TTLs on DNS records.

Scaling collector: Dynamics

During collector scaling we will face typical rebalancing issues. It’s good practice to limit the number of scaling operations.

False Telemetry

You have to remember that if you come up with tail sampling rules that only collect traces with errors or only slow traces, then all your telemetry will be wrong. In most APM systems you will have service maps that show a completely distorted reality.

LoadBalancing Exporter Settings

It’s important to config retries and queues in your loadbalancing exporter setup according to: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/exporterhelper

Tail Sampling Processor Settings

It’s important to set parameters of tail sampling processor according to your system requirements and characteristics. Take a look at Tail Sampling Processor Docs which describes setup with:

decision_wait (default = 30s): Wait time since the first span of a trace before making a sampling decision

num_traces (default = 50000): Number of traces kept in memory

expected_new_traces_per_sec (default = 0): Expected number of new traces (helps in allocating data structures)

Summary

We are really happy about our current setup. We have now flexibility that was not possible with any other solution. We are now able to improve the way that we are collecting traces and pay for the data that really matters.

Cons:

  • More complex infrastructure
  • More expensive collectors instances— stateful machines consume more resources and we need additional layer of machines

Pros:

  • Cheaper — in case you are storing less data in your observability provider which most probably will prevail cost of infrastructure
  • Paying for more relevant data
  • Flexibility in sampling

Important links

--

--