Adding StatefulSet support in the OpenTelemetry Operator

In this post, AWS interns, software engineers Huy Vo and Iris Song, share their experience with adding StatefulSet support in the OpenTelemetry Operator and their design approach to building a scrape target update service into the OpenTelemetry Collector’s Prometheus receiver.

OpenTelemetry (OTEL) is a popular open source framework used for observability. It provides a set of APIs and libraries that standardize how to collect and transfer telemetry data including metrics, traces, and logs. OpenTelemetry also offers a secure, vendor-agnostic framework for instrumentation so that the data can be sent to distinct service endpoint based on user choice.

In this article, we’ll explain how we added StatefulSet support to the OpenTelemetry Operator, and describe our design choices and lessons learned.

Auto scaling the OpenTelemetry Collector

Large Kubernetes clusters usually contain many instrumented service instances to be scraped by Prometheus. To efficiently scrape metrics from a large cluster, target allocation is required to evenly distribute the workload over Collectors. Instrumented applications often are automatically scaled by Kubernetes controllers, making it impractical for the user to manually specify the sharding configuration.

There is also an ongoing effort to enable the OpenTelemetry Operator (a Kubernetes operator managing the Collector) to discover instrumented services automatically, manage the Collector instances accordingly, and notify Prometheus receivers to scrape the updated targets.

The current approach can be split into the following five parts:

Enable the OpenTelemetry Operator to manage StatefulSet resources.
Enable the OpenTelemetry Operator to perform Prometheus scrape config extraction and replacement in OpenTelemetry Collector config.
Build scrape target update receiver into the OpenTelemetry Collector’s Prometheus receiver.
Build scrape target observer that can distribute discovered targets among a set of OpenTelemetry Collector instances using the scrape target update receiver.
When reconciling StatefulSets, the scrape target observer is notified if state has changed so that it can react to the new replica count.

Our internship project goals were to complete two of the five tasks: to enable the OpenTelemetry Operator to manage StatefulSet resources, and to build the scrape target update receiver into the OpenTelemetry Collector’s Prometheus receiver.

Introduction to OpenTelemetry Operator

Kubernetes Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. The Operator runs in a pod on the cluster, interacts with the Kubernetes API server, watches for the predefined custom resource types, and is notified about their presence or modification.

The OpenTelemetry Operator is an implementation of the Kubernetes Operator and can manage multiple components; however, it currently only manages the OpenTelemetry Collector. There are three workload constructs that culminate in the creation of the pods: Deployment, DaemonSet, and StatefulSet. The OpenTelemetry Operator previously supported management of Deployment and DaemonSet, but support for StatefulSet has not been implemented yet.

StatefulSet is the Kubernetes workload object used to manage stateful applications. It manages the deployment and scaling of a set of pods and provides guarantees about the ordering and uniqueness of these pods. StatefulSet also maintains a sticky identity for each of the pods. These pods are created from the same spec but are not interchangeable; each pod has a persistent identifier that it maintains across any rescheduling.

When starting a Collector as a DaemonSet, we can select resources that are on the same node. This approach allows us to monitor any resource that can be isolated to a single node. It watches for resources and enables scaling proportionally to the number of nodes, multiplied by the number of services that are not associated with a single node, such as Service, Endpoint, and Ingress.

This process can be solved by using the Collector StatefulSet, which better supports non-node-local resources. This is almost always the case when we would like to collect users’ application metrics instead of host metrics.

Figure 1: System flow from user to configure OpenTelemetry Collector from an Operator.

In the system flow shown in Figure 1, a user configures a custom resource definition (CRD) in Kubernetes for the OpenTelemetry Collector, and then the deployed operator watches this resource for updates. If a change is required to achieve a desired state, the operator will reconcile this change in configuration and send updates to the StatefulSet in order for it to be implemented using the Kubernetes API.

Design for Operator enhancement

To allow the OpenTelemetry Operator to manage StatefulSet resources, a new mode is added. This mode allows the user to deploy the Operator in their clusters to create OpenTelemetry Collectors running as StatefulSet in addition to other options, such as DaemonSet, Deployment, and Sidecar.

If the Collector resource has not already been created, the Operator will send the configuration for the StatefulSet using the Kubernetes API for it to be provisioned as desired. If the resource already exists and the user decides to make a change (for example, they want five replicas as opposed to three), then the Operator will observe this change within the CustomResource and then send the configuration difference using the Kubernetes API for the update to be implemented.

After confirming the requirements for adding StatefulSet support to the OpenTelemetry Operator, we moved onto the design for the enhancement. We started with a diagram of the data flow for the StatefulSet implementation as shown in Figure 2.

Figure 2: Flow chart describing the StatefulSet implementation.

As shown in this data flow, the Operator requires an implementation of the reconcile package for StatefulSet in order to create, update, and delete the StatefulSet object using the Kubernetes API.

A user runs a kubectl command to create an OpenTelemetryCollector instance using the OpenTelemetry Operator as shown in the following code snippet:

kubectl apply -f - <

To accomplish this configuration, we had to add a new value for the spec.Mode field called statefulset and add a new field called spec.VolumeClaimTemplates to the Custom Resource Definition (CRD).

The spec.Mode allows users to specify how they want the OpenTelemetry Collector to be deployed (Deployment, DaemonSet, Sidecar, etc.), and the spec.VolumeClaimTemplates field allows users to specify their own Persistent Volume for their StatefulSet. Because only StatefulSet requires Persistent Volumes, we also implemented error handling within the webhook configuration to only allow VolumeClaimTemplates to be specified if the mode is statefulset.

The preceding changes to the CRD and the addition of StatefulSet support to the Operator Controller/Reconciler are now available. With these changes in place, users can create and manage a StatefulSet resource as shown in the following code example:

kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: stateful
spec:
  mode: statefulset
  volumeMounts:
        - mountPath: "/usr/share/test-volume"
          name: test-volume
  volumeClaimTemplates:
        - metadata:
              name: "test-volume"
          spec:
             accessModes: [ "ReadWriteOnce" ]
             storageClassName: "standard"
             resources:
               requests:
                 storage: 1Gi
  replicas: 1
  config: |
    receivers: 
      jaeger:
        protocols:
          grpc:
    processors:
    
    exporters:
      logging:
    
    service:
      pipelines:
        traces:
          receivers: [jaeger] 
          processors: [] 
          exporters: [logging] 
EOF

opentelemetrycollector.opentelemetry.io/stateful created

After a stateful object is created, it can be verified to be running with the following command:

kubectl get statefulset 
NAME                 READY   AGE
stateful-collector   1/1     2m1s

And now the Operator supports StatefulSet resources.

Testing strategy

The various components of the implementation for StatefulSets were tested by multiple unit tests to verify basic functionality and edge cases. We also used the Kubernetes test tool (kuttl) to create an end-to-end test to help us ensure that the entire process of creating StatefulSet Collectors from user input is correct.

Unit test

A test to verify that the StatefulSets function in the Collector package can correctly parse a Collector configuration in the CustomResource and convert it to a StatefulSet instance was added. This test covers all the new fields we plan to add to the CustomResource, especially the OpenTelemetryCollectorSpec. We also included a test to verify that the StatefulSets function in the reconcile package can correctly read Collector configuration from Params and make corresponding changes back to Params.

We also added a unit test for the expectedStatefulSets method. The test passes in multiple StatefulSet specifications, which overlap those already present in Params.Client. After calling expectedStatefulSets, the test will assert that new StatefulSets have been created in Params.Client while existing ones are correctly updated.

End-to-end test

We use kuttl to help verify the functionality of our Operator in creating StatefulSets. With this tool, we can provision Collector instances in a test environment and assert that it was created with the desired specifications (for example, the correct namespace and persistent volume attached). We also verify that if a change is made to the CustomResource, then the Operator correctly applies the update to get to the expected states specified by the developer. In terms of error handling, the configurations for the StatefulSet in Kubernetes will be tested to make sure they are valid with different kinds of inputs and result in some error with certain undesired inputs.

For this test environment setup, kuttl will start a test cluster KIND (a tool for running Kubernetes clusters using Docker container nodes) and run several commands to install the OpenTelemetry operator and start the operator controller manager. This setup prepares the cluster to be able to accept the OpenTelemetryCollector CRD to start collectors. Then, it will run all test cases under ./tests/e2e in the repo.

Originally, there were two test cases, as follows:

smoke-simplest, which tests the most basic functionality of the operator
smoke-sidecar, which tests that the Operator can start Collectors in sidecar mode

For each test case, the installation YAML files are run first and then the test framework gets the Kubernetes objects from the test cluster and verifies that they match the expectation specified in the assertion YAML file.

For StatefulSet testing, a new test case is added to the test directory. The following example shows the test setup file 00-install.yaml, where the mode of OpenTelemetryCollector will be set to statefulset:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: stateful
spec:
  mode: statefulset
  config: |
    receivers:
      jaeger:
        protocols:
          grpc:
    processors:
    exporters:
      logging:
    service:
      pipelines:
        traces:
          receivers: [jaeger]
          processors: []
          exporters: [logging]
  volumeMounts:
    name: testVolume
    mountPath: /usr/share/testVolume

The assertion is specified in 00-assert.yaml to assert that there will be a StatefulSet resource and a PersistentVolume (we can assert other resources as well) with the correct configuration corresponding to the test setup.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: stateful
status:
  readyReplicas: 1

apiVersion: v1
kind: PersistentVolume
metadata:
  name: testVolume
spec:
  storageClassName: "test-storage-class"
  ...

Building scrape target update mechanism in OpenTelemetry Collector’s Prometheus receiver

Another part of this project involved building a scrape target update mechanism for Prometheus receiver. A main use case of this feature is to support OpenTelemetry Collector load balancing, which means automatically adjusting OpenTelemetry Collectors based on the number of Prometheus targets or the load of collectors in Kubernetes. Thus, when the instrumented services scale up or down, the OpenTelemetry Operator will need to discover the new Prometheus scrape targets and then use the scrape target update endpoint feature to inform collectors.

With the help of the scrape target update service, the Prometheus receiver in the Collector no longer has to stick with a fixed target. Instead of providing the target at the startup time, the Operator passes HTTP service discovery configs, which specify how the receiver should retrieve the target info from HTTP servers.

Later, when the Operator needs to update the scrape targets, it will publish the updated target info in the HTTP server it started. The discovery manager built into the Prometheus receivers will be able to get this information through periodic GET requests sent to the server, with the retrieval interval configurable during startup time. Therefore, the Operator can coordinate among different OTEL Collectors and assign the work dynamically as shown in Figure 3.

Figure 3: Operator can coordinate among different OTEL Collectors and assign the work dynamically

Prometheus receiver enhancement

For the Prometheus receiver, we must take into account an abundance of design considerations. Our first design approach included building a server to listen to the target update request directly into the Prometheus receiver. After bringing this design to the Prometheus Working Group meeting, however, we received constructive feedback regarding the stability of this model, as building the service directly into the receiver raised concerns for future development. This feedback led us to a second design implementation, which included creating a custom service discovery tool that can be integrated easily with the Prometheus Receiver and still update a list of scrape targets for a specific job.

Push and pull

For the second version of the design, we explored two different models:

A push model, in which we start up a server that serves an endpoint and push target requests to it
A pull model, in which we periodically make GET requests to an endpoint for a list of updated scrape targets

Both models have pros and cons, and we ultimately decided that a pull model would better fit our scope.

The push model, although easy to regulate, requires a handshake with the service endpoint. The pull model is less complicated, but it requires consideration of how to identify each Collector when pulling targets. To solve this problem with the pull model, we investigated whether we could use the pod name that a Collector instance runs in as the identifier and then pass this through to the Collector configuration using environment variables.

This design works for our use case. We also evaluated using http_sd_config, a Prometheus service discovery tool, and we found a way to remotely update a list of scrape targets for a specific Collector instance. This enhancement has been completed and is now available in the OpenTelemetry Collector’s Prometheus receiver.

Conclusion

Our efforts in this project helped us get closer to our goal of OpenTelemetry Operator Prometheus Target load balancing. We were able to successfully add StatefulSet support to the OpenTelemetry Operator and design the approach for adding a scrape target update service to the Prometheus receiver.

In this project, we learned how to develop software under the industry standard. During the process, we have communicated and collaborated with other engineers, which helped us better understand how the system works and why the new features were needed. Gathering requirements and review them with various stakeholders enabled us to better understand the use cases and requests and identify a viable design. With our detailed design mapped out, the implementation was straightforward. We also gained experience in how to contribute to open source projects, adding high-quality code and getting continuous feedback by interacting with the maintainers.

We look forward to contributing more to the OpenTelemetry community and other cloud technologies in the future.

Huy Vo

Huy Vo is a Junior at Drexel University studying Computer Science. He is currently an SDE Intern at AWS working on the OpenTelemetry project.

Iris Song

Iris Song is a Master’s student studying Computer Science at Northeastern University. She is an SDE intern at AWS and is interested in observability.