Adding AWS X-Ray support to the OpenTelemetry PHP library

In this blog post, AWS observability team intern engineer Oliver Hamuy shares his internship experience on his project to enhance the OpenTelemetry PHP SDK by adding support for AWS X-Ray. Please note that the OpenTelemetry PHP SDK is in development and in alpha state currently. We’ve tested the X-Ray pipeline for simple tracing using a sample app successfully; however, there are ongoing development tasks that could benefit from more PHP engineers contributing to the OpenTelemetry project. There is a backlog of tasks that you could help with to get the SDK to beta state. Note that this functionality will not be available in the downstream AWS Distro for OpenTelemetry (ADOT) until a later release.

Many of today’s production applications were created with or have adopted a distributed application model. As applications become increasingly complex, tracking processes and locating performance issues and errors becomes more difficult. Observability gives us the ability to measure the current state of an application by examining its outputs through metrics, traces, and logs.

Introduction

OpenTelemetry (OTEL) is an open source project that provides the ability to collect telemetry data to measure application performance and behavior. Under the Cloud Native Computing Foundation (CNCF), OTEL provides tooling as a set of APIs and SDKs to extract signals, metrics, and relevant telemetry information from applications and send this information to a backend of choice for processing. It standardizes how telemetry data is transmitted to backend platforms, thereby providing a common application instrumentation so engineers do not need to re-instrument their application for different backend platforms.

AWS X-Ray is one such backend platform that helps developers analyze and debug distributed applications, including those built with a microservices architecture. AWS X-Ray provides a way to view and monitor how requests are made through an application, collecting data as traces and aggregating them to show a map of the application’s underlying components.

By default, OpenTelemetry supports OpenTelemetry Protocol (OTLP), which specifies the encoding, transportation, and delivery of telemetry data. However, AWS X-Ray has its own data format that must be followed to export traces successfully. AWS X-Ray support already exists in OpenTelemetry SDKs for Go, JavaScript, Java, .Net, and Python.

Many production applications, however, continue to use PHP. Additionally, many OTEL users have been requesting a GA release of the PHP library, and AWS customers have been pushing for the support of AWS services within the PHP library. The motivation behind the addition of AWS X-Ray to the PHP SDK was to fulfill these exact requests. Adding AWS X-Ray would provide direct access to AWS user to test and instrument their applications using the PHP library.

My project enhanced the OpenTelemetry PHP SDK by adding support for AWS X-Ray. It also added instrumentation detection for Amazon Elastic Container Service (Amazon ECSS), Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Lambda. Adding this functionality involved much work within the core code of the OTEL PHP library to fix and improve items necessary to support AWS X-Ray. I also built two sample applications to demonstrate how to use these components and visualize traces on AWS X-Ray.

Project overview

When a developer’s application is instrumented using the AWS Distro for OpenTelemetry (ADOT), the OpenTelemetry library will send the trace data collected from the various microservices to AWS X-Ray, where it will be displayed to users to be analyzed. Figure 1 shows a high-level view of how an application is instrumented and traces are sent down the pipeline to AWS X-Ray.

a high-level view of how an application is instrumented and traces are sent down the pipeline to AWS X-Ray

Figure 1: A high-level view of pipeline.

Distributed tracing forms the foundation of this project. Distributed tracing is the tracking of a single request as it propagates through multi-service architectures, such as microservices and serverless applications. The activity of these requests are traces. Each trace is made up of spans, where a span tracks the work being done by individual services. A span provides request, error, and duration metrics that can be used to debug availability and performance issues. Thus, traces are simply a tree of spans, delineating the flow of a request.

The steps that occur when you make a request are shown in Figure 2. The request causes a chain reaction of microservices that end up fulfilling the original request. A microservice, for example, could be a call to payment validation from a shopping website.

The steps that occur when a user makes a request: arrows from user out to Service 1 and Service 3. Arrow from service 1 goes to service 2, and arrow from service 3 goes to Database 2 and Service 4. Arrow from service 2 goes to database 1 on the far right.

Figure 2: What happens when you make a request.

To facilitate the delivery of traces from OTEL to AWS X-Ray, two main components are necessary: the AWS X-Ray propagator and AWS X-Ray ID generator. Additionally, to support resource population, four detectors were created: ECS detector, EC2 detector, EKS detector, and Lambda detector.

AWS X-Ray ID generator

The first component I created was the AWS X-Ray ID generator. An ID generator in the context of traces is a random value generator for the ID of a trace (that is, the trace ID). The trace ID is the identifying value given to a root span (the first span created in a request) that is unique for that entire trace tree. Therefore, all parents and children under this root span will have the same trace ID.

OpenTelemetry has a default ID generator that conforms to W3C trace format, which generates a random unique 32-hex-character lowercase string. However, AWS X-Ray has a different trace ID format, so the default implementation had to be overridden to conform to X-Ray ID requirements.

Similar to the default implementation, the AWS X-Ray trace ID is a 32-digit hexadecimal number. However, the first eight digits are generated by taking the epoch time when a trace is created, which is then converted into hexadecimal. The remaining 24 digits are a randomly generated unique hexadecimal number. Figure 3 shows an illustration of how an AWS X-Ray trace ID is created.

an illustration of how an AWS X-Ray traceId is created: create a trace, get current timestamp, convert to hexadecimal/generate random 24 hexadecimal string, final trace id.

Figure 3: A visualization of AWS X-Ray trace ID generation. The final trace ID is 32 hexadecimals long.

With AWS X-Ray, you can specify which generator to use to generate trace IDs, allowing use with both AWS X-Ray and other backend services.

AWS X-Ray propagator

The AWS X-Ray propagator component facilitates formatting HTTP headers to the requirements of AWS X-Ray trace header format. The propagator is configured inside a trace object to support the transferring of context across process boundaries. It propagates by bundling context and transferring it across services, usually through HTTP headers. A trace is made up of spans, where the SpanContext is the identifying information of the current span and other metadata.

By default, OpenTelemetry uses the W3C trace header format. The following snippet shows an example of a W3 trace header with root trace ID in which you can see the difference in the default and desired format.

traceparent: 5759e988bd862e3fe1be46a994272793 tracestate:optional

The following snippet shows an example of AWS X-Ray trace header with root trace ID and sampling decision. The root trace ID is equivalent to that of the trace parent. The parent is the parent span ID.

X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793;Parent=53995c3f42cd8ad8;Sampled=1

Most importantly, the propagator injects and extracts a given trace ID from one microservice to the next. The propagator is the engine behind distributed tracing, enabling the traces to be tracked across every microservice used within an application.

Figure 4 shows which components of the OpenTelemetry SDK are used alongside the propagator to support distributed tracing. In this figure, a trace begins at the TraceProvider, where a tracer object is created to generate spans, which is then propagated to each microservice, processed, and sent to the Collector to export to AWS X-Ray.

Figure 4: Trace propagation.

We begin by looking at the TraceProvider, which is a class that provides and delegates Tracers, specifying the IdGenerator to use and other details about the system. The Propagator is configured inside a Tracer object created through the TraceProvider object. Then, the propagator propagates context (the metadata about the trace) from one boundary to the next in actions called injections and extractions.

In this process, first the context is received, and new operations are added to the context. Then it is extracted to the next step in the request. The context is then again received by the API, processed, and then exported to the Collector. All together, the Propagator is the key component that enables distributed tracing.

Detectors

To understand the purpose of detectors, we must cover the concept of a Resource within OpenTelemetry. A Resource contains information about the application from which telemetry data is taken. Thus, we need something to extract this data from the application. A detector fulfills this role as the component that discover whether your application is running on a type of service.

For example, metrics being sent by a Kubernetes container can be linked to a resource that specifies the cluster, namespace, pod, and container name of that specific container. In Figure 5, the application shown is running on Amazon EKS. Because the EKS detector exists within the instrumentation, we can populate the resource of any given trace with metadata about the application. In this case, we can pull the containerId and clusterName to then populate the resource.

Figure 5: An application with the EKS detector is able to pull the containerId and clusterName from the environment.

In this project, I implemented the Amazon ECS, EKS, EC2, and Lambda detectors. Although they all serve the same function, they all return different metadata dependent on the service. Their implementations also differ quite drastically. A couple of detectors created environment variables on the application from where data could be pulled. The Amazon ECS and Lambda detector contained most of the data on these environment variables, with a few environment variables coming from other files on the machine.

The Amazon EKS and EC2 detectors both needed to make multiple requests to endpoints to verify that the application was indeed running on these services. Tokens had to be pulled, then verified, and finally data could be extracted through these API calls. These detectors provide vital information for users running the application, helping them debug and verify that their environments are correctly functioning.

Core repository improvements

During the implementation of all the components, I dove deeper and started contributing to help improve the core OpenTelemetry PHP library. The desire to contribute started from a need to be able to test my components end to end, as they could not be tested due to bugs and errors within the core repository.

Currently, the OpenTelemetry PHP repository is in a nascent stage, as it’s still in the alpha state. I found little documentation, vestigial functions, and hardcoded values in parts of the core code, which provided me with the opportunity to contribute. This project became a significant part of my work, writing fixes, submitting issues, and creating pull requests to get the files up to specifications so that my components could be integrated.

One of the first issues I encountered involved attempting to send traces to AWS X-Ray from a local sample application. The issue revealed itself to be hard-coded values in two classes called TraceProvider and SpanContext. Working with maintainers to make sure that my changes were up to specifications, I merged the changes and was able to get traces to show up on AWS X-Ray, shown in Figure 6. This snapshot shows how the first sample app creates a request. The trace map visualizes how the request traveled. The trace ID can be found near the top under the ID section.

Figure 6: AWS X-Ray backend.

Apart from getting the core repository working, I also was able to make performance improvements. The function preg_match, which is used to validate a value against REGEX, is notoriously slow in PHP and should be avoided if possible. Working with PHP benchmarking tools, I created a custom function that ran in one third the time that a single preg_match call would take. All occurrences were replaced with this new and improved method. Figure 7 shows the benchmark result. The benchFunctionComboValidator performed approximately three times faster than the preg_match function as shown under the mode column.

+------------------+-----------------------------+-----+------+-----+-----------+----------+--------+
| benchmark        | subject                     | set | revs | its | mem_peak  | mode     | rstdev |
+------------------+-----------------------------+-----+------+-----+-----------+----------+--------+
| IdValidatorBench | benchPregmatchValidator     | 0   | 1000 | 5   | 639.952kb | 15.363μs | ±2.56% |
| IdValidatorBench | benchFunctionComboValidator | 0   | 1000 | 5   | 639.968kb | 4.381μs  | ±0.67% |
+------------------+-----------------------------+-----+------+-----+-----------+----------+--------+

In working with the core repository, I handled various specification docs and classes that I would not have worked on otherwise, and I worked closely with the small PHP team and maintainers on these issues. Other problems still exist within the core repository, but the state of it has improved drastically, and further improvements will be made as the repository continues to develop.

Sample applications

To test the components thoroughly, I created sample applications that mimicked the propagation of traces through microservices. A few PHP applications existed within the main repository that allowed for simple testing; however, the lack of use of a propagator limited the depth of testing.

In creating the sample applications, I had to consider the state of the core repository. Currently, the ability to instrument an application automatically does not exist, so manually instrumenting the apps was necessary. This process entailed creating a tracer, generating spans manually, propagating contexts, and closing the spans once a request was completed.

Another factor to consider during creation was the ease of use of testing these applications. I wanted to create an app that would be easy to set up for anyone who wanted to test the alpha version of the PHP library with AWS X-Ray. Thus, both applications were created as console applications in order not to complicate the process with setting up web applications.

The first sample app in its implementation is creation of a span, then a child span, which is then populated in an HTTP header that makes a request to aws.amazon.com. Figure 8 shows how the initial span and its child are created:

Figure 8: The trace map generated by sample application 1.

The second application involves a more robust example of how a real-world application may make a request to different services. A main application will make a call to two different microservices, called Service1 and Service2. Before calling either of the services, however, the span context is injected into a carrier that is then taken to the service. Then, the services will extract the context from the carrier and create a new span based upon it.

After the services are concluded, the child spans are ended, and then the main root span is ended in the main application. Figure 9 shows how these requests are mapped out:

Figure 9: The trace map generated by sample application 2.

These sample apps serve as integration tests, demonstrating that traces created in an application can be sent to the OTEL Collector and then seen on the AWS X-Ray backend.

Conclusion

During this project, I have come to understand much more about the open source community and the process of creating and delivering high-level, maintainable code. Through this open source project within OpenTelemetry, I had the opportunity to work with many contributors from all over the industry who are working to create well-documented and efficient services. I learned to communicate with my team members and to respond effectively to their inputs and apply those changes within my code.

I also became much more familiar with PHP as a language and all the little nuances that come with it. Also, I have come to understand that being an engineer is more than coding—it’s an amalgamation of coding, documenting, and presenting ideas, as well as reaching out to other engineers for their expertise and responding productively to evaluations by team members and reviewers.

Working within OpenTelemetry and with the AWS X-Ray team was an incredible experience that enriched my knowledge of the open source community and the industry as a whole. I appreciate the mentorship from Bob Strecanksy, OpenTelemetry PHP maintainer; Anthony Mirabella, OpenTelemetry Go maintainer; and Bolu Peng from the AWS X-Ray team for their guidance and reviews. I hope to continue working within wonderful teams with talented people and to continue contributing to influential open source projects in the future.

Oliver Hamuy

Oliver Hamuy is a senior at Washington University in St. Louis studying Computer Science and Electrical Engineering. He is currently working as a intern software engineer at AWS and is interested in observability and artificial intelligence.