System tracing with AWS X-Ray

Knowing your application is running fast or slow is one thing, understanding what makes up those characteristics is another. This is not a new problem, APM (Application Performance Monitoring) tools like New Relic and AppDynamics will happily take your money to make this problem easier and have done for a while. However, this problem gets even more complicated in a serverless environment where integration points are not always totally under your control. This is a frustration of mine as AWS Lambda can often be an opaque beast.

AWS X-Ray is a lightweight entrant into this field, with the extent of the configuration being a tick box in the ‘advanced settings’ section of the ‘config’ tab. The feature switch will even add the required permissions for you.

With this enabled, each request will now produce a trace.

Now for the interesting bits, we can see that this Lambda was cold and took a while to start executing. We can also see where Amazon have called out some of their ‘initialization’ time.

Slow Lambda execution

My Lambda includes a call to a RESTful API but the default X-Ray setup won’t show this. I want to call this out especially in my monitoring because this external dependency has the possibility of affecting my service. To do this I need to create a sub-segment, Amazon as ever have thought of everything and have created a decorated version of HttpClient with their own X-Ray integration. It’s literally a drop in replacement with no further configuration. With this addition, I now get a much clearer view of the profile of my service. The remote API call I’m calling makes up a large percentage of the total time taken.

Lambda execution with HttpClient augmented

Amazon does support creating your own custom sub-segments which is something I want to try out soon and maybe write about later.

The other interesting feature I’ve not yet tried but is also on the backlog for investigation is attaching metadata to the segments and sub-segments. I can imagine this being really useful by adding pertinent data to traces, this could then be used in investigating themes of slowness for example.

Finally, another freebie from Amazon: out of the box, a service map is created linking together any associated resources it can find from your traces. The green around each service is a ring chart of % success, these turn yellow if there are problems.

I could imagine a slightly more sophisticated version of this being a fantastic DevOps dashboard.

linked services