Cloud Architecture

Distributed Tracing

TLDR;

How do you debug distributed applications in production? Distributed Tracing

Longer Version

Distributed tracing is a technique used in modern, complex software systems to help us understand how requests flow through multiple services, allowing to diagnose and debug issues that may arise in the system.

It provides a way to follow the path of a single request as it travels through multiple microservices and across different processes and machines.

Distributed tracing allows developers to answer questions such as:

What services were involved in processing a request?
Which service was responsible for a particular latency or error?
How much time was spent in each service for a given request?
Which requests were affected by a particular issue?
How do performance metrics change as the system scales?

By providing a detailed view of request flow and service interactions, distributed tracing helps to diagnose issues more quickly and effectively, leading to faster resolution times and better overall system reliability.

Distributed tracing is also used to monitor system health and performance over time, allowing to identify trends and potential issues before they become critical problems.

It helps with capacity planning and resource allocation, by providing insights into how different services are using resources and where bottlenecks may be occurring in the system.

Typical Application Layout

Applications have to be instrumented (normally not very complex process) to collect and stream traces to metrics storage over a proxy.

Proxying

Currently most popular solution is open telemetry collector (AWS has own adjusted distribution called ADOT).

Component is deployed to k8s cluster and is able to proxy not only traces but also collect metrics and logs.

Storages

Traces or spans are data, need to be stored somewhere to be analysed and therefore need a place to be stored.

There are many storages out there, but few worth mentioning.

First of all storages are falling into two categories:

managed
and not managed :).

Managed storages are costly but reliable and require less human attention.

Not managed storages do require human attention and can be costly too if implemented for small volumes or for just debugging reasons (which is same).

Grafana Tempo (stack)

Grafana is complete stack of monitoring, logs, traces and alerting storages. Has open source and commercial versions.

Comes with Loki - log storage, Tempo - traces, Mimir - metrics (which is relatively new and with limited features).

Grafana is a bit complicated to configure but has quite robust functionality to cover all needs.

Is a cloud agnostic and can be used on premise setups too.

CloudWatch

AWS Solution which is costly for large volumes of data, but very easy to configured and integrates well with native services like SNS, RDS, ALB and so on.

CloudWatch has own log storage, metrics and traces in X-Ray.

Shows service map out of the box, same feature needs to be configured in Grafana which is not easy.

CloudWatch X-Ray is very easy to start with but could be costly.

If you are considering debugging application once a while this might be good solution. If you will be running lot of services and will need to monitor and report SLAs, this might be expensive solution in contrast to Grafana.

Also X-Ray integrates easily with CloudWatch Alarms.

Google Cloud Monitoring

Similar situation as with AWS CloudWatch. Easy to start with, but expensive for large volumes.

How to cut costs

Sampling

It is possible to sample traces/spans. In some cases it can cut costs up to 1 000 times and still provide accurate information to debug and monitor application performance.

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md

Filtering spans

You can filter spans for non critical cases, but this will require closer and continuous monitoring which is a cost too.

Manage own storage

There are plenty open source storages which you can run on k8s and relatively easy to manage and can be cheap if configured and managed on spot / preemptible nodes and instances. It is a bit difficult to setup but if done correctly can save you lot of money.

All in all this is monitoring data and does have to be reliable to send alerts, but does not require long retention unless required by regulations.

Performance Optimisation

Quite possible after spotting and fixing performance issues you will save on computing resources enough to cover your tracing costs. How about improved UX?

Conclusion

Use traces, this will speed up your debugging x100. You can catch issues way before users will catch and bring your stack down.

How? Consider the volume.

Start with managed version (easy to start with)
use otel-collector (gives you abstraction layer and will not require to change all services when switching provider)
if you have lot of traces either go for a switch (enable from time to time, maybe after release)
use sampling (more aggressive sampling in dev environments)
use filters
if large volumes deploy and manage your own version
if little volume stay with cloud tools, this will save you lot of time!

References

https://github.com/dasmeta/terraform-aws-eks/tree/main/modules/adot

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md