Observability for Distributed Systems

6 min readFeb 27, 2023

Distributed systems are the new normal in the new era of software engineering. In the micro-services world, you trade an atomic execution for a distributed execution relying on message exchanges with multiple services and systems. There are a lot of moving components, and failure is inevitable. With so many interconnections among components, it is incredibly challenging to understand the system’s behavior and find the root cause of the problem.

Observability helps us to understand the behavior and state of the system. E.g., in a distributed system, you may find it difficult to debug an issue across multiple hosts using logs or other signals. You might systematically drill down into anomalous behavior by utilizing a combination of signals you must remember. Instead, you can use distributed tracing to break down and visualize each step. In that case, you need to understand the complexity of only those applications that appear in the path of execution of the specific request.

Observability is still an afterthought in many places and falls under the umbrella of Operations, meaning developers do not experience the power of observability. When developers build features spanning multiple services, it might be very frustrating to debug and develop a high level of confidence before committing the code. We must shift observability towards the left in the development lifecycle to empower development and operations teams.

Observability vs Monitoring

The traditional monitoring approach works by checking known system conditions against known thresholds. This is a fundamentally reactive approach because the incident has already occurred.

Observability proactively collects required data and correlational information to enable exploratory investigations. It helps to identify known and even unknown issues. Being observable means opening up yourself so that the world understands you. Remember, observability gives you a paved path to monitoring.

If you are observable, then I can monitor you.

Where do we start?

Structured events are the basic building blocks of observability. They can be parsed by machines and don’t need human intelligence to sense the situation making scalability less of a challenge for you.
Structured events present a record of everything happening in your system, and our goal would be to use these events as a complete source of information. You might have to do much work iteratively and use multiple tools to get a reliable observability solution. You could split the work into three categories at a high level.

Instrumentation: Instrumentation refers to the process of generating structured events that contain the information. These could be metrics, logs or traces.
Data Collection: The events generated by instrumentation must be collected and stored for later analysis.
Visualization: Visualization covers querying, analyzing and presenting the collected data.

Instrumentation (Generating Events)

With the wide use of multiple languages and frameworks for software development in an organization, teams have a lot of autonomy and flexibility to use the libraries or frameworks of their choice to generate metrics. Automatic instrumentation could generate most of the standard metrics. Exposing these metrics to an HTTP endpoint makes it easier to view and validate what is generated. It doesn’t require any other third-party tool if developers want to test their metrics as part of functional testing.

Opentelemetry(OTel) makes automatic instrumentation simpler. It can automatically generate traces for inbound and outbound HTTP, gRPC, and database calls if your services call OTel before and after handling each request.

Data Collection (Collecting and Storing Events)

You need to collect the events generated by your services at regular intervals and store locally in a database or export them to a back-end system. There are two ways to do that.

Run a collector service that can reach out to your endpoints that expose the events and scrape them.
Have your services instrumented to send the data directly to the collector.

The collector acts as a middle-ware agent that handles the generated events and could process them before storing or forwarding them to another system. OTel provides quite an extensible and powerful Collector agent.

Visualization (Analyzing Events)

Visualization is the most valuable component for developers and operations folks. I will call it observability back-end. You can find both types of back-ends: one that stores the data and the other that pulls the data stored somewhere else. Any observability back-end would allow you to create charts, dashboards and alerts to monitor the metrics that matter to you. Some of the most popular ones are Grafana, Wavefront and Datadog.

A good observable system should allow you to perform the analysis with the tiniest legacy information about the system. You should be able to follow a systematic and scientific approach to uncover issues that you have never seen earlier.

How do we start?

Code instrumentation

Rolling out instrumentation properly across the whole organization takes time. Start with what you can quickly implement and develop a strategy to grow your instrumentation with time.

Automatic instrumentation using open-source projects such as Open Telemetry is a good place to start. Start instrumenting the services that are the most significant pain points and use them as a reference for the other teams.

If your service teams do not prioritize the instrumentation work, use the production issue as an opportunity to use this new tooling to address problem areas. That will prove an immediate value for your development or production teams.

Design you system to be loosely coupled but highly aligned

There is a term called sunk cost fallacy that resists organizations from adopting a new technology framework. It can drive your decisions to satisfy the recovery of the money already spent because switching costs are too high.

Avoid falling in that trap by designing you system loosely coupled, meaning that individual component should be easily replaceable, and highly aligned, meaning that the role of each component is well defined and contributes to the end goal. An analogy is the bike chain, where links are loosely coupled with each other but highly aligned to keep the drive train intact.

Bike chain links are **Loosely Coupled** but **Highly Aligned** to keep the drive-train intact

Choose your observability back-end (that provides data ingestion and visualization), which offers integration with multiple data sources. E.g., Grafana could ingest data from all kinds of popular data sources out there.

Start with the most significant pain points

It might sound counterintuitive, but starting with a small and relatively unimportant service will not help you prove the value of observability. Is there a flaky and aching service that has been troubling people for weeks, yet nobody knows the root cause? Start right there.

Pick a complex problem which is opaque to you and your users. Instrument the code, collect and export the data, explore the cause with great curiosity, and share your results with the team to find the answer. The quickest way to drive adoption across your teams is to help them solve the most significant pain points to run services in production successfully.

The Future Of Observability

Observability is an essential part of highly productive development teams. A successful Ops team relies on a proactive and methodical approach to deal with issues rather than depending on institutional knowledge passed along informally from senior engineers to junior engineers. Observability empowers DevOps and SRE teams by providing visibility and insights into system behavior that helps managers and leaders make informed decisions. It’s a rapidly evolving area, and its importance will increase due to the complexity of distributed systems and cloud infrastructure.

Observability for Distributed Systems

Observability vs Monitoring

Where do we start?

Instrumentation (Generating Events)

Data Collection (Collecting and Storing Events)

Visualization (Analyzing Events)

How do we start?

Code instrumentation

Design you system to be loosely coupled but highly aligned

Start with the most significant pain points

The Future Of Observability

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Amit Pal

No responses yet