Observability-first development

2023-03-27

A while back I worked for a company that hopped on the microservices train while I was there. As we adapted to this new way of creating software, we also adopted a new set of technologies to be able to effectively do this.

One area we invested in was observability. When you think about it, there is no way you could effectively operate multiple microservices in production if you had no visibility into how they were functioning. This includes:

  • Being automatically alerted when a service has problems. Not just for critical issues like “your service is down” but also for things like request handling errors, high latency, etc.
  • Being able to visualize the problems in some way. Once the issue pops up, you need a way to inspect the service’s metrics to get some context about what’s going on.
  • Tied to the point above, being able to visualize metrics and find out trends that may be suspicious. Sometimes there are corner cases you may forget to cover with alerts but are easy to spot to the naked eye.

For every service we wrote, we used an observability-first approach in which no service would be actively used in production unless there were metrics, alerts, and dashboards to explore its metrics. This gave us confidence that our services were functioning as expected, and at the same time made us more effective at delivering software.

Running blind

After talking to other people, I found out that this is not the norm. There are places where there’s software running in production with absolutely no visibility beyond a basic health check to ensure the services are up.

The main problem with this approach is that eventually, someone finds out things are broken. Those who do will likely find out because they use your product. Guess who uses your product the most? That’s right, your customers! In a way, maybe getting calls from customers complaining that something isn’t working counts as an alert, doesn’t it?

Besides that, by not being alerted that an issue is ongoing it can end up snowballing and causing much worse problems down the line. For example, an event processing service that is stuck will take longer and longer to consume the event backlog the more it’s left in that state. If the problem persists for long enough it can cause irreversible side effects, like the queued events expiring without ever being processed, leading to data loss.

Metrics

It’s often fairly simple to come up with metrics as you’re writing code. There are usually certain types of measurable events that pop up whenever you’re writing a new service like:

  • Request handling count/latency for server-type applications.
  • Request count/latency for API calls to external services (databases, HTTP APIs, etc). This helps diagnose slowness caused by other services taking too long to reply, or just errors in general.
  • Event processing count/latency for event-processing systems. This is helpful to estimate a service’s latency and to spot bottlenecks between different producers/consumers.
  • Any other application-specific metric that should help you diagnose issues. This includes things like figuring out how long some operation normally takes, or how many times that thing is happening (e.g. cache hits/misses).

Besides this, new metrics should be driven by new issues found in each application. Every time a new issue is found, it’s a good idea to think about whether some particular metric would have helped detect or diagnose the issue and introduce it. Next time a similar issue pops up, that metric should hopefully help you detect and/or figure out what the issue is.

Alerts

Alerts are meant to… Alert you when something goes wrong. A good rule of thumb is that alerts should be actionable. This means you should avoid defining alert rules for situations that are not indicative of a real problem or that you can’t do anything about. For example, a real case for this was an alert that would fire when the 99th percentile latency for a service went up over X% from its baseline. While this sounds like there could be a problem, the service was still functioning well and was far from hitting capacity or its SLA so the alert wasn’t actionable.

Sometimes you want to define an alert as a type of warning of something that could become an issue. While in theory this is a good idea, in practice this tends to be a non-actionable alert. What ends up happening is people just ignore it, which can lead to a culture of ignoring issues even if they’re real.

When you realize an alert is not actionable, you probably want to:

  • Adjust its thresholds so it only triggers when the issue warrants attention.
  • Remove the alert altogether if there isn’t anything to be done.

Visualizing metrics

Just as important as being alerted about an issue is being able to dig into what’s going on. The last thing you want is getting paged at 3 am, only to realize you have no way to inspect the service’s metrics that led to the alert being fired. It’s also possible to go too far in the other direction and end up having metrics dashboards with hundreds of widgets where there’s so much information that you can’t get anything out of it.

A few things I like when defining dashboards to visualize metrics are:

  • Create an overview dashboard. This is a great entry point into the health of the system. You may want an overview of the system that visualizes very high-level metrics, but also another one per service.
  • Keep them simple. Don’t create a dashboard with a hundred widgets/panels. If you put too much information together, it becomes overwhelming and harder to interpret.
  • Split your dashboards. This helps fix the point above and involves defining dashboards that focus on different aspects of the system. Depending on how much a single service does, an overview dashboard may be enough. But if you have too much information you can create different ones for major parts of the service.
  • Keep dashboards consistent. I’ve mostly used grafana and I like things like filter variables to be consistent across all of them. This reduces the amount of effort you need to adapt to a dashboard when you jump into it. I’ve recently used grafanalib and a bit of code on top of it to allow creating dashboards/panels to visualize some metrics. As a result, all dashboards have the same filters, all widgets have the same look, etc. As a bonus, it also takes 3 lines of code to define a new dashboard. This is a huge improvement compared to having to define them all by hand, making an extra effort to keep things consistent across dashboards when there’s a change, etc.

Conclusion

Running services in production without any visibility into them is a recipe for upset customers. While setting up the infrastructure needed to monitor your services can be daunting, it’s a one-time investment that will pay off in the long run. Having visibility will not only let you react to issues faster but also help you understand and fix them quickly as well.