Roots of Observability

Emergence of SRE and DevOps

Roots of Observability lie in the emergence of SRE (Site Reliability Engineering) and DevOps.

Software engineering has taught us that software needs to be Scalable, Available, Resilient, Manageable and Secure. Scalability and reliability form the cornerstones of these principles.

Scalability is the ability of the system to handle increased load. Applications can scale vertically (scaling up) i.e. means increasing the capacity of resource(s), or horizontally (scaling out) i.e. adding new instances of a resource(s). Availability is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. Reliability is the probability of continuous correct operation. Having a system available and not reliable makes no sense. Similarly having a system reliable and not available makes no sense either. Hence, these terms are used congruously.

The early 2000s saw emergence of Site Reliability Engineering (SRE) with an intention to have dedicated teams whose sole target was to make software applications ultra-scalable and highly-reliable in an enterprise setting. After the introduction of SRE at Google, almost every tech giant saw the impending value of such teams, and ultimately incorporated such teams in their engineering organisations.

By late 2000s, the industry saw emergence of yet another vertical in form of DevOps. By this time most of the tech world was moving or had already moved towards Agile development. The need to 'reduce the time between committing a change to a system and the change being placed into normal production environment, while at the same time ensuring high quality' was of paramount importance.

The common thread between SRE and DevOps were effective monitoring systems. The monitoring system would address two questions: what’s broken, and why? The “what’s broken” indicated the symptom; the “why” indicated a (possibly intermediate) cause. The answers to these 'Whats and Whys' would give the development, deployment and management teams a bird's eye view of the system, and valuable information which they could use in their corresponding decision making. Dashboarding of different metrics became a common practice in engineering teams. As the systems grew complex (and distributed), the need for monitoring changed from good-to-have to mandatory.

Observability, Context and Monitoring

By early 2010s the software industry had started adopting distributed architecture (microservices in particular) in their products. Major software products had seen the apparent positives of breaking their monoliths into distributed architecture. Google, Twitter and Amazon were few of the forerunners in this journey. At the same time, software infrastructure was moving towards cloud. AWS, Azure and the likes were providing cutting-edge solutions for software product/services companies to move their infrastructure in to the cloud. Cloud made way for the ideas like Containers, Orchestrators, Microservices and Serverless easy to explore & implement; and soon software underwent another transformation.

Distributed architecture meant more moving parts, which meant more communication between these moving parts. The software application communication (via gRPC, HTTPs, REST, GraphQL etc.) also increased in a distributed environment. With introduction of containers and serverless, the intra-cluster communication spiked up as well. More communication meant more messages, and more messages meant more events. Monitoring of the health and performance of the complex distributed architectures became important for quick root-cause analysis and debugging of the issues. The distributed teams felt the need of a system which could collect and monitor various events happening all over the system. Such events could be health logs of every node in the cluster, performance metrics of the hundreds of upstream/downstream services across the datacenters, logs corresponding to the topology of the cluster network, the business logic logs, request failure logs etc.

Building resilient and better fault-tolerant systems required understanding the context of the events that were being monitored. Observability was introduced to provide such a context-aware monitoring of the distributed infrastructure at hand. It is rather difficult to define Observability into a single concept. Some definitions consider Observability in terms of system failure, while some talk about Observability with reference to the testing pyramid. Distributed systems are inherently designed not be 100% available all the time. Hence, it makes sense to build Observability as a concept of collecting every possible snapshot of the system. These snapshots can then be used to develop intelligent analytics upon the data collected, which can further be used to provide alerts and probably self-healing triggers into the system. Final outcome of Observability stack are visualisations of these snapshots, analytics and alerts which could be useful to the engineering team.

We like to define it as: An engineering philosophy wherein you observe the data flowing through the whole system via a set of tools and practices, and turn the collected data points and contexts into useful insights.

Monitoring vs Observability

Monitoring inherently is tracking of alerts and the numbers of what’s going on in the system. Monitoring helps in answering “how is your system doing” by collecting and dashboarding the data.

Observability like visibility and availability is the quality of the service. It attempts to explore the system with an intention to deduce “what is your system doing” by exposing the data and exploiting context between the data to better understand the system. If something is observable, then you can monitor it (as well as do other things). 'Observable' monitoring systems inherently:
1. Collect and analyse high quality data in terms of correctness and completeness
2. Periodically update the metrics so as to avoid getting 'gamed'
3. Avoids wrong incentives/insights by analysing combination of metrics instead standalone analysis

©The Remote Lab UK