Pillars of Observability
The concept of Observability is built on collecting every possible snapshot of the application. These snapshots can then be used to develop intelligent analytics upon the data collected, which can further be used to provide alerts and probably self-healing triggers into the system. Final outcome of Observability stack are visualizations of these snapshots, analytics and alerts which could be useful to the engineering team. Observability is a process which is more than just tooling. It is a culture which when adopted makes the system context aware.
Making systems observable inherently is based upon collecting factual measurements, and concluding insightful inferences based on those measurements.
Being cognizant of all the events happening in the system and collecting them serves as a rich dataset towards achieving observability. Such measurements include:
- Logging: Logging comprises of recording discrete events in the system. These events can be structured (JSON based application/system logs) or unstructured (text strings)
- Metrics: Metrics are aggregatable events like counters (Eg: HTTP requests), gauges (HTTP queue depth), histogram etc. which can help identify trends
- Tracing: Recording events with causal ordering across services and distributed systems as well; hence, enabling them to identify cause across borders
Inferencing comprise of extracting information out of the data collected, and correlating multiple sources of data to give a better understanding of the system.
Logs provide strategic insight by capturing the snapshot of the system along with the context between the multiple subsystems of an application. Logs are generally instrumented as per their usability. Depending on the storage rules, they’re processed, aggregated and eventually stored in a centralized data store from where they can be indexed, queried and analysed/processed further.
There are a number of sources from which these logs can originate such as:
- Application logic code (Business logic)
- Middleware and Network communication (Request Brokers, JDBC, Switches etc.)
- Communication over the network (HTTP requests/responses)
- Communication with underlying Database(s)
- Communication channels (Message brokers)
- Messages with task-queues, caches (Celery, Redis, etc.)
- Interaction with load balancers
- Communication with security and authentication modules (Firewalls)
Types of Logs
- Plaintext Logs: This is generally a timestamped free-form text
- Structured Logs: This type of log has a well defined structure. For instance: Logs conforming to the JSON format.
- Binary Logs: Logs in the Protobuf format, MySQL binlogs used for replication and point-in-time recovery, systemd journal logs, the pflog format used by the BSD firewall etc.
Benefits of Logs
- Logs are omniscient about every single request. Hence, logs can be queried with simplest to most complex of tools to gather insights into the working of the system.
- With intelligent logging one can literally play Dr. Strange and recreate the system state of past or future
- The analysis performed using logs can further be logged and stored for future comprehensive analysis
- Structured logging helps capture just about anything that one might perceive to be of interest. Such type of log data can support high dimensionality, the sort that is great for things like:
- Exploratory analysis of outliers
- Analytics like measuring revenue, billing, user engagement
- Real time fraud analysis
- DDoS detection
Limitations of Logs
- Log generation and storage overhead tends to increase exponentially with n number of application components, nodes and communication channels
- The cost of logs increases in lockstep with user traffic or any other system activity that could result in a sharp uptick in Observability data
Note: Stream processing (for data in motion) and compression (data at rest) can dramatically reduce amount of storage needed
Metrics are a set of numbers that give information about a particular process or activity. They are measures of properties in pieces of software or hardware. To make a metric useful we keep track of its state, generally recording data points or observations over time. An observation is a value, a timestamp, and sometimes a series of properties that describe the observation, such as a source or tags. The combination of these data point observations is called a time series.
Metrics can be visualized in different types of graphs such as gauges, counters, and timers.
Type of Metrics
Gauges: Gauges are numbers that are expected to change over time. A gauge is essentially a snapshot of a specific measurement. The classic metrics of CPU, memory, and disk usage are usually articulated as gauges. For business metrics, a gauge might be the number of customers present on a site.
Counters: Counters are numbers that increase over time and never decrease. Good examples of application and infrastructure counters are system uptime, the number of bytes sent and received by a device, or the number of logins. Examples of business counters might be the number of sales in a month or cost of sales for a time period. A useful thing about counters is that they let you calculate rates of change. A lot of useful information can be understood by understanding the rate of change between two values. For example, the number of logins is marginally interesting, but create a rate from it and you can see the number of logins per second, which should help identify periods of site popularity.
Timers: Timers track how long something took. They are commonly used for application monitoring—for example, you might embed a timer at the start of a specific method and stop it at the end of the method. Each invocation of the method would result in the measurement of the time the method took to execute.
Benefits of Metrics
- Since metrics are just numbers measured over intervals of time, they can be compressed, stored, processed and retrieved very efficiently
- Metrics are optimized for storage and enable longer retention of data, which can in turn be used to build dashboards to reflect historical trends
- Metrics allow for effective and valid aggregations (daily or weekly frequency)
- Metrics transfer and storage has a constant overhead
- The cost of metrics doesn’t increase in lockstep with user traffic or any other system activity
- Metrics, once collected, are more malleable to mathematical and statistical transformations such as sampling, aggregation, summarization and correlation, which make it better suited for monitoring and profiling purposes
- Metrics are also better suited to trigger alerts, since running queries against an in-memory time series database is far more efficient than running a query against a distributed system storage, and then aggregating the results before deciding if an alert needs to be triggered
Limitations of Metrics
- One of the biggest drawback of historical time series databases has been the representation of metrics which didn’t lend itself very well toward exploratory analysis or filtering. The hierarchical metric model and the lack of tags or labels in systems like Graphite especially hurt in this regard. Modern monitoring systems like Prometheus represent every time series using a metric name as well as additional key-value pairs called labels. This allows for a high degree of dimensionality in the data model. A metric is identified using both the metric name and the labels
- Metrics in Prometheus are immutable; changing the name of the metric or adding or removing a label will result in a new time series. The actual data stored in the time-series is called a sample and it consists of two components — a float64 value and a millisecond precision timestamp.
A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system. Traces are a representation of logs; the data structure of traces looks almost like that of an event log. A single trace can provide visibility into both the path traversed by a request as well as the structure of a request. The path of a request allows software engineers and SREs to understand the different services involved in the path of a request, and the structure of a request helps one understand the junctures and effects of asynchrony in the execution of a request.
Components of a Trace
- Span: Span is a set of annotations that correspond to a particular RPC. A span represents a logical unit of work that has an operation name, the start time of the operation, and the duration. Spans may be nested and ordered to model causal relationships.
- Trace: A trace is a data/execution path through the system, and can be thought of as a directed acyclic graph of spans. At the highest level, a trace tells the story of a transaction or workflow as it propagates through a (potentially distributed) system. Traces are built by collecting all spans that share a traceId. The spans are then arranged in a tree based on span-Id and parent-Id thus providing an overview of the path a request takes through the system.
Although discussions about tracing tend to pivot around its utility in a microservices environment, it’s fair to suggest that any sufficiently complex application that interacts with or rather, contends for resources such as the network, disk, or a mutex in a non-trivial manner can benefit from the advantages tracing provides.
The basic idea behind tracing is straightforward - identify specific points (function calls or RPC boundaries or segments of concurrency such as threads, continuations, or queues) in an application, proxy, framework, library, runtime, middleware, and anything else in the path of a request that represents the following:
- Forks in execution flow (OS thread or a green thread)
- A hop or a fan out across network or process boundaries
Traces are used to identify the amount of work done at each layer while preserving causality by using happens-before semantics.
Benefits of Tracing
- Identify the amount of work done at each component/layer/service while preserving causality
- Ability to track a request as it travels through each of the services
- Ability to collect metrics of any interest for a specific span
Limitations of Tracing
- Most of the distributed tracing tools still do not support quite a few programming languages
- Library instrumentation still lacks support for a few big frameworks
- Tracing tools themselves are distributed in nature. Hence, compatibility needs to be checked at the time of orchestration