Today a few technology companies like honeycomb.io, humio.io, lightstep.com etc. that provide out-of-the-box paid frameworks and products in the Observability area.
However, there are quite a few mature open-source (OSS) tools and frameworks available in order to realise Observability via Instrumentation ( Logging, Metrics and Tracing collection), Stack (Data storage) and Visualisation (Analysis). These include the following:
Formerly known as “ELK” Stack, “ELK” stood for three open source projects: Elasticsearch, Logstash, and Kibana.
- Logstash: Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch. In context of Observability it serves as a Logging instrumentation component.
- Elasticsearch: Elasticsearch is a search and analytics engine. In context of Observability it serves as a centralised data storage component to which varied data can be ingested, and from which data can be exported to any UI tooling framework for analytics purposes.
- Kibana: Kibana lets users visualize data with charts and graphs in Elasticsearch. In context of Observability, Kibana serves as a Dashboarding/Analytics component.
- Beats: In 2015, Elastic introduced a family of lightweight, single-purpose data shippers into the ELK Stack equation called Beats. The community Elastic framework continues to grow stronger as the need for Observability finds firm roots in current engineering scenario.
- Elastic provides enterprise-level proactive cluster alerting which automatically notifies changes in the cluster state, application state and host of other metrics
- Elastic provides multi-stack support & analysis to record, track, and compare the health and performance from a single place.
- Elastic provides ability to go beyond rule-based alerting by combining alerting with unsupervised machine learning
- Elastic provides ability to generate, schedule & email reports of any Kibana visualization or dashboard based on specified conditions and is architected to scale and travel well
- Elastic provides ability to export raw documents, saved searches, and metrics for seamless integration into existing monitoring frameworks
Prometheus: (Metrics Collection and Dashboarding)
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Since its inception in 2012, many companies and organizations have adopted Prometheus, and the project has a very active developer and user community. It is now a standalone open source project and maintained independently of any company. Prometheus joined the Cloud Native Computing Foundation in 2016 as the second hosted project, after Kubernetes.
The Prometheus ecosystem consists of multiple components, many of which are optional:
- The main Prometheus server which scrapes and stores time series data
- Client libraries for instrumenting application code
- A push gateway for supporting short-lived jobs
- Special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
- An alertmanager to handle alerts
- Various support tools (E.g. node_exporter)
Most Prometheus components are written in Go, making them easy to build and deploy as static binaries.
- Prometheus provides a multi-dimensional data model with time series data identified by metric name and key/value pairs which fits both machine-centric monitoring as well as monitoring of highly dynamic microservices architectures
- Prometheus has a flexible query language to leverage this dimensionality monitoring of highly dynamic microservices architectures
- Prometheus is designed for reliability, to be the system you go to during an outage to allow you to quickly diagnose problems
- In Prometheus time series collection happens via a pull model over HTTP
- Pushing time series is supported via an intermediary gateway
- Targets are discovered via service discovery or static configuration
- Multiple modes of graphing and dashboarding support (Eg: Grafana and other API consumers)
- No reliance on distributed storage; single server nodes are autonomous
- Each Prometheus server is standalone, not depending on network storage or other remote services, thus making it reliable even when parts of the infrastructure are broken
Zipkin (Trace Collection and Dashboarding)
Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures, and manages both the collection and lookup of this data. Applications are instrumented to report timing data to Zipkin.
- Zipkin Collector: Once the trace data arrives at the Zipkin collector daemon, it is validated, stored, and indexed for lookups by the Zipkin collector.
- Storage: Zipkin was initially built to store data on Cassandra since Cassandra is scalable, has a flexible schema. In addition to Cassandra, there is a support for ElasticSearch and MySQL. A few other back-ends are also available as third party extensions.
- Zipkin Query Service: Once the data is stored and indexed, we need a way to extract it. The query daemon provides a simple JSON API for finding and retrieving traces. The primary consumer of this API is the Web UI.
- Web UI: A GUI that presents a nice interface for viewing traces. The web UI provides a method for viewing traces based on service, time, and annotations. Note: there is no built-in authentication in the UI
For example, when an operation is being traced and it needs to make an outgoing http request, a few headers are added to propagate IDs. Headers are not used to send details such as the operation name.
The component in an instrumented app that sends data to Zipkin is called a Reporter. Reporters send trace data via one of several transports to Zipkin collectors, which persist trace data to storage. Later, storage is queried by the API to provide data to the UI.
- Web UI: The Zipkin UI presents an easy-to-understand web UI which provides a dependency diagram showing how many traced requests went through each application.
- Trace Filtering: Zipkin provides filter or sort all traces based on the application, length of trace, annotation, or timestamp for troubleshooting latency problems or errors
- Production Safe: Instrumentation is written to be safe in production and have little overhead
Jaeger (Trace Collection and Dashboarding)
Jaeger is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including:
- Distributed context propagation
- Distributed transaction monitoring
- Root cause analysis
- Service dependency analysis
- Performance/latency optimization
- Jaeger Agent: The Jaeger agent is a network daemon that listens for spans sent over UDP, which it batches and sends to the collector. It is designed to be deployed to all hosts as an infrastructure component. The agent abstracts the routing and discovery of the collectors away from the client.
- Jaeger Collector: The Jaeger collector receives traces from Jaeger agents and runs them through a processing pipeline. Currently our pipeline validates traces, indexes them, performs any transformations, and finally stores them.
- Jaeger Query: Jaeger Query is a service that retrieves traces from storage and hosts a UI to display them
- High Scalability: Jaeger backend is designed to have no single points of failure and to scale with the business needs
- Native support for OpenTracing: Jaeger backend, Web UI, and instrumentation libraries have been designed from ground up to support the OpenTracing standard
- Multiple storage backends: Jaeger supports two popular open source NoSQL databases as trace storage backends: Cassandra 3.4+ and Elasticsearch 5.x/6.x. There are ongoing community experiments using other databases, such as ScyllaDB, InfluxDB, Amazon DynamoDB. Jaeger also ships with a simple in-memory storage for testing setups.
- Cloud Native Deployment: Jaeger backend is distributed as a collection of Docker images. The binaries support various configuration methods, including command line options, environment variables, and configuration files in multiple formats (yaml, toml, etc.) Deployment to Kubernetes clusters is assisted by Kubernetes templates and a Helm chart.
- Metrics Support: All Jaeger backend components expose Prometheus metrics by default (other metrics backends are also supported). Logs are written to standard out using the structured logging library zap.
- Backwards compatibility with Zipkin: Jaeger provides backwards compatibility with Zipkin by accepting spans in Zipkin formats (Thrift or JSON v1/v2) over HTTP. Switching from Zipkin backend is just a matter of routing the traffic from Zipkin libraries to the Jaeger backend.
OpenCensus (Metrics and Trace Collection)
OpenCensus is a metric and trace collection tool. It is available as a vendor-agnostic set of libraries which can be used to collect traces and metrics from any application. Instrumenting the application code with OpenCensus helps gain the ability to understand exactly how a request travels between the services, and gather any useful metrics about the entire architecture. OpenCensus is used to visualize request lifecycle, perform root-cause analysis, and optimize service latency by gaining key insights in to the latency and performance of every (micro)service and data storage that is being managed.
- Context: Some of the features for distributed tracing and tagging need a way to propagate a specific context (trace, tags) in-process (possibly between threads) and between function calls. Context component makes sure that such contexts, and their corresponding sub-contexts are propagated.
- Trace API: Trace component is designed to support distributed tracing apart from data collection and export.
- Tags API: The Tag API allows for creating, modifying and querying objects representing a tag (key-value pair) which propagate through the context subsystem via RPC, HTTP, etc.
- Stats API: The Stats API component is designed to record measurements, dynamically break them down by application-defined tags, and aggregate those measurements in user-defined ways. It is designed to offer multiple types of aggregation (e.g. distributions) and be efficient (all measurement processing is done as a background activity); aggregating data enables reducing the overhead of uploading data, while also allowing applications direct access to stats.
- Low latency: OpenCensus is simple to integrate and use, it adds very low latency to your applications and it is already integrated into both gRPC and HTTP transports.
- Vendor Agnosticity: OpenCensus is vendor-agnostic and can upload data to any backend with various exporter implementations. Even though, OpenCensus provides support for many backends, users can also implement their own exporters for proprietary and unofficially supported backends.
- Simplified tracing: Distributed traces track the progression of a single user request as it is handled by the internal services until the user request is responded
- Context Propagation: Context propagation is the mechanism by which information (of your choosing) is sent between your services. It is usually performed by sending data in headers and trailers on HTTP and gRPC transports.
Grafana: (Analytics and Dashboarding)
Grafana is an open source, feature rich metrics dashboard and graph editor that allows you to query, visualize, alert on and understand your metrics no matter where they are stored. It gives engineering teams ability to create, explore, and share dashboards and foster a data driven culture.
- Visualize: Fast and flexible client side graphs with a multitude of options. Panel plugins for many different way to visualize metrics and logs.
- Alerting: Visually define alert rules for your most important metrics. Grafana will continuously evaluate them and can send notifications.
- Notifications: When an alert changes state it sends out notifications. Receive email notifications or get them from Slack, PagerDuty, VictorOps, OpsGenie, or via webhook.
- Mixed Data Sources Mix different data sources in the same graph. You can specify a data source on a per-query basis. This works even for custom datasources.
- Annotations: Annotate graphs with rich events from different data sources. Hover over events shows you the full event metadata and tags.
- Ad-hoc Filters: Ad-hoc filters allow you to create new key/value filters on the fly, which are automatically applied to all queries that use that data source.