The WHAT, The WHERE, And The WHY


Observability framework comprises of the following building blocks:
  1. Instrumentation (Edge Collection): Simply put, the responsibility of this block of the framework is to collect the logs, metrics and trace
  2. Stack (Data Storage): Stack is where all the collected data (logs, metrics, and traces) is indexed and stored
  3. Visualisation (Analysis): This block presents the collected data in a form that is useful for analysis. This is where the collected data is used to correlate and create dashboards which provide analytical insights of the system and (business) application
The visualisation layer queries its data from stack layer, which in turn stores the data provided by the instrumentation layer. The consumers of the visualisation layer range from the top-management (business), development teams, to SRE (Site Reliability Engineering) teams. Hence, it becomes imperative to have a clear understanding of WHAT data needs to be collected by the Instrumentation layer, WHERE should such data be collected and WHY is such data important.

As it turns out; these questions are not new. The SRE teams have been dealing with these questions for years now, and to our advantage, several SRE workbooks have addressed this in depth.

The WHAT


Building up from the experience of SRE teams we’ve come to a conclusion that the USE, and RED methods give a fair understanding as to what type of events need to be collected for an effective Observable system. Let us understand both of these methods in brief.

USE stands for: Utilization | Saturation | Errors The USE method applies to infrastructure (network interfaces, storage disks, CPUs, memory etc.)
Utilization: the average time that the resource was busy servicing work
Saturation: the degree to which the resource has extra work which it can’t service, often queued
Errors: the count of error events

RED stands for: Rate | Errors | Duration Since the USE method doesn’t really apply to services, the RED method addresses the monitoring of services.
Rate: The number of requests per second
Errors: The number of those requests that are failing
Duration: The amount of time those requests take

The Google SRE team has also defined ‘4 Golden Signals’ which prove to be insightful when collected. Those 4 Golden Signals are: Latency | Traffic | Errors | Saturation
Latency: The time it takes to service a request
Traffic: A measure of how much demand is being placed on your system, measured in a high-level system-specific metric
Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”)
Saturation: How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O)

Sample events
  1. Error Count (Eg: RPC Errors)
  2. Request bytes
  3. Number of request messages
  4. Response bytes
  5. Number of response messages
  6. Round Trip latency
  7. Server elapsed time
  8. Uncompressed Request bytes
  9. Uncompressed Response bytes
  10. Page Load Time (PLT)

The WHERE


Good places to add instrumentation (collect data) in the system are at the points of ingress & egress, inside system (application logic) etc. For instance:
  1. Logging requests and responses specific webpage-hits or API endpoint (If you’re instrumenting an existing application then make a priority-driven list of specific pages or endpoints and instrument them in order of importance.
  2. Measure and log all calls to external services and APIs. For e.g: Calls (or queries) to database, cache, search service etc.
  3. Measure and log job-scheduling and the corresponding execution. E.g. cron jobs
  4. Measure significant business and functional events, such as users being created or transactions like payments and sales
  5. Measure methods and functions that read and write from databases and caches

The WHY



‘Observing the software system’ is a good idea as it enables the engineering teams to:
  1. Identify and diagnose faults, failures, and crashes
  2. Measure and analyse the operational performance of the system
  3. Measure and analyse the business performance and success of the system or its component(s)
  4. Discover different insights by combining the results of different metrics
  5. Identify gaps in business logic
  6. Thus, avoiding a serious business and operational risk.

PREV
  
NEXT

©The Remote Lab UK