Site Reliability Engineering – Measuring Service behaviors

To manage a service accurately, it is of paramount importance to understand the behaviors of the service that are critical for the business and how to measure and assess those behaviors. It is important to choose appropriate metrics to measure the service.

Service Level Indicators (SLIs)

Metric that provides the measure of the level of service provided
Service level indicator is a carefully defined quantitative measure of some aspect of the service. The measurements are often collected over a measurement window and aggregated into a rate, average, or percentile.

Defining a SLI
1. Chose application and services for which SLO needs to be defined.
2. Identify the features, activities and processes critical for the business.
3. Identity and classify target set of users to the critical components.
4. Measure and define the aspects of the application important to users.

Some examples of SLIs:
1. Success Rate: Number of successful requests / total number of requests
2. Error Rate: Number of failed requests / total number of requests
3. Request Latency: Number of requests completed successfully in < X ms / total number of requests.
4. Correctness: Number of requests with correct data/ total number of requests
5. Availability: Time for which service is usable / total time window of monitoring

Implementing SLI

Identify and document right SLIs
Not every metric in the monitoring system can be SLI. It is important to understand the metrices from the business point of view. Choosing too many SLIs creates noise and diverts attention from the indicators which are important.
Collect Indicators
Collect metrices defined by the SLI from all the sources such has monitoring system on server or client-side instrumentation to get the thorough insights.
Aggregate Indicators
Aggregate the metrices for a defined time period (measurement window) for better understanding.
Standardize Indicators
A standard definition for the SLI should be followed so that they are not left for individual interpretations.

Service Level Objectives (SLOs)

The target value for a service level and the performance of SLI against it over a period

Service level objective is a target value or range for a service level that is measured by an SLI. SLO helps to make data-driven decisions and define what work to be prioritized- new features or improving reliability.

SLI ≤ target, or lower bound ≤ SLI ≤ upper bound

It is important to understand the 100% is a wrong target as it’s not possible to achieve. Therefore, it is wise to target near-100% which is practically achievable. Such as availability, it is not a practical approach to target 100% availability. The system could be down for some maintenance, failure or due to some external factors beyond control. So how can we define a SLO? What would be the characteristics of a good SLO?

Good SLO
A good SLO is one that
• Meets the service reliability goals
• Is derived from the user needs.
• Has considered the current performance of the service.
• Is approved by all the stakeholders in the organization.
• Is ambitious but achievable under normal circumstances.
• Can be achieved consistently over a period of time.
• Has a process in place for review and redefine?

The SLI and SLO defined initially may or may not be correct. Therefore, it is important to set up a review process to improve.

Implementing SLO

Defining and document SLO
SLOs should specify how they’re measured and the conditions under which they’re valid. It may be appropriate to define separate objectives for each class of workload:
- 90% of Type A request will be completed in 1 ms.
- 99% of Type A request will be completed in 10 ms.
- 99% of Type B request will be completed in 20 ms.
Critical Dependencies
A service cannot be more available than the intersection of all its critical dependencies. If the aim for the service is to offer 99.99 percent availability, then all the critical dependencies must be significantly more than 99.99 percent available.
Frequency, detection time, and recovery time
A service cannot be more available than its incident frequency multiplied by its detection and recovery time. For example, three complete outages per year that last 20 minutes each result in a total of 60 minutes of outages. Even if the service worked perfectly the rest of the year, 99.99 percent availability (no more than 53 minutes of downtime per year) would not be feasible.
Measure SLO
Monitor and measure the SLIs and compare it with the SLOs to keep a track.
Respond
If the intermediate thresholds or SLO is breached, take necessary actions required to bring the back to normal.
Validate and Improve SLO
Validate the SLOs based on the business feedback, postmortems and SRE feedbacks and re-define if required.

Service Level Agreements (SLAs)
Service level agreements are the contract with the customer that specify the service level targets and the consequences of missing those targets. SLAs are tied to the business decisions and SRE doesn’t typically take part in constructing SLAs. However, SRE does helps to avoid consequences of missed targets.

References

Site Reliability Engineering [book]
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Share this:

Leave a comment Cancel reply