Error budget is the mechanism to quantify allowed unreliability
An error budget is the inverse of reliability. It defines how unreliable the service is allowed to be. This unavailability can a result of some planned/unplanned maintenance, hardware or infrastructure failure, bad release roll-outs etc. If the error budget is spent in full, the service freezes changes (except for emergency releases)
For example: If SLO is defined as 99.9 % request success per quarter, then error budget states that maximum 0.1 percent of requests can fail in the given quarter.
which is just enough time for a monitoring system to surface an issue, and support team to investigate and resolve. And that too for not more than single incident per month.
It’s important to keep a track of the error budget spent to prevent overspending as well as to plan releases as per the availability.
The graph shows that 90% error budget for availability in Quarter Q1 has been exhausted. That suggests that the system was unavailable for ~118 mins in the quarter.
The error budget is defined to be consumed. It doesn’t matter what percentage of error budget we consume as long as it does not exceed 100%. However, It does help to plan the releases so that we don’t break the availability further.
Error Budget expenditure
-
- Define ad document measuring window
A measuring window need to be defined to measure and track error budget spent so that decisions can be made for reliability or feature development. This window can be defined as weekly, monthly, quarterly, yearly etc. It depends upon the system size and criticality, team structure, maturity of SRE etc. - Define and document Error budget policyIt is very important to define an Error budget policy to track the consumption of error budget and to take actions at various level of consumption to mitigate the risk of exhaustion and/or plan releases accordingly. A typical error budget policy contains:
- The trigger for the error budget policy i.e. when it takes effect for example spending X% of error budget in just a week or on a single outage.
- The various thresholds of budget spent.
- The actions to be taken on reaching the thresholds such as re-prioritization of work.
- Consequences of breaching threshold consecutively in multiple measuring windows such as giving control back to dev team to improve reliability.
- Decision makers for change in scenarios as defined in the policy.
- Acceptance of policy to be followed by all the teams: Dev, SRE, managers and leadership.
- Reduce critical dependencies
Critical dependencies should be reduced to save error budget expenses. It’s not practically feasible to get rid of all critical dependencies in the large systems. However, some best practices around system design can be followed to optimize reliability. The most common strategies to reduce critical dependencies is to eliminate Single points of failures, redundancy, automatic fail-over and fallback, fast and reliable rollback.
- Define ad document measuring window
References
- Site Reliability Engineering [book]
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy - Business Monitoring: If You Can’t Measure It, You Can’t Improve It [blog]
- Fundamentals of an error budget policy