SRE Calculator Revisit
- Published on
Indicator
Indicators are the metrics that represent a service is working, performant, or fruitful. In other words, they are measures that signal reliability. Common indicators include latency, error rate, throughput, availability, yield, and durability.
"Successful" vs "unsuccessful" and "valid" vs "all" are important distinctions, and care should be taken to define these. For this tool, unsuccessful means anything that is valid but not successful (i.e. error).
Objective
This concept describes the boundaries for the indicator. This is the goal we have for the indicator. We need to decide on the window for the objective. Rolling windows (eg 30 days, 2 weeks) align with user experience. Calendar windows (eg by month or quarter) align more with business planning. Shorter windows force faster decisions, while longer windows allow for more strategic decisions.
For our indicator, we want to target a minimum percentage of successful requests over a rolling window. For fine adjustment, click on the slider and use the arrow keys.
Error Budget: 0.100%
| Not Successful
Successful |
|
Total
|
100000
This describes the percentage of failures that the system can tolerate. Error Budget is the complement of the SLO. This is a sort of buffer for how much change or how many issues the system can handle before consumers are no longer happy enough. 100% - SLO
Burn Rate
Burn rate is how quickly the error budget is consumed. 1x means that the budget is consumed evenly and fully across the window. For a 30 day window, we consume 10% in 3 days, 50% in 15 days, and 100% in 30 days.
10x means that we consume the budget 10 times faster. For a 30 day window, we consume 10% in 7.2 hours, 50% in 1.5 days, and 100% in 3 days.
Burn rate will fluctuate, but this graph demonstrates the constant burn rate over the window. Google SRE recommends examining 1x, 6x and 14.4x rates.
Alert Windows
3600 seconds
Long Window
(1 hour)
300 seconds
Short Window
(5 minutes)
176400 seconds
Time to Respond
(2 days 1 hour)
Alerting on error budgets allows us to respond to issues and course correct before we miss our objectives. We need to balance how long it takes to detect consumption of the error budget, how long we have to respond to an issue consuming the budget, and avoiding false positives and fatigue.
Google SRE recommends a long detection window and a short detection window that a ratio of the long window. We also have to decide how much budget is consumed at the given burn rate before alerting. Google SRE recommends 2% consumed for a 14.4x rate, 5% consumed for a 6x rate, and 10% consumed for a 1x rate.
Alert Metric
This is the metric for alerting based on the SLO and error budget, the burn rate, the consumption threshold, and the short / long ratio. Google SRE recommends measuring the alerting on 14.4x burn, 6x burn, and 1x burn and deciding if each alert results in a page or a ticket.
Alert Demo
8 minutes
Detection Time
4 minutes
Reset Time
Simulate a spike in error rate against the alert by intensity and duration. The pink area represents the time period that the alert is active. Detection time is how long it takes for the alert to trigger after the start of the error spike. Reset time is how long it takes for the alert to stop triggering after the spike ends.
Earlier this week Alex Ewerlöf released the Service Level Calculator via his newsletter and substack. I've enjoyed Alex's content on reliability engineering, career growth and leadership, and organizational change. His calculator inspired me to subscribe (finally, sorry!) and re-roll my own. A long time ago, I made a far-too-basic and assumption-filled downtime to SLO calculator that missed the nuance and most of the point of indicators and objectives.
There's a lot I like about Alex's calculator and a few things I dislike. Splitting the calculator into SLI, SLO, and Alerting categories is great. Including costs is great and almost always overlooked. The help texts are actually helpful, and the presets are useful. At the expense of complication, he includes ways to change the events unit and amount, supports time-based indicators, and hides short window alerts. I'm not a huge fan of the budget consumption graph.
Above you'll find my go at a calculator. I tried to simplify this to what I've seen used and work, and I kept some of the same graphs that you can find in the SRE Workbook. Hopefully, you can connect what that workbook recommends with the interactive graphs here. I'll probably expand this tool out in the coming weeks.