Service Level Calculator

SLI

Service Level Indicators in simple words are the metrics that represent how the reliability is perceived by the consumers of the service. They are normalized to be a number between 0 and 100 using this formula:

SLI =
Good
Valid
× 100

Common SLIs include latency, availability, yield, durability, correctness, etc.

You can load a predefined example to the UI. You can tweak it and play around with different parameters as you wish.

SLIs can either be either:

  • Time-Based: Concerned with the duration of good time in a given period. The duration is actually a time window where the data is aggregated for a good/bad result. In a sense the Time-Based SLI is also an Event-Based SLI where an event is an aggregation window.
  • Event-Based: Concerned with the count of good events per valid events in a given time period.
Learn more here .

The time slot is the time window that the metric data is aggregated to calculate a good/bad time slot.

For example, probing an endpoint every 60 seconds to see if it is available, assumes that the endpoint is available for the entire 60 seconds.

Another example is percentiles. When calculating the 99th percentile of the latency every 5 minutes, the aggregation window is 5 x 60 = 300 seconds.

Here you can set this parameter to common values:

The unit of the event that the SLI is measuring. This is mainly used in the UI to make it easier to understand.

Here you can set this parameter to common values:

What is a good {{ sliUnit }}? How do good {{ sliUnit }} look like? What is the metric that you can measure to identify the good {{ sliUnit }} from all the valid {{ sliUnit }}?

For simplicity, sometimes "total" is used instead of "valid". But there is a difference .

While Service level indicator guides the optimization, the definition of valid scopes that optimization for two reasons:

  • Focus the optimization effort
  • Clarify responsibility and control

Formula

The formula for calculating SLI for the given SLO window is the percentage of good per valid.

Depending on whether the SLI is time-based or event-based, the formula calculates the percentage of bad time or bad events.

SLI =
Good Time Slots Good {{ sliUnit }}
All Time Slots Valid {{ sliUnit }}
× 100
=
× 100

Service Level Objective (SLO) is the target percentage of good {{ sliUnit }}.

Using the two sliders below you can fine tune the SLO to your needs. The first slider is for the integer part of the percentage ({{ sloInt }}). The second slider is for the fractional part of the percentage ({{ sloFrac }}).

Just be mindful of the price tag for this high service level objective! Why bother with SLI and SLO? Everyone wants the highest possible number but not everyone is willing to pay the price.

Note: this is an unusually low service level objective. Typically service level objective is above 90% with some rare exceptions. Please check the Error budget for implications of your chosen SLO.

It looks like you're expecting the service level to be violated more than met! It may be tempting to start from a low objective, but maybe the definition of your service level indicator could be improved to narrow down the focus of service level optimization.

😂 Are you joking?

 

The SLO window (also known as the compliance period) is the time period for which the SLO is calculated.

It is usually 30 days or 4 weeks.

You can play with different ranges to see how a given SLO translates to different good {{ sliUnit }} and how it impacts the error budget.

Typical compliance periods
Window Length Advantages
4 weeks 28 days Restarts at the beginning of a week
A month 30 days Maps to typical subscription services

{{ sloWindow.humanTimeSlots }}

Error budget: {{ errorBudgetPerc }}%

Error budget is one of the core ideas behind using SLI/SLOs to improve reliability. Instead of denying or forbidding errors, error budget allows the system to fail within a pre-defined limit.

The number one enemy of reliability is change. But we need change to be able to improve the system. Error budgets do exactly that. They provide a budget of error for the team to improve the system while keeping the consumers happy enough.

Error budget is the opposite of SLO. It is the percentage of bad {{ sliUnit }} that you can have before you violate the SLO. It is calculated as 100% - SLO.

In this case:

error_budget = 100% - {{ slo.perc }}% = {{ errorBudgetPerc }}%

{{ errorBudgetWindow }}

Here you can enter the numbers for your expected load and see how many {{ sliUnit }} are allowed to fail during the SLO window while still being within the error budget.

Expected in {{ sloWindow.humanTime }}:

{{ errorBudgetBadExample }} {{ sli.unit }} are allowed to violate the {{ sli.good }} condition in {{ sloWindow.humanTime }}.

Alerting

What is the point of setting SLI/SLO if we are not going to take application when the SLO is violated?

Alerting on error budgets enable us to be on top of the reliability of our system. When using service levels, the alert triggers on the rate of consuming the error budget.

When setting an alert, the burn rate decides how quickly the alert reacts to errors.

  • Too fast and it will lead to false positives (alerting unnecessarily) and alert fatigue (too many alerts).
  • Too slow and the error burget will be burned before you know it.

Burn rate is the rate at which the error budget is consumed. It is the ratio of the error budget to the SLO window.

A burn rate of 1x means that the error budget will be consumed during the SLO window (accepted).

A burn rate of 2x means that the error budget will be consumed in half the SLO window. This is not acceptable because at this rate, the SLO will be violated before the end of the SLO window.

Goole SRE Workbook goes through 6 alerting strategies and recommends :

Burn Rate Error Budget Long-Window Short-Window Action
14.4x 2% Consumed 1 hour 5 minutes Page
6x 5% Consumed 6 hours 30 minutes Page
1x 10% Consumed 3 days 6 hours Ticket

Note: The above values for Long-Window and Short-Window are based on a 1-month SLO window. You can see your actual values in the comments below Long-Window and Short-Window.

The time it takes to exhaust the entire error budget ({{ errorBudgetWindow.humanTimeSlots }}) at this rate:

Long-Window Long-window alert is the "normal" alert. The reason it is called "long" is to distinguish it from the "short-window" alert which is primarily used to reduce false positives and improve the alert reset time.

We don't want to wait for the entire error budget to be consumed before alerting! It will be too late to take action.

Therefore the alert should trigger before a significant portion of the error budget is consumed.

Based on your setup, the alert will trigger after we have consumed {{ alert.longWindowPerc }}% of the entire time allotted for the error budget (or SLO compliance window) which is .

{{ alert.longWindowPerc }} % × {{ sloWindow.humanSec }} = {{ alertLongWindow.humanSec }}

Alert when at least {{ alert.longWindowPerc }}% of the error budget is consumed in the last:

Time to resolve before the entire error budget is exhausted:

Note: The time to resolve (TTR) is too short for a human to react. It is strongly recommended to automate the response to the alert.

Time Slots consumed:
{{ alertLongWindowConsumedTimeSlots }}

Warning: Alert Window is not usable at this burn rate ({{ alert.burnRate }}x) and time slot length ({{ sli.timeSlot }} sec) because it will lead to division by zero error.

Alert Policy

This is a pseudo-code that help you understand what triggers the alert. You need to translate it to your alerting tool.

( , {{ alertLongWindow.humanSec }} )
{{ alertLongWindowConsumedTimeSlots }} ( , {{ alertLongWindow.humanSec }} )
{{ percentToRatio(slo.perc) }}

The Short-Window alert reduces false positives but makes the alerting setup more complex.

Short-Window

The purpose of the Short-Window alert is to make sure that the alert only triggers if the burn rate is still high.

Therefore it is always shorter than the "Long-Window" alert, hence the name.

Short-Window alert is used to only trigger the alert if we're still actively burning through the error budget with the same burn rate. It also improves the alert reset time (the time it takes for the alert to shut up after the issue is resolved).

The Short-Window is usually 1/12th of the Long-Window (per Google SRE Workbook recommendation). But you can play with different dividers to see how they impact the detection time of the alert.

Your chosen Short-Window is 1/{{ alert.shortWindowDivider }} of the Long-Window. Long-Window alert triggers after consuming {{ alert.longWindowPerc }}% of the total error budget.

{{ alert.longWindowPerc }} %
{{ alert.shortWindowDivider }}
= {{ alertShortWindowPerc }} %

Alert if at least {{ alertShortWindowPerc }}% of the error budget is consumed in the last:

Time Slots consumed:
{{ alertShortWindowConsumedTimeSlots }}

Warning: Alert Window is not usable at this burn rate ({{ alert.burnRate }}x) and time slot length ({{ sli.timeSlot }} sec) because it will lead to division by zero error.

Alert Policy

Note: The query for both long-window and short-window alerts must be combined with an AND condition.

( , {{ alertShortWindow.humanSec }} )
{{ alertShortWindowConsumedTimeSlots }} ( , {{ alertShortWindow.humanSec }} )
{{ percentToRatio(slo.perc) }}

This site uses cookies from Google to deliver its services and to analyze traffic. Information about your use of this site is shared with Google. By using this site, you agree to its use of cookies.

Learn More