Site Reliability Engineering – Reducing Toil

What is Toil?

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows

SRE propose to keep operational work below 50% of time and other 50% of time should be spent on engineering project work such as adding new service features. If the toil is not reduced, it becomes difficult to achieve this goal. Moreover, repetitive manual work increases TTD and TTR and increase the manual fault rate.

Some examples of toil could be:

  • Manually running a script (even if the script executes some automated tasks).
  • Manual scaling based on parameters such as traffic, volume or user count.
  • Work performed repeatedly.
  • Tasks that could be completed by a machine with no human intervention required
  • Actions that does not permanently improve service.

The steps to measure and reduce toil is available in the implementation section of document.

Steps to reduce toil?

  1. Identify and Measure Toil
    Follow a data-driven approach to identify and evaluate sources of toil, make objective remedial decisions, and quantify the time saved by toil reduction projects.
  2. Cost benefit analysis
    It’s important to analyze return on investment i.e. cost versus benefit to confirm that the time saved by reducing toil will be more than the time invested in development and maintenance of automated solution.
  3. Assess Risk
    Automation can save a lot of human efforts but may cause side effects under wrong circumstances such as automation script failure, system restart etc. It’s important to carefully access those risks before implementing automation.
  4. Reject the Toil
    If the Cost of responding to toil or effort required to reduce toil is more than the business outcome from the tasks, then the toil-intensive task should be rejected.
  5. SLO driven toil reduction
    If ignoring toil doesn’t consume or exceed the service’s error budget, then It would be better to use the engineering effort on other productive activities.
  6. Reduce toil
    The toil can be reduced or eliminated at two levels:

    • At Source: Identity the source where the toil is generated and figure out if it’s feasible to change the system to reduce or eliminate the toil.
    • During Operations: Automating the tasks that are performed manually by the operations teams.
  7. Use Feedback to Improve
    Feedback is an important part of improvement of any process. Seek productive feedback from people interact with the automation tools, scripts, documents etc and try to optimize based on those.

Approaches to reduce toil

  • Baby steps
    For complex system, implement partially automation and incrementally move toward full automation. Engineers in this approach may still handle some of the resulting operations until it is completely automated.
  • Self-Service
    Provide an option to users, wherever possible, to resolve the most common issues without the need to contacting the operations team.
  • Automate Toil Response
    Once the process is thoroughly documented, try to break down the manual work into components that can be implemented separately and used to create a software library that other automation projects can reuse later.
  • Use Open Source and Third-Party Tools
    Do not re-invent the wheel. It’s not necessary to develop everything from scratch. Look for opportunities to use or extend third-party or open source libraries to reduce development costs.
  • Feature development, which is focused on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: