Site Reliability Engineering

Programming is neither maths nor language but a process. A core part of that process is using code to solve problems, how do we exploit this to turn the mundane into the challenging?

First, remove the mundane...

Toil

Toil is artificial difficulty: it's caused by external factors, like a lack of reliability, automation or mistakes earlier in the development cycle. Technical debt was incurred but never repaid and now that cut corner is back, with interest, and it will get its money back.

Tech debt itself is not toil - debt heavy trade-offs can be worthwhile in context, whereas low levels of toil are unavoidable even in well-designed systems. Unchecked, toil scales quietly with growth and the damage is only truly visible when it's probably too late for a cost-effective solution.

This isn't limited to development, any time a system requires manual intervention for what should be automatic? That's toil, and it's team agnostic: IT forgotten passwords to deploying distributed systems.

Toil in situ

Despite initial preconceptions: complicated, arduous or challenging tasks are commonly toil neutral.

Toil, in the development context, usually results from arbitrary roadblocks that are primarily environmental - replacing problem solving and creativity with manual tedium and the grind.

Fixing a race condition can be a nice challenge.

Fixing a race condition when you have...

No reproducible environment;
No instrumentation or logs;
No observability.

...is pure toil.

You can roll the code boulder up toil hill but tomorrow it's the same fundamental issues rolling it back down again. This is the essence of dev toil - arbitrary hoops to jump through before you can generate impact.

toil.S

section .text
    global _start
_start:
   jmp _same_hoops_as_yesterday

_same_hoops_as_yesterday:
    push boulder
    pop ecx
    mov edx,len
    mov ebx,1
    mov eax,4
    int 0x80
    mov ebx,0
    mov eax,1
    int 0x80

section .data
boulder db  'toil', 0xa
len equ $ - boulder

Naive implementation of a toil generator in x86 Assembly

Toil Driven Development

Despite many benefits during the development process Test Driven Development is not a good means of reducing toil and, if implemented poorly, can even mutate into Toil Driven Development.

By default a test case isn't considered artificial difficulty, a pass or fail is simply insight:

constexpr int AddTwo(int x, int y){ return x+y; }

int main(){
static_assert(AddTwo(1, 1) == 2); // doesn't generate toil
}

Tests can help catch some edge cases, or logic errors, but high code coverage does little for overall toil reduction. For a process to truly qualify as 'toil' it needs to be redundant: like forcing a dev to deploy and manage their own test environment instead of using a pre-existing CI/CD pipeline.

Removing Toil

Sounds inevitable? It isn't. Embrace site reliability engineering.

In the modern cloud of buzzwords "SRE" is one of the more salient and enduring rebranding efforts of ancient principles. It may go by a different name but you'll likely be familiar with the philosophy.

Many approaches exist, but a solid first step on this road is to instill an IaC DevOps culture:

Part of the momentum behind infrastructure as code is generating reproducible environments through imperative or declarative scripting. This is not just inspired by automation but the benefits of having stable, consistent environments form the foundation of toil-free development.

Every unnecessary manual operation removed is one piece of toil rolled back and more time spent solving real issues that motivate and matter to the project: enabling your developers to develop.

Great further reading, and an excellent resource, on the SRE mindset at scale can be found @ https://sre.google/