IT October 18, 2016 3 min read

On-call rotation is a management problem, not an IT problem

Unstructured on-call duties burn out engineers and leave incidents without clear ownership. The fix is not a tool - it is a set of decisions that only management can make.

Most conversations about on-call start with the alert tool: PagerDuty, OpsGenie, or whatever is in use. The team signs up, rotations get configured, and the assumption is that the on-call problem is now handled. In practice the tool is the least important part of the equation.

What I see most often when I look at on-call culture in a company is: unclear scope of what one person is expected to handle alone at 2am, no agreement on what constitutes an alert-worthy event versus noise, and no management decision about compensation - whether monetary, time off, or otherwise.

Those are not tool settings. They are management decisions that have been left unmade.

What on-call actually requires from the organisation

Being on-call means being available to respond to production problems outside working hours. For that to be sustainable, several things need to be true simultaneously:

The person on call must have the access, permissions, and runbooks to actually resolve common issues. If they have to wake someone else up to get a deploy key or find the relevant documentation, the rotation is fiction.

The scope must be defined. "Everything" is not a scope. If the on-call engineer is expected to handle only the services their team owns, that is manageable. If they are also expected to diagnose third-party integration failures at midnight, that is a different job.

The escalation path must be agreed in advance and must involve a manager. When an engineer cannot resolve something alone, who gets called? That chain cannot be improvised during an incident.

The compensation question

On-call is extra work. In some countries it is regulated. Everywhere, it is real effort that takes a toll on people's rest and personal lives.

I have seen teams where engineers silently accept nights and weekends as part of the job, without any acknowledgement or compensation. This works until it doesn't - and when it stops working, it results in resignations that surprise the management team that "didn't see it coming."

The decision about how to compensate on-call hours - extra pay, time off in lieu, smaller rotation so each person is on call less often - is a management decision. The tool cannot make it.

What a minimal working on-call system looks like

From what I have seen work in practice:

A written definition of what the on-call engineer is responsible for and what is explicitly out of scope.
A runbook for each alert that fires frequently - not documentation of how the system works, but a step-by-step of what to check and do at 2am.
An escalation list with names and phone numbers, not job titles.
A post-incident practice: after any alert that woke someone up, a brief write-up of what happened and whether the alert was actually necessary.
A regular review of alert volume. If the on-call engineer is being paged more than two or three times per night shift, the system is not healthy.

Where this starts

On-call culture starts with a conversation between an engineering manager and the team - not a tool rollout. The question is: what are we committing to in terms of availability, what are we giving in return, and what are the explicit limits?

If that conversation has never happened, no alert routing configuration will substitute for it.

Back to all posts

Contact

What on-call actually requires from the organisation

The compensation question

What a minimal working on-call system looks like

Where this starts

If this resonated, write to me. I reply personally.