On-call is a skill nobody teaches
Debugging under pressure, running a postmortem that actually compounds, and what years of incidents taught me about the craft of being on-call.
It’s 2:47 AM on a Saturday. My phone lights up on the nightstand: CRITICAL – checkout_service P0 – payment success rate < 40%. I’m already sitting up before I know I’m awake. I tap the Slack link. The chart looks like a cliff edge at 2:31 AM. Sixteen minutes of customers failing to complete purchases.
I open the laptop. The terminal is still open from Friday. I type kubectl logs -n payments deployment/checkout-service --since=20m and the output floods the screen. One line jumps out immediately, repeating every few seconds: connection pool exhausted: max_connections=50 active=50 waiting=214.
I know what this is. I think I know what this is.
I call my colleague Dian.
Why nobody actually teaches this
Computer science programs teach you to write code that works. They give you algorithms, data structures, system design. If you’re lucky, you get a course on distributed systems where you read about CAP theorem and Lamport clocks.
Nobody teaches you to diagnose code that stopped working at 2 AM with a CTO on a group call and a payments chart still falling.
On-call is a craft. And like most crafts, the only way to learn it is to do it badly first, then less badly, then with some kind of grace. The Google SRE book — particularly the chapters on “Being On-Call” and “Postmortem Culture” — is the closest thing the industry has to a curriculum. But even that doesn’t tell you about the specific panic of staring at a terminal when someone’s revenue is on the line.
The panic is the first thing you have to learn to work inside. Everything else is downstream of that.
What actually compounds
After enough incidents you start to notice that some skills pay off every single time, and some skills are just speed — local maxima that don’t transfer.
Reading a system under load. The first thing you learn is where to look. Not what to fix — where to look first. For me it’s always the same sequence: metrics dashboard to find the shape of the problem, then logs to find the first error, then resource utilization to understand the constraint. Not configuration files. Not code. Metrics, logs, resources — in that order — before I form any hypothesis at all. This ritual matters because panic wants to make you skip steps. It wants you to reach immediately for the change that seems obvious. The ritual slows you down enough to see the actual problem.
That Saturday it was the connection pool. But it took me two wrong hypotheses first. I restarted the deployment — no change. I checked the database slow query log — clean. I wasted nine minutes on those. The pool exhaustion was in the logs from the beginning. I just didn’t look there first.
Narrating out loud. Dian taught me this. When she’s on an incident bridge, she talks through every step. “I’m looking at the ingress metrics. Latency is normal. Moving to the service layer.” It sounds strange the first time you hear it. It feels like it slows things down.
It doesn’t slow things down. It lowers the room temperature. When someone is narrating, the other people on the call can follow your reasoning, catch mistakes, rule out hypotheses in parallel. More importantly: it tells everyone that someone has a handle on the situation. The silence in a crisis call is the most dangerous thing. The silence is where the assumptions pile up.
I’ve used this move ever since. I am not a naturally calm person under pressure. But I’ve found that if I keep talking — slowly, methodically, out loud — the calm follows the narration rather than preceding it.
Postmortems as a genuine learning tool. I’ve been in plenty of postmortems that were performance theater. Everyone is careful. Nobody names names. The “contributing factors” section is so sanitized it could describe any outage at any company. You file the document, send it to the distribution list, and the same class of incident happens four months later.
The version that actually compounds looks different. It starts from John Allspaw’s foundational framing on blameless postmortems — the principle that people took actions that made sense to them given what they knew at the time, and the investigation should be about understanding that context, not assigning fault. But it goes further than blameless. The postmortems that changed how I think were honest about confusion. Not just “operator error” but “here’s the point where I genuinely didn’t know what was happening, here’s why the system made that hard to see, and here’s what would have helped.”
The checkout incident postmortem had a section titled “Things that looked like the problem but weren’t.” I wrote two paragraphs about the nine minutes I spent chasing the wrong hypotheses. That section got referenced more times in subsequent incidents than the contributing factors section did.
Runbooks that admit what the author didn’t know. A runbook written by someone who fully understood the system at incident time is often useless to someone who doesn’t. The most useful runbooks I’ve found — and the ones I try to write — include the things the author was uncertain about. “I’m not sure if step 4 is always necessary — try skipping it first and check if error count drops.” “The log pattern you’re looking for is pool exhausted but it may say connection refused depending on which service is calling.” Uncertainty, made explicit, is a form of documentation.
The thing that actually separates seniors from everyone else
I used to think it was speed. Senior engineers resolve incidents faster. That’s true, but it’s a consequence, not the cause.
It’s not memorized commands either. You can Google commands. You can’t Google judgment.
The thing I’ve watched in the engineers I’ve learned the most from — and the thing Allspaw describes better than anyone in his writing on senior engineering — is judgment about which hypothesis to test first. Incidents are hypothesis testing under time pressure. Every check you run costs time. The senior move isn’t knowing all the answers. It’s knowing which question to ask next, and why, and being willing to say “I’m ruling this out, here’s why” before the data is in.
This is learnable. It comes from postmortems that you actually read, from incidents you stayed in when you wanted to leave, from watching how someone else built a hypothesis tree when the alerts made no sense. It is not a personality trait. It is a skill you build by being in the room.
Charity Majors has been writing about the observability side of this for years at charity.wtf — the argument that you can’t build this judgment without instrumentation that tells you what the system was doing, not just whether it was up. I’d add: the instrumentation is necessary but not sufficient. You also need the postmortem habit that turns each incident into a data point.
What I’d tell a junior engineer starting their first rotation
Stop trying to look calm. You won’t be calm. That’s fine.
The goal is not to feel like you know what you’re doing. The goal is to have a process that works even when you don’t know what you’re doing. Write it down before you’re paged. “When I get an alert I don’t recognize, I will: check the runbook, look at the relevant dashboard, check recent deploys, then ask for help.” Make the escalation step explicit and early. The engineers who improve fastest on on-call are not the ones who solved incidents alone — they’re the ones who escalated clearly and then stayed on the call to watch what happened next.
Read the Google SRE book before you start, or at least the on-call and postmortem chapters. It’s free online. It will not prevent your first bad incident. But it will give you a vocabulary for what happened afterward, and vocabulary is how you turn experience into skill.
After an incident, write down what you tried. Not what worked. Everything you tried, in order, including the wrong hypotheses. That document is worth more than any tutorial.
The quiet after
It’s 4:12 AM. The connection pool limit is raised to 200. A database proxy is queued for next sprint. The checkout success rate is back at 99.3%. I’ve sent the incident summary to the on-call channel and drafted the postmortem skeleton. Dian is still on Slack.
“Good job staying calm,” she types.
I think about telling her I wasn’t calm. That I spent nine minutes on the wrong hypotheses and felt the particular cold sweat of realizing I was wasting time. That I kept talking out loud mostly because she’d taught me to and I needed the structure.
Instead I type: “You too. I’ll write up the nine-minute detour in the postmortem — it’s useful.”
The chart is climbing back. The cursor blinks.
Next time I’ll look at the connection pool first.