I’m not exaggerating, here. If you are a software developer, the worst day of your life – BY FAR – is when you have a production outage for your application.
It dawned on me today that some people may have never experienced this before. I’ve experienced this a couple times as a developer, and have been involved many, many times on the troubleshooting side.
Let me tell you if you are not familiar – this will be the worst day(s) of your professional life!
I was involved in the recovery of a production outage yesterday and today – so I thought I’d share my thoughts and experience.
How it starts:
The worst kind of problem is one that you didn’t catch earlier. You didn’t catch it:
- While coding.
- In dev.
- In QA.
- During testing.
- After prod checkout.
That means that this is going to have the biggest impact.
The earlier you catch a bug, the easier, cheaper, and fastest it is to fix.
This is really bad though. Even non-developers know that if a bug made it all the way to post-production, there is something systemically wrong with your testing or your processes! Worse, is that people can and will now openly talk about your failure of process! Not to be mean, but because it failed, and they are now involved.
So – you get the call that your component is spitting out an error in production that people didn’t notice before. That means your manager will be directly involved and constantly asking you questions.
If you work in-office, expect your manager over your shoulder every few minutes. If you work at home, expect constant IM’s, phone calls, and texts.
Now, if you are lucky, you know what the problem is. If you do, that is still is horrible news. That means you need to develop a fix IMMEDIATELY, swallow your pride and admit you messed up, and make arrangements to push the change out so the business can test it. At this point, you are fighting time. People’s productivity is down and they are getting agitated by the minute.
Writing a fix is time-consuming, you pretty much need to stop/cancel everything else you were doing. It’s embarrassing, but manageable – if you are super-lucky.
I define super-lucky as: you immediately know what the problem is, and you can produce a fix, test it, and get it into production by the next half-day. Meaning, by lunch (if it’s morning) or by end of day (if it’s afternoon).
If this is you, you have totally dodged a bullet. On your way home, go buy a lottery ticket! Oh, and also start writing better code!! 🙂
How it snowballs:
In the other 98% of incidents, it’s not that easy:
- You don’t know what the core problem is yet.
- It’s some other, bigger environmental problem.
- You can fix it, but it’s going to take time.
After many minutes of you telling your manager “I’m looking into it right now” – there will be a breaking point; a point of no return. At some point, the business or your manager is going snap – they will DEMAND a fix.
At my company, they open up a bridged phone line with “mission control” and immediately establish an Incident Response Team (IRT). This team is often multi-disciplinary where people are from infrastructure, security, networking, DBA, MQ, change management, your senior management, the business, etc. I’ve been at a few companies that all had a similar procedure – even if they just call it something else.
At this point, you have officially entered a World of Hurt™
Now, you have a conference call of people who know nothing about your application, asking you an endless barrage of questions. Sprinkled in there will be daggers of “Who put this in production? Wasn’t this tested?” and “Sheesh, how did this even make it production?”. You’ll keep getting repeat questions of “What version is the server running on again?” and “What service pack was it?” – and no one writes down those answers apparently, as you will get them over and over!
So you, the developer will be absolutely, and fully consumed by this process. You will have to answer these endless questions as more people come on the call, asking when it’s going to fixed – all while YOU are supposed to be writing the fix. You of course can’t do that because you are busy answering questions. This can easily last for HOURS, it’s exhausting, and non-stop!
By the way, don’t even hint that you are getting annoyed at the questions because everyone is on this call because of you – don’t forget!!
No light at the end of the tunnel:
In many cases, and today was an example of that – the developer was on the phone from 7am, to at least 5pm when I got off. IN that time, there was never a lunch break and from what I can tell – the developer never even took a bathroom break! There is no dabbling on this day – literally every second you are on this call, you are either talking, thinking, or writing – for the ENTIRE time.
If this is a business critical outage, I’ve been on calls where this will go for 12-14 hours, then someone will call a :30 minute break, then you are back on for another 8-10 hours with no breaks.
This isn’t just some fluffy conference call though. YOU, the developer, are the star of the show and need to be “on” this ENTIRE time.
Being the developer in this situation is nothing short of exhausting and extremely stressful. Not only are you getting agitated, frustrated, and flustered – the other people on the call start getting annoyed too and you can’t lose your cool. You can’t really say anything because the whole reason everyone is here is because YOU messed up!
The (Possible) Reprieve:
In many cases, you usually end up involving a vendor – like IBM or Microsoft. If that is the case – then you get a small reprieve. Often the vendor needs to do research and while you are waiting, people drop off the call and you can take a break. Another benefit is that you and everyone else gets to blame the vendor. That’s one of the things you get when you buy a support contract, the vendor gets to be your punching bag!
However, if this outage is because you wrote bad code though, you will get no reprieve and the misery will continue.
The Final Stretch:
Just to recap, it will likely have been 8+ hours of you writing code, copying files, responding to IM’s, answering questions on the conference call, sending Vicki that file, checking for your e-mail for that .zip from Steven, answering more questions, woops! back to writing code, etc, etc. I mean, full-blast for the whole day!
Depending on the problem though, if you get a bunch of smart people together, you’ll eventually figure out the problem and a solution within a day.
At some point, you’ll have a solution and you just need to produce it and test it. You will be so exasperated at this point, you will be in auto-pilot mode. You will have tunnel-vision and will think of nothing else except getting this fix in. When the fix is in, this nightmare is over. Until it’s in, you have to stay on this call answering questions and working on fixing the mistake in front of a bunch of people, who can’t go home to their families because of that coding mistake!!
Finally – you put in the fix, the business verifies it – and the call disbands. It’s over.
Immediately, you will likely need to just sit down and stare at the wall for a few minutes because your brain will be buzzing. It’s very rare in life where we need to be “switched on” for that long of time with no breaks. So, you will soak it all in and start to process what happened.
If you are smart, you’ll then send some e-mails and recap what happened, why it happened, and what you did to make sure it doesn’t happen again.
If this outage was because of a process or testing failure, expect many meetings on that. Also, understand that this puts a mark on your reputation. Granted, everyone has production bugs, but it never looks great for you when you have one.
Overall – this outage was bad for everyone: the business, your management, the reputation of your dev group, you, your reputation – everyone lost.
What can be learned from this?
If you’ve never experienced this before, hopefully this scares you. What I outline above is not some extreme case – this is how MOST production outages go. Some are better, more are worse.
Many people poke fun that I am too much of a perfectionist with coding. “Why care so much?” and “Who cares, why do we need that ‘else’ statement?” or “I don’t need a ‘catch’ block here” they say.
I care about these things because I’ve had production outages because of face-palm, stupid mistakes. “If I would have just checked for that null…” type of regrets.
You really only need to live through this event ONCE as the primary developer to become SUPER paranoid about your code.
Many years ago I was in a car accident where my vehicle was “t-boned” at an intersection. My light turned green, I drove and BAM! A car hit us on the passenger side from the cross-street. Ever since that accident, I can’t go through an intersection after the light turns green without looking both ways. This is very similar. After going through a 12-18 hour “ordeal”, is what I’ll call it – it will (and should) change you.
Instead of writing working code, and then adding code for error conditions; you will write code for the error conditions and the LAST thing you will do is write the happy path!
How can you avoid this / never have to go through this?
Well first, going through it at least once will do WONDERS for you, professionally. You will either quit being a developer, or you will become a remarkably better developer. With that said though, what can you do? In no particular order:
- Become a defense programmer.
- Have intention-revealing names.
- Verify every single argument.
- Catch every exception (appropriately).
- Add an else for every if.
- Regularly do code reviews / regularly have others review your code.
- Pair-program with another developer.
- Have LOTS of unit tests and >70% code coverage for ANY code that goes to production.
- Add detailed logging so you can quickly find a root-cause.
In short, take a production release very seriously. What can you do to this code, while it is still in your care, to make sure that there is no chance of a production bug showing up?
OK, but what if I find myself in this situation?
Great question. Nobody is perfect, and there is a chance you will end up in the hot seat. If you are, here is what I recommend: Don’t let the above happen to you. MANAGE the problem!
- Have a knowledgeable developer on the call to answer everyone’s question while you work. A “wingman” of sorts.
- Over-communicate, and constantly summarize. See this post for most ideas on this topic.
- Control the flow of information. Your manager and the business will angrily come to you to find status – give them constant updates, written at a “manager”reading level!
- Over-communicate as much as you can!
- Did I mention over-communicate?
If you have a cohort who can help you manage the outage, that would be best. Have one person doing the actual work – and the other who fields questions and who is constantly feeding updates and summarizing the situation.
In all of the incidents I’ve been a part of, that’s about the best you can do. That day is still going to suck bad – but that will be a bit more manageable. If you are the SINGLE person doing the fix AND communicating, it’s going to be a brutal day.
Consider even coming up with an “emergency response” for your dev team now, for how you would respond to an outage? Maybe assign different people to different roles so that when this comes out, you are ready to manage the problem.
In the case at work today, the root-cause was an unforeseen environmental problem – but testing should have caught it. So even though this wasn’t even a coding problem, today was a miserable day for those developers. So, I would be equally paranoid about testing – because lack of testing or lack of a good back-out plan can equally land you in this same spot.
Back to my car accident analogy – while in the car, if someone sees me look both ways while going through an intersection and jokes “Paranoid, much?” – I laugh and say “I see you’ve never had a car smash into you while going through an intersection before!”. Likewise, when someone jokes about my code being overly-paranoid and overly-cautious – I laugh and say “I see you’ve never lived through a major production outage before!”, because it will change how you code.