IT crisis: When the fire truck rolls
Who wants an IT crisis on their hands? No one! But once one happens, how should you respond? Here are some lessons reiterated by a recent crisis response.
I was recently asked to help fix a high-visibility Web site that was performing poorly. I'd like to share some of the lessons--not learned, but reinforced--by the experience.
Fix the problem, not the blame. "We're in trouble! Help!" calls are fraught with embarrassment. Who, after all, wants to admit they have a problem? Many feel they "should" be able to "handle it ourselves," without sending up an emergency flare or asking for assistance. This latest was a straightforward "we can't get decent performance, and it's getting critical!" situation. That shouldn't be too embarrassing. It doesn't begin to compare with the doozies, like a large software company that called me concerned that it didn't know how to keep its flagship software service/SaaS app from becoming the next poster child for downtime. No problem, right? We have two whole weeks before the service goes live. Or when the IT staff of a large computer company asked us for help understanding and optimizing the performance of its internal ERP deployment, while at the same time the company's sales people eagerly touted its ability to help customers avoid ERP hurdles, and its professional services arm offered several packages to this end.
When you're a troubleshooter, irony will find you. But we don't linger to be amused. We focus on problem. So get over your own ego and call in whoever can help you--internal staff, external resources, equipment providers, service providers, or a combination--to work resolving the issue.
Tick tock! Tick tock! Get on the job ASAP. Even if no one much cares about an app when it's running fine, people will notice, be unhappy, and complain bitterly when it's down. The old model of "down" was "completely kaput," but that's too simple. Poor performance, variable performance, and transient errors also count. If the app or site that's down is truly critical, the stakes and the aggravation are multiplied by a number than seems to approach infinity.
No matter how good the resolution team is, it needs time to work. It takes time just to figure out what the problems are and what variables can be safely adjusted; then it takes time to do the adjusting. Usually, when you get a "help!" call, you're already behind. In this case, by the time I was asked to have a look, the performance problem was quite visible, and threatening to become an emergency. When the call comes, jump in fast.
Communicate. When you're working at full speed, it seems wrong to pause to document what you've found, or to communicate what you're doing about it. That's true in any project, but when alarm bells are going off, everyone's under pressure, and everything's double-plus urgent, it seems impossible. Do it anyway. Even if you don't yet have a diagnosis or a fix, organized, occasional updates reassure folks that someone's on the job, making them feel less frustrated and helpless. It helps turn a problem from an "us/them" or "My God! Haven't you fixed it yet?!" situation into a cooperative "we" situation. It also helps the resolution team to use a common whiteboard or trouble-ticket system to keep many separate, concurrent activities straight.
You can't manage what you can't measure. This statement--variously attributed to Bill Hewlett, William Demming, Peter Drucker, and Lord Kelvin, among others--is manifestly untrue. It's pretty common, actually, to start managing what you haven't measured. "Change this!" "Fix that!" "OK, we need to do it this way!" It's just not such a good idea, operating in the blind. Far better to know what exactly is your problem, and how severe is it. Otherwise, you can't triage. You'll spend your time fixing things that aren't that important or high-leverage--and at the same time, missing those that are. So the real advice is "you can't do a good job of managing what you can't measure."
When I arrived, there were no good metrics. Not even for basic things like how many users we were getting per hour, or what pages and resources are getting hit the hardest. The site had analytics, to be sure, but marketing owned that dashboard--not development or operations. Job one: Get good data. I analyzed the site from an external, user perspective; this immediately suggested high-impact improvements. Getting access to the internal analytics immediately told me which areas and resources were most critical to optimize. Even in a large site or app, tuning the top few views can massively improve overall perceived performance. It can also reduce the workload, so everything runs faster. But without good metrics, it's hard to know what to focus on, and what not. You can guess, but you're just guessing. Without metrics, it's also hard to see how much you've improved the situation with each fix.
The other key piece of information analytics gave me was how crucial performance improvements would be. User load was increasing rapidly--hour-by-hour, day-by-day--in the run-up to a major event. Tick tock! Tick tock! Having this data instantly showed me that if we didn't get things fixed, now, we were going to be overrun by demand. This indicated how much optimization we needed to do, as well as how much improvement each optimization had to yield for spending time on it to make sense. Metrics are invaluable because they let you prioritize with confidence.
Be incremental. We did some "big things," such as rehosting on a much beefier server. But big things generally take big time, which we did not have. So while waiting for the rehosting to be finished, we simultaneously did several rounds of incremental improvements. The details are common sense--minimize the size of each request, arrange for static content to be served from the cloud (a.k.a. a CDN), improve cacheability, et cetera. The specifics vary based on the kind of application or site you're dealing with, but the core idea remains: make improvements that take effect rapidly rather than waiting for bigger improvements that arrive slowly. Being incremental allowed us to improve the quality of service and serve tens of thousands of additional customers, even before bigger hardware came online. Bigger hardware certainly was required, but no matter how big the hardware, optimizing away 75 percent of the workload goes a long way!
There's a second reason to embrace incrementalism: You have to. By the time you've responded to a crisis, you can't change any of the really big variables. You can't change the app, the database, or the platform. You may be able to fire up a new database node, or build a new index, but you're not going to change the database structure or schema. You may be able to upgrade or reconfigure a server, storage array, or network switch, but the basic design and deployment choices have already been made, long time ago. You have to focus on the variables you can effect, and quickly.
Be careful! When you're asked to make big improvements, rapidly, under pressure, it's easy to make mistakes. Mistakes that are entirely benign in a testing or development environment can easily be fatal when working on a live, production service. "Oops! Well, I'd better put that back the way it was!" becomes "Oh my God! I've just disabled the site!" It's not only possible, but easy, to break things that weren't broken before. So even though you're working fast, don't neglect to take back-ups, perform small-scale proof-of-concept tests, and otherwise prepare for quick "put it back the way it was!" moments before you make changes. Think of your fixes as surgery on a live patient. "First, do no harm."
An ounce of preparation is worth a pound of cure. In an ideal world, fire trucks would never roll. They wouldn't need to. But even in our imperfect world, the story of fire fighting has changed entirely over the years. Whole blocks and cities used to burn. Now they seldom do. Why not? Preparation and prevention. Ubiquitous fire hydrants. Building codes. Fire-retardant materials. Smoke alarms. Built-in fire suppression systems like sprinklers. Well-trained, well-equipped, professional fire-fighting corps. Actual fire fighters do their real work upfront. So should IT shops.
Inevitably during a crisis, there will be a lot of "I wish we had done...earlier!" or "We should have..." moments. On your current and future projects, think about that beforehand, and try to "get in front of the problem." Make sure that you do at least basic performance analysis and capacity planning as part of the development process. Tuning ahead of production can be much more systematic, and therefore more effective. It's certainly lower-stress than doing that work while underway. Beyond performance, do other forms of operational engineering--at a minimum, in the areas of security, availability, and data protection. If your app is one that's going to be put under a lot of stress and load quickly, make sure you have a designated operational team to manage it while it's ramping up. Ideally, take a devops approach that focuses on the transition from development to production operations, and on updating your app/site after it's gone live. Finally, make sure you've pre-positioned your tools: monitors, analytics, and other management/control points so that if something does need live remediation, the focus is on the fixes, and not on "Uhh...What's the problem?!" or "What do we have to work with?" Actual fire fighters do the lion's share of their work upfront, in preparation. So should we.
If you're already in a crisis for which you haven't prepared--sigh--let's get to work fixing this problem. But next time? Be better prepared, OK?