Tuesday, June 29, 2010

The worst fail I've ever seen

If you spend anytime working with computers in an office setting, you've undoubtedly experienced "an outage". This is the dreaded, panic-stricken period of time where something goes wrong. It could be a complete loss of power, despite on-site generators that "were supposed to kick on". It could be a car hitting a telephone poll that knocks out your T1 line. Or, it could be a hard drive failing, taking all your data with it. Some of these are avoidable, and others are not.

But IT is expected to be somewhat proactive. They are expected to have basic monitoring in place to ensure they notice a machine dying before paying customers do. And when the machine dies, they should probably have some plan for bringing it back online. I've seen this done well, and I've seen this go awry (and when I was a sysadmin, was even responsible for a few less than perfect recoveries).

But what happened last week was an epic fail of monumental proportions. Our company continues to support a legacy e-commerce system. It continues to generate revenue for us, but more importantly, it is the platform on which a number of small businesses run. In other words, without this service up and running, our customers are not able to do business. I'd like to think I work at a company that cares for it's customers and wants to see them succeed.

And I understand - shit happens. So, it when the system went down , it was not a surprise. After all, one reason the system is being replaced is because it . . . well, frankly it sucks!

But, this didn't just lead to an outage - the system was down for days. Here's what happened, according to our "IT Director":
[T]he server that holds the database files ran out of disk space. When this happened, it corrupted both the database file, and the transaction log file that's used to recreate activity in the event of the database file becoming corrupted. This left us with no way to recreate the activity on the platform and has left us with no any easy way to get that information back.

No big deal. That's what backups are for, right? Well, here's the explanation from our marketing folks a few days later:
During a reconfiguration of our backup utility (on April 22), another server took over mirroring the backups during the reconfiguration. After the reconfiguration was complete, the backup process was transitioned back to its normal process. As a precaution with the updated system, the backups were left on the temporary server and that is where the recovery came from.

In other words, the backup strategy in place was fucked up and no one noticed. The most recent backup was from 2 months ago. 2 fucking months! Are you kidding me! I've seen a lot of IT disasters that have cost people and companies a lot of money. While this may not have been the most expensive, it certainly was the most careless and avoidable - by an extremely wide margin!

In large enterprises, file systems running out of space can result in firings; there would at least be a lot of post mortem meetings. Downtime of hours (much less days) causes more unemployment. Lack of backups for 2 months would get most IT departments "swept clean". In smaller companies, those responses aren't possible since the "IT department" may be a single person. But it certainly is a sign that something is seriously fucked up and mismanaged.

Since we only have 2 people in our IT department (the IT Director and a sysadmin), no one has been fired. As far as I know, there have been no reprimands or repercussions. There are hints about not having enough staff and already working 9 - 12 hours a day coming from the IT guys. But this isn't about staff or hours, it's about planning and management. The current "IT Director" (said sarcastically) has proven an inability to manage an IT department and maintain systems.

Even if the company isn't going to attempt to fix this, the IT Director should man up and step down. Or at least, take responsibility. The sysadmin has admitted publicly to making mistakes and dropping the ball, but the "IT Director" is just waiting with his head down, hoping the storm will pass.

And, if you think I'm being harsh, let me assure you, I am not. After all, 2 1/2 years ago, the same system had a similar multiday outage, because . . . you guessed it . . . no fucking backups! You can bitch about not having enough people or working too many hours already, but there's no excuse for that!

I've seen a lot of IT fuck ups over the years, and most I can look back on and laugh about. I'm not laughing now!

No comments:

Post a Comment