Tuesday, June 29, 2010

The worst fail I've ever seen

If you spend anytime working with computers in an office setting, you've undoubtedly experienced "an outage". This is the dreaded, panic-stricken period of time where something goes wrong. It could be a complete loss of power, despite on-site generators that "were supposed to kick on". It could be a car hitting a telephone poll that knocks out your T1 line. Or, it could be a hard drive failing, taking all your data with it. Some of these are avoidable, and others are not.

But IT is expected to be somewhat proactive. They are expected to have basic monitoring in place to ensure they notice a machine dying before paying customers do. And when the machine dies, they should probably have some plan for bringing it back online. I've seen this done well, and I've seen this go awry (and when I was a sysadmin, was even responsible for a few less than perfect recoveries).

But what happened last week was an epic fail of monumental proportions. Our company continues to support a legacy e-commerce system. It continues to generate revenue for us, but more importantly, it is the platform on which a number of small businesses run. In other words, without this service up and running, our customers are not able to do business. I'd like to think I work at a company that cares for it's customers and wants to see them succeed.

And I understand - shit happens. So, it when the system went down , it was not a surprise. After all, one reason the system is being replaced is because it . . . well, frankly it sucks!

But, this didn't just lead to an outage - the system was down for days. Here's what happened, according to our "IT Director":
[T]he server that holds the database files ran out of disk space. When this happened, it corrupted both the database file, and the transaction log file that's used to recreate activity in the event of the database file becoming corrupted. This left us with no way to recreate the activity on the platform and has left us with no any easy way to get that information back.

No big deal. That's what backups are for, right? Well, here's the explanation from our marketing folks a few days later:
During a reconfiguration of our backup utility (on April 22), another server took over mirroring the backups during the reconfiguration. After the reconfiguration was complete, the backup process was transitioned back to its normal process. As a precaution with the updated system, the backups were left on the temporary server and that is where the recovery came from.

In other words, the backup strategy in place was fucked up and no one noticed. The most recent backup was from 2 months ago. 2 fucking months! Are you kidding me! I've seen a lot of IT disasters that have cost people and companies a lot of money. While this may not have been the most expensive, it certainly was the most careless and avoidable - by an extremely wide margin!

In large enterprises, file systems running out of space can result in firings; there would at least be a lot of post mortem meetings. Downtime of hours (much less days) causes more unemployment. Lack of backups for 2 months would get most IT departments "swept clean". In smaller companies, those responses aren't possible since the "IT department" may be a single person. But it certainly is a sign that something is seriously fucked up and mismanaged.

Since we only have 2 people in our IT department (the IT Director and a sysadmin), no one has been fired. As far as I know, there have been no reprimands or repercussions. There are hints about not having enough staff and already working 9 - 12 hours a day coming from the IT guys. But this isn't about staff or hours, it's about planning and management. The current "IT Director" (said sarcastically) has proven an inability to manage an IT department and maintain systems.

Even if the company isn't going to attempt to fix this, the IT Director should man up and step down. Or at least, take responsibility. The sysadmin has admitted publicly to making mistakes and dropping the ball, but the "IT Director" is just waiting with his head down, hoping the storm will pass.

And, if you think I'm being harsh, let me assure you, I am not. After all, 2 1/2 years ago, the same system had a similar multiday outage, because . . . you guessed it . . . no fucking backups! You can bitch about not having enough people or working too many hours already, but there's no excuse for that!

I've seen a lot of IT fuck ups over the years, and most I can look back on and laugh about. I'm not laughing now!

Thursday, June 3, 2010

MapReduce - just another design pattern!

A co-worker mentioned that we was playing around with MapReduce recently. He told me how "fast" is was and how he could process a huge batch of log files "in real time". And the whole time, there's something in the back of my head telling me that something isn't right about all of this. And it's not just because he's a sysadmin, and not a developer.

Because he's testing on a laptop (with a single CPU no less), he's not getting any benefit from using a MapReduce approach. Sure, he's using a shiny new technology that helps him sound cool and cutting edge. But in reality, the processing his application is performing could be done, probably more efficiently, using a different approach.

MapReduce on a single machine is nothing more than a design pattern! On a single processor, it's not even a good design pattern. You're taking a large data set and breaking it up to be "mapped" by a bunch of workers that are going to end up waiting on each other, only to "reduce" the data into a single collection. There's no advantage over processing the data serially. In fact, the cost of breaking up the input and merging the results is additional overhead.

I suppose the "style" of typical map and reduce code may seem novel. Since these often take a more functional approach, programmers who have only worked with object-oriented code see the simplisity of passing functions-as-arguements a wonderous and magical thing. But this is not a MapReduce specific feature. Again, the same result can be achieved in many other ways.

I'm currently finding many business people (non-programmers) around the office with huge misconceptions about MapReduce. There seems to be a concensus that MapReduce is synonomous with NoSQL. And you can take billions of rows of data, run an ad hoc query, and get results in sub-second times. After all, that's the experience one gets when searching Google. Right?

At least I think I've figured out where these misconceptions are coming from! And I'm pretty sure, as he continues to tinker with his super-speedy log splicer, he's going to report his magnificent performance results. I trust he'll never benchmark his ingeniuos invention against a simple Perl script (since there's no buzz around such an old school approach to parsing log files).

Too bad he doesn't realize his "big data" is laughably small compared to the amounts of data being processed when MapReduce was conceived. The bright spot in all of this is, as long as he continues to use MapReduce for purposes it is not intended for, he won't have to ask management to fund more servers. Maybe it'll also keep him busy and out of the developers' hair.