A co-worker mentioned that we was playing around with MapReduce recently. He told me how "fast" is was and how he could process a huge batch of log files "in real time". And the whole time, there's something in the back of my head telling me that something isn't right about all of this. And it's not just because he's a sysadmin, and not a developer.
Because he's testing on a laptop (with a single CPU no less), he's not getting any benefit from using a MapReduce approach. Sure, he's using a shiny new technology that helps him sound cool and cutting edge. But in reality, the processing his application is performing could be done, probably more efficiently, using a different approach.
MapReduce on a single machine is nothing more than a design pattern! On a single processor, it's not even a good design pattern. You're taking a large data set and breaking it up to be "mapped" by a bunch of workers that are going to end up waiting on each other, only to "reduce" the data into a single collection. There's no advantage over processing the data serially. In fact, the cost of breaking up the input and merging the results is additional overhead.
I suppose the "style" of typical map and reduce code may seem novel. Since these often take a more functional approach, programmers who have only worked with object-oriented code see the simplisity of passing functions-as-arguements a wonderous and magical thing. But this is not a MapReduce specific feature. Again, the same result can be achieved in many other ways.
I'm currently finding many business people (non-programmers) around the office with huge misconceptions about MapReduce. There seems to be a concensus that MapReduce is synonomous with NoSQL. And you can take billions of rows of data, run an ad hoc query, and get results in sub-second times. After all, that's the experience one gets when searching Google. Right?
At least I think I've figured out where these misconceptions are coming from! And I'm pretty sure, as he continues to tinker with his super-speedy log splicer, he's going to report his magnificent performance results. I trust he'll never benchmark his ingeniuos invention against a simple Perl script (since there's no buzz around such an old school approach to parsing log files).
Too bad he doesn't realize his "big data" is laughably small compared to the amounts of data being processed when MapReduce was conceived. The bright spot in all of this is, as long as he continues to use MapReduce for purposes it is not intended for, he won't have to ask management to fund more servers. Maybe it'll also keep him busy and out of the developers' hair.
Thursday, June 3, 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment