Sunday, April 4, 2010

Big Data and Big Problems

In talks recently, I have mentioned an ACM paper that showed just how bad random access to various storage devices can be.  About half the time I mention this paper, somebody asks for the reference which I never know off the top of my head.  So here it is

The Pathologies of Big Data

Of course, the big deal is not that disks are faster when read sequentially.  We all know that and most people I talk to know that the problem is worse than you would think.  The news is that random access to memory is much, much worse than most people would imagine.  In the slightly contrived example in this paper, for instance, sequential reads from a hard disk deliver data faster than randomized reads from RAM.  That is truly horrifying.

Even though you can pick holes in the example, it shows by mere existence that the problem of random access is much worse than almost anybody has actually internalized.

1 comment:

Doug Cutting said...

That link didn't work for me on Linux. But I found the article at:
http://queue.acm.org/detail.cfm?id=1563874

Good article!