Friday, March 28, 2008

How Not to Recommend

A friend of mine pointed out The Netflix Prize: 300 Days Later this morning. The article points out how difficult it has been to make significant improvements on the Netflix rating prediction problem.

This article reinforces a lot of my own feelings. I continue to believe that recommendations based on ratings are inherently flawed. Among other things
  • predicting ratings from ratings is hard
  • predicting real user preferences from ratings is even harder
  • most people don’t rate things very much, if at all (makes #1 even harder)
All of these factors are eased considerably if you just get rid of the idea that you have to predict ratings. Observing real behavior and then predicting that same behavior is vastly easier.

Wednesday, March 26, 2008

The Power of Meta

Evolutionary algorithms are oohhh so trendy, but they often provide very poor convergence rates. This occurs particularly when the population is in the neighborhood of a good solution, but the mutations being produced are far from that neighborhood because the mutation rate is too high.

Decreasing the mutation rate often doesn't help because then it takes forever to find the right neighborhood.

Changing the mutation rate externally can work, but you will usually get the annealing schedule wrong and have too much or too little mutation at some point.

One of the most effective ways to fix these problems is the technique known as meta-mutation. With meta-mutation, each member of the population has data about how dramatically it should be mutated. Then, when you are far from a good solution, members that take big steps can be big winners and their descendants will also take big steps. When you get close to a good solution, members with large mutation rates will step far from the best solution and will be mostly unable to improve the solution, but some descendants will have smaller mutation rates and these will able to find improved solutions. These calmer descendants will quickly dominate the population because they will be able to produce significant improvements.

Unfortunately, meta-mutation introduces the new problem of how you should mutate the mutation rate. (danger ... recursion alert)

I describe some methods that avoid all of these problems in my 1998 paper Recorded Step Mutation for Faster Convergence. The techniques are very simple and very effective. The paper is available from arxiv.

The real trick is that you can have the mutation rate encode the meta-mutation rate as well. One approach using a fixed distribution that is scaled by the current mutation rate. Another is to use the last step determine a directional component of the new mutation rate. Together, these approaches can improve convergence rate and final accuracy by many orders of magnitude relative to other approaches. Self-similar meta-mutation alone gives results that are nearly identical in symmetric Gaussian bowl problems to analytically determined optimal annealing schedules. When combined with directional mutation, results are near optimal for Gaussian bowls with low rank cross correlations.

Tuesday, March 25, 2008

Hello world for map-reduce

There has been a surge of interest lately in map-reduce based parallel from many quarters. The public, non-google side of the that interest is largely focussed around an open source system of software known as Hadoop.

Lots of people are doing some very cool things with Hadoop. Yahoo is now running key parts of their web indexing infrastructure using Hadoop. Academics have described very impressive results in large scale machine translation developments. Facebook uses Hadoop for key parts of their metrics pipeline. IBM has developed a functional programming system on Hadoop. Imitators have been spawned.

But one key ingredient of any new computing technology is still missing.

What Hadoop has been missing to date is a good way to write a hello-world application in just a few lines of code. It has often been said that the ability to do simple things simply is a critical capability for a new technology. Hadoop, and map-reduce programming in general, does not lack a good hello-world example; the word-counting example provides that. What is missing is the ability to write this program in just a few lines of code. The word-count example that comes with the Hadoop distribution is over a hundred lines of code, most of which are impossible to motivate for a new user.

The basic idea is simple, the map phase tokenizes input lines and emits words as keys with (initial) counts of 1. The reduce phase takes all of the counts for a single word and sums them up. As a refinement, we can use a combiner to pre-reduce counts within the output of a single mapper.

So why is this more than 5 lines of code? The answer is that running a real world map-reduce cluster has some real complexity if you want the best possible results. Another reason is that in Java hiding complex implementations requires lots of noise tokens. It isn't that Java is bad, it is just that it isn't good for hiding complexity from naive users.

My answer to this problem is to move to Groovy, a language whose raison d'etre is specifically to hide the cruft in Java programs. Groovy isn't Java or an extension of Java, but Groovy programs inter-call with Java programs completely transparently and compile to the same byte codes.

So what should word-count look like? I think that it should look like functional programming. Our mapper should look like a function as should our reducer. They should be combined to produce a map-reduce program which should itself be a function that takes inputs and produces outputs. Those inputs should be any kind of input that we might like to use including lists of strings (for testing), local files and files stored in a distributed data store.

So our program really ought to look like this:
  wc = mapReduce(
{key, line, out, report -> line.split().each {w -> out.collect(w,1)}},
{w, counts, out, report -> out.collect(w, counts.sum())}
and we should be able to use this program in a lot of different ways:
  wc(["this is input text", "for the word-counter"])
wc(new File("/local/input/file")
wc(findInterestingTextUsingMapReduce(new HdfsFile("/existing/distributed/file")
and we should be able to access the output very simply:
  wc(["this is input text"]).eachLine {
println it
Subject to some limitations in Groovy 1.5.4, this is exactly what my new system does. I call this new system grool because it is pretty thin porridge right now.

Send me email for a copy. I will get it onto a standard kind of open-source repository soon(ish), but you can try it sooner.

Friday, March 21, 2008

Surprise and Coincidence

Some years ago, I wrote a simple paper, Accurate Methods for the Statistics of Surprise and Coincidence that has since seen quite a history of citation and use by others. The basic method was applied to get highly accurate language identification, species identification, author identification, document classification, musical recommendations, insurance risk assessment and video recommendations. It was eventually the core of my doctoral dissertation (ask me for a copy) and has been cited hundreds of times.

What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.

To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.

Here is the table we need

Event A Everything but A
Event B A and B together (k_11) B, but not A (k_12)
Everything but B A without B (k_21) Neither A nor B (k_22)

Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G2) is very simple,

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}

The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.

So there you have it. Two lines of code and the world is your oyster.

Why the title?

What kind of title, you may ask, is "Parallel and Uncertain".

If you were to ask, I would answer, "descriptive".

My main interests lately have to do with parallel processing and (automated) reasoning in the fact of uncertainty. Thus, these blog posts will be largely about these topics.

What more can I say?