Sunday, October 31, 2010

Mahout 0.4 released!

Go to the Apache Mahout site for more info.  Here is the official announcement:

We are pleased to announce release 0.4 of Mahout. Virtually every corner of the project has changed, and significantly, since 0.3. Developers are invited to use and depend on version 0.4 even as yet more change is to be expected before the next release. Highlights include:
- Model refactoring and CLI changes to improve integration and consistency
- New ClusterEvaluator and CDbwClusterEvaluator offer new ways to evaluate clustering effectiveness
- New Spectral Clustering and MinHash Clustering (still experimental)
- New VectorModelClassifier allows any set of clusters to be used for classification
- Map/Reduce job to compute the pairwise similarities of the rows of a matrix using a customizable similarity measure
- Map/Reduce job to compute the item-item-similarities for item-based collaborative filtering
- RecommenderJob has been evolved to a fully distributed item-based recommender
- Distributed Lanczos SVD implementation
- More support for distributed operations on very large matrices
- Easier access to Mahout operations via the command line
- New HMM based sequence classification from GSoC (currently as sequential version only and still experimental)
- Sequential logistic regression training framework
- New SGD classifier
- Experimental new type of NB classifier, and feature reduction options for existing one
- New vector encoding framework for high speed vectorization without a pre-built dictionary
- Additional elements of supervised model evaluation framework
- Promoted several pieces of old Colt framework to tested status (QR decomposition, in particular)
- Can now save random forests and use it to classify new data
- Many, many small fixes, improvements, refactorings and cleanup
Details on what's included can be found in the release notes.
Downloads are available from the Apache Mirrors

Monday, October 25, 2010

New Mahout release coming

The vote has started for the 0.4 Mahout release.  Lots of new stuff, but the part that I am excited about is a fairly comprehensive implementation of logistic regression suitable for large scale training and high speed classification, but there is a whole lot more.

With the 0.4 release, Mahout is moving along strongly towards the fabled 1.0 release.  At that point, we will start paying lots of attention to backwards compatibility.  That will be good, but the current wild and wooly policy is pretty handy if you have something in mind that Mahout really, really needs because we can get new things in pretty readily right now.

See for the release when it arrives and watch my twitter feed for an announcement.

Wednesday, October 13, 2010

Why is the sum of two uniform randoms not uniform?

Lance Norkog asks on the Mahout mailing list why adding two uniformly distributed random variables gives a pyramidal distributed value. I would normally answer on the mailing list, but here I can use lovely math notation. As I mentioned on-list, this is a very basic result that is related to the law of large numbers.

If we were to draw a picture of the joint distribution of these variables \(x\) and \(y\), we would get something that is 1 inside the \([0,1] \times [0,1]\) square and 0 outside that region.

For a given value \(\alpha\) of the sum \(x + y\), there is a diagonal line segment where \(x+y=\alpha\) and \(x\) and \(y\) are in the square. Where \(z \le 0\) or \(z \ge 2\) that intersection vanishes and for \(0 \lt z \lt 2\), that intersection varies in length. The probability of the sum having some particular value z is proportional to the length of that intersection. As you can imagine, the intersection varies in size linearly and it reaches a maximum where z = 1.

For the sum of three random variables, the situation is more complex to reason about geometrically because we need to worry about the intersection of a plane and a cube.  For more variables, the geometry is not worth the trouble.

If we tackle the problem a bit more rigorously, then the easiest way to approach the problem is to compute the cumulative distribution of various values of sums. That leads to a convolution integral over the density functions involved. Since the densities are all 1, the integration limits are the key to the value and those limits have to broken down into cases. Actually doing these integrals is a pretty rare activity since the limit is approximated so well after just a few iterations.

Just how quickly that convergence happens is can be seen by looking at the empirical distribution of the sum of three uniform deviates.  I used something very like this R code to produce the graph below:

        breaks=50, prob=T)
   lines(seq(-1,4,by=0.01), dnorm(seq(-1,4,by=0.01), 
        1.5, sqrt(1/4)))

In this graph, the red curve is the normal distribution with the same mean and standard deviation.  As you can see, the peak is a tiny bit too high and the tails are just a skoshi too long for the normal to be a good description of the samples of the sum.   
This is, however, just the sum of three random values.  If you sum more values, the convergence to the normal distribution is very strong and by the time you are adding six uniform random values together, the difference between the distributions is no longer visible in a graph like this and can only be detected numerically using lots of data and clever things like a Kolmogorov-Smirnov test.

The moral here is that there isn't much way to avoid this regression to the normal distribution and distorting the data to avoid it is probably pointless.

But if you are like me, being just a little more normal always made it easier to get along.