Wednesday, April 16, 2008

Reading email

And you thought that would be easy.

If you have ever written a program to receive email (not just pick it up from a POP server), you know just how painful that can be. If you wrote the program in Java, you know this even better.

The world just got better:

http://subethasmtp.tigris.org/

This package is just as simple as reading email should be. Your program decides whether to receive the email and then it gets it. The interface is tiny and it looks like they handle enough of the standards to really work.

Whew.

Tuesday, April 15, 2008

A random walk from eigenvectors to parallel page rank

Power law algorithms are ideal for computing things like page rank. That isn't obvious, however.

The basic outline idea is that hub and authority style algorithms are intimately related to eigenvector or singular value decompositions (depending on whether the links are symmetrical). This also means that there is a close relationship to asymptotic beahavior of random walks on the graph. That probably still isn't obvious, so let's dig in.

If you represent the linkage in the web by a matrix that has columns representing source page and rows representing the target page and with a 1 where-ever the source page has a link pointing to the target page, then if you start with a vector with a single non-zero element equal to 1 as a representation of a single page, then multiplying by the linkage matrix will give you a vector with 1 in the positions corresponding to the pages the original page linked to. If you multiply again, you get all the pages that you can get to in two steps from the original page.

Mathematically, if we call the original vector x and the linkage matrix A, the pages that x links to are just Ax. The pages that are two steps from x are A(Ax) = A2 x.

The eigenvector decomposition of A is just a way of writing A as a product of three matrices:

A = U S U'

U' is the transpose of U, and U has the special property that U'U = I (it is called ortho-normal because of this).

S is a diagonal matrix.

There is lots of deep mathematical machinery and beautiful symmetry available here, but for now we can just take this as given.

The set of pages n steps from x are

xn = An x = (U S U')n x = (U S U')n-2 (U S U') (U S U') x
= (U S U')n-2 (U S (U'U) S U') x = (U S U')n-2 (U S2 U') x
= U Sn U' x

This is really cool because Sn can be computed by just taking each diagonal element and raising it to a power.

Eigenvector decompositions have other, really deep connections. For instance, if you take the elements of S (call the i-th one si) then

\displaystyle\sumi sin

is the number of paths that are n steps long.

Connected (or nearly connected) clusters of pages can also be derived from the eigenvector decomposition. This is the basis of so-called spectral clustering. For some very impressive examples of spectral clustering see this paper.

So eigenvectors are cool. But how can we compute them? And how can we compute them on BIG graphs in parallel?

First, note that if An = U Sn U' and if some of the si are bigger than others, the big ones will quickly dominate the others. That is pretty quickly, An \approx u1 s1n u1'. This means that we can compute an approximation of u1 by just doing An x where x is some random vector. Moreover, we can compute u2 by starting with a different random vector and iterating the same way, but with an additional step where we forbid the result from going towards u1. With just a few additional wrinkles, this gives us what is called the Lanczos algorithm. Golub and van Loan's excellent book Matrix Computations gives a lot of information on these algorithms.

The cool thing here is that our random vector can represent a single page and we can approximate the final result by following links. Following links is just a (human-readable) way of saying sparse matrix multiplication. If we do this multiplication against lots of different random starting points, we can quickly build parallel algorithms to compute things like page rank.

Words at random, carefully chosen

On comp.ai, Dmitry Kazakov reiterated the lonely cry of a frequentist against statistical natural language. This cry has been repeated many times over the years by many people who cannot abide the treatment of documents and language as if they were random.

Let's examine the situation more carefully.

On Apr 14, 5:28 am, "Dmitry A. Kazakov" wrote:
> ... It cannot be probability because the document is obviously not random ...

The statement "It cannot be probability ..." is essentially a tautology. It should read, "We cannot use the word probability to describe our state of knowledge because we have implicitly accepted the assumption that probability cannot be used to describe our state of knowledge".

The fact that an object has been constructed in its present state by non-random processes outside our ken is no different as far as we can tell than if the object were constructed at random (note that random does not equal uniform). What if the document were, in fact, written using the I Ching (as Philip K Dick is reputed to have written "The Man in the High Castle")? Is it reasonable to describe the text as having been randomly generated now that we know that?

Take the canonical and over-worked example of the coin being flipped. Before the coin is flipped a reasonable observer who knows the physics of the situation and who trusts the flipper would declare the probability of heads to be 100%. After the coin is flipped, but before it is revealed, the situation is actually no different. Yes, the coin now has a state whereas before the coin was only going to have a state, but, in fact, the only real difference is that the physics has become somewhat simpler, the most important factor in our answering the question of the probability has not changed. We still do not know the outcome.

Moreover, if the person flipping the coin looks at the coin, that does not and cannot change our answer.

When WE look at the coin, however, we now suddenly, miraculously declare that the probability is now 100% that the coin has come up heads. Nothing has changed physically, but our estimate has changed dramatically.

Moreover, if we now examine the coin and find that it has two heads, our previous answer of 50% is still valid in the original context. If we were to repeat the experiment, our correct interpretation is to give 100% as the probability before the flip. The only difference is our state of knowledge.

So philosophically speaking, probability is a statement of knowledge.

Moreover, by de Finetti's famous theorem, even if this philosophical argument is bogus, the mathematics all works our AS IF there were an underlying distribution on the parameters of the system. That means that we can profitably use this philosophical argument AS IF it were true.

The upshot is that even if you are a frequentist in your heart of hearts, it will still pay to behave as if you were a Bayesian. And I, as a Bayesian, will be able to behave as if you were rational because I will not know your secret.