## Saturday, June 17, 2017

### Generalized Log-Likelihood Test for Poisson Distribution

In another blog, I described how a generalized log-likelihood ratio can be used to find interesting differences in counts. In the simplest cases, we have counts for some kind of observation (A and not A) under two conditions (B and not B). The question of interest is whether the the rate of A varies with whether or not condition B applies. In classical statistics, we would be talk about having a test against a null hypothesis, but in practice we want to look at lots of A's under lots of different conditions B. Testing so many situations makes it very hard to use the classical machinery well, so we have to look at the problem a bit differently.

As I mentioned in other blogs, we can still use a classically derived test known as the generalized log-likelihood ratio as a way of simply ranking different A-B combinations against each other according to how interesting they are. Even without being able to interpret the statistical score as a statistical test, we get useful results in practice.

The generalized log-likelihood ratio most commonly used in these situations is derived assuming we have two binomial observations. This test can be extended to compare two multinomial conditions for independence, but this is rarely done, if only because comparing two binomials is so darned useful.

With the binomial test, we look at the number of positive observations out of some total number of observations for each condition. In some situations, it is much more natural to talk about the number of positive observations not as a fraction of all observations, but as a fraction of the duration of the condition. For instance, we might talk about the number of times we noticed a particular kind of network error under different conditions. In such a case, we probably can say how long we looked for the errors under each condition, but it can be very hard to say how many observations there were without an error.

For a Poisson distribution under two conditions $A$ and $\neg A$, we can observe a count for each of the conditions as well as the total time over which the count is taken. We can arrange our results this way:

 Count $\Delta t$ $A$ $k_1$ $t_1$ $\neg A$ $k_2$ $t_2$

This is suggestive of the way that counts from two binomial observations can be arranged to look for a difference under different conditions \cite{dunning93}.

We can investigate whether the Poisson distribution the same under both conditions using the generalized log-likelihood ratio test. Such a test uses $\lambda$, the generalized log-likelihood ratio,
$\lambda = \frac{ \max_{\theta_1 = \theta_2 = \theta_0} p(k_1 | \theta_0)p(k_2 | \theta_0) }{ \max_{\theta_1, \theta_2} p(k_1 | \theta_1)p(k_2 | \theta_2) }$
According to Wilks\cite{wilks1938} and also later Chernoff\cite{Chernoff1954}, the quantity $-2 \log \lambda$ is asymptotically $\chi^2$ distributed with one degree of freedom.

For the Poisson distribution,
$p(k | \theta, t) = \frac{(\theta t)^k e^{-\theta t}}{k!} \\ \log p(k|\theta, t) = k \log \theta t - \theta t - \log k!$
The maximum likelihood estimator $\hat\theta$ can be computed by maximizing the log probability
$\max_\theta \frac{(\theta t)^k e^{-\theta t}}{k!} = \max_\theta \log \frac{(\theta t)^k e^{-\theta t}}{k!} \\ \log \frac{(\theta t)^k e^{-\theta t}}{k!} = k \log \theta t - \theta t - \log k! \\ \frac{\partial \log L(k | \theta, t)}{\partial \theta} = \frac{k}{ \theta} - t =0 \\ \hat \theta = k$
Returning to the log-likelihood ratio test, after some cancellation we get
$-\log \lambda = k_1 \log k_1 + k_2 \log k_2 - k_1 \log \frac{k_1+k_2}{t_1+t_2} t_1 - k_2 \log \frac{k_1+k_2}{t_1+t_2} t_2$
Some small rearrangement gives the following preferred form that is very reminiscent of the form most commonly to compute the log-likelihood ratio test for binomials and multinomials
$-2 \log \lambda = 2 \left( k_1 \log \frac{k_1}{t_1} + k_2 \log \frac{k_2}{t_2} - (k_1+k_2) \log \frac{k_1+k_2}{t_1+t_2} \right)$

## Wednesday, September 18, 2013

### How Accurate is Mahout for Summing Numbers?

A question was recently posted on the Mahout mailing list suggesting that the Mahout math library was "unwashed" because it didn't use Kahan summation.  My feeling is that this complaint is not founded and Mahout is considerably more washed than the original poster suggests.  Here is why I think this.

As a background, if you add up lots of numbers using a straightforward loop, you can lose precision. In the worse case the loss is $O(n \epsilon)$, but in virtually all real examples the lossage is $O(\epsilon \sqrt n)$. If we are summing a billion numbers, the square root is $\approx 10^5$ so we can potentially lose 5 sig figs (out of 17 available with double precision).

Kahan summation increases the number of floating point operations by $4 \times$, but using a clever trick and manages to retain most of the bits that would otherwise be lost. Shewchuk summation uses divide and conquer to limit the lossage with $O(\log n)$ storage and no increase in the number of flops.

There are several cases to consider:

1) online algorithms such as OnlineSummarizer.

2) dot product and friends.

3) general matrix decompositions

In the first case, we can often have millions or even billions of numbers to analyze. that said, however, the input data is typically quite noisy and signal to noise ratios $> 100$ are actually kind of rare in Mahout applications. Modified Shewchuk estimation (see below for details) could decrease summation error from a few parts in $10^{12}$ to less than 1 part in $10^{12}$ at minimal cost. These errors are $10^{10}$ smaller than the noise in our data so this seems not useful.

In the second case, we almost always are summing products of sparse vectors. Having thousands of non-zero elements is common but millions of non-zeros are quite rare. Billions of non-zeros are unheard of. This means that the errors are going to be trivial.

In the third case, we often have dense matrices, but the sizes are typically on the order of $100 \times 100$ or less. This makes the errors even smaller than our common dot products.

To me, this seems to say that this isn't worth doing. I am happy to be corrected if you have counter evidence.

Note that BLAS does naive summation and none of the Mahout operations are implemented using anything except double precision floating point.

Here is an experiment that tests to see how big the problem really is:

@Test
public void runKahanSum() {
Random gen = RandomUtils.getRandom();

double ksum = 0;                // Kahan sum
double c = 0;                   // low order bits for Kahan
float sum = 0;                 // naive sum
float[] vsum = new float[16];  // 8 way decomposed sum
for (int i = 0; i < 1e9; i++) {
float x = (float) (2 * gen.nextDouble() - 1);
double y = x - c;
double t = ksum + y;
c = (t - ksum) - y;
ksum = t;
sum += x;
vsum[i % 16] += x;
}
// now add up the decomposed pieces
double zsum = 0;
for (int i = 0; i < vsum.length; i++) {
zsum += vsum[i];
}
System.out.printf("%.4f %.4f %.4f\n",
ksum, 1e6 * (sum - ksum) / ksum,
1e6 * (zsum - ksum) / ksum);
}

A typical result here is that naive summation gives results that are accurate to within 1 part in $10^{12}$  8 way summation manages $< 0.05$ parts in $10^{12}$ and 16 way summation is only slightly better than 8 way summation.

If the random numbers being summed are changed to have a mean of zero, then the relative error increases to 1.7 parts in $10^{12}$ and 0.3 parts in $10^{12}$, but the absolute error is much smaller.

Generally, it doesn't make sense to do the accumulation in float's because these operations are almost always memory channel bound rather than CPU bound.  Changing to floating point arithmetic in spite of this decreases the accuracy to about 500 parts per million, 200 parts per million respectively for naive summation and 8 way summation

## Wednesday, April 24, 2013

### Learning to Rank, in a Very Bayesian Way

The problem of ranking comments by a crowd-sourced version of "quality" is a common one on the internet.

James Neufeld suggests that Bayesian Bandit algorithms can be applied to this problem. The basic idea is that you would define a stochastic quality metric whose distribution for each comment depends on the up and down votes that comment has received.

Normal ranking algorithms try to estimate the single best value for this quality metric. Neufeld suggests that this value should be sampled from a beta distribution which models the probability that a user would mark the comment positively given that they have marked the comment at all. To present comments to a user, the metric would be sampled independently for each comment and the comments would be sorted according to the resulting scores. Different presentations would necessarily result in different orders, but as users mark comments positively or negatively, the order should converge to one where the comments presented near the top of the list have the highest probability of being marked positively.

One very nice thing about this approach is that it doesn't waste any cycles on determining the ranking of low quality comments. Once the quality of these comments has been determined to be relatively lower than the best columns, no more learning need be done with those comments. This accelerates learning of the ranking of the best options dramatically.

This idea is interesting enough that I built a quick implementation which you can find on github.  The main sample code there invents several hundred "comments" each with a uniformly sampled probability of getting a positive rating.  The ideal behavior for ordering the comments would be to put the comment with the highest probability of getting a positive rating first and the one with the lowest probability last.  The way that the program proceeds is that it picks a pageful of twenty comments to show and then proceeds to generate ratings for each of the comments on that page according to the underlying probability associated with the items displayed.  The process of generating pages of comments to show and applying feedback is repeated and performance is measured.

Here are some results of running the program.  Here we have 200 total comments, of which 20 are shown on the page that defines which comments are rated.  Precision is measured here to determine how many of the best 10 comments are shown on the page.  As can be seen, the system shows immediate improvement as ratings are collected.  The performance rises from the initially random 10% precision and passes 50% after 30 pages of ratings.

As James demonstrated in his article and as others have demonstrated elsewhere, this class of algorithm is very effective for this sort of bandit problem.  What is much less well known is how easily you can build a system like this.

### Try it yourself

To run this code, you will need git, maven and java 1.7.  To download the source code and compile the system, do this

$git clone git://github.com/tdunning/bandit-ranking.git$ cd bandit-ranking
$mvn package This will download all dependencies of the code, compile the code and run some tests. To run the test program, do this$ java -jar target/bandit-ranking-*-with-dependencies.jar

The output is a thousand lines of numbers that you can drop into R, OmniGraphSketcher or even Excel to produce a plot like the one above.

### Quick code dissection

In the main program for this demo, a BetaBayesFactory is used to construct several BayesianBandit objects (for average results later).  This pattern can be used with other kinds of bandit factories.

The BayesianBandit objects allow you to do a variety of things include sampling (BayesianBandit.sample) for the current best alternative, ranking (BayesianBandit.rank) a number of alternatives and providing training data (BayesianBandit.train).  Sampling is used in a traditional multi-armed bandit setting such as with A/B testing.  Ranking is used as it is here for getting a list of best alternatives and training is used ubiquitously for feeding back training data to the bandit.

Evaluation can be done by computing precision as is done here (how many good items are in the top 20?) or by computing regret.  Regret is defined as the difference between the mean payoff of the best possible choice and the mean payoff of the choice made by the bandit.  For the ranking problem here, I assume that payoff of a page is the sum of the probabilities of positively rating each item on a page.

The BetaBayesFactory internally uses a beta-binomial distribution to model the likelihood of a positive rating for each rank. A more general alternative would be to use a gamma-normal distribution. This can be done by using the GammaNormalBayesFactory instead. This extra generality comes at a cost, however, as the graph to the left shows. Here, the beta-binomial distribution results in considerably faster convergence to perfect precision than the gamma-normal. This is to be expected since the beta-binomial starts off with the assumption that we are modeling a binary random variable that can only take on values of 0 and 1. The gamma-normal distribution has to learn about this constraint itself. That extra learning costs about 50 pages of ratings. Put another way, the cumulative regret is nearly doubled by the choice of the gamma-normal distribution.

In order to understand what the algorithm is really doing at a high level, the graph on the right is helpful.  What it shows is the number of times comments that are at different ranks are shown.  What is striking here is that comments that are below the fourth page get very few trials and even on the second page, the number of impression falls precipitously relative to the first page of comments.  This is what you would expect because in this experiment, it takes only a few ratings on the worst comments to know that they stand essentially no chance of being one of the best.  It is this pattern of not sampling comments that don't need precise ranking that makes Bayesian Bandits so powerful.

## Friday, October 26, 2012

### References for On-line Algorithms

I have been speaking lately about how various on-line algorithms have substantial potential for various real-time learning applications.  The most notable of these algorithms are Thompson sampling for real-time handling of multi-armed bandit and contextual bandit problems and an algorithm due to Shindler, Myerson and Wang for fast $k$-means clustering of data.  Since I have had a number of requests for references back to the original sources for these works, I figured a few blog posts would be a good thing.  This post will describe the multi-armed bandit work and the next will describe the clustering work.

#### The Basic Problem - Multi-armed Bandits

For the bandit problems, there are two basic problems to be dealt with.  The first and most basic problem is that of the multi-armed bandit.  In this problem, you can sample from any of a finite number of distributions and your goal is to maximize the average value of the values that you get.  This can be cast into a number of practical settings in which you select which slot machine to put a quarter into, or you select which on-line ad to present to a user or you select which landing page to deliver to a user's browser should see when they visit a particular URL.  It is common to simplify this case further by assuming a stationary distribution.  Obviously, at least one of the distributions you are picking from has a mean equal to the large mean of any alternative.  Any time you take a sample from a distribution that has a smaller mean, you fall behind the theoretical best, on average, that you could have achieved by picking from (one of) the best distributions.  The degree by which you fall behind is known as the regret that you incur.

The key to the multi-armed bandit problem is that you cannot know which distribution might have the largest mean.  This means that you have to sample all of the distributions in order to estimate their means, but this implies that you have to sample from the lesser distributions in order to determine that their are, in fact, inferior.

There are well known bounds on how well you can actually solve this problem.  There are also a number of algorithms that have regret on par with these bounds or come reasonably close to these bounds.  Mostly, however, these known solutions have limitations either on the number of distributions they can consider or on the complexity of the solution.

Kuleshov and Precup provide some good examples of how to compare different bandit algorithms in their paper.  This tutorial on bandits provides a wider view of different forms of multi-armed bandit problems with a number of references.

Conspicuously missing from most lists of references, however, is all the recent work using Thompson sampling.  These algorithms, which I have referred to as Bayesian Bandits, have particularly nice properties of simplicity and optimality.  Chapelle and Li provide an empirical look at performance with these algorithms compared to upper confidence bound (UCB) algorithms.  The last paragraph of that paper laments the lack of a theoretical analysis of these algorithms, but that lack was cured shortly in this paper by Agrawal and Goyal.  Scott provided a more comprehensive view of these algorithms under the name of randomized probability matching.

The idea behind Bayesian Bandits is quite simple.  For each bandit, we maintain use the observations so far to build a posterior distribution for the mean of the associated payoff distribution.  For binary payoffs, it is common to use a $\beta$-binomial distribution and for other cases a $\gamma$-normal distribution works well.  To pick a bandit, we sample a mean for each bandit from these posterior distributions and then pick the bandit with the largest sampled mean.  The new sample from that bandit gives us more data which refines the posterior distribution for that bandit.  We can repeat this process as long as desired.

#### Extensions to Contextual Bandits

One of the most important characteristics of the Thompson sampling approaches (aka randomized probability matching aka Bayesian Bandits) is that they can be extended to more complex situations.  One setting that I have found particularly useful involves optimizing return not just from a few bandits, but from a parameterized set of bandits that could conceivably even be infinite.  The transformation from the parameters to the bandit distribution is unknown, but if we could know that, we would be able to search the parameter space to find the bandit with the highest mean payoff.

This formulation is a generalization of the previous case because we can take the parameter to be an integer from $1 \ldots k$ where there are $k$ bandits and the transformation consists of the mean payoffs for each of the $k$ bandits.

The algorithm in the contextual case simply consists of sampling the transformation from some posterior distribution and then solving for the parameters of the bandit that we would like to use.  Some of the parameters might be fixed by the context we are working in which is where the name contextual bandits comes in.

The paper by Scott alludes to this formulation, but the most approachable work on this that I know of is the paper by Graepel, Candela, Borchert, and Herbrich.  In this paper, they describe the operation of AdPredictor, a system used by the Bing search engine to target ads using context.

## Wednesday, February 8, 2012

### Bayesian Bandits

The basic idea with Bayesian Bandits is to solve the problem of the explore/exploit trade-off in a multi-armed bandit by keeping a distributional estimates for the probability of payoff for each bandit, but avoiding the cost of manipulating distributions by sampling from those distributions and then proceeding on each iteration as if that were a point estimate.

The advantage that this is that bandwidth assignments will be made according to the best estimate of the probability that each bandit could possibly be the best.  This automatically causes the system to do exploration as long as it is plausible and causes the system to smoothly transition to exploitation when it becomes clear which bandit is the best.  Essentially what this gives us is a Bayesian implementation of active learning.

To illustrate this, here is a graph that shows the posterior distribution of conversion probability for two bandits where we have lots of history for one and only a little history for the other.

You can see that the red bandit with 100 conversions out of 1000 impressions mostly like has a probability of conversion of 0.1, more or less a bit.  The blue bandit with no conversions out of 10 impressions is very likely worse than the red bandit, but there is a small possibility that it is better.  If we just picked the average or mode of these distributions, we would conclude that the blue bandit is worse and wouldn't give it any bandwidth without substantial mechanisms to over-ride this decision.

On the other hand, if we estimate a conversion probability by sampling and use that sampled estimate for targeting, then we will give the blue bandit a little bandwidth and thus a chance to redeem itself.

There are several aspects of the Bayesian Bandit algorithm that are exciting.

• exploration and exploitation are handled uniformly in the same framework
• our internal representation encodes all of the information that we have so we don't confuse evidence of failure with lack of evidence for success
• the algorithm only requires a few lines of code
• the updates for the algorithm can be expressed in terms of back-propagation or stochastic gradient descent
• the performance is really good.
As a bit of a teaser, here are a few graphs that describe how the Bayesian Bandit converges in simulated runs. The first graph shows how the average total regret for the Bayesian Bandit algorithm approaches the ideal as the number of trials increases. This experiment uses normally distributed rewards with $\sigma = 0.1$.  Any algorithm that meets the optimal $O(\log n)$ convergence lower bound is said to "solve" the bandit problem.

The convergence here is very close to the ideal convergence rate.  Note that this graph includes the convergence to optimal payoff, not the convergence knowing which is the better bandit.  This is actually an interesting aspect of the problem since the algorithm will converge almost instantly for cases where the conversion probabilities are highly disparate which will make the payoff converge quickly.  For cases where the conversion probabilities are nearly the same, it will take a long time for the algorithm to determine which is the better bandit, but exploration is not expensive in such a case so the convergence to near-optimal payoff will be even faster than the case where the conversion rates are very different.

For example, here is a graph of the probability of picking the better bandit where the conversion rates are nearly the same.  As you can see, it takes quite a while for the algorithm to split these two options.   The average payoff, however, only changes from 0.11 to 0.12 during this entire convergence and it has already reached 0.118 by the time it is 20% into the process so the cost of a long experiment is not that high.

Sample code for the Bayesian bandit is available at https://github.com/tdunning/storm-counts.

## Wednesday, June 22, 2011

### Buzzwords Keynote - Conclusion

In the end, it is up to us to make things better.  We need a way for non-Apache entities to interact with Apache productively.  If we can't do that, then it is quite possible that all of the momentum and excitement that Hadoop now has will be lost.

Conclusion

The key is that we now have an eco-system, not just a community.  We can make it work.  Or we can elect to let it not work.  Not working is the default state.  We have to take positive action to avoid the default.

Apache can stay a strong voice for business friendly open source by remaining Apache.  Trying to make Apache broad enough to include all of the players in Hadoop and Hadoop derivatives will simply debase the voice of Apache into the average of opposing viewpoints, i.e. into nothing.

There are, however, many other players who are not part of Apache and who should probably not be part of Apache.  There needs to be a way for these others to engage with the Apache viewpoint.  It can't just be on the level of individuals from Apache trying to informally spread the Apache way even though that is critical to have.  It is likely to require a venue in which corporate entities can deal with something comparable to themselves.  A good analogy is how Mozilla participation in W3C has made the web a better place.

But we can make our eco-system work.   It isn’t what it was and it never will be again.

But it can be astonishing

Let's make it so.

### Buzzwords Keynote - Part 3

In the third part of my talk, I talked a bit about where Hadoop has come from and where it is going.  Importantly, this involves a choice about where Hadoop and the related companies products and individuals might be able to take things.

Where we are and how we got here

My second section described the rough state of the Hadoop eco-system is a slightly provocative way.  In particular, I described a time when I was on a British train and in partial compensation for delays the operators announced that "free beer would be on sale in the galley car".  Free beer for sale is a wonderful analogy for the recent state of Hadoop and related software.

That said, there are serious problems brewing.  The current world of Hadoop is largely based on the assumption that the current community is all that there is.  This is a problem, however, because the current (Apache-based) community presumes interaction by individuals with a relatively common agenda.  More and more, however, the presence of a fundable business opportunity means that this happy world of individuals building software for the greater good has been invaded by non-human, non-individual corporations.  Corporations can't share the same agenda as the individuals involved in Apache and Apache is constitutively unable to allow corporate entities as members.

This means that the current community can no longer be the current world.  What we now have is not just a community with shared values but is now an eco-system with different kinds of entities, multiple agendas, direct competition and conflicting goals.  The Apache community is one piece of this eco-system.