Wednesday, August 11, 2010

Sadder things

I spent this last weekend holding a hand that grew cold in the early hours of Sunday morning.

That hand helped me through much of my life.  No longer.  At least not in the flesh.

Nobody who reads this blog is likely to have known my father and given how little he talked about things he had done, few who knew him would know much of the many things he did.  He lived a long life and a full one.  Along the way he saw things few will ever see.

In his prime, he was simply extraordinary.  He could see and he could hear better than anyone I have ever known. That could be torture, as it was the time when a cat walking in the next room woke him from a deep sleep but it was what let him fly the way he did.  And fly he did in planes large and small.  He checked out Gary Powers in the U-2, flew P-47's and P-38 in combat and flew with me in small aircraft.  We fished and walked and camped across the western US and we lived in many places.

He didn't show off his mental abilities, but there too he could do things few others could match.  He passed a graduate reading exam in French without ever studying any romance language.  I saw him on several occasions understand spoken German also without having studied the language.  He spoke of the shape and rate of physical processes in ways that only somebody with innate ability in math could possibly do.

These faculties declined in age as they must with all of us, but even thus dimmed his candle burned bright.

But it did not last.  I saw it sputter and fade.  Then, between one instant and the next, it was gone.

Wednesday, August 4, 2010

Word Count using Plume

Plume is working for toy programs!

You can get the source code at http://github.com/tdunning/Plume

Here is a quick description of how to code up the perennial map-reduce demo program for counting words.  The idea is that we have lines of text that we have to tokenize and then count the words.  This example is to be found in the class WordCountTest in Plume.

So we start with PCollection lines for input.  For each line, we split the line
into words and emit them as a separate record:

    
  PCollection words = lines
    .map(new DoFn() {
      @Override
      public void process(String x, EmitFn emitter) {
        for (String word : onNonWordChar.split(x)) {
          emitter.emit(word);
        }
      }
    }, collectionOf(strings()));


Then we emit each word as a key for a PTable with a count of 1.  This is just the same as most word-count implementations except that we have separated the tokenization from the emission of the original counts.  We could have put them together into a single map operation, but the optimizer will do that for us (when it exists) so keeping the functions modular is probably better.


  PTable wc = words
    .map(new DoFn>() {
      @Override
      public void process(String x, 
                         EmitFn> emitter) {
         emitter.emit(Pair.create(x, 1));
      }
    }, tableOf(strings(), integers()))


Then we group by word


  .groupByKey()


And do the counting.  Note how we don't have to worry about the details of using a combiner or a reducer.


  .combine(new CombinerFn() {
     @Override
     public Integer combine(Iterable counts) {
       int sum = 0;
       for (Integer k : counts) {
         sum += k;
       }
       return sum;
     }
   });

In all, it takes 27 lines to implement a slightly more general word-count than the one in the Hadoop tutorial.  If we were compare apples to apples, this could would probably be a few lines shorter.  The original word-count demo was 210 lines to do the same thing.