pozorvlak (pozorvlak) wrote,

Lazy lists, impatience and hubris

There was a recent mailing list post by Andrew Coppin (which received some discussion on programming.reddit) in which he bemoans the way that GHC, when presented with reasonable-looking code, can suddenly and mysteriously decide that it needs to allocate gigabytes of memory, slowing your system to a crawl for ages (I've had similar problems myself). Andrew's example was the following:

I offer up the following example:
mean xs = sum xs / length xs
Now try, say, mean [1.. 1e9], and watch GHC eat several GB of RAM. (!!)
This is not so mysterious when you remember that Haskell uses linked lists as its native list structure, and so the only way to calculate length xs is to traverse the list a second time: the runtime system "helpfully" keeps around the copy of xs that it generated while calculating sum xs, and you have a space leak. But what to do about it?

Andrew suggests the rather eye-watering
mean = (\(s,n) -> s / n) . foldr (\x (s,n) -> let s' = s+x; n' = n+1 
    in s' `seq` n' `seq` (s', n')) (0,0)
Stare at that for a while, and let your eyes de-focus: after a while, it becomes apparent that he's fused the two traversals of the list into one. All that seq business is, er, something to do with strictness, I dunno. Tilton's Law of Programming applies here. Anyway, Don Stewart's written a good article showing how to get to the fast version of the code from the slow version, including some handy tips on reading intermediate compiler output to see what's really going on. It's a bit disingenuous, though - he shows that allocating the list strictly will consume over 8GB, and concludes that "Strictness is not the solution here", whereas from what I can see, this is the only part of the program where strictness doesn't help! It appears that recent versions of GHC do a lot of strictness analysis when given the -O flag, and here that's enough. But there's a lot of good stuff in there, and I shall refer back to it next time my Haskell code performs surprisingly (although I must say, the last time I heard of anyone inspecting intermediate compiler output to get adequate performance was in the late 80s on early Ada compilers...).

Anyway, that's not what I wanted to talk about. As part of a very silly discussion prompted by Andrew's post, someone challenged me to "provide, in the strict language of [my] choice -- any language at all, as long as it is strict -- a function that calculates the mean of a list of doubles that correctly produces the mean of [1 .. 10e9] in constant space".

Now, one could argue that 8GB is a constant amount of space (albeit a large one), and write a version that simply allocated the list directly. But that would miss the point, because the challenge is actually very easy. Sure, if you want to consume a sensible amount of memory and yet retain genericity, you're going to need to generate the list lazily. But what my challenger had forgotten is that one can easily hand-roll lazy data-structures in any language with lexical closures or, come to that, objects. Which is to say, in any reasonably modern language. I chose Perl, because it's the strict language I know best:
    sub mean {
            my $getnext=shift; my ($next, $done, $sum, $count);
            while (!$done) {
                    ($next, $done) = $getnext->();
                    $sum += $next; $count++
            }
            return $sum/$count;
    }
    my $i = 1; print mean( sub { ($i++, $i>1000000000); } );
Now, that code's by no means fast: in fact, it took over half-an-hour to complete on my girlfriend's aging laptop. But it fulfilled the "constant space" part of the spec admirably, consuming about 1.5MB consistently (almost all of that for the Perl interpreter).

What's going on there? Well, instead of accepting a list, my mean function accepts an iterator: a closure that, when called, returns a pair containing the next element of the list and a boolean saying whether we've reached the end yet (the keyword sub, as well as declaring named functions, creates anonymous coderefs which can access variables in their surrounding lexical scope: in effect, it's Perl's lambda). This iterator may not look very listlike at first glance, but it does the same thing: conceptually, therefore, it's a list. Someone, probably Duncan Coutts, once suggested to me that the best way to think about Haskell's lazy lists is as forward-only iterators over collections: implementing a toList function for your new data structure is conceptually just like building an iterator, as the next part of the list is only built when it's asked for (which is equivalent to calling next() on an iterator in Java or C++). I've found this approach very helpful.

This (next_elt, done?) approach isn't the only way of making lazy lists, but it's a perfectly sensible one. One could generalise the way we've created the list (1..1e9):
sub make_stream {
    my $i = shift; my $upper = shift; my $step = shift || 1;
    $i -= $step;
    return sub { $i += step; ($i, $i >= $upper) };
}
or wrap an honest-to-God Perl list:
sub stream { my $i; my @xs = @_; sub { ($xs[$i++], $i > $#xs) }
mean(stream(1,2,3,7, -32.6 10**3.7))
The general form is
# set up lexicals here
my $stream = sub { return (
    # calculation to get next element; probably mutates state
    ,
    # are we done yet?
    )
}
As funktio suggested, you'd probably want to wrap the lexicals in either a do-block or a function like make_stream above.

What else could we have done? Another option would have been to create an object with next and end methods, and store any necessary state as a member variable of the object. The most standard solution would have been to create a coderef that returned (car, cdr) pairs, where car is the next element of the list, and cdr is another coderef which, when evaluated, returns the next (car, cdr) pair. This, incidentally, is how Haskell's laziness works under the hood: unevaluated values are stored as closures called "thunks", which, when called, return the value you want. If you never need the value, the thunk is never called, and so the calculation's never done. In Perl, this looks like:
sub range {
    my $lower = shift; my $upper = shift; my $step = shift || 1;
    my $test = ($lower >= $upper, 1, $lower <= $upper)[($step <=> 0) + 1];
    my $cdr = $test ? sub { range($lower + $step, $upper, $step) } : undef;
    return ($lower, $cdr);
}

sub range_to_list {
    my ($car, $cdr) = @_;
    my @list;
    while (defined($cdr)) {
        push @list, $car;
        ($car, $cdr) = $cdr->();
    }
    return @list;
}
In practice, you should use one of the lazy list modules on CPAN. I've never used any of them, but Tie::LazyList, which lets you use lazy lists like they were ordinary Perl lists, looks good.

We're almost done, but there's something else I'd like to look at. What would all this look like in Ruby? Well, one could easily do what we've just done: Ruby's anonymous blocks fulfil the same role as Perl's coderefs. But the more natural approach, I think, is the reverse of ours: You'd define a Range class which has an each method, and then define mean by passing a callback to each:
def mean(lst)
    sum = 0
    count = 0
    lst.each do |x|
        sum += x
        count += 1
    end
    sum / count
end
Then each could provide an iterator in any way appropriate, which probably wouldn't keep the list in memory. Even if we did two traversals of the list, we wouldn't run out of RAM. As a bonus, our mean function now works on any object which implements each, so it's approximately as general as the Haskell version. By the way, does anyone know how Ruby's ranges work internally?

So, the take-home messages:
  • While laziness is cool, it's not actually that magical, and you don't have to be using Haskell to take advantage of it: you can implement your own lazy data-structures in a few lines of any modern language. What Haskell brings to the table is pervasive laziness, so all the standard functions will work as-is with lazy structures, and dividing code into producer/consumer pairs is easier.
  • Perl has full lexical closures, and all that follows from them.
    [If you haven't had a look at Mark Jason Dominus' book Higher-Order Perl yet, then now would be a good time. It's a free read online, and also available in dead tree format. And be sure to read the lovely story associated with the cover.]
  • A good way to think of lazy lists in Haskell is as forward-only iterators.
  • Callbacks Are Your Friends, and widely-supported APIs (like Ruby's Enumerable API) are also your friends.
  • If you're doing anything numerical with Haskell, you probably want to be giving the -O or -O2 flag to GHC!
Tags: beware the geek, computers, haskell, perl, programming, ruby
  • Post a new comment

    Error

    default userpic

    Your IP address will be recorded  

  • 13 comments