30 March 2015

Understanding Folds

Very often, one has to deal with some kind of recursive data-structure, for example to represent entries in a filesystem, or more generally, nodes in a tree:

data FS = Folder String [FS] | File String String
data Tree = Leaf | t (Tree t) (Tree t)

Sometimes this tree-like structure is slightly hidden:

data Prop a = Var a
            | Not (Prop a)
            | And (Prop a) (Prop a)
            | Or (Prop a) (Prop a)

…when modelling logical propositions or parsing any other kind of language.

But even the humble list:

data List a = [] | a:([a])

The first crucial thing in understanding how to effectively work with such data is to realize, that all these nice definitions follow a basic inductive scheme:

data D = BaseCase1 | ... | BaseCaseN
       | ProductionRule1
       | ...
       | ProductionRuleN

Together, these two kinds of information provide all that is neccessary to construct a minimal set, containing everything we consider a D. We start with all valid base-cases and then repeatedly apply production rules until either we get bored and stop (for infinite types like [Int]) or can’t apply any of the rules anymore. So note that production rules take some existing instances of D’s as input.

Now remember that the nice descriptive types, Prop, Tree, Folder and so on, are simply labels we provide for people and compilers to help their (and frankly our) understanding of what we intend to do. But in the end, all that matters are primitives that can be used in computations on the hardware. We never actually compute a data-structure, we only ever need them to hold a couple of things we can work with. - You don’t add lists, you add multiple numbers, etc…

So whenever we want to compute something over the actual data in our structure, we need to reduce it down to the simpler types we can work with. Some examples of this:

sum :: [Int] -> Int
search :: Tree k v -> k -> v
eval :: Prop a -> Bool

By some cosmic coincedence, all these operations follow a basic pattern. They recurse through the structure, extract the information in there, combine it somehow and provide a result of type Result. For that they need to know how to handle each base-case and how to handle an instance that was generated by a production rule. For a type with only one base-case, and a single production rule, we shall call this knowledge

handleBaseCase :: BaseCase -> Result
handleRule :: D -> ... -> D -> Result

But wait, actually at this point, we are not interested in the semantic fluff of our data-structure anymore. Instead we want to work with the values we gave it to store. Let’s say those values are of type Value. So a more correct definition of our handlers would become:

handleBaseCase :: Value -> Result
handleRule :: D -> ... -> D -> Result

Still we have those unwantend D’s lying around in our rule handler. To get rid of them, we would need to reduce instances of D’s down to simpler types, namely Result’s. - Which is what were trying to do all along! So whenever we have an unwanted D, we theoretically have a reciept to get us a Result instead. Finally, we can give a satisfying type for our handlers:

handleBaseCase :: Value -> Result -- nothing new here
handleRule :: Result -> ... -> Result -> Result

Putting it all together

Let’s quickly give our work up until now a name, so we can reuse it later:

reduceD :: (Value -> Result) -- base-case handler
           -> (Result -> ... -> Result -> Result) -- rule handler
           -> D -- the D we are simplifying
           -> Result

reduceD handleBaseCase _ (BaseCase ...) = handleBaseCase ...
reduceD _ handleRule (ProductionRule ...) = handleRule (reduceD ...)

As it turns out, people before us figured this out already. They also gave it a cooler name: fold. Reluctantly, we shall stick to this one.

So from now on, whenever we are asked to compute something over a recursive data-structure, we think of base-case- and rule- handlers and then use a fold. Why not try it out on some examples.

Summing a list:

What we knew abstractly as D is now List Int. Which means our handlers need to deal with Int, and since it’s a sum we want to compute, Result should be Int as well.

List type has a single base-case, the empty list []. Since there is no simpler information in an empty list, our handler won’t even have to take any arguments. But a good value for sum [] is probably 0. Which gives us handleEmptyList = 0. List only has a single production-rule as well, prepending. If we encounter a List instance like ([1, 2, 3] = 1:[2, 3] or x:xs in Haskell) we already know what to do: x is an Int already and xs we can simply fold again to get an Int. Therefore:

handlePrepend x currentResult = x + currentResult
handlePrepend = + -- written a little bit cleaner

Putting the parts together yields:

sum :: List Int -> Int
sum xs = foldList handleEmptyList handlePrepend xs
-- substituting our definitions:
sum xs = foldList 0 (+) xs

Counting the number of nodes in a filesystem:

As a final example, let’s consider the FS type from the beginning. We wan’t a count returned, so probably Result = Int.

count :: FS -> Int
count fs = foldFS (\_ _ -> 1) (\_ children -> 1 + sum children) fs

Don’t get confused by a blind (‘_’) here and there, our handlers simply don’t need to look the actual contents in our filesystem (names and contents). Our base-case is now a humble File, which counts as a single node. Folders need to count themselves, as well as their children.

Closing thoughts

By now I hope you were able to grasp the fold as a simple recursive procedure over complex data-structures, which allows you to focus on the parts of the problem that are truly unique to your data. If so, you won’t get confused when dealing with the (Haskell-)reality, left- and right-folds, and so on. And I will be proud.