Saturday, March 3, 2012

Possible Design for a Machine Learning/Evolutionary Algorithm Library

My newest idea for a Machine Learning/Evolutionary Algorithm library in Haskell is to separate the problem into two levels.

The first is a level of plumbing to connect genetic operators, for example, that allows the user to store state, parameters to the algorithm, to write to a log, and, since these algorithm are stochastic, random number generation. This will be implemented as what I'm going to call the Stochastic Reader Writer State Monad (SRWS), which is just the RWS monad with another State keeping a seed for the random number generator. This provides the functionality one would expect for one of these algorithms, but I don't plan on enforcing any conditions on exactly what context, state, or log one actually keeps. These will have to be filled in for each algorithm, although I plan on writing some default ones for basic Genetic Algorithms and maybe Gene Expression Programming.

The burden is on the user to make sure that parameters that don't change are stored in some sort of context, that they have a data structure to hold their state, and that they insert writes to the log where they want. This is the weakness of this approach that I tried to avoid in my last attempt at writing a library like this. It seems like each algorithm will have to duplicate a lot of data and have its own, possibly large, data structures for things like generation count or probability of mutation, even if it is only a small extension of Genetic Algorithms. I think the way around this is to provide a separate specialization of the generic plumbing for each algorithm type, along with some classes like GenerationCount with functions like s -> Int with s the state, which says it has some generation count without specifying how to extract it. This is basically like a getter in OO, and I'm not sure that I'm not just regressing to my imperative thinking with that design. Perhaps there is something nicer out there? Lenses, maybe?

The second layer of this library would be some sort of mechanism for constructing these algorithms. I don't think that this layer has somehow capture the nature of the algorithm, or even be the most direct was of specifying it. Right now, my focus is just to give a single, coherent way to build bigger algorithms out of little pieces. If the ideas are simple and composable then I would be satisfied, even if I expect they will not be the most efficient. My current plan is to use this idea of composing machines to build the necessary operators. I'm not entirely sure what form the machines will take, but I've been playing around with several possibilities, and I hope to post about some of them soon.

To give a simple example of how the concept of little machine could be applied here, I'm going to describe how I envision building the usual GA. Assuming a bit vector representation for individuals, the bits would be (conceptually) little machines that, when provided a number between 0 and 1, would output a 0 or a 1, possibly changing some internal state. This internal state will probably be accessible, but with some of the schemes I've looked at it is not visible (see http://en.wikibooks.org/wiki/Haskell/StephensArrowTutorial). Selection will be a function that, given a population with fitnesses for the individuals, returns a machine that, given a number between 0 and 1, will return an individual. Repeating this process to get pairs of individuals, we would construct a machine that, given a number between 0 and the length of the individuals, would return a pair of crossed individuals. Notice that these machine are really just functions, but they store information about what actions they take along with their data. This is, again, somewhat like OOP. The advantage here is that it provides a single concept with which to express the operators, and it localizes the information about how they work, so that it may be easy to automate.

In addition to developing some of the ideas for how to define these ideas, I hope to make clear why I think this idea is useful at all (which I imagine isn't clear at first, even with (maybe because of?) this example). Just a simple look ahead- the result may end up just being the SRWS monad just like the outer layer, but perhaps with a little extra structure or another wrapping.

Friday, March 2, 2012

Finding 1s in a Machine Word

I stumbled the other day on a pretty neat paper called "Using de Bruijn Sequences to Index a 1 in a Computer Word" about a very cool algorithm for finding the index of the least 1 in a bit string. I wanted to share it here because it feels elegant to me. Note that finding the index of the first one can be repeated to find the index of all ones, and to find the number of 1s.

First, some background- there is a surprisingly difficult problem called Population Count, which the the problem of finding the number of 1s in a bit string (usually a single machine word). This is not difficult in the sense of finding an algorithm for computing it (shifting and masking the for each bit would work) but rather in finding an solution better than linear in the size of the machine word. This is sometimes provided in hardware, but not all instruction sets have it, and it may not be exposed to the programmer even if it exists (outside of, say, inline assembly in C). This may seem like a strange thing to want, but it (and the problem of finding the index of 1s) comes up in a variety of situations. One example would be if you were keeping a bit-map of blocks in memory, and you wanted to know how many bits are set in a single word (this came up at work the other day). In this case you may also want to know the first bit set, say to find the first bad block when managing flash storage. Another is in the implementation of an efficient immutable trie, which is very nice and useful data structure. In this case, the first bit set to 1 would indicate the first symbol that a branch has a child for.

The algorithm takes three steps- first isolate the first 1, than hash it to a unique key, than finally index into a lookup table for solutions. This lookup table is very small it turns out- one index per bit in the word. For a 32 bit machine, this only has to take 32 bytes, which is basically nothing and can be (and is assumed to be) pre-calculated very easily. Lets take it one step at a time.

First, isolating the first one. This algorithm assumes there is at least one 1, which is easy to check beforehand. What we want to do is set all 1s to 0s except the least significant one. This is very easy, and is given as "x & -x" for a machine word x, with "&" the bit-wise and operator and "-" twos complement inverse. Notice that the twos complement inverse inverts all bits except the first one, so anding it with the original value makes that bit the only bit still set. The example they give is 01101000, with inverse 10010111, add one to get the twos complement 10011000, and them together 01101000 & 10011000 to get 00001000. Nice.

The next step is to produce a has of the resulting number. The has must be unique, so there are no collisions given values that are powers of 2, so that each possible first bit position with a one produces a unique key to index the lookup table with. This is where de Bruijn sequences come in. These sequences are just bit strings that, for some number n (a power of 2), contain exactly one instance of all log2(n) bit strings, if you allow yourself to wrap around. For example, for n=8, the sequence 00011101 contains all 3 bit (log2(8)=3) sequences, although to get 100, for example, you need to start at the rightmost bit and wrap around. They chose a sequence that starts with 000, for reasons we will get to soon. Now, for this fixed sequence, the hash function is given as: hash(x) = (x*s) >> (n-log2(n)) where s is the chosen de Bruijn sequence.

Now the cool part of the paper- this function must necessarily produce a unique key! To see this, we need to make a couple of observations. First off, since the input number, x, if a power of 2, multiplying by x is the same as a left shift. We do this multiplication modulo 2^n, so we ignore any bits after the nth. This means that we have lopped off some of the higher order bits of our de Bruijn sequence. Next, notice that the right shift operator will shift out all but the bits necessary to index into our lookup table. This means that we have lopped off some of the lower order bits now, leaving us with a subsection of the original sequence. At first I didn't see how we would get the bit sequences where we would have wrapped around, as the shifts that we are doing are not wrap-around shifts, but that is why we specified that the sequence should start with 0s! If it starts with zeros, and the multiplication by x adds zeros to the right, then get those sequences "for free" in the sense that they are still possible even without wrapping.

The next step is pretty obvious- we have a unique index, so just look up the answer in the table, and we are done.

This algorithm may not be the best in all situations, and they do some performance tests on it and discuss extensions and limitations in the paper, but its is pretty neat, I think. I like how we can derive a hash function with no collisions for this input, and how it cleverly ensures all indices are possible.

Incidentally, I found this paper while looking for a reference on de Bruijn indices, which are also very nice, but are completely different.