Saturday, June 30, 2012

Possible Improvement on the Last Post's Algorithm

This post describes what may be an improvement of the efficient way to do the canonical Genetic Algorithm described in the last post. I'm not sure how well it would work in practice, but for a mostly converged population it could be a huge help. It can also help avoid recalculating fitness for two individual that happen to be the same. This will involve hash functions, and the downside is that it may result in some crossovers, selections, and fitness evaluations not being performed correctly when distinct sections of individuals hash to the same value.

Recall that in the last post we made use of a complete 2-tree representation of our individuals. What we will add here is one single piece of data to each internal node of the tree.

Assume that we can hash the contents of an individual. This is not much of an assumption given the usual fare for a GA- natural numbers, integers, bits, doubles, floats, etc. We calculate a hash of each location, and, moving up the tree, combine the hashes of the two leaves, fingertree style. Each genetic operator will now be slightly modified from the description in the last post by this change.

Mutation is the same, except for the unfortunate fact that we have to propagate changes to a location up the tree to the top. This makes each mutation take logarithmic time.

Crossover is where we start to see some advantages. When traversing down two individual's trees, if we reach a node where hashes agreed, we stop doing crossover. We assume that if the hashes agree than the subtrees are the same and don't require any copying. Note that there can be false positives here if two subtrees just happen to hash to the same value.

Selection is done similarly- if we are copying over a losing individual with a winning won as described in the last post, and we come to subtrees sharing the same hash, we again assume they are the same, and can stop the copying. Again, false positives can occur.

We could try to use this to avoid recalculating fitness for individuals by placing their hashes in a hash table with their fitness, although this is not a good idea in general. This is the one case where I would be very wary of collisions, especially as we may want to only save one result per location in the table to avoid having it grow too much (which is what happens in my experience if you try to hash every fitness evaluation). Again, not really recommended unless we really want to avoid doing evaluations and we can safely report the wrong fitness or gene expression sometimes (like an a certain game).

Unless I actually implement these algorithms, which I may, I think this is it for this idea. The 2-tree representation is something I've been thinking about for a long time, and I think it is really a good idea, despite the extra complexity and logarithmic access to individual locations (just use maps and folds like a regular person and it will be fine).

Efficient Genetic Algorithm Idea

This post describes a pretty efficient way to do Genetic Algorithms in an imperative language. Its pretty much like the pure way I talked about in an earlier post (using Haskell's Sequence and Vector types), except should be pretty straightforward to implement with destructive updates. Its a little harder to say exactly what the complexity is in practice, as it depends on how close the population is to convergence. However, even in the worst case it should be a good bit faster then a naive implementation. Note that this assumes the usual genetic operators and a population of linear individuals.

The representation of the population is as a collection (an array, for example) of complete 2-trees (trees with 2 children where all but the last level is full except possibly the last, and all nodes on the last are as leftmost as possibly). The nodes of the trees simply contain pointers to their children, and only the leaves contain values (the elements of the individual). The shape never changes, and it can be allocated as an array of pairs of pointers to tree nodes. The genetic material itself can be in a separate array or actually held in the leaves, as long as it is easy to get access to the leaves as an array. We will see why we need each piece of this setup soon.

Mutation is easy, as long as the locations can be accessed as a single array (or 2d array, or anything that can be accessed in constant time). As I've described before, rather then testing each location, simply calculate the number of failures you will have before the next success, skip over that many elements moving left to right through the population, and mutate that location. This calculation requires sampling from the geometric distribution describing failed mutation attempts before the next successful one. The only tiny trick here is to use an efficient sampling algorithm (Vose Alias Algorithm would be my choice) and to lump the probabilities of all indices beyond the end of the population into one. For example, if there are 10 indices in total, then 0-9 are given by the geometric distribution, and the 10th has the sum of all possibilities past 9 to infinity. If you generate something past the range of locations, you are done with mutation, so thing turns the infinite geometric distribution into a finite one.

Crossover is where we make use of the tree structure. This works essentially for the same reason the Sequence type in Haskell works for this, although its implementation is simpler. This is slightly tricky to explain, but it isn't really very complex. First off, note that we can usually consider the random pairs to be simply individuals next to each other in the population, as selection usually randomly picks out individuals, so they would already be shuffled. In this case we have to do this ourselves, which only takes O(n) time (shuffle a list of 0..n-1 and use the locations next to each other in this list as pairs).

Now we want to actually do crossover- we have our pair of individuals and a crossover point chosen. I'm only going to describe 1 point crossover, but 2 or n point would be essentially the same. What we want to do is to swap the child pointers in the two trees representing our individuals such that the range of indices from 0 to c (the cross point) and from c to m (the end of the individuals) are swapped. This proceeds as follows- traverse the two trees from their roots by checking if the crosspoint c is less then or greater then the midpoint of the range of indices in leave nodes under the current node- if less than, go to the left child, if greater then go to the right child. This information can be precalculated or stored at the internal nodes. When moving down a level, if you move left, simply follow the left pointer/reference. If you move right, then before recursing on the right child, swap the child pointers for the left children at the current node between the two trees. In other words, we swap all pointers on the left as we move down. If you get to a leaf node, swap the pointers for that node in the internal node that pointers to it.

This process is again hard to describe, but it is truly easy, especially if you look at something like http://www.haskell.org/haskellwiki/Zipper. The cool thing here is that it is a pointer update to do crossover, and requires only logarithmic time in the size of the individuals.

Finally we get to selection. Selection isn't really any different then usual here- we can do tournament selection and roulette wheel selection efficiently as I've discussed in previous posts. There is the problem of indexing into the individual- while in mutation we could consider the population on large array regardless of which individual a location belongs too, when inspecting a particular individual we need to traverse its tree to get the actual location of an index as they have been mixed up through crossover. Accessing a single location would take logarithmic time, but accessing the locations one after the next would only take linear time, so we many want to copy the individual off somewhere for evaluation.

The only other optimization I've considered is to do the replacement of winners with loser in place, leaving winners where they are in the population. We might do this as follows- assume we have the individuals in an array, and we know for each individual how many times they are to be represented in the next generation. We move two pointers through the population, one keeping track of the current individual with more then one representation, and the other pointing to the next individual with 0 representations in the next generation. Simply copy the winning individual into the losing one and decrement the representation count. This is repeated until the winning individual has a count of one, and we move to the next repeat winner until we get to the end of the population. This doesn't save a huge amount of copying, except perhaps in a mostly converged population using roulette wheel selection, where many individuals will have some representation in the next generation.

Thursday, June 28, 2012

Thoughts on C Callbacks

I've come across several examples recently at work of places where a C function pointer can be provided as a callback, and a single argument is (often optionally) provided. The argument is given when the callback is registered, and the function is called with the same argument each time, usually given as an "int". These are just some thoughts about this technique.

First off, the whole "only one argument" is not a big deal. These are essentially higher order functions, and as we all know, one can curry and uncurry functions. Of course, in C this is done by hand, but its still possible and done all the time. It is common to define a struct containing the values for a callback, and passing a pointer to such a struct in the callback to provide multiple arguments. As structs are product types, this is essentially uncurrying a callback of multiple arguments.

The whole "called with the same argument every time" is not really a big deal either- if you want to have a value that changes, just pass it a double-indirect pointer and change the pointer value. This must be done carefully (like everything in C) as in some situations your callback can be used in a context that can interrupt any context that changes the value of its argument, and synchronization primitives may not work the same in the two contexts (ISRs in VxWorks for example).

This is not the only situation I've seen, however. In VxWorks, tasks can have up to 10 arguments when they are spawned. I believe the reason they can be optional is that if ABI specifies that the caller cleans up the stack, the callee not using some extra values is not a problem. I imagine they would have to be placed with the leftmost argument on the top of the stack for this to work, but I haven't peeked around in memory to make sure this is true yet.

Monday, June 18, 2012

MicroSwarmigism

Joel and I have been talking about making a game together that has to do with controlling a swarm or swarms of tiny organisms. This is going to involve flocking behavior, Genetic Algorithms for traits, and possibly several other techniques (I may work in RGEP if it has a place), as well as lots of art, animation, and creative awesomeness. Joel has already done some great videos to show off ideas for the game and to give people a feel for it, as well as develop our understanding of it. You can see these videos on his blog, http://sculptorbyday.blogspot.com/.

The current plan is to use Unity, which is completely new to me, so I need to get a copy of Windows and start programming. I will probably use C#, but Javascript and Boo are not out of the question- I just don't know enough about any of them to have very strong opinions (opinions are very welcome- please!). I will have to learn quite a lot to get anything working, which I'm excited about. Games are such an expressive, visual, and interactive type of programming that I have only some small experience with, and hopefully a nice balance to the embedded systems programming I do for a living. I've always loved visualization of complex things like Genetic Algorithms and Particle Swarms, and the beautiful images (and games?) they give rise to.

MicroSwarmigism! As Fry from Futurama once said- girls like swarms of things, right?

Linear Lambda Calculus and PTIME-Completeness

I've just finished reading (the parts I could understand of) a paper with the same name as the title of this post, and I'm very impressed. It turns out that the linear lambda calculus has all the computing power of a boolean logic circuit and that the types in linear lambda terms uniquely determine their terms and since they are all typeable, there is an isomorphism between types and terms.

I definitely recommend the paper, especially the part where they do a simple embedding of boolean circuits in the simply typed lambda calculus and find that typing the resulting term results in the type of the embedding of true or false, which is also the normal form of the resulting lambda term. Good stuff all around!

One last cool thing- as I understood it, they also said that the type system of the simply typed lambda calculus is the linear lambda calculus. This seems to be common with systems of logic, where the upper layers are some restricted form of the lower.

Accuracy Based Learning Classifier Systems

I have always found learning classifier systems to be extremely interesting. A LCS is an agent that is able classify its environment (say, through sensor input, or given problem instances) and decide on an action based on a set of rules. The crazy thing about its decision making is that it arises from a collection of competing rules evolved by a Genetic Algorithm- a collective but singular mind. Even better than a LCS is an Accuracy Based Learning Classifier Systems (XCS), which attempts not just to find the actions that give the highest reward from its environment, but to find rules that accurately predict the actual reward. The nice thing here is that while in both systems you end up with a collection of rules that can be used as a classifier, with an XCS you end up with a collection of rules that map out the reward space. Either way, the result is the whole population (there are variations on this, but I'm not going to go into that), so we get a swarm of weak classifiers that collective can solve problems. The individuals compete with each other, but they also share rewards.

I won't go into huge detail on describing these systems- there is no single part that is terribly complicated, but they are altogether more complex then some of the other systems. If nothing else, they contain a GA component, and so they are more complex than a GA by itself. I will give a brief overview of XCS (for me, the more interesting of the two), as much for myself as for the reader. The description follows the very good explanation in the paper "Tournament Selection in XCS".

A XCS has some input that it must learn to classify using its internal rules (individuals), which consist of a classifier (some pattern to match to the input to determine if the individual has something to say about the input). Each individual must also have an action to perform if their classifier does match an input, as well as a predicted reward for the action along with an error in that reward prediction. They also keep a fitness value with them, which goes up and down depending on their ability to accurately predict high-reward actions.

A simple case of input is a fixed length bit vector, so that will be the example. The input is an element of {0,1}^L, bit vectors of length L, which are matched against classifiers in {0, 1, #}^L. If you are not familiar with this notations, the # is commonly used as a "don't care" symbols, matching anything it is compared to. So if we imagine that the system get an input 10101, and the individual has a classifier like 10101, 10###1, 1#####, ##101, or something like that, then the individual matches the input. We collect all individuals matching the input into the matching set of individuals, |M|. There will be some set of actions advocated by individuals in |M|, call it A, and we chose the action to take by the greatest fitness weighted average of predicted payoff for each action in A. That gets us to the action set |A|, the set of individuals that advocate (have as their genetic material the action) that has the highest average payoff weighted by the individuals fitness' that advocate that action. Having now chosen an action, the action is taken and the reward is reaped from the environment (the action may be a classification of input, and the reward an indication of how close the classification was to the correct one).

Having received a reward, the classifiers that advocated the reward must it. In addition to merely being rewarded, they are modified by how accurate their predictions of the reward was, and their error is modified if the reward was outside their prediction +- their error and beyond some predetermined threshold for error tolerance. I will not describe these updates- just look at the paper. <>A GA is then evoked for individuals that were in the action set and have not been part of the GA for a certain amount of time. This is one of the many parameters to this algorithm, and the time since an individual last took part in a GA is one of several extra values that are kept track of in individuals (also an experience and numerosity count- again, just read the paper). The individuals are selected from this subset of the action set by some selection mechanism (tournament being advocated in the paper), and mutated and crossed as usual. It seems like uniform crossover is favored here.

Interestingly, after the GA a "subsumption" operator is applied to delete individuals that are completely subsummed by others in the action set, are experience (have had their parameters adjusted a certain number of times) enough to be deleted, and are less accurate then the subsumping individual.

So there you go! XCS! Very cool algorithm, somewhat complex, lots to keep track of, but conceptually simple enough and a very awesome idea. I have never programmed one, but given that I'm making lots of good progress on my evolutionary algorithm framework, I may give it a try.

Wednesday, June 6, 2012

Condensed Table Method for Sampling Discrete Proability Distributions

My brother once told me that when faced with a problem in computer science, the answer is to use a hash table. It turns out that a mere lookup table, along with some cleverness, can be just as good. In this case, a lookup tables gives a pretty nice constant time algorithm for sampling discrete probability distributions. In this post I will describe the Condensed Table method and its implementation in the Haskell library "statistics". I was just about to implement Vose Alias algorithm, but seeing this already implemented method with constant time sampling may convince me otherwise (depending on its space cost for the distributions I'm interested in, which I think will be easily justifiable). The original paper (called "Fast Generation of Discrete Random Variables" and found at http://ideas.repec.org/a/jss/jstsof/11i03.html) describes this algorithm nicely, and I recommend it more then my explanation.

This algorithm requires the creation of a series of tables, one for each digit of the probabilities we are interested in. This may seem like a lot of tables if we want to resolve down to many digits of each probability, but we are free to chose any base system. The implementation in the statistic library uses base 256 and 32 bit numbers, so it only needs 4 tables (each digit is one byte of the 32 bit number). The implementation first multiples the probabilities by 2^32 and gets their integer equivalents, but we can imagine that we are inspecting the probabilities expressed in the desired base system one digit at a time.

Starting with some set of items paired with weights normalized so that they are between 0 and 1 and sum to 1 (a discrete probability distribution with finite support). Notice that if we multiple the probabilities of each item by, assuming base 10, 1000 (truncating after the decimal place) then we are left with numbers between 0 and 1000. If we allocate places in some huge array such that for each item "i" each number n between 0 and 1000, i gets put in n places. Sampling this array with a uniform distribution will get us an item i with probability equal to the original truncated after a certain number of digits (4 in this case). In applications where speed is more important that correctness, this cost of accuracy is certainly worth-while, especially as we can set it up so we get as much resolution as we could responsibly expect without resorting to much more complex and correct algorithms. Therefore, the main problem is that we need (in this case) O(1000*m) space, where m is the number of items in the original distribution. The more digits we use and the more items in the distribution, the more space we need, and the growth in space is fairly quick.

The trick in this algorithm can be seen by reorganizing the item in the large array. For an item that started with probability 0.252352 we multiple by 1000 and truncate, getting 2523. Instead of putting all 2523 of that item together, put 2000 of it. Do this for each item, so we are left with an array ending with groups of less than 1000 (all the groups larger then 1000 are grouped together in the front of the array). We repeat this process so that the 523 of our example item is grouped into 500 of that item and 23 at the end of the array. Doing this once more splits out the 20 together and the last 3 are in the last group of groups (which are all less then 10).

Having grouped this items by order of magnitude, we can reduce the 2000 to 2, the 500 to 5, the 20 to 2 and the 3 to 3. We have to keep these separately so we know which groups have been reduced by what number. We can then sample the distribution similar to before, but we are careful to search the group corresponding to where in the original large array we would have fallen even though we are not storing this array. Obviously we don't have to store this original array at all, but rather a series of arrays with numbers of representations in each array corresponding to the number found in each digit of the probabilities.

These array will still be somewhat large, but by my calculations, they will be on the order of kilobytes or maybe a couple of megabytes for the applications I have in mind. I have no problem paying in space for an advantage in time, especially since I expect to sample a single distribution many tens of thousands of times.

Again, I recommend the original paper- it walks through an example in much more depth then I bothered to.