Monday, July 25, 2011

Encoding in Robust Gene Expression Programming

I realized that I've never actually completely explained in this blog the decoding mechanism in my version on Gene Expression Programming- all my posts about it mention that I'm leaving out a bit of detail. So- here is goes.

We are given the set of terminal symbols and the set of operator symbols, which are problem specific and must be supplied by the user, as well as the number of genes per individual (the length). The individuals can then be generated as random bit vectors of size length*genesize where genesize is the minimum number of bits we can use to encode the symbols in the terminal and operator sets (plus one). We can find this by 1+log2(max(|terminals|, |operators|)). The plus one is because we need one extra bit per gene to determine if it will encode a terminal or operator.

Imagine we have a random individual. We decode first into the "raw symbol list" by splitting the bit vector into equal length chunks (the length is the result of the previous calculation) and decoding each chunk. Decoding a chunk is as follows: determine the type of the symbol by inspecting the first bit (0 for operator and 1 for terminal), and then turn the rest of the bits into a natural number in the usual binary encoding. This number is then used as the index in the set (or really, list) of symbols given by its type. In the case that the number is larger than the size of the list, we wrap around to the end, making it a circular list. In other words we take syms[n%len{syms)] if syms is the symbol list, n is the decoded value of the bits in the gene, and len gets the length of the provided list.

Having done this we have a list of symbols from the terminal and operator set, where the length is the given size of the individual. We must then turn this into an expression tree. While it is not necessary to actually build the tree, in principal we always can if we want. To do this we must do a reverse polish evaluation of the symbol list as an expression.

This evaluation starts with an empty stack, and evaluates each symbol in turn, updating the stack along the way. If a terminal is encountered then it is pushed onto the stack. If an operator is encountered, then the necessary number of arguments are popped off the stack (the arity of each operator is fixed and known ahead of time) and the result of the operator applied to the arguments is pushed as a result. If we are building a tree then the "result" is just a small tree with the root the operator and leaves the trees popped off the stack (which may be terminals or trees). In the case that the size of the stack is less than the number of required arguments the operator is simply skipped. This corrects for improperly defined trees, as there is no way for a partial tree to be constructed.

After each symbol has been evaluated the top of the stack is taken as the result, and this object (possible an expression tree) is the phenotype of the individual. This complex tree structure is much removed from the bit vector it started out as, but the process is really pretty painless. The resulting tree can be compiled or evaluated or inspected in some way to produce a fitness value for the individual.

Notice that it is possible for the stack to have many values at the end of evaluation which will be thrown out. These garbage expression may be more complex and interesting then the one chosen as the result, but it is not easy to know this in advance. It would be costly to evaluate each, and in common cases is is easy to see that the top of the stack will actually be very interesting.

Well, there it is, the full decoding process for RGEP. It might seem complex, but it is much nicer than some GEP mechanisms (IMHO) and has lots of advantages on several levels. I am not going to go into all the advantages of this encoding in this post, but suffice to say it is well motivated.

Wednesday, July 6, 2011

Lunch Assignment Problem

Last week someone I work with was assigned the task of setting up lunch groups for the interns in my program. This is something we do every week to meet new people, and it is intended that people eat with new and different people every week if possible. It should be clear that there may not be a solution after several weeks, but we can still find a solution with as small as possible a number of "collisions" (people eating together that have been assigned the same lunch group together in a previous week). Perhaps you can already see where I'm going with this?

This problem involves partitioning a set of people into n disjoint subsets with the minimal number of collisions. If each person is taken to be a node in an simple undirected graph then there is an edge between two people iff (if and only if) they have had lunch together before, and the edges can be weighted such that the weight is the number of times they have had lunch together. Alternatively each pair of people have a weight, which is simply 0 if they have never been in the same group. This provides a measure of the correctness of a solution, and therefore can be used as a fitness in an evolutionary algorithm!
I wrote a simple Random Mutation Hill Climber (RMHC) which we have been using the last two weeks to come up with our lunch assignments, and so far it has found perfect solutions. A RMHC simply mutates a randomly generated solution, and replaces the current solution with the mutated solution if the new one is better or equal to the previous one according to the fitness measure. The initial solution is generated by shuffling the list of people randomly and cutting the list into sublists of a fixed length. After that mutations occur by choosing two people and swapping the groups they are in.

I also tried a Genetic Algorithm for this problem, but it takes longer to run and seems to perform worse. Interestingly, next week it looks like the problem has gotten much harder in the sense that I doubt there is going to be a perfect solution. For this new instance of the problem the GA actually seems to work better! This is probably because the solution space has gotten much more complex and the RMHC, being merely a hill climber, has trouble exploring it even when run many times (so that many possible hills are climbed).

So! Evolutionary Algorithms tackling a very simple, real-life-style problem! Despite being a simple problem (simple to describe) the set of possible solutions is huge. It is infeasible to enumerate the solutions, and it doesn't seem to have an efficient deterministic solution, so it is the sort of problem that AI is often used to tackle. If anyone can classify this problem exactly I would be interested to hear about it- it doesn't remind me of anything off hand.

Sunday, July 3, 2011

Compiled My First Program!

More progress! I successfully compiled and ran a program in my little language for the Arduino! The program that I got running is this:

: wait 100 delay ;
: blink 13 on wait 13 off wait ;
: setup 13 output ;
: loop blink ;

This program simply blinks the LED attached to pin 13 every 100ms. Nothing interesting about the program, but the fact that even such a simple program compiles is nice to see.

The next step is to add a couple of control structures, and then on to a type checking system!

Parsing with Parsec

I've made some progress with my little concatenative language. I'm using Parsec, a Haskell parsing library, to parse an input string into the three types of top-level definitions that language currently has- words, variables, and constants. The next step I think will be actually compiling to C, and then trying to write a type checker.