Monday, July 25, 2011

Encoding in Robust Gene Expression Programming

I realized that I've never actually completely explained in this blog the decoding mechanism in my version on Gene Expression Programming- all my posts about it mention that I'm leaving out a bit of detail. So- here is goes.

We are given the set of terminal symbols and the set of operator symbols, which are problem specific and must be supplied by the user, as well as the number of genes per individual (the length). The individuals can then be generated as random bit vectors of size length*genesize where genesize is the minimum number of bits we can use to encode the symbols in the terminal and operator sets (plus one). We can find this by 1+log2(max(|terminals|, |operators|)). The plus one is because we need one extra bit per gene to determine if it will encode a terminal or operator.

Imagine we have a random individual. We decode first into the "raw symbol list" by splitting the bit vector into equal length chunks (the length is the result of the previous calculation) and decoding each chunk. Decoding a chunk is as follows: determine the type of the symbol by inspecting the first bit (0 for operator and 1 for terminal), and then turn the rest of the bits into a natural number in the usual binary encoding. This number is then used as the index in the set (or really, list) of symbols given by its type. In the case that the number is larger than the size of the list, we wrap around to the end, making it a circular list. In other words we take syms[n%len{syms)] if syms is the symbol list, n is the decoded value of the bits in the gene, and len gets the length of the provided list.

Having done this we have a list of symbols from the terminal and operator set, where the length is the given size of the individual. We must then turn this into an expression tree. While it is not necessary to actually build the tree, in principal we always can if we want. To do this we must do a reverse polish evaluation of the symbol list as an expression.

This evaluation starts with an empty stack, and evaluates each symbol in turn, updating the stack along the way. If a terminal is encountered then it is pushed onto the stack. If an operator is encountered, then the necessary number of arguments are popped off the stack (the arity of each operator is fixed and known ahead of time) and the result of the operator applied to the arguments is pushed as a result. If we are building a tree then the "result" is just a small tree with the root the operator and leaves the trees popped off the stack (which may be terminals or trees). In the case that the size of the stack is less than the number of required arguments the operator is simply skipped. This corrects for improperly defined trees, as there is no way for a partial tree to be constructed.

After each symbol has been evaluated the top of the stack is taken as the result, and this object (possible an expression tree) is the phenotype of the individual. This complex tree structure is much removed from the bit vector it started out as, but the process is really pretty painless. The resulting tree can be compiled or evaluated or inspected in some way to produce a fitness value for the individual.

Notice that it is possible for the stack to have many values at the end of evaluation which will be thrown out. These garbage expression may be more complex and interesting then the one chosen as the result, but it is not easy to know this in advance. It would be costly to evaluate each, and in common cases is is easy to see that the top of the stack will actually be very interesting.

Well, there it is, the full decoding process for RGEP. It might seem complex, but it is much nicer than some GEP mechanisms (IMHO) and has lots of advantages on several levels. I am not going to go into all the advantages of this encoding in this post, but suffice to say it is well motivated.

No comments:

Post a Comment