Wednesday, April 13, 2011

Learning Genetic Algorithms, Genetic Programming

The symbol lists of Genetic Algorithms are lots of fun. We can encode all sorts of things if we are clever enough. It is generally considered a good idea to use the most natural structure to encode a problem, if possible. If we are optimizing the parameters to a real-valued function then we should try to use a real-valued encoding (a vector of doubles, for example). But what if the problem has some more complex structure?

There are some very interesting problems that are most easily described by trees structures. Just so we are on the same page here I will describe the concept of a tree (a binary tree aka 2-ary tree for simplicity). A tree is either empty, or it is a node with 2 subtrees, called its children.
Here is an awesome ascii-art tree with the nodes as * and the \ and / connecting each node to its children.


*
/ \
* *
/ \
* *
/
*

In general these trees may have anything labeling their nodes, not just * symbols. If the internal nodes (the ones with children) are functions like +, -, *, or / and the leaves (the nodes with no children) we may have something interesting. A function like f(x) = x*2 + 1 would look like:

+
/ \
* 1
/ \
x 2
This can be written in the infix notation as x*2 + 1, in prefix notation as +*x21, and in postfix notation as 12x*+, all of which completely and unambiguously define the tree above (assuming we know the number of arguments each operator symbol (+ and *) requires).

If we want to evolve functions like this, or many other things (programs can be described this way using formal grammars and abstract syntax trees), we may want to evolve them directly. This means creating random trees, making sure each function has enough arguments and each leaf has a terminal (a constant or variable or something with no arguments to fill).

A technique called Genetic Programming does exactly this. This gives it a great deal of expressiveness- trees are complex, interesting, and general things. An example of how cool this is- imagine a robot that knows some things about its current situation (location, facing, direction of something good or bad, speed, etc) and needs some controlling program. We can use an expression tree with many variables, not just "x" as in our example, and functions that act on the variables. It may be tricky to get this function to really do anything interesting, but it is not all that hard. The cool thing would be that we can evolve functions for controlling robots like this.

Another cool possibility is to evolve a function to fit some set of data. We can then use the function to describe the data- it tells us something about the data we might not know and it can be used to extrapolate and interpolate. The fitness for a tree is the sum of the error it makes on each data point when used as a function.

Just like a GA, we generate random structures to start the algorithm out. Then we must do something like mutation and crossover. Unfortunately trees are more complex than vectors, and the biological metaphor stops making much sense. This is one problem with the more complex trees structures, and we will see in a later post how the algorithms I am most interested in deal with this by using a symbol vector to encode a tree. Mutation here means choosing a random node and replacing it with a randomly generated tree, which deletes the node and any of its children. Crossover of two trees means choosing random nodes (one in each tree) and replacing them with each other (swapping the subtrees underneath them as well).

This is all well and good (actually its pretty great- GP is really good at a lot of things), but there are some problems beyond the complexity of the operators (there are normally more restrictions that are added then what I described). One interesting problem is called bloat- the trees will tend to grow uncontrollably with no associated increase in their fitness. The interesting thing is that not only is this possible, but if the tree is able to do this it will be in its best interest to do so, and so it will start to grow uncontrollably.

An advantage of the more complex structure is we can enforce all sorts of extra constraints to encode more interesting problems. We can add operators with different types or use recursions or iteration or some memory space or lots of things.

All I wanted to do in this post is get a little bit into GP and show how trees can be evolved. Mostly this is just buildup to Gene Expression Programming, but it is important to understand trees and how they might be useful.

No comments:

Post a Comment