Monday, April 18, 2011

Learning Genetic Algorithms, Gene Expression Programming

So, this series of posts is pretty poorly named. I should have titled it Learning Evolutionary Algorithms, as I'm about to get into Gene Expression Programming (GEP), which is a couple of steps removed from Genetic Algorithms in terms of subfields. In another sense GEP is closer to a GA then Genetic Programming as it uses a linear encoding (symbol lists again).

The original GEP is very complex, and I will not go over the whole thing. Instead I just want to introduce some ideas and operators, and show how they fit together. I think that Prefix Gene Expression Programming is a much nicer algorithm then the original, so that is the one I will describe in the most detail.

Remember expression trees from last post? Instead of evolving trees directly, in GEP (and Genotype-Phenotype Mapping systems in general) we evolve symbol lists which contain the terminal and the operators. This can be exactly like a GA, but often has extra genetic operators beyond Point Mutation and One Point Crossover. It is a distinguishing factor of GEP that it has many different operators, while some of its cousins have only Point Mutation. When we want to measure the fitness of an individual we have to turn it into a tree and determine the fitness of the tree.

I want to make a quick point about fitness before talking about this translation between the list of symbols (the genotype) and the tree (the phenotype). Fitness can be determined in any way that is appropriate for the problem, but a good way of measuring it for many problems is this: we have a list input/output pairs, and the fitness of a function (assuming the individual encodes a function or can be described by one) called f is the sum of the error the function makes on the output given the input it is paired with. For example if an individual's tree encodes the function f(x) = x*2 + 1 then if we had some data [(1, 2), (4, 2), (10, 3)] (a list of pairs, the first the input, the second the expected output of an ideal function) we would get |f(1)-2| +
|f(4)-2| + |f(10)-3| which is the sum of the differences between each input and output. I'm using |expr| as the absolute value of the expr between the pipes "|". If we get 0 as the fitness then the individual's encoded function, f, made no errors and was a perfect solution. While we may not have such a simple interpretation of the individual as a function from x to y, so we just have (x, y) pairs, it will very often be some pairing of inputs of some form to outputs of some form and a way of determining how well an individual's outputs match the ideal outputs.

So, decoding a symbol list into an expression tree. We can choose any notation we want- infix, prefix, postfix, or anything else. This can be very complex in general, we could be evolving expressions in a programming language for example, but I am mostly interested in this more restricted case where we are interested in expressions combining terminals with operators (its very algebraic, which I might post about sometime). For this purpose the choices of in/pre/postfix notation are the well known ones. The other option is Karva notation, which is not very nice (IMHO).

If we want our individuals to be symbol lists in some notation, we would like that they always can be turned into a tree. This is not a problem for an expression like +1-23 which is 1+(2-3) in infix notation, but what about +1-2*? This expression does not encode a tree as the operator * does not have two arguments (it has 0 arguments). We can do many things in this case, but the two options I want to look at are: we can forbid such messy individuals, or we can allow them and, when we want them to flower into trees (we get very attached to our individuals sometimes and want them to succeed even if they are not valid expressions), we can make changes to the genetic material in order to make a tree without changing the actual individual. Imagine that we made a copy of the individual, edited it to make it valid, and then turned it into a tree to determine its fitness.

The first option is taken by Prefix Gene Expression Programming, as well as the original GEP. In the original algorithm the symbol lists had two parts- operators could only appear in the beginning of the individual and so they would always have enough terminals after them to have a valid tree. In PGEP genetic operators that caused invalid individuals, like a point mutation from +1*32 to +1*/2, are undone when they happen. This means that the operators must know what a valid individual looks like to make sure they don't cause any problems.

The other possibility is what is common in other Genotype-Phenotype Mapping systems, which are any system where the linear list of symbols becomes a more complex object before evaluating its fitness (a tree, graph, whatever). It is called editing, and the original such systems had some complex editing as they wanted to evolve programs in programming language like C.

In the next post I will go over the symbol list idea more, the editing stage that I used in my thesis, and the expression of the linear encoding into an expression tree.

No comments:

Post a Comment