I would like to do a research project named MBPGEP- Messy Binary Prefix Gene Expression Programming and its Applications in Combinatorial Optimization Problems.

Not only does it sound like real research, it would could possibly be a very effective method.

Messy GAs have been suggested for combinatorial optimization problems, so I would like to extend them to work naturally for GEPs. Binary GEPs have been suggested as mitigating the scalability problem inherent in having many operators and terminals, and I think we will certainly have a lot of terminals for most of these types of problems if we apply GEP to them. Prefix GEP allows subexpressions to stay together better then karva notation. Together these may form a really nice method for solving all sorts of problems, though I will probably try SAT or 3-SAT first.

I would also like to incorporate Adaptable GEP to turn off genes that do nothing, perhaps try crossover bias' (though thats not high on my list), naive GEP (because it is related to current genetics research on the importance of neutral mutation, as well as maybe one or two of my own modifications on the standard GEP.

Lots of things to do!

## Thursday, February 25, 2010

### Well Formed Expressions

As I said in the last post, it is possible to remove the restriction in GEP of keeping operators in the head. Why we would do that is just to allow a greater variety of trees to form that would be impossible with other notations. I particular, I hope that this may increase the ability of PGEP to find good solutions, as that is a form of GEP that I like a lot and I think it has a lot of potential.

One way to remove this restriction is simply to allow all symbols to appear anywhere, and ignore operators on the end that cause the expression to be invalid. This is a little messy, but it would certainly work.

Another way would be to turn any operator that causes the individual to be invalid to a random terminal before evaluating. Again, messy, but workable.

The last way that I can think of is geared towards binary GEP (BGEP). Again, identify the operators that cause the expression to be invalid, but this time just read them as the symbol with the same value (assuming there is one) and ignoring the bit that determines that it was supposed to be an operator. Since BGEP determines between being a terminal or an operator by one bit in the front of each locus, it would be the same genetic material just being interpreted differently. I like this one the best, but it still has some problems. It depends on the terminals and operators being matched nicely in their binary interpretations and having a way of resolving problems when there is no terminal that matches the operator. Binary GEP is a little messy like that and I plan on trying to find reasonable solutions to its problems, which will hopefully make it easy to remove the head and just have a full homogeneous gene.

One way to remove this restriction is simply to allow all symbols to appear anywhere, and ignore operators on the end that cause the expression to be invalid. This is a little messy, but it would certainly work.

Another way would be to turn any operator that causes the individual to be invalid to a random terminal before evaluating. Again, messy, but workable.

The last way that I can think of is geared towards binary GEP (BGEP). Again, identify the operators that cause the expression to be invalid, but this time just read them as the symbol with the same value (assuming there is one) and ignoring the bit that determines that it was supposed to be an operator. Since BGEP determines between being a terminal or an operator by one bit in the front of each locus, it would be the same genetic material just being interpreted differently. I like this one the best, but it still has some problems. It depends on the terminals and operators being matched nicely in their binary interpretations and having a way of resolving problems when there is no terminal that matches the operator. Binary GEP is a little messy like that and I plan on trying to find reasonable solutions to its problems, which will hopefully make it easy to remove the head and just have a full homogeneous gene.

### Are GEP Gene Head Necessary?

In GEP each gene starts with a head of predetermined length. The head can have both operators and terminals, while the rest of the gene can only have terminals. This forces all individuals to be well formed, which is vital to GEP according to its creator. This is all well and good, I'm not sure that we couldn't get rid of the restriction on the head and still come up with well formed individuals if we were a little creative.

To be clear, the only reason that we might want to get rid of this restriction is that it may influence the types of trees that we can see. Since the original karva notion fills out each level before the next is started, we are fine and get trees that can be filled out nicely even if the whole head is terminals.

The problem occurs if we use prefix GEP (PGEP), where the decoding of the tree is done in prefix notion. This seems to have the advantage of keeping substructures from spreading out around the gene (at least, it helps). Imagine a gene with all operators in the head: the resulting tree would be terribly unbalanced where in karva notation it would be nicely filled out. While we can't can that being filled out is better then being lopsided in all cases, it seems like the types of trees that the two notations find may be different.

If there was no restriction to the place an operator appeared, then many trees that where not possible before could be formed. Many of these trees would be strange ones, but the point is that they would not be excluded from forming, nor would any tree currently findable be prevented from forming.

Next post I will discuss possible ways to remove the need for the restriction, as well as go into the possible pros and cons (not that I've done the research, I'm just jotting down ideas).

To be clear, the only reason that we might want to get rid of this restriction is that it may influence the types of trees that we can see. Since the original karva notion fills out each level before the next is started, we are fine and get trees that can be filled out nicely even if the whole head is terminals.

The problem occurs if we use prefix GEP (PGEP), where the decoding of the tree is done in prefix notion. This seems to have the advantage of keeping substructures from spreading out around the gene (at least, it helps). Imagine a gene with all operators in the head: the resulting tree would be terribly unbalanced where in karva notation it would be nicely filled out. While we can't can that being filled out is better then being lopsided in all cases, it seems like the types of trees that the two notations find may be different.

If there was no restriction to the place an operator appeared, then many trees that where not possible before could be formed. Many of these trees would be strange ones, but the point is that they would not be excluded from forming, nor would any tree currently findable be prevented from forming.

Next post I will discuss possible ways to remove the need for the restriction, as well as go into the possible pros and cons (not that I've done the research, I'm just jotting down ideas).

## Tuesday, February 23, 2010

### Messy GA Time?

After reading about messy GAs and how they can be applied to combinatorial optimization problems, I think something like this may help GEPs. Since my capstone project deals specifically with combinatorial optimization, this could be something to investigate.

The idea of messyness in genetic algorithms is (briefly) that small solutions can be combined to make complete solutions. The small solutions will, hopefully, represent building blocks (as per the building block hypothesis) and the full solution found at the end will be composed of all these great little blocks. This is especially cool as the blocks are made in order to cover any schema, so they are represented not as a group of contiguous loci, but as a list of locus positions and their values. This allows the blocks to fall in any schema possible, while the unspecified positions are seen as # (the unspecified character).

This of course requires some way to evaluate partial solutions. This is particularity difficult because of non-linear interactions between parts of a chromosome. The only reasonable method that I am aware of is to apply the partial solutions to some known solution that can be found through hill climbing, for example. If the partial solution "overlayed" on the given solution creates a better solution then the original, that it may be a good schema to keep in the population. This relies on another problem solving method, and I suspect it reduces the ability of the messy GA to search the full problem space, as only schemas that improve the local optima are considered fit, and not ones that move the solution towards another (possibly better) optima.

Next post I will explain how this can be extended and possibly applied to GEPs, and why we might want apply it in the first place.

The idea of messyness in genetic algorithms is (briefly) that small solutions can be combined to make complete solutions. The small solutions will, hopefully, represent building blocks (as per the building block hypothesis) and the full solution found at the end will be composed of all these great little blocks. This is especially cool as the blocks are made in order to cover any schema, so they are represented not as a group of contiguous loci, but as a list of locus positions and their values. This allows the blocks to fall in any schema possible, while the unspecified positions are seen as # (the unspecified character).

This of course requires some way to evaluate partial solutions. This is particularity difficult because of non-linear interactions between parts of a chromosome. The only reasonable method that I am aware of is to apply the partial solutions to some known solution that can be found through hill climbing, for example. If the partial solution "overlayed" on the given solution creates a better solution then the original, that it may be a good schema to keep in the population. This relies on another problem solving method, and I suspect it reduces the ability of the messy GA to search the full problem space, as only schemas that improve the local optima are considered fit, and not ones that move the solution towards another (possibly better) optima.

Next post I will explain how this can be extended and possibly applied to GEPs, and why we might want apply it in the first place.

## Monday, February 22, 2010

### Binary GEP Time

So like I said in the previous post, set membership is easy to express with a GEP. Then we run into the problem that size of the terminal set is dependent on size of the problem (|A| where A is the set we are taking a subset of). In other problems, this does not appear. In symbolic regression for example, we have some small constants, and maybe some method of creating constants (which is a cool area on research in GEP btw) and the size of the problem can be very large with very few terminals (at least, the number of terminals is dealt with in a nice way).

There is such thing as Binary GEP (BGEP) which encodes the operators and terminals in a low order alphabet (0|1), which it is claimed will help when you have problems with many operators and terminals. I also would like to try prefix GEP, just because I like it a lot. It seems to allow building blocks to be created and stay together better. I hope that these two techniques, which I've never seen used together, will provide some reasonable performance and solve even relatively hard problems.

Of course, this is a general method, and will almost certainly be outperformed by problem specific solvers. On the other hand, I would guess that it will outperform simple GA solvers simply because GEP seems to generally perform well.

This is especially interesting as a project because what we will end up with is an expression that creates the solution, not the solution itself. This means that we may be able to investigate the knowledge that is encoded in the expression and learn about the problem. I doubt I will actually do this, but it would be cool in a real applied situation where that knowledge may be helpful.

There is such thing as Binary GEP (BGEP) which encodes the operators and terminals in a low order alphabet (0|1), which it is claimed will help when you have problems with many operators and terminals. I also would like to try prefix GEP, just because I like it a lot. It seems to allow building blocks to be created and stay together better. I hope that these two techniques, which I've never seen used together, will provide some reasonable performance and solve even relatively hard problems.

Of course, this is a general method, and will almost certainly be outperformed by problem specific solvers. On the other hand, I would guess that it will outperform simple GA solvers simply because GEP seems to generally perform well.

This is especially interesting as a project because what we will end up with is an expression that creates the solution, not the solution itself. This means that we may be able to investigate the knowledge that is encoded in the expression and learn about the problem. I doubt I will actually do this, but it would be cool in a real applied situation where that knowledge may be helpful.

## Sunday, February 21, 2010

### Set Membership Time

As I said in the last post, GEP can be used to solve set membership problems. By set membership I mean problems that can naturally be represented as having some set, and trying to find a subset of this set with a certain property. Subset sum, partition, SAT (I'm probably going to do 3-SAT) , and a lot of other problems fall into this category. This is interesting because SAT has such a rich history and use, both in direct practical applications and because of its importance in the study of NP and NP-complete problems. The representation that I am planning on using is so simple and obvious that I am surprised no one has done this yet (actually I'm not sure this hasn't been done, but I've not been able to find a paper related to it at all). The setup is with set operations as operator, and each terminal as a set containing one element (the id of the variable in the case of boolean satisfiability) as well as the universe set (I'm not sure I will include this yet) and the empty set. This will allow a single individual to encode an expression whose result is a set. This can be the set of boolean variables that are true out of the total set of variables. In general, this will result in one subset, and another can be constructed as U/S where S is the subset created, U is the Universe and / is 'remove'. The only real problem with this method is it does not extend easily to multi subset problems like graph coloring (each color could be a set and the vertexes would be the items in the subsets). This means that the only way to use this method to solve that type of problem is reduction, which I will be avoiding so I can focus on making this method work as well as possible.

In the next post I will discuss some of the issues involved in this method, and the ever present problem of scalability and how it can be addressed with GEP.

In the next post I will discuss some of the issues involved in this method, and the ever present problem of scalability and how it can be addressed with GEP.

### Combinatorial Optimization Time

For my capstone project I am investigating the use of gene expression programming in combinatorial optimization. Specifically, I am using the method outlined in the paper "Combinatorial Optimization by Gene Expression programming: Inversion Revisited."

While I agree with the paper (and I think the author is amazing (she invented GEP)), on the usefulness of inversion, there are some things that I just don't accept. First, I have been unable to reproduce her results. At best I get about 6 successes out of 100 runs for the traveling salesman problem The author has not responded to my emails asking for clarification on the setup of the problem, so I'm not sure if its my program, the setup, or the method that is behind this disparity.

Second, while inversion is all well and good, this method will obviously have problems scaling to harder problems, as only one change may be made to an individual in a generation.

Third, I understand the need to remove crossover as an operator for the TSP, but it is not necessary to throw it away entirely. For the problem that I started out investigating, the sorting network problem, there is a natural representation that involves sorting a series of lists that can be crossed over easily and never create anything but a full individual. Unfortunately this has produces very poor results so far, which I suspect is because adding crossover by using each list as a crossover point turns inversion into a point mutation and each list into a locus. This means that the cardinality of the set of states that each locus can be in is huge (2^n), which goes against a lot of GA research. I realize that the low order alphabet idea has come under a lot of criticism, but I think that it is probably justified in this example. Results have been poor so far.

Fourth, this method only solves list ordering, or multi list ordering. Even extended to larger numbers of lists, adding crossover as I am doing in my project, this is does not cover all problems of this type. In fact, I can only think of very few multi list ordering problems. Therefore, I think we can sidestep this whole scalability issue where the method seems to have inherent problems with using regular GEP on set membership problems. I will talk about this in my next post.

While I agree with the paper (and I think the author is amazing (she invented GEP)), on the usefulness of inversion, there are some things that I just don't accept. First, I have been unable to reproduce her results. At best I get about 6 successes out of 100 runs for the traveling salesman problem The author has not responded to my emails asking for clarification on the setup of the problem, so I'm not sure if its my program, the setup, or the method that is behind this disparity.

Second, while inversion is all well and good, this method will obviously have problems scaling to harder problems, as only one change may be made to an individual in a generation.

Third, I understand the need to remove crossover as an operator for the TSP, but it is not necessary to throw it away entirely. For the problem that I started out investigating, the sorting network problem, there is a natural representation that involves sorting a series of lists that can be crossed over easily and never create anything but a full individual. Unfortunately this has produces very poor results so far, which I suspect is because adding crossover by using each list as a crossover point turns inversion into a point mutation and each list into a locus. This means that the cardinality of the set of states that each locus can be in is huge (2^n), which goes against a lot of GA research. I realize that the low order alphabet idea has come under a lot of criticism, but I think that it is probably justified in this example. Results have been poor so far.

Fourth, this method only solves list ordering, or multi list ordering. Even extended to larger numbers of lists, adding crossover as I am doing in my project, this is does not cover all problems of this type. In fact, I can only think of very few multi list ordering problems. Therefore, I think we can sidestep this whole scalability issue where the method seems to have inherent problems with using regular GEP on set membership problems. I will talk about this in my next post.

### Its Starting a Blog Time!

I am a computer science major, minoring in math, at Christopher Newport University. My research interests are in Genetic Algorithms and Gene Expression Programming (Prefix and Binary GEP in particular). I like vim, linux, and interesting programming languages like Python, Ruby, Haskell, Scheme, Forth, and some other things. These will be the primary topics for this blog.

I am starting this mostly as a way to record thoughts as I go into my graduate studies and start doing research, as well as to give my opinion on technical issues.

The time has come to blog.

I am starting this mostly as a way to record thoughts as I go into my graduate studies and start doing research, as well as to give my opinion on technical issues.

The time has come to blog.

Subscribe to:
Posts (Atom)