Wednesday, September 29, 2010

Symbolic Regression in Haskell

I am testing my Haskell GEP library with a simple symbolic regression problem from the paper "Investigation of Constant Creation Techniques in the Context of Gene Expression Programming". I like the idea that the individuals evaluate to something that is then mapped to a fitness, even though my first thought (the way I did it in the Java library for GEP I wrote last summer using the Functional Java) was to have the expression tree of the individual be the expression they encode. The real difference here is how, when you encounter a variable like 'x', how do you give it a value? Do you evaluate the expression tree with some extra global state (kidding!), pass down an environment (what would it looks like?), or something else? In the Java program I passed info to the evaluator telling it to evaluate the individual with some value for x. This was messy and required each use of the function to go through the whole evaluation of the expression tree.

My solution was to actually create a function of type Double->Double, which is the type of thing we are optimizing in this case (unary functions of real numbers, f(x)=something). This means we must construct some function that, given a value x, calculates the function the individual encodes, and we should do it incrementally so it can be built by a depth first evaluation of the expression tree.

I want to note that the other way to do this is to create an instance of State Double, or more generally State Env where Env is some environment necessary for the computation.

So, how do we make such a thing? We must be able to take our terminals (constants and the variable x in this case) and combine them with our operators (+, -, *, / in this case).

First the definition of an operator
data OP a = OP { arity::Int, applyOp::[a] -> a, name::String }
Note that operators know their name, and have a function that takes a list of arguments and evaluates to something.
Then of a variable
var = OP { arity=0, applyOp=const id, name="x" }
A function that turns constants into operators
constant d = OP { arity=0, applyOp=const (const d), name=show d }
and a function that turns a function Double->Double into an operator
function fn nam = OP { arity=2, applyOp=(\(f:g:xs) -> (\a -> f a `fn` g a)), name=nam }

The trick here is that a variable x is just the value of x being passed in as a parameter, therefore x=id. Constants take in the value for x, but have no use for it, therefore a constant must take something in but always return the same thing- const d where d is some constant.
Binary functions, the only type I'm defining right now, must apply themselves to the result of evaluating their arguments, and their arguments must know the value of x to evaluate, so we take the functions, and return a function that, given an x, evaluates the functions and applies the operator to the results.

This is really very simple, I am just so relieved that I can actually create such a thing easily- it was really frustrating in Java not having higher order functions (despite using the Functional Java library, which tries hard, but is nothing like a real functional language for ease of use).

Now I have to find the right construct for implementing the evolutionary algorithm. This has presented several problems, the solution to one may be a monad transformer, but I'm not sure.

Incidentally the other problem I'm having is that it seems like the validity checker is rejecting all variation (mutation, rotation, etc) so no new individuals are ever created after the randomly initialized population is created. This is not cool, and I'm not sure whats going on- the checker seems to work, the operators work by themselves (when not reversed by the checker) and yet I've never seen it create new genetic material, even through recombination.

Tuesday, September 28, 2010

Treeless Expression Trees

I am fairly certain that in BPGEP there will be no need to ever actually create a tree data structure. We can perform our genetic operations on a bit-vector (or a list of symbols if you really want- I actually think PGEP is fine without binary encoding in some cases) and we can already perform evaluation of individuals without the actual tree because prefix evaluation is so simple.

The neat thing though is that I think we can get away without checking for consistency of trees. It shouldn't be difficult to repair trees by acting on the linear structure- in fact, I think it can be done in a simple little O(n) way. If we find and chop off the non-encoding region first (a O(n) operation), this is still n*n, 2n time. This is actually much *less* work then was being done with the consistency checking, because this will only be done when actually evaluating the individual, rather then for each genetic operation. It may be more then the original GEP needed, but it requires no concept of the eventual structure of the expression tree anywhere, which saves in complexity and allows for a more general range of genetic operators.

I also think that it will do a fair job of preserving substructures- which is the most important thing to me about PGEP.

Basically the only thing that makes it GEP and not GA besides some addition operators (rotation doesn't really make sense in GAs) is the evaluation function, which you have to specify anyway- I think it would be possible to reuse GA libraries for BPGEP with some additional code all located in the evaluation.

Pretty cool.

Sunday, September 26, 2010

Pullbacks

The videos posted by Catsters on youtube are really a great resource in learning Category Theory. The instructors have a sense of humor, and seem to be enjoying themselves. Unfortunately, I often am unable to understand the concepts given the formal nature of their discussions. The way I've been proceeding is to determine and then learn the specific concept that each idea in category theory generalizes. In one video the instructor suggests that you can either prove a claim they are talking about, or meditate about it in a categorical trance until you feel like you understand it. I've read of people having said that they do this very often (go into categorical trances), but I've never understood enough about the ideas to do so. Until now! I was able to understand a (admittedly trivial) concept in category theory (pullbacks) in at least some basic sense simply by thinking about the meaning of the associated commutative diagram.

A small victory, but a victory none the less.

Complexity Theory

I've been looking into Calgary as a possible next step in my education. One thing that is attractive about it is the contingent of AI researchers that investigate swarm intelligence. In addition to my interest in functional languages and the subjects that surround them, I really like emergent behavior in complex systems, and I've always found swarms of things to be beautiful. The massive random-seeming complexity and high-order structure is both interesting and potentially useful. It is my opinion that this type of behavior has wider reaching (and more beneficial) consequences then anything else I could study (why I believe this is complex- perhaps some other day I will post about it).

In this spirit I read a book called Complexity by Melanie Mitchell. Her work also introduced me to Genetic Algorithms (and her discussion of Messy Genetic Algorithms is my inspiration for Messy Gene Expression Programming), and now to complexity theory, which underlies the type of emergent intelligence that swarms are known to exhibit. The book is very interesting, covers a wide variety of topics, and is very recent- published in 2009. The step by step description of the program Copycat was instructive, and I appreciate the overview of some ideas of the structure, definition, and difficulties of the concepts of information, computation, etc.

I think it is interesting that, given the distribution of gender in the field of computer science, that Melanie Mitchell should be an inspiration to me in complexity, Candida Ferierra the inventor of the field I'm doing my research in, and Xin Li's PGEP my favorite version of GEP which I hope to propose a variation of as my masters thesis. I wonder if complexity and AI has a better gender distribution then the field as a whole or if this is a coincidence.

Unification and Assignments

Is unification an appropriate tool in assignment generation? I imagine a user specifying some
partially filled in template (possibly a type like (a->b) or (((Int->Int)->[Int]->Int)+) where we are saying that we want a problem where + is folded over a list. A problem is generated and the very general test case of folds is specialized to use some addition operation. The story is one of probably many possible for that type and the things in the list are things with a property that can be summed. This is discovered by a database of knowledge that backs the whole system. Hopefully there is some way to allow composition to occur between function types, and we can get application of assignments to assignments, composition of assignments, and still instantiate because some invariants are satisfied and ensure or at least guide the composition to semantically valid templates from which assignments and test cases can be created.

Will this ever exist? Does it make any sense? I don't know.

Trees are Complex

Tree structures have nice properties as representations of solutions to many problems. This is rather obvious, but it is also very important- the whole subject of Genetic Programming relies on it. They are used because they can describe things that are built from subexpressions and things that naturally have this structure. But Gene Expression Programming exists essentially as a biologically justified reaction to the complexity of Genetic Programming. In this sense my research is justified in a way I would never have anticipated- it is more like GEP then GEP is. BPGEP is better justified in some ways (though PGEP removed some of the operators that where biologically justified in GEP, but I'm not really concerned about them) and it (hopefully- see my last post) will remove all need to be aware of the tree structure of the solutions.

If we could have a system in which all knowledge of the expression tree is in the evaluator, which included some easy and obvious repair or adjustment operation to ensure all individuals encode some tree or other that works for all problems, then we would have no more reliance on trees and their complex structure. Not that trees are all that complex- the point is that genetic operators work very well for linear symbol lists, especially bit vectors, and GP has historically had much research into finding good operators that respect tree structures. In BPGEP we have a system with a trivial encoding, binary, simple genetic operators (point mutation, single and double point crossovers, and rotation (also selection, but that is not dependent on encoding)), no tree related complexities, and a solid biological justification without going crazy.

I know I posted this idea before, but I want to emphasis that this is a direction consistent with the spirit of the original GEP paper. I feel pretty good about this.

Tuesday, September 21, 2010

The Future of BPGEP

My current plan for the direction of my thesis is to keep modifying BPGEP until it is as simple and easy to use as possible. The idea is that the original GEP is just the inspiration from which a nice and biologically justified method can be made. There should be no knowledge in the genetic operators about the structure of the tree an individual expresses, nor even about the symbols it encodes. It should be easy to specify a terminal and operator set without getting into details about the binary representation. New genetic operators should be easy to come up with, as all they have to do is preserve length. New chromosome structures (which are common for holding extra information in individuals) should be easy to add- just add to the end like in other EAs. All individuals should encode some expression tree- and if a repairing operation must occur it should be simple and quick. The biological justification should be stressed, as well as the binary representation- PGEP is an amazing method and we have to justify the added complexity of the binary representation if we are going to use it at all (which we are).

I would maybe like some interesting problem to solve that this method is suited for, but if I don't end up with one I expect to perform some symbolic regression task and maybe classification. There are some really cool problems used in the first GEP paper which may be a good testing ground.

It is my hope that in the end the number of parameters that one has to specify is smaller than GEP and that the implementation does not suffer too much complexity.

Looking at the original GEP paper it looks like you need to specify: popsize, Pm, Pc1, Pc2, Pgene recombination, Pris, ris length, Pis, is length, terms, ops, generations, head length, number of genes, linking function, and Pgene transpose. In BPGEP I expect to specify: popsize, individual length, generations, Pm, Pc1, Pc2, Pr, terms, and ops. These are mostly unavoidable and as long as there is a reasonable strategy for dealing with some of the implementation details those may be the only parameters to BPGEP.

Shape of Homework Assignments

My current thinking about the research for the whole automatic generation of homework assignments is about descriptions from which a problem can be constructed. The best I have so far is that type signatures should be linked to elements of stories, where the types can be specific or general (polymorphic). An abstract type (like a type constructor) can be made more specific, or kept abstract, and should be linked to one or more stories. These stories might be composable if the functions that describe them are composable

For example- give the subject "lists" we have functions related to lists- (a->b) -> [a] -> [b] for example, which is the familiar map. This type can have a, b, neither, or both specified as a concrete type, and may have either or both of its arguments fixed. Any of these combinations of things should give rise to some story about lists where we want a list of something based on the list we have or are given.

An example of composition is a function a->b and and a function b->c. We hope that if each function has some set of stories behind it, then there is some set of stories that are the composition a->c. For example- we have a price and want to know how much is will cost to buy n of something and we have a coupon that takes %10 off a price. The composition of those stories is- we want to know the price of buying n of something with a 10% off coupon and the type signatures are (Double->Double)o(Double->Double) making a Double->Double.
Interestingly the layer of specificity seems to be related to the difficulty of the problem. This even works in other domains- in physics it is considered more difficult to give a homework problem where we need to solve an equation symbolically then numerically, where needing to solve symbolically corresponds to not specifying a value for some input variable.

I wonder about the ability to describe problems with arbitrary depth- lists can hold lists can hold lists, etc. At each level we can "add" the lists, or we can add up all elements of each list (assuming the lists hold some monoid or something). It makes me wonder if shape analysis, something I've just recently become interested in, may help. We could say- lists of length n to m as part of the type if we can be explicit about shapes.

It seems like to work around all sorts of nonsense problems that may appear, we would have to guide our search for specifying an abstract type by some real world knowledge. This is where the whole concept map, knowledge domain thing comes in to this project. I'm not sure exactly how this would happen.

I will have to flesh out exactly how these types are arrived at, and how the stories they specify can be distinguished (as I suspect the underlying stories will direct the search for specificity as well as our ability to compose).

Thursday, September 16, 2010

Messy GEP Again

I'm back at the possibility of a Messy Binary Prefix Gene Expression Programming. The idea is to have small individuals that encode small, partial solutions to a problem. Then we save some of the subtrees of these solutions and add them as terminals (or more generally as operators with some number of "holes" in them if the subtrees are allows to be partial (partial=not all leaves are terminals)). These never terminals are available to be added by mutation and will enter into the population. Every so often (every n generations for example) this extraction of substructures would take place, potentially replacing some previously extracted substructures (or maybe even regular terminals!). The individuals would get more and more complex as they combined and recombined substructures into increasingly complex expression trees.

While it would be nice to include an "expansion" operator like that in the paper on self-emergence of substructures in PGEP, I'm not sure that I would add it. In real genetics, very complex creatures have a lot of parts of their DNA that do not change very much at all, and small changes often prove fatal when they occur. What I want is the emergence of complexity out of simple trees.

We would almost certainly have to bound the length of these extracted trees, or somehow ensure that they don't just grow to arbitrary length.

I'm not sure I'll ever actually do this research, but it seems like a neat little idea- modeling the increase in complexity of individuals through time to find very large and complex solutions to problems.

Modifications to BPGEP

I talked to a biology professor today about the biological justification of BPGEP. It turns out that some of the things I thought of to simplify BPGEP are biologically justified. The most interesting is the idea that instead of ensuring that no operation is ever going to create an invalid individual, as PGEP does, or separating each gene into a head and tail, as GEP does, we could just allow any change at any time. Then, when we are doing the evaluation, we repair the gene so that it has a corresponding expression tree. The way I thought of is to have a neutral element that we tac on to fill out the extra needed terminals. Dr. Hibler, my adviser, suggested that instead we just have that any operation that does not have a full complement of terminals just return, unchanged the one(s) we have. This would work for 2-ary operations, but I'm not sure if that would help with n-ary ones (unless we just returned the first argument or something). Apparently the RNA strand that a series of base pairs creates adds a head and a tail called a poly(A) tail, and I hope to add such a tail. In real genetics that tail does not have an effect on the resulting protein (I think) but here it has as little an effect as possible.

The other modification is that apparently sometimes parts of an RNA strand are removed before it creates a protein. This has some effect, but the result is often close to what it would be without the modification. This is apparently not very well understood- but it occurred to me that after expressing parts of the bit vector into symbols, we could remove symbols that encode "bad" things, like the rest of the 2^n symbols that fill out the terminal and operator list in BGEP. This may end up being a bad idea, as we may have individuals that need lots of repairing and symbols that get removed (which allow neutral mutations, so maybe they aren't all bad).

There are a huge number of possible modifications to the GEP system, I'm trying to not get lost in them. Hopefully some of the biologically inspired ones both improve and simplify the system and can be made into some sort of reasonable result. I would rather this not end up being yet another highly complex evolutionary algorithm that no one is interested in because of the high cost of entry (the cost of learning and implementing it).

Wednesday, September 15, 2010

Prefix Evaluation of a Chromosome in Haskell

My HGEP project is focusing on PGEP right now, though there is no reason not to add karva notation later on. One reason I'm doing prefix first is that there is no need to be aware of the structure of a chromosome, except a very simple validity check to ensure we never make any illegal individuals (illegal in the sense that they do not encode any expression tree). I've finished the prefix evaluation code, and I liked it enough to post about it.

The code involves evaluating a list of operations. The operations are defined like so

data OP a = OP { arity::Int, applyOp::[a]->a }

Each on is a function of some arity and takes a list of arguments- hopefully with length >= arity.One nice thing about prefix notation is that (as mentioned in the original PGEP paper) there is no need to actually create an expression tree- the operations can be evaluated as a list. The only complication is that the operations must "eat" their arguments off the list. Constants and variables are removed from the list and their values returned, but an operation like "+" must evaluate its first argument, then its second argument before returning their sum. The second evaluation must evaluate the list left over from the first evaluation. Hopefully some of that makes sense.

And without further attempts at explanation- here is some code:

prefix :: [OP a] -> a
prefix (op:ops) = evalState (prefixEval op) ops

prefixEval :: OP a -> State [OP a] a
prefixEval (OP 0 fn) = return $ fn []
prefixEval op = do
args <- evalArgs (arity op)
return $ applyOp op args

evalArgs :: Int -> State [OP a] [a]
evalArgs 0 = return []
evalArgs n = do
(op:ops) <-get
put ops
arg <- prefixEval op
args <- evalArgs (n-1)
return $ arg : args

Notice that it is a mutual recursion that allows us to take the list apart and eval each argument. This reminds me of the metacircular interpreter for lisp (eval apply). Also notice that the problem of having to evaluate only the part of the list not yet inspected is done by making explicit that the list is the state of the computation. I like how we have to be so explicit about the type of computation we are performing.

In another post I hope to talk about symbolic regression and different ways to implement it is Haskell.

Generalized Crossover in Haskell

I've started to see some nice stuff come out of my hgep project. The post "Operads and Their Monads" from http://blog.sigfpe.com/ (which is an amazing blog btw) uses some code that relates to operation in GEP and crossover in evolutionary algorithms. This is not really related to the post, but the way he does things is much better than what I was doing.

Crossover in general involves finding cut points and doing an exchange between two lists to produce two mixed up lists. I've never seen anyone do crossover with more than 2 individual, or immediately throw away one of the results, so crossover can be described as two lists of loci, and a list of cut points. The cut points can also be a list of the length of each section. That is the approach taken here.

And now some code:


crossover :: [Int] -> [a] -> [a] -> ([a], [a])
crossover ns as bs = (exchange f s, exchange s f) where
f = unconcat ns bs
s = unconcat ns bs

unconcat :: [Int] -> [a] -> [[a]]
unconcat [] [] = []
unconcat (n:ns) as = take n as : unconcat ns (drop n as)

exchange :: [[a]] -> [[a]] -> [a]
exchange [] [] = []
exchange (a:as) (b:bs) = a ++ exchange bs as

I just liked this code so much, I had to post it. There is some setup code, to apply it to random pairs, to make sure it only happens with some probability, etc, but this is the core idea of crossover in 7 lines of actual code.

Tuesday, September 14, 2010

Started a Real Haskell Project!

I have been working on a Haskell implementation of GEP (really PGEP because I don't like GEP's complexity) and I finally made it an official project, with source control through darcs, and available through cabal! The project is just called hgep, for Haskell Gene Expression Programming.

While there is at least one other Haskell GEP project, it is too deeply meshed with GEP's way of doing things, and I need the ability to change any part of the system at any time to introduce new ideas (for research). Also, I've wanted a large Haskell project to get me familiar with the language for a while now.

At the moment the project can perform a simple GA, where the fitness function is to interpret the individuals as a binary number and apply f(x) = x^2, and a simple symbolic regression PGEP problem. The PGEP section of the code is terribly slow, and needs a lot of work to add BPGEP and more machinery to allow me (and I guess anyone else, if someone were so inclined) to add new experiments and algorithms as easily as possible without locking too much into place. In fact, I don't plan on locking much in place at all- the project will be mostly components that can be fit together to construct an evolutionary algorithm, but will have to be put together from scratch almost every time (hopefully with some amount of help from the framework).

Monday, September 13, 2010

Automatic Generation of Programming Problems

I have just started a graduate assistantship with a professor doing research in the automatic generation of programming problems. The idea here is to define a template for a problem and generate (automatically) enough unique problems for a classroom, along with a testing module for each instance of the problem. It is possible that this will reduce cheating, and it will probably make life easier for professors.

There are several things that I can look into for this project, including extending the existing framework written in Java. The other possibility, which I'm looking into this week, is to continue a nice, mathematical, rigorous approach to studying programming problems. The student that started down this route used Haskell, and seems to have a firm grasp on category theory and functional programming theory (better then mine, but then I am taking up this project mostly to learn).

While I certainly wouldn't give up my evolutionary algorithm work to do research in this area, at least it ties in to some of my other main interests in computer science, namely mathematical foundations and function languages.

Hopefully I will know more about this subject soon, and I will post about it.

Monday, September 6, 2010

Self Emergence of Substructures

There is a paper by the creators of PGEP called "Direct Evolution of Hierarchical Solutions with Self-Emergent Substructures." This paper describes a system in which expression trees can be extracted from an individual (in the elite group) and collapsed into a single terminal, which is added to a library of substructures. Similar techniques appear to exist in other fields, such as static and dynamic super instructions in forth, which are known to improve the efficiency of native code compiler.

This idea has some interesting possibilities, methinks. For one, it allows substructures to combine in new and interesting ways. Also, it may allow greater depth to appear in expression trees, which is a problem in regular GEP in certain cases (a problem where many terminals should appear in the optimal solution would require very long individuals to ensure proper expression tree sizes to include necessary terminals). I would like to investigate this area of GEP research as part of my masters, presumably using BPGEP as the underlying method.

One possibility that I may try is a simpler way to add this technique to GEP then that proposed in the paper. It is not overly complex, but I think maybe a simpler way would be easy to digest for people using GEP, and may be able to keep the primary advantages. My adviser warned me against overly complex methods, where a great deal of computation time is taken just keeping track of things and updating all the moving parts. One of my favorite things about PGEP is its relative simplicity (comparing to GEP), and I don't want to add too much more to it other than the binary encoding.

Another possibility is to allow a great deal of subtree saving and collapsing, and to have relatively small individuals. The idea would be to recreate the conditions of Messy Genetic Algorithms, where small partial solutions are built up to create working solutions. The population would have small individuals, which would become more complex as the number of saved subexpressions grew. I would have to account for uncontrolled growth (as appears in some forms of GP) as the depth would increase with each "level" of saved subexpression. Such a method could use the same biological justification the MGAs do.