Wednesday, April 27, 2011

Genetic Algorithms, Combinatorially

I've been thinking about the combinatorial structure of Genetic Algorithms. I've decided to completely ignore the whole probabilistic thing and just describe it as a series of functions between combinatorial species, hoping that setting it in some suitable category will recover the stochastic part (perhaps a state monad-like thing?).

A normal, bit vectored population, can be described by (2^n)^m, where n is the size of the individuals, and m the size of the population. This is of course isomorphic to 2^(n*m), but I like to think of the m -> n -> 2 form.

Point mutation is a function 2 -> 2, lifted twice to (2^n)^m -> (2^n)^m.

Crossover is (2^n)x(2^n) -> (2^n)x(2^n) lifted and composed with a function (2^n)^m -> ((2^n)x(2^n))^(m/2) that pairs up individuals.

Selection is ultimately a function (2^n)^m -> (2^n)^m, but it should be composed of a lifted function R> that pairs individuals with their fitness, giving the structure ((2^n)xR)^m and a selection function ((2^n)xR)^m -> (2^n)^m.

In general the 2^m can be any combinatorial object and the vector of individuals replaced with any function. It is possible that we would want a multi-sorted species, as in GEP when using multiple domains.

Thats all for now I think. I would be interested to see a combination of species and monads, although I have no idea how it would work.

Learning Category Theory, examples

To specify a category, we have to give a set of objects, and for each pair of objects a set of morphisms between them. We then have to define a way to compose morphisms that is associative. If we have, for each object, a unique identity morphism, then we have defined our category.

If we take objects as sets and the morphisms as functions, then composition is just the composition of functions. This is clearly associative, and for each set there is definitely an identity function.

If we take objects as groups and group homomorphisms as morphisms. Group homomorphisms are simply function that satisfy the following axioms: for groups G and F a group homomorphism is a function f:F -> G such that, with e the identity, f(e) = e and for x and y in G with additive notation f(x+y) = f(x)+f(y) and f(-x) = -f(x). Allow me my abuse of notation where the "e", "-", and "+" have different meanings on the left and right hand side of the equations. Since function composition is associative, group homomorphism composition is associative and there is clearly a identity function for each group- the identity function on the underlying set.

To see that categories are not just about categories for different mathematical theories, lets look at posets. A poset is a set with a partial order. A partial order is an reflexive, antisymmetric, transitive relation. While there is certainly a category of posets, with posets as objects and monotone functions as morphisms, each poset is itself a category. To see this take the objects as the underlying set of the poset. For each pair of objects the set of morphisms is empty if the objects are not related, and it is the set containing the empty set if they are. Clearly for all elements "a" in the underlying set, a <= a (a is related to a) so the morphisms from a to a contains one and only one morphism, the identity. Composition comes directly from the need to make the relation transitive- the composition of two morphisms is the set containing the empty set if the objects in the source and the target of the composite are related.

Another example of a structure that has both a category and is a category is monoids. The category of monoids, Mon, is very much like Grp but without negation. To see that each monoid is itself a category, take a monoid, M, and form the category where the objects to be a single object, perhaps the empty set, and the morphisms the elements of the monoid. Then the composition of morphisms (elements of the monoid) is the action of the monoid. All morphisms in this category are composable as they all have the same source and destination. The associativity and identity are given by the associativity and identity of the monoid.

So, we have seen some examples of categories. Next time perhaps I will go into some of the ways that category theory is used in Haskell by introducing Functors and showing different ways to think about them.

Tuesday, April 26, 2011

Learning Category Theory, cont

A very important aspect of category theory is that it lets us talk very generally about structures and "structure preserving maps" between those structures. The structures are the objects of the category and the morphisms the maps between the structures. In the category Set, objects as sets morphisms as function, there is no extra structure to preserve in some sense, as we often refer to many mathematical objects as "sets with additional structure". Morphisms will then often be whatever notion of homomorphism is appropriate in the theory we are designing the category for. Notice, however, that categories are not always about theories- they are just points with lines between them, and they can describe all sorts of interesting things.

An object with additional structure to preserve would be groups. The category called Grp has groups as its objects, and group homomorphisms as its morphisms. The addition structure here is the identity, inverses, and the action of the operator. The cool thing about CT is that it can study very general situations that occur when we talk about structures and how to map between them. For example, it is the case that a lot of structures have some basic, trivial example that has only the minimal content to satisfy the axioms of whatever structure we are talking about. In the theory of groups this would be the one element group, containing in its underlying set only the identity of its operator. CT satisfies our desire to generalize this situation to all sorts of different structures, and shows us some very high level similarities between structures of different types when their categories turn out to have similar properties. The reason groups have a trivial example is that they are "pointed sets", sets with a distinguished element. Other algebraic structures also have a "point" at their identity (assuming one operator they will have one identity), and the category of their structures as objects with their structure preserving maps will have a similar feature as Grp. This object is called the initial object and it can be used to show that the objects in the category have some sort of underlying structure that is exemplified by this initial object in the simplest way possible.

That explanation may have been confusing, but the point is that CT lets us talk about the similarities between seemingly different objects, and gives a way to talk rigorously about notions like the "smallest" or "simplest" or "best" of something.

One way to think of category theory is as being about morphisms more than being about objects. We can talk about sets without looking inside of them, or talk about groups without ever noticing that they are constructed from sets with additional structure. The properties of objects are revealed by arrows. This is like considering an object to be defined by how it can be used, or more like by its relationship with other objects. It may seem circular- how do we define objects as their relation to other objects if we only define those other objects by their relation with still other objects? It turns out not to be a problem, as we can examine arrows and find objects that must have certain properties in the category and those objects will tell us a lot about the category. In a (bounded) lattice we can find a top and bottom element, and even not knowing anything about the element itself, we know the top the "largest" object in the lattice and the bottom element the "smallest" in the sense of those words most appropriate to the lattice. Incidentally, a lattice forms a category, and there is a category of lattices with morphisms as functions that preserve the top and bottom element and the structure of the meets and joins.

I hope to describe a series of categories in the next post to give us some practice thinking in this manner. Hopefully I can then show some similarities between them and talk about how we characterize these similarities. Things will then get more interesting when we talk about an interesting category called Cat, the category of categories. Can you see from the definition of a category what additional structure must be preserved by the morphisms in Cat?

Learning Category Theory

Continuing my series of introductory posts on awesome thing that seem very difficult at first but turn out to be very fun and understandable, I'm going to start posting about category theory. This is a huge subject, and despite about a year of teaching myself about it I don't even understand all of what sometimes called elementary category theory (I may say CT at times for short). The assumed knowledge will be higher for this series then the others, however, so be warned.

I think we will start off defining a category without going into their use or importance. I found it confusing to think about the sheer size and amount of content of some categories, and the abstraction still gets to me at times. For this reason, we will talk about a category as an object like any other object- we have a series of axioms and we are interested in the objects that satisfy those axioms. It will be a little different from an algebraic structure as the binary operation is not closed. Its a little like a graph, but often we don't allow graphs to have multiple edges between the same source and target, where in a category we do.

I'm going to ignore a whole series of issues in the definition for simplicity, so just take this post with a grain (or bucket) of salt.

A category, C, is given by a set of objects Obj(C) and a set of morphisms Mor(C). We can specify this as a 4-tuple (O, M, S, T) where O is a set of objects, M a set of morphisms (also called arrows, but they are just some kind of object) and two functions S, T : M -> O taking a morphism to its source (S) or its target (T). One way to think of this is as a set of objects with arrows between them. The source of an arrow is the object (or point if we draw them as a diagram) that the arrow starts at and the target the object it points to. A little ascii art:


*--->*
|
V
*
where the "*" are objects and the little ascii art arrows are the arrows/morphisms.

A nicer way to specify a category, C, is as a set of objects which we write either Obj(C) or, in an abuse of notation, as just C, and, for every pair of objects, a set of morphisms Mor(A, B) where A, B are objects in C. We can also write C(A, B), which I prefer. You may see it written hom(A, B) which is referring specifically to the set of morphisms when its important that it is a set.

So, we have our objects and our arrows. Now for the axioms- for A, B, D in C if we have an arrow from A to B, written A -> B, and an arrow B -> D, then there must be an arrow A -> D that is the same as combining A -> B and B -> D. This combination is called composition, and will be function composition in some cases. If f:A->B, g:B->D, h:D->E for some A, B, D, E in C, then gof:A->D (g composed with f, with "o" as the symbol for the composition operator) must exists, and ho(gof) = (hog)of, so the composition operator is associative.

Also, for every object A in C, there is a morphism id in C(A, A) (the set of arrows that start and end at the same object A) such that for any morphisms f:A->B and g:D->A, f o id = f and id o g = g, so id is the identity in the sense that it has no effect on composition.

Notice the difference between composition and other algebraic operators- normally we can just say that there is some underlying set, S, and the operator is a function (using additive notation) +:SxS->S. If we have an identity element we say e is in S and for all b in S b+e = e+b = b. Composition can not operate on arbitrary elements like the operator + can, it can only act on composable elements. Being composable means that the target of the first is the source of the second.

Each of the elements in this definition can be interpreted in many ways- thats part of the fun of category theory. On way to think of morphisms is as transformations that morph one thing into another. Composition then means to do one transformation then another. In the category of sets the objects are sets, and the morphisms are functions. In this case we can think of a function turning one set into another by acting on each of its elements. In general, however, the objects can be anything and they may not have elements inside them (if they are numbers for example).

I'm hoping that this series will record what I've learned about CT and will provide many interpretations of different parts of the theory. If I can far enough I would like to talk about categorical functional type theory and topos theory, which I'm currently struggling desperately to learn.

Monday, April 25, 2011

Thinking about Category Theory

I've been trying to teach myself category theory for over a year now. Its affectionately referred to as "abstract nonsense" at times, and I've been wading into the abstraction with little direction, guided only by my love of computation and my desire to understand how we can use category theory in computer science.

At first I thought of categories as domains of mathematical discourse. Set, the category of sets, is set theory, Mon, the category of Monoids, is monoid theory, etc. This is a pretty abstract notion, and most of elementary category theory seemed difficult and complex in this interpretation.

Then I realized that categories are really just dots and lines. This helps with understanding the importance of commutative diagrams, as they just specify things about whats dots are connected by which lines.

Then I realized that categories as just another mathematical objects, like lattices, algebras, sets, groups, etc. I don't know why it took me so long, but I was thinking of categories as being so deep and complex I didn't see the obvious. Just like other mathematical objects, it is interesting to see which objects are categories and which are in categories- there is a category of posets, and each poset is a category, there is a category of monoids, and each monoid is a category.

Then I was thinking about the algebraic structure of categories and wondering if it related to any other structure. The lack of closure for composition (not all morphisms are composable) confused me for a little while- all other algebraic structures start with closure as a basic axiom. Well, if no other algebraic structure is like a category, then that is the definition of a category- an algebraic structure that is not closed, but rather its operator can only act on pairs of composable objects (and a series of other axioms).

All of these interpretations are useful at different times, and its been fun to see this concept from different angles.

I plan on posting more about category theory pretty soon- I've been making huge breakthroughs in understanding categorical function type theory and category theory is starting to be a useful way of thinking and understanding new material for me.

Wednesday, April 20, 2011

Better Than Wikipedia

Today I discovered a site similar to wikipedia in the sense that you can get into long, deep investigations of topics based on links in articles that can lead all sorts of places. This site is: http://nlab.mathforge.org/nlab. I found it while looking at the n-category cafe blog.

Some articles- the term evil in category theory: http://nlab.mathforge.org/nlab/show/evil, type theory: http://nlab.mathforge.org/nlab/show/type+theory, and the notion of a theory in the sense of an algebraic theory or a type theory http://nlab.mathforge.org/nlab/show/theory.

I've been trying to understand categorical type theory and the semantics of type theories over the lambda calculus as the internal language of certain categories for a while. The nice thing about this site is that all the articles are either just beyond my reach or much beyond my reach, as if you are not completely confused you are not learning fast enough!

Tuesday, April 19, 2011

Learning Genetic Algorithms, Robust Gene Expression Programming

My thesis is on a new form of Gene Expression Programming that I've named Robust Gene Expression Programming (RGEP). It is designed to be simpler and more robust than its closest relatives, without losing generality or performance. In this post I'm just going to describe each part of RGEP without a complete explanation.

RGEP consists of the following stages: a single initialization step, followed by n generations consisting of gene expression, evaluation, selection, point mutation, rotation, one point crossover, and two point crossover.

The population consists of bit vectors of some fixed length. Initializing them is easy- they are completely random. All individuals are valid, so there is no need to do any checking while creating individuals.

Gene expression consists of splitting the bit vectors (which are symbol lists where the symbols can only be 1 or 0) into fixed size groups called codons. Each codon is expressed into a symbol, forming the raw symbol lists. This can then be edited/expressed into a tree using the postfix notation from the last post. The resulting tree can then have its fitness determined.

Selection is a simple tournament selection with 2 individuals per tournament and elitism.

Point mutation is just like in a GA.

Rotation rotates an individual just like a wrap around shift in a register. The rotation point must be a multiple of the codon size so that the rotation occur on codons, not bits. This prevents rotation from being terribly disruptive.

One point crossover is just like in a GA, and two point crossover is just like one point crossover except we chose two random points and exchange with respect to both.

So- that was a very short description missing details on the decoding process. I may go into much more detail on this in later posts, as well as other topics in GAs and Evolutionary Algorithms.

Monday, April 18, 2011

Learning Genetic Algorithms, editing

Editing of a symbol list takes valid individuals (individuals whose genetic material encodes a tree) to themselves, so it makes no changes, and invalid individuals to an individual that encodes a tree, hopefully the "closest" such individual. We can add extra symbols, delete symbols, or change symbols, or any combination of those three. In Binary Genetic Programming, the original Genotype Phenotype Mapping system, the editing stage was very complex. In the systems that followed that line of research the editing only got more complex as they were intended to investigate the idea of developing an individual using the genetic material as a template. The editing that I will explain is must easier and is novel in the field (not that it is all that clever, but it works well). The downside is that it only works on the more algebraic expressions that GEP evolves rather then the more general ones in other Genotype Phenotype Mapping systems. On the other hand, its very nice compared to other techniques in the GEP literature, and it suggests a really cool addition to GEP systems that I hope to get into at some point involving stack operators.

Okay, so lets get to editing! Starting with the expression +1-2*3 lets see if we can create an expression that is pretty close but valid. The tree it expresses to is:


+
/ \
1 -
/ \
2 *
/ \
3 _
Where the _ symbol is an argument that is missing for *. Notice how we filled out the tree btw, depth first. This is prefix notation. A very close valid individual would be:

+
/ \
1 -
/ \
2 3
We could do this by "floating" the 3 up to the -, but it is easier to do this if we don't actually create the tree. So! Postfix notation!

Starting with an empty stack, written [], and the expression +1-2*3, lets evaluate the expression in the usual postfix way. First lets reverse the expression, 3*2-1+, and evaluate the expression symbol by symbol. Any terminal is simply pushing onto the stack, so first 3 is pushed- the expression is *2-1+ and the stack is [3]. The * will try to pop enough arguments off the stack (2) and multiply them, pushing the result on the stack. If the stack had been [3, 2] and the next symbol * the stack would become [6] or [(2*3)] assuming the top of the stack is the rightmost item in the list. Since there is not enough arguments on the stack when the stack was [3] the * is ignored. Then the expression is 2-1+ and the stack is [3]. Then we get -1+ [3, 2], then 1+ and [(2-3)] or [-1] then + and [-1, 1] then an empty expression and the stack [0].

Notice that the only thing that differs from this evaluation and the prefix one is that the * is ignored, which was the goal we set out to perform. This mechanism will always remove operators that do not have enough arguments, and it will make very few operators and make no other changes to the expression. It can also be done while evaluating the individual, so we don't even need to create the expression tree at all. This is very cool.

One of the nicest things about this notation and editing is that the subexpression in the tree (the tree starting from any node and taking all of that nodes children and childrens children, etc) are close to each other in the symbol lists. This means that symbols close to each other in the genetic material are close in the expression tree, and symbols that are part of a subexpression are never distanced by symbols of a different subtree. This is not the case in the original notation, Karva notation, which fills out its trees breadth first. The postfix notation allows us to gain this locality property as well as remove operators that will be a problem in the expression tree. This could be called postfix evaluation ignoring operators that cause a stack underflow.

We have gotten to material that is introduced in my thesis in this post. Perhaps I will post about the other parts of RGEP, and maybe about the algorithm as a whole. The editing stage is the only really novel thing besides some details of the binary encoding and the meaning of the genetic operators. Everything else in RGEP from the literature on Genetic Algorithms, Genetic Programming, Genotype Phenotype Mapping Systems, and Gene Expression Programming.

Learning Genetic Algorithms, Gene Expression Programming

So, this series of posts is pretty poorly named. I should have titled it Learning Evolutionary Algorithms, as I'm about to get into Gene Expression Programming (GEP), which is a couple of steps removed from Genetic Algorithms in terms of subfields. In another sense GEP is closer to a GA then Genetic Programming as it uses a linear encoding (symbol lists again).

The original GEP is very complex, and I will not go over the whole thing. Instead I just want to introduce some ideas and operators, and show how they fit together. I think that Prefix Gene Expression Programming is a much nicer algorithm then the original, so that is the one I will describe in the most detail.

Remember expression trees from last post? Instead of evolving trees directly, in GEP (and Genotype-Phenotype Mapping systems in general) we evolve symbol lists which contain the terminal and the operators. This can be exactly like a GA, but often has extra genetic operators beyond Point Mutation and One Point Crossover. It is a distinguishing factor of GEP that it has many different operators, while some of its cousins have only Point Mutation. When we want to measure the fitness of an individual we have to turn it into a tree and determine the fitness of the tree.

I want to make a quick point about fitness before talking about this translation between the list of symbols (the genotype) and the tree (the phenotype). Fitness can be determined in any way that is appropriate for the problem, but a good way of measuring it for many problems is this: we have a list input/output pairs, and the fitness of a function (assuming the individual encodes a function or can be described by one) called f is the sum of the error the function makes on the output given the input it is paired with. For example if an individual's tree encodes the function f(x) = x*2 + 1 then if we had some data [(1, 2), (4, 2), (10, 3)] (a list of pairs, the first the input, the second the expected output of an ideal function) we would get |f(1)-2| +
|f(4)-2| + |f(10)-3| which is the sum of the differences between each input and output. I'm using |expr| as the absolute value of the expr between the pipes "|". If we get 0 as the fitness then the individual's encoded function, f, made no errors and was a perfect solution. While we may not have such a simple interpretation of the individual as a function from x to y, so we just have (x, y) pairs, it will very often be some pairing of inputs of some form to outputs of some form and a way of determining how well an individual's outputs match the ideal outputs.

So, decoding a symbol list into an expression tree. We can choose any notation we want- infix, prefix, postfix, or anything else. This can be very complex in general, we could be evolving expressions in a programming language for example, but I am mostly interested in this more restricted case where we are interested in expressions combining terminals with operators (its very algebraic, which I might post about sometime). For this purpose the choices of in/pre/postfix notation are the well known ones. The other option is Karva notation, which is not very nice (IMHO).

If we want our individuals to be symbol lists in some notation, we would like that they always can be turned into a tree. This is not a problem for an expression like +1-23 which is 1+(2-3) in infix notation, but what about +1-2*? This expression does not encode a tree as the operator * does not have two arguments (it has 0 arguments). We can do many things in this case, but the two options I want to look at are: we can forbid such messy individuals, or we can allow them and, when we want them to flower into trees (we get very attached to our individuals sometimes and want them to succeed even if they are not valid expressions), we can make changes to the genetic material in order to make a tree without changing the actual individual. Imagine that we made a copy of the individual, edited it to make it valid, and then turned it into a tree to determine its fitness.

The first option is taken by Prefix Gene Expression Programming, as well as the original GEP. In the original algorithm the symbol lists had two parts- operators could only appear in the beginning of the individual and so they would always have enough terminals after them to have a valid tree. In PGEP genetic operators that caused invalid individuals, like a point mutation from +1*32 to +1*/2, are undone when they happen. This means that the operators must know what a valid individual looks like to make sure they don't cause any problems.

The other possibility is what is common in other Genotype-Phenotype Mapping systems, which are any system where the linear list of symbols becomes a more complex object before evaluating its fitness (a tree, graph, whatever). It is called editing, and the original such systems had some complex editing as they wanted to evolve programs in programming language like C.

In the next post I will go over the symbol list idea more, the editing stage that I used in my thesis, and the expression of the linear encoding into an expression tree.

Wednesday, April 13, 2011

Learning Genetic Algorithms, Genetic Programming

The symbol lists of Genetic Algorithms are lots of fun. We can encode all sorts of things if we are clever enough. It is generally considered a good idea to use the most natural structure to encode a problem, if possible. If we are optimizing the parameters to a real-valued function then we should try to use a real-valued encoding (a vector of doubles, for example). But what if the problem has some more complex structure?

There are some very interesting problems that are most easily described by trees structures. Just so we are on the same page here I will describe the concept of a tree (a binary tree aka 2-ary tree for simplicity). A tree is either empty, or it is a node with 2 subtrees, called its children.
Here is an awesome ascii-art tree with the nodes as * and the \ and / connecting each node to its children.


*
/ \
* *
/ \
* *
/
*

In general these trees may have anything labeling their nodes, not just * symbols. If the internal nodes (the ones with children) are functions like +, -, *, or / and the leaves (the nodes with no children) we may have something interesting. A function like f(x) = x*2 + 1 would look like:

+
/ \
* 1
/ \
x 2
This can be written in the infix notation as x*2 + 1, in prefix notation as +*x21, and in postfix notation as 12x*+, all of which completely and unambiguously define the tree above (assuming we know the number of arguments each operator symbol (+ and *) requires).

If we want to evolve functions like this, or many other things (programs can be described this way using formal grammars and abstract syntax trees), we may want to evolve them directly. This means creating random trees, making sure each function has enough arguments and each leaf has a terminal (a constant or variable or something with no arguments to fill).

A technique called Genetic Programming does exactly this. This gives it a great deal of expressiveness- trees are complex, interesting, and general things. An example of how cool this is- imagine a robot that knows some things about its current situation (location, facing, direction of something good or bad, speed, etc) and needs some controlling program. We can use an expression tree with many variables, not just "x" as in our example, and functions that act on the variables. It may be tricky to get this function to really do anything interesting, but it is not all that hard. The cool thing would be that we can evolve functions for controlling robots like this.

Another cool possibility is to evolve a function to fit some set of data. We can then use the function to describe the data- it tells us something about the data we might not know and it can be used to extrapolate and interpolate. The fitness for a tree is the sum of the error it makes on each data point when used as a function.

Just like a GA, we generate random structures to start the algorithm out. Then we must do something like mutation and crossover. Unfortunately trees are more complex than vectors, and the biological metaphor stops making much sense. This is one problem with the more complex trees structures, and we will see in a later post how the algorithms I am most interested in deal with this by using a symbol vector to encode a tree. Mutation here means choosing a random node and replacing it with a randomly generated tree, which deletes the node and any of its children. Crossover of two trees means choosing random nodes (one in each tree) and replacing them with each other (swapping the subtrees underneath them as well).

This is all well and good (actually its pretty great- GP is really good at a lot of things), but there are some problems beyond the complexity of the operators (there are normally more restrictions that are added then what I described). One interesting problem is called bloat- the trees will tend to grow uncontrollably with no associated increase in their fitness. The interesting thing is that not only is this possible, but if the tree is able to do this it will be in its best interest to do so, and so it will start to grow uncontrollably.

An advantage of the more complex structure is we can enforce all sorts of extra constraints to encode more interesting problems. We can add operators with different types or use recursions or iteration or some memory space or lots of things.

All I wanted to do in this post is get a little bit into GP and show how trees can be evolved. Mostly this is just buildup to Gene Expression Programming, but it is important to understand trees and how they might be useful.

Sunday, April 10, 2011

Things Haskell Has Taught Me Or Helped Me Learn About

Monads, functors, categories, limits, the Yoneda Lemma, domain theory, denotational semantics, abstract algebra, the theory of combinatorial species, type theory, the lambda calculi, intuitionistic logic, referential transparency, natural transformations, comonads, currying/partial application, lazyness/eagerness, strictness/non-strictness, and adjunctions.

And much more, but thats all I can think of at the moment. I don't completely understand all of these things (actually I'm not sure I would really say I understand any of them). The important thing is that Haskell has taught me about math, logic and computer science. Also, knowing Haskell has given me a way of understanding new material in math. What a cool language.

Negation in Computer Science

Negation in Computer Science is a funny thing.

There are several important semirings in CS, which miss out on being rings simply because there is no negation. Some examples include the theory of combinatorial species (which describes datastructures), the Kleene Algebra describing regular expressions aka regular grammars aka finite state machines, and both Boolean and Heyting lattices. The last two have a lot of applications- obviously Boolean algebra is important in CS. Heyting lattices are a generalization of Boolean lattices which have some implication structure not necessarily satisfying the usual a -> b = -a\/b. Negation in this context is defined as -a = a -> _|_, so "not a" is the same as "a implies bottom". These are important in denotational semantics and Domain theory.

Negation is also interesting in constructive mathematics, where there is no double negation elimination. Double negation is something like "potentially false" apparently.

The translation of statements from classical logic into intuisionistic logic by double negation corresponds to a translation into CPS (continuation passing style) which is pretty amazing. This only occurs when the _|_ type is replaced with some other type, giving the type of continuations (a -> r) -> r, which is related to the Yoneda Lemma in a way that I am having trouble understanding.

Double negation also has something to say about the concept of proof irrelevance. A term of a type is a proof of the proposition that its type corresponds to. On the other hand, which the actual term is important from a computational perceptive (as it is an algorithm) it is irrelevant from a proof perspective because we are interested in the existence of a proof, not its particulars. The double negation actually removes the computational content from a proof in a very interesting way. In (a -> _|_) -> _|_, if "a" is inhabited then (a -> _|_) is not, as there is no function from an inhabited set to an empty one. If (a -> _|_) is equal to _|_ then (_|_ -> _|_) is inhabited, as there is one function from the empty set to the empty set. This means that the type (a -> _|_) -> _|_, proves a if true, but has lost the particular term that proves a, losing the computational context. I wonder if this is related to the difference between Prop and Set in Coq when doing program extraction to other languages..

There is probably more to say, but for now thats all I can think of.

Saturday, April 9, 2011

Visualization of Diversity in a Cellular Genetic Algorithm

A Cellular Genetic Algorithm (CGA) is a regular GA with additional structure on the population. While the normal populations in a GA are a multi-set (a set that can have repeated elements) in a CGA they are some other data structure. Often this is either a list or a torus (a grid whose edges meet). This imposes a restriction on crossover where individuals can only cross with others near them. It also affects selection, where an individual can only be replaced by individuals around them.

On of the neat things about CGAs is that they can have interesting behavior with diversity. It is possible for some sections of the structure (lets consider a grid, as that will be interesting later on in this post) to converge to different solutions then other areas. This means that in the center of some collection of cells (holding individuals) there will be very little change, as mutations will be weed out (mostly) and crossover will not affect equal individuals. On the other hand, at the edges between to converged groups, selection will cause one group to dominate other other, or a stable edge may appear. Crossover across this edge will cross the traits that make each group highly fit, and may produce very very fit individuals. This is the coolest thing about this technique. This may also occur in an Island Genetic Algorithm, incidentally.

The other thing I want to introduce before getting into the point of this post is the solution space. This really is a space, as for any two points (encoding of solutions) a distance measure may be taken. If the encoding is simple bit vectors, then the distance is the hamming distance (the number of bits that differ between the individuals). This distance does not necessarily correspond to the difference between their fitnesses. This gives another dimension to the solution space, which we may imagine as a landscape where the solutions are x, y coordinates and the z value is the fitness of those points. The landscape may be very complex.

Another measure we can make is the "diversity" of the individual. We can take the pair-wise hamming distance between the individual and the individuals "near" to it. In a multi-set population an individual is "near" all individuals in the population's topology. In a CGA this is not the case, and diversity is measured only by the individuals in cells of a certain distance away in the topology (lets say distance 1, so a cell in a grid has 8 neighbors).

The idea of this post is that we can equate the diversity measure of the individuals and an individual's fitness. This means that we are almost certain to have many different groups converging to different bit vectors. Then we can imagine the grid of fitness values of the individuals, which will tell us not their genetic material but just how much they differ from their neighbors. This is were the visualization comes in.

If we assign some range of colors to the values from 0 to the maximum diversity measure (neighbors*individual length), probably with blue or white around 0 and red around the highest diversity. Then we could make images out of the population's fitness values throughout the run of the CGA and see the interaction at the edges of the groups. Probably the groups will quickly form and one group will take over the others, but really I don't know what would happen. I do know it would look really neat.

Friday, April 8, 2011

BIC and SKI

I like to learn skills that require dexterity like juggling, pen tricks, and butterfly knife tricks (see http://sculptorbyday.blogspot.com/2010/11/its-computer-science-time.html on my twin's blog).
Today I was flipping my pen, and I saw that it said BIC on it. I've been thinking about combinatory logic recently, and all of B, I, and C are well known combinators. The word BIC reduces to the combinator that swaps its arguments, which is called "flip" in Haskell.

Therefore, BIC pens are the best pens for pen flipping tricks. QED.

Thursday, April 7, 2011

Purely Functional Genetic Algorithms

In this post I want to go over an idea I had about a month ago about the representation of a population in a Genetic Algorithm (or really any evolutionary algorithm) where the population is represented by a function. I got this idea from a library called pan which represents images as functions from a coordinate to a color (although it is really more general than just that). A population of a Genetic Algorithm can be considered a function from a pair of integers (the first integer specifies which individual and the second the location in that individual) to a bit (assuming for simplicity that we are concerned with bit vectors). Alternatively, we could represent the population as a function that, given an integer specifying the individual, returns a function that defines that individual and which, given an integer, returns its value at that location.

Each of the genetic operators has to be reinterpreted using this representation. We can't change the population directly, as it has no mutable structure (real functions have no internal state), but we can wrap it up in other functions that enact changes. Let go through the three main operators- mutation, crossover, and selection.

Mutation requires that some small number of bits are flipped. These bits can be represented by a set of integer pairs specifying not the bits values, but rather their location. There value is not important in the representation. If we make such a set of random pairs, we can create a function that takes as an argument a function representing a population, and returns a function representing the new population. This new function will use the old function to get the value of a particular location, but if the coordinate requested is also in the set of mutated coordinates then it will flip the result before returning it. This means that we use the old population's values, but wrap the function up in case we need to change its specific locations.

Crossover can be represented by a list of triples. The first two numbers are the crossed pair, and the third in the triple is the cut point between the two individuals. Of course, some individuals will not be crossed at all, and their cut points will be 0. Again, the actual genetic material of the individuals does not matter, only the place that the cut point occurs. This list of triples can be used to wrap up the function representing the population, such that to get a particular coordinate one has to check the respective triple and cut point, and get the value from the previous population. If the requested value is before the cut, we get it from the first individual, and if it is after the cut point then we get it from the second individual.

Selection is particularly easy to represent. Each selected individual- regardless of the selection mechanism- gets its index recorded in a list as long as the population. This list is like a switch board- when an index is requested, the index from the previous population is returned based on the value in this list. So to get the first individual in the new population, we return the individual in the previous population whose index is recorded in the first position of this list. If we don't want to make the choose of the actual individual, as we will see at the end of the post, we can just use tournament selection and record the tournaments, deciding the winners when we have the individuals. Something similar is possible with roulette wheel where we record value from 0 to 1 and use them as the spins for the wheel.

With all the genetic operators represented this way, we simply have to create a population function and wrap it up in higher order functions again and again to perform a run of the algorithm. This turns out not be be terribly fast in my trials, as after a couple of operators there are a large number of bit flips, checks, and indirections necessary to get the value of a location. I've tried several things to make it faster, but at least the naive implementation does not appear to scale well. It is still an interesting idea, however. If the population where somehow very sparse it might even save space, as you only have to record changes, not all the actual bits. In fact, if the population started out completely uniform- for example all bits are set to 0- such as before the initialization step, then the population could be represented by a very concise function- the constantly 0 function. This requires a fixed constant amount of memory, which might be a huge win in some cases.

Unfortunately I have not been able to get this idea to be any better then the usual way of doing a GA. The only thing that it has lead to is an idea about investigating the effect of the initial population on the resulting population. If all the generations where generated as I've described (and not necessarily stored wrapped up in a function, but rather just as the data structures) then a run could be performed many times. If we generate some population, and then run the same GA many times with the same mutations, crossovers, and selections, then we could flip each bit one at a time and see how it effects the resulting population. This would tell us if some bits where more important then others. The coolest thing would be a "heat map" were each bit in the original population is replaced with a color based on how many bits in the final population it effected. This would be really cool to look at, if nothing else.

Wednesday, April 6, 2011

The Proof Assistant Coq

I've been learning to use the proof assistant Coq for an Artificial Intelligence project. It is lots of fun to learn about, and it gives a nice concrete way to view the Calculus of Constructions (technically it uses the Predicative Calculus of (Co)Inductive Constructions, apparently).

One of the funny things I've seen about it is that it is possible to get this compilation error: "Universe Inconsistency". Is this not the most intimidating error possible? The universe is inconsistent? Oh god. What do we do? Luckily all this mean is that the stratification of the type Type, which occurs behind the scenes, failed because of some definition. Stratifying Type allows the calculus to remain predicative without making the user manage the infinite hierarchy of universes indexed by the natural numbers.

There are two languages in Coq- Gallina, the term language, and the administration language called the Vernacular.

I need to do a sample project to evaluate the tool, so I've chosen a very simple one- I'm translating the untyped lambda calculus into the system SKI. Its a simple translation on inductive datatypes. I'm not sure how to show that the translation makes any sense- my first thought was some proof of extensional equality, but I don't think thats really possible. I suppose something might be possible, but I can't think of anything that doesn't require reducing lambda terms into a normal form, which is, in general, impossible. I'm sure I'm missing something here.

Coinductive types are pretty neat. They are often infinite, and yet in Coq programs using them must always halt. This is handled by rules on their use that ensure they do not cause some potentially unbounded recursion (co-recursion?). This certainly forbids some programs that would have been terminating, but in this sort of thing we are willing to give up some generality for strong normalization.

Speaking of strong normalization, while the actual logic of the system is intuisionistic, it is possible to do classical proofs. Adding the LEM (Law of the Excluded Middle) to a module does not break the whole system, but I don't know how it interacts with the assurance of termination or the types of proofs possible.

So yea, Coq is pretty awesome. So are the other systems built using the CoC like Adga and Epigram, I only chose Coq because it is a proof assistant and so seems more closely connected to AI to me. It really is amazing to see it prove things, and interactive sessions seem like a conversation with a computer that is capable of fairly complex proof techniques.

Tuesday, April 5, 2011

Learning the Lambda Calculus, second order types

In a previous post we looked at the simply typed lambda calculus, which corresponds to a propositional logic where we can only form statements with implications such as "p implies q" written as p -> q with the meaning of "a function from terms of type p to terms of type q". If we allow universal quantification over types we get the second order lambda calculus aka the polymorphic lambda calculus aka System F. The name System F is, of course, awesome, but there is a more powerful system with an even more awesome name- The Calculus of Constructions. Later on in this post we will see something in the PLC (polymorphic lambda calculus) that is common in programming languages, but rarely given its true name (type constructors as types with higher order kinds).

System F does not correspond to what you normally hear called second order logic, which may be called second order predicate logic, but rather to (the implication fragment of) second order propositional logic. Interestingly Haskell's concept of a type class gives us predicates, which makes it more like first order predicate logic, which is awesome.

So, we have two tools- implication in the form of lambda abstractions written as ->, and universal quantification in the form of type abstraction. In the simply typed lambda calculus we could specify the identity function as (\x:a.x) read as "for x of type a, x" which has no effect on its arguments. The problem with this is that this function only takes arguments of type "a", so we have to define a different identity function for each type, even though all identity function look exactly the same. Universal quantification to the rescue! Now we can define the identity function once and for all, introducing some new notation, as (\a:*.\x:a.x) which is read as "given a type called a and a term of type a, return that term". I don't have a good way of showing in text the difference between abstraction of types and abstraction of values, but the first argument's type is * which is the type of all types. I kind of like using "\" for both term and type abstractions as they follow mostly the same rules. This makes "*" a kind of second order type, and it is called a kind. We pronounce it "star". To use this term, we must first supply some type, then some term of that type, and will get back our term unchanged like so- (\a:*.\x:a.x)Integer 3 =beta 3. In logic this would look like "forall a, a -> a" meaning that all things imply themselves.

The nice thing about this situation is that we can define functions that are the same for all types. This gives what is called parametric polymorphism, similar to Java's Generics. In Haskell the type system is a series of extensions on the second order lambda calculus, so we don't have some system tacked on to the language but rather a built-in and well understood form of parametric polymorphism.

I want to mention that Haskell's type system isn't quite System F, but rather the Hindley-Milner system, which restricts the locations that the universal quantification can occur in order to allow type inference. This means that the compiler can determine types for you, which is hugely convenient and useful, while imposing a very small restriction to the set of valid programs.

Lets look at one more interesting thing about this system. For the term (\a:*.\x:a.x) we must supply a type. What if we supply the terms own type? Well, its type is written (in my own ascii art interpretation of the actual symbols) as (\a:*.\x:a.x) : (\a:*.a->a). Therefore we have that (\a:*.\x:a.x)(\a:*.\x:a.x) =beta (\b:(\a:*.a->a).\x:b.b) : (\b:*->*.\x:b.b) : (* -> *) -> (* -> *). This is pretty hard to see just looking at the terms, especially as I don't have all the symbols I would like (like both lower and upper case lambdas, product types as the big pi symbol, and the box called "sort" which is the type of all kinds (which brings us to the fourth layer of the type system)). The important thing to see here is that the kind of the argument is not *, but rather (* -> *), and the kind of the resulting term is (* -> *) -> (* -> *). The kind * -> * is something that needs a type and returns a type. This is what I was talking about in the beginning of this post as something we see in programming. The array type in most languages (such as c, c++, Java, Haskell) is not itself a type. There is no such thing as an array, there are only arrays of some type of object. We can have arrays of ints, written int[], or of strings, written String[] (in Java), or of Objects, written Object[]. This means that arrays need a type, which we can call A, and return a type, an array of As. The term we just saw has a more complex kind, and it happens that supplying the identity function its own type give a kind of second order identity- the identity on type constructors.

In Haskell we can have kinds formed with * and ->, which may remind us of something. It is very similar to an untyped lambda calculus, as different occurrences of * may be different types.

I think that enough for now. Having mentioned the Calculus of Constructions, maybe I will post about it later on.

Learning the Lambda Calculus, lambda -> ski

We have seen both SKI and the lambda calculus, and the fact that they are the same. I want to go deeper into this connection between them because I'm doing a project for my Artificial Intelligence 2 class where I've decided to prove their equivalence with a proof assistant called Coq.

Before going over the proof (which is very easy), notice that there are many combinators besides S, K, and I. We have seen Y, but other cool ones are B = (\f.\g.\x.f(gx)), which composes two functions called f and g here (this is called "." in Haskell to mimic the notation in math of a small unfilled circle), and C = (\f.\x.\y.fyx) which swaps two arguments of a function (called flip in Haskell). These are cool because they give nice little transformations and show how combinators (and the lambda calculus in general) express control flow. Concatenative languages can be considered the composition of combinators in a cool way. Also, combinators appear in the compilers for functional languages. Very cool things, them.

So, the proof. The lambda calculus is a system of universal computation, so it can compute anything that can be computed (by definition). If we can turn any term of the LC into a term of the SKI combinatory calculus that is equivalent. There is some difficulty with the notion of equality here- the translation is not a simple bijection as a lambda term translated into SKI can be translated back, but it will not always be the same term. We may take an intensional notion of equality (an intentional equality considers terms to be equal if they are exactly the same syntactically) but this will not work in general. Remember that not all reduction terminate in the lambda calculus- that is one of the reasons type systems exist. To test if two terms are exactly the same syntactically we could want to reduce then to their normal form (as all normal forms are unique and therefore are a sort if "identity" for a term). This computation of the normal form may not terminate, and therefore we can't use this notion of equality. On the other hand, we can use an extensional notion of equality- terms are equal if they work the same in all possible uses. While intensional equality is concerned with the "internal" representation of the term, extensional equality is concerned with the "external" behavior of a term. The translation that we will be looking at may not get us back where we started, but it will get us somewhere equal to where we started if we consider terms equal if they behave the same way.

The actual translation is by cases. This is very easy to express in Haskell, so I've just now written a translator from SKI to lambda calculus and back, as well as a function to reduce an SKI and a function to reduce a lambda term, both by only one reduction (again, if we keep reducing until we are done we may never terminate as in the case of the term (\x.xx)(\x.xx)).

In code we have:
data SKI = S | K | I | App SKI SKI | Var Id deriving (Show, Eq)
data Lambda = LVar Id | LApp Lambda Lambda | Abs Id Lambda deriving (Show, Eq)
As data, and for the translation:
fv (Var id) = [id]
fv (App e1 e2) = fv e1 ++ fv e2
fv _ = []

freeIn id term = id `elem` fv term

lambda2ski :: Lambda -> SKI
lambda2ski (LVar id) = Var id
lambda2ski (LApp e1 e2) = App (lambda2ski e1) (lambda2ski e2)
lambda2ski (Abs id term) = lamb id $ lambda2ski term

lamb id term | not (id `freeIn` term) = App K term
lamb id (Var id') | id == id' = I
lamb id (App e1 e2) = App (App S (lamb id e1)) (lamb id e2)

This is the code only to translate one way, the other way is even easier. The function fv gets the list of free variables in a term, and the function freeIn tests if an identifier (a String) is a free variable in a term. Of course, lambda2ski does the translation, and lamb is a helper function to take care of lambda abstractions.

For those who do not know Haskell, the translation from Wikipedia is this:
1. T[x] => x
2. T[(E₁ E₂)] => (T[E₁] T[E₂])
3. T[λx.E] => (K T[E]) (if x does not occur free in E)
4. T[λx.x] => I
5. T[λx.λy.E] => T[λx.T[λy.E]] (if x occurs free in E)
6. T[λx.(E₁ E₂)] => (S T[λx.E₁] T[λx.E₂])

Where T[term] is the translation function applied to the term "term". Having embedded the lambda calculus in SKI we have proven it as a system of universal computation, assuming the extensional equality of the translated term with the original.

It is interesting that the reverse translation, also given on the wikipedia page for combinatory logic, will make the resulting lambda term longer than the original. For the terms I have played with, it seems like translating from the LC into SKI and back always makes the term larger, although I'm not making an attempt to do reductions at the moment.

Well, that was fun. Hopefully this works out nicely in Coq as well, and I can make the project a computer verified proof of this translation.

Saturday, April 2, 2011

Learning Genetic Algorithms, the max problem

One of the nicest things about Genetic Algorithms is how general they are- notice that the operators we looked at in the last post did not make any reference to the problem they where involved with solving (except fitness evaluation, but that can easily be replaced with whatever we want). The main thing we have to think about when solving a new problem with a GA is how to encode solutions to the problem.

One problem we might want to use a GA to solve is the "best" series of choices. For example, we may have many compiler flags, purchase choices, forks in the road, or any other type of yes/no decision to make. We can certainly do more then just yes and no, but for simplicity lets consider the situation where we must make 6 choices. To make it even simpler, it will turn out in our example that to get the best answer, we should always choose yes. The maximum fitness is 6 and the perfect individual will get this fitness by choosing "yes" for each choice.

So how do we write down random symbols that encode a series of 6 choices? Like so- 001010, 111000, 101010, 001011. This will be our population- each individual will chose "yes" for the indices that it has a 1 in, and "no" in the indices that it has a 0 in. Since we want the individuals to always choose yes, the fitness function will just count the number of 1s in the individual.

Remember the first step is determining fitness- f(001010)=2, f(111000)=3, f(101010)=3, f(001011)=3. Now lets let selection choose a new population- 111000, 111000, 111000, 001011. The way this is actually done is by creating something like a pie chart, where each individual gets a slice proportional to how high its fitness is. A little spinner is spun 4 times, in this case, and the individual that it lands on gets copied into our new population. This is called Roulette Wheel selection.

I want to mention that there are other selection methods. One, called Tournament Selection, just holds a series of tournaments to determine who gets copied into the new population. Each tournament consists of a random selection of individuals from the population (normally 2) and the competition is just selecting the individual with the higher fitness most of the time, and randomly choosing a less fit individual as the winner some of the time (normally 75% of the time the better individual wins). An individual might participate in many tournaments, or it might not be chosen for any tournaments at all.

Now lets do some mutation- 111000, 111000, 111000, 001011 becomes 111000, 111000, 111100, 001011.

And Crossover, pairing individuals up randomly- (111000, 111100), (111000, 001011). Lets say both pairs are chosen for crossover, as the rate of crossover is normally set pretty high (70% or so). The first pair gets the space between the third and fourth index as the cut point, and the second the space between the fifth and sixth. (111|000, 111|100), (11100|0, 00101|1), where the "|" is the cut point, becomes 111100, 111000, 111001, 001010.

We are going to go for more than one generation in this example, so lets start from the beginning- f(111100)=4, f(111000)=3, f(111001)=4, f(001010)=2. Notice that not all that much progress has been made and we certainly don't have a perfect solution (which would be 111111). Well, lets keep going- selection gets us 111001, 111001, 111001, 111001. It looks like one individual has taken over the population- we say the population has converged.

This population has no diversity, which is a very bad thing. Diversity is vital in a GA to get good solutions and partial solutions to create good results. Now we have to let mutation randomly mutate the 4th and 5th digits from 0s to 1s in order to discover the best individual. This is one of the contributions of mutation to a GA- it will introduce new genetic material into a population that no longer has the diversity to find good solutions through crossover.

I won't go through the details of the rest of the run- imagine that mutation takes several more generations to mutation those last couple of indices to 1s, and we find that the resulting population is all individuals that look like 111111. Maybe a 011111 is thrown in there too- mutation can decrease fitness just as easily as increase it.

This problem- finding an individuals of all ones- is called the MAX problem for GAs. There are many standard problems for different techniques that try to isolate properties of the problem solver (GA in this case) so that variations of the basic methods can show is they do better than others in the aspects that these problems try to showcase.

I sneakily introduced the "yes or no" encoding as 1s and 0s to introduce the most common encoding for individuals in a GA- a binary string aka bit vector aka {0, 1}* aka base two numeral.

I think next time we will move on to Genetic Programming on our long road to RGEP. Good times ahead!

Friday, April 1, 2011

Learning Genetic Algorithms, cont

In this post I will introduce the biological metaphor of Genetic Algorithms, and try to develop a simple example problem. We will see how to encode solutions for this example problem and motivate the parts of a GA.

The sample problem will be trying to find a value, x, for the function f(x) = x^2- the function that takes a single input number, lets say a positive whole number, and returns its square. We want to know what input number between 0 and 1000 (exclusive, so 0-999 inclusive) will get us the highest number as a result. The solution space is the numbers from 0 to 999. It will be convenient to write all numbers with 3 digits, so 0 is written as 000 and 23 as 023, etc.

The answer is obviously 999. If this partially random search of the solution space is to be "intelligent" it should arrive at the same conclusion that we have come to.

Now that we have a problem to solve (the max output of f(x)=x^2), a solution space (the numbers 0-999), and a way of writing down possible solutions as a list of symbols (000-999, which is all possible three symbol words using the numbers 0-9), we are ready for the biological metaphor. The story is that the list of symbols is sort of like the genetic material for an organism. The number that it corresponds could be considered its phenotype- so 023 is the genetic material that gives rise to the number 23 as the phenotype. The fitness of the individual is the result of putting its phenotype (its "body") into the fitness function (here its our squaring function). The result of squaring its phenotype tells us how well it solved our problem because we are looking for the highest output of the function f.

This is a nice little relation between the components of the problem and the situation in nature, but there are still parts missing. In nature we may expect there to be many individuals- in a GA we have many lists of symbols, not just one. Also we expect that fitness isn't just a number- fitness should affect the survival of the individual. It is not enough for fitness to affect survival, as there must also be some sort of variation between generations (which implies some way of creating children). This situation gives us the inspiration for the rest of the system.

Lets outline the situation using these new facts- there is a population of lists of symbols which will be random, our example population will be 900, 002, 299, and 432; there is some selection of good solutions that kills off some bad solutions, there is some random variation corresponding to mutation in natural genetics, and there is some mechanism for creating child, a crossover similar to crossover in natural genetics. We hope that good solutions will be copied many times during selection, that mutation will improve solutions and not make them worse, and the crossover will take two goodish solutions and create really good ones by combining the parts of the solutions that made them good into an individual with the traits to be great.

Starting with the population of randomly generated individuals 900, 002, 219, and 432 we should turn them into their phenotype-900 becomes 900, 002 becomes 2, 219 becomes 219, and 432 becomes 432. Now we can put them into our fitness function so f(900)=81000, f(2)=4, f(219)=47961, f(432)=186624. Clearly 900 is a good (but not perfect solution), and 432 is pretty good too. Now lets select some of the good ones, allowing them to be selected many times so the new population may have several individuals that look exactly the same. I won't go into example how to do this in this post, but lets say we end up selecting a new population from the old (think of the individuals as reproducing) and we get 900, 900, 219, and 432. The 219 wasn't as good as 900 or 432, but the "best" individual doesn't always live in nature and the "worst" doesn't always fail to reproduce.

Now lets do some mutation. This means that we take our new population (900, 900, 219, 432) and just randomly change some digits to random numbers. After mutation the population may look like this: (900, 900, 299, 132) which changed exactly two digits in different numbers. It looks like the 432 that got mutated into 132 got much worse, but that happens sometimes. The 219 that was mutated into 299 certainly got better, which is nice.

We are almost to the end of the first generation! Now we just to crossover. This means that we pair up each individual, and sometimes the "happy couple" crosses their genetic material, and sometimes (again, randomly) they don't. Lets say they are paired like this: (900, 299) and (132, 900). Lets assume only the first pair (900, 299) chose, or was chosen, to be crossed. Now we must chose a (random) place in the genetic material to do the cross, so lets chose between the first and second digit. This splits the individuals into the pieces (9, 00) and (2, 99). We take the first part of the first individual (9) and the second part of the second individual (99) and put them together getting 999, and the first part of the second individual (2) and the second part of the first individual (00) which combine to 200. The new population is 999, 200, 132, and 900.

This might have all seemed pretty random and not very intelligent at all- the only thing that wasn't *completely* random was selection. Mutation and crossover could create terrible mutants that perform nowhere near as well as the originals just as it could possibly create good solutions. So why did we do this at all? Well, for one thing, in practice we would repeat the selection, mutation, and crossover many times on a much larger population so there would certainly be some chance or getting a perfect solution.

The other thing to notice is the first individual in the new population- its the perfect solution 999! It looks like we started with a small random population that didn't contain a perfect individual and randomly ended up with the correct solution! Its almost as if I had this in mind from the very beginning!

The most vital thing to notice here- really the point of this whole post- is that the original population didn't contain a perfect individual, and none of selection, mutation, or crossover by itself created the perfect solution. Selection ensured that the individual 900, which is a pretty good solution, appeared in the population more than once. Mutation introduced a pattern that was not in the original population (the 99 in 299, which we would write as #99 with # as a "don't care" symbol, which says that we are interested in the last two digits being 9s, and we don't care what the first digit is in this pattern). Crossover took two patterns that helped the fitness of the individuals they were in (#99 and 9## are the patterns that combined) and this produced the individual 999, which has both useful patterns.

This is the basic Genetic Algorithm. The three operations performed many times is the canonical example of an Evolutionary Algorithm and is considered the primal form of Genetic Algorithms. The problems can be much more complicated and useful, the genetic material can have many different symbols and different digits may be constrained to different values. The genetic material may be many numbers, like 934202 could be two inputs to a function f(x, y) = x*y where x=932 and y=202 in the example individual.

Hopefully Genetic Algorithms aren't so mysterious now, and we can move on to more interesting material. The basic Genetic Algorithm is surprisingly similar to my own work- Robust Gene Expression Programming- but we will have to introduce many new concepts to get there from here.