Tuesday, January 29, 2013

Stochastic Uniform Sampling

I'm not a big fan of roulette wheel selection. I've seen it remove the diversity of a population much to quickly for a solution to be found, and it bothers me that my choice of fitness function can have unintended consequences on selection pressure. However, stochastic uniform sampling appears to be a reasonable alternative while still be a fitness proportional selection mechanism. I'm planning on including it in my Haskell GA library (which is coming along reasonably well) along with tournament selection (and any others I come across).

The idea of stochastic uniform sampling (SUS) is almost the same as roulette wheel selection. The only difference is that while with roulette wheel selection the "roulette wheel" is spun N times (with N the number of individuals in the population) with a single "head" that can land anywhere in the wheel, with SUS the wheel has N "heads" and it is spun once (conceptually). The space between each head is the average fitness of the population. This helps to prevent the resulting population from consisting of only a few very fit individuals, as the average fitness would then be very small, allowing many heads to hit other individuals. Since it is fitness proportional, it is still possible to highly favor a small set of individuals, but it should do much better then roulette wheel at this.

Another way to think of SUS is to imagine the individuals on a line. The line represents the total fitness of the population, and each individual has a segment proportional to its share of the total fitness of the population. Now imagine a comb with equally space teeth (where the space between them is, again, the average population fitness). With the comb's leftmost tooth all the way to the left of the line, move it a random amount to the right, where the distance moved is between 0 and the average population size. Now, select the individual that each tooth of the comb falls within as part of the new population.

My GA library aims to allow the user to select different operators and combine them in any way they desire. This means I need to supply some commonly used mechanisms, and stochastic uniform sampling fits in with this very nicely.

Saturday, January 26, 2013

Really Good OpenGL Tutorial

I definitely recommend this tutorial: http://duriansoftware.com/joe/An-intro-to-modern-OpenGL.-Table-of-Contents.html

Its the clearest explanation of opengl I've seen, and it gets into some interesting things in the later chapters. It looks like there was more planned that never got written, but what is there is very helpful.

As a side note, when he talks about matrices and using them for effects, he is implicitly using one of the concepts found in this article: http://conal.net/blog/posts/reimagining-matrices

Conal Elliot always has interesting things to say, and in this case he mentions that multiplying matrices is the same as composing them. This means that to compose linear transformations in OpenGL code, you simple multiple the matrices that define those transformations. Good stuff.

Friday, January 11, 2013

The Next Generation of Neural Networks

There is a Google Tech Talk called "The Next Generation of Neural Networks" by Geoffrey Hinton that talks about some very cool neural network architectures and applications. This guy has been doing neural networks for a long time, and he seems to have some cool ideas on how to use them that he really believes are much better then whats been done. In particular he doesn't have to use back-propagation, and gets to learn features automatically if they are not known beforehand.

The speaker is funny, interesting, and presents some really cool material. I was very impressed by his demonstration of his hand writing recognition program- you will just have to watch the talk to see why its so very cool.

I definitely recommend this talk, though I had to watch it more then once to understand a lot of it, and I'm still unclear on some parts (my failing, not the speakers). One thing that I thought was interesting to note was that in the examples he gives of his new network outperforming other techniques, some of the much simpler algorithms did pretty well. They aren't as good as his, but its interesting that the simplest algorithms can do pretty well in a lot of cases.

Radial Basis Function Neural Networks

I've been working on my genetic algorithm library lately, and after getting RGEP to work again I've moved on to evolving neural networks. Since RGEP is about evolving tree structures, one easy test for it is to evolve equations to match data (symbolic regression). I wanted to see how a neural network would do on the same problem if we evolved its weights using a genetic algorithm, but the usual neural networks aren't terribly good at this kind of problem since their outputs are usually either in [0..1] or [-1..1]. This led me to Radial Basis Function (RBF) Neural Networks, which are a pretty straightforward modification to the usual feed-forward networks with, say, sigmoid activation functions. They have the same feed-forward architecture, but they are much easier to use for function finding.

The main thing to note about an RBF network is that the activation function is whats called a radial basis function, which I will describe below. Other things to note are that the architecture for these networks usually don't have an activation function for the output neurons (in other word, the activation function is the identify function) and there are no weights for the inputs (in other words, the weights are all one). This means that the network is a simple linear combination of the hidden neurons. I'm not sure there is a disadvantage to structuring them differently, except perhaps it will make training harder (as there are more values to learn). Also, its seems like having more then two layer (deep neural networks) can be very advantageous to neural networks, although it makes them harder to train (especially with back-propagation, which apparently has trouble propagating error through multiple layers).

So- radial basis functions. These are functions that have a "center", and whose value depends only one the distance from this point. Note that we could use any distance function we wanted, but euclidean distance seems to be the most popular. The center for a neuron is a vector, which replaces the usual weights. This means that each hidden layer neuron will contain an n-dimensional vector defining a point in n-dimensional space. Given the input vector to the neuron, we first compute the distance between the neuron's point and the point defined by the input. This gives a single, scalar, number, which is then plugged into some equation such as the Gaussian function e^(-beta * r^2) where r is the input and beta is a parameter to the algorithm. Other functions are available.

I found this applet useful to get some intuition about these networks: http://lcn.epfl.ch/tutorial/english/rbf/html/index.html. I like how the introduction of a difference kind of activation function extends neural networks into a difference problem type, and I think these are a nice thing to have in owns toolkit.

Wednesday, January 2, 2013

Generating Initial Population for a Genetic Algorithm with Markov Chains

Sometimes when running a Genetic Algorithm a lot of time is lost when the initial population is random. This depends on a lot of factors, such as the encoding of solutions, or the distribution of good solutions in the solution space. One case that seems common is when some solutions are not in the feasible region, and generating random solutions results in a population containing mostly invalid possibilities.

One way to remedy this is to generate individuals encoding feasible solutions to start with. Something this is as easy as walking a grammar while making random decisions about how to expand non-terminal symbols, but sometimes no such structure is available or solutions known in practice follow a more complex distribution then can be easily describe this way (some expansions may be much more likely then others, for example).

In this situation, one possibility would be to train a Markov Chain on known solutions, perhaps sampled from a physical process, such as in a prepared training set. The resulting Markov chain would generate solutions like those found in practice while still being random.

This is not without drawbacks- it assumes we know something about the solution space and that the samples used accurately cover this space. It will improve convergence time but will not search as much of the solution space as a randomly generated population would. Of course, these things are always about trade-offs, and this seems like a nice tool to have in one's toolbox.

On last note- it seems like this could also be used to generate new values for a training set given an initial set. This might prevent some over-fitting, assuming we are able to evaluate the new samples accurately.