Wednesday, May 30, 2012

Random Sampling from a Discrete Distribution

I can't recommend enough the article "Darts, Dice, and Coins: Sampling from a Discrete Distribution". It develops Vose Alias Algorithm step by step from a series of algorithms with increasing performance, proving the correctness of each algorithm along the way. I feel like I have a much better intuition about this awesome algorithm now, and the pictures and examples went a long way in helping along the way.

Good stuff.

Power vs Safety

I heard one time the idea that one should use the least powerful tool powerful enough to solve a problem. I didn't understand this at the time, but I've recently come to an understanding (not necessarily the meaning of the person who wrote it).

My first interpretation of this idea was that one should use the tool (I'm thinking in particular about programming languages, but this is more general then just that) with the fewest features. I thought I would rather use the *most* powerful tool I could find. I can understand, for example, not embedding a full programming language into a program if all you want is to let users enter arithmetic expressions, but not, for example, using BASIC rather than Haskell if its an option. You can benefit from simplicity, but I don't see an advantage to using languages without useful features just because you don't strictly *need* them.

Recently I came to another interpretation of this idea. It is common in computer science for there to be a series of systems of varying power that can solve increasingly difficult problems. The catch is that the more powerful the systems the weaker the statements one can make about them. This is true in automata theory (FSM < DFM < Pushdown Automata < Nondeterministic Pushdown Automata < Turing Machine, (although DFM < FSM as well)), the typed lambda calculus (the hierarchy is less obvious (at least to me), but much research goes into coming up with powerful systems (the calculus of constructions, for example) that are still weaker than the untyped lambda calculus to retain properties such as strong normalization or confluence), optimization theory (where you classify problems by the weakest form of optimization that can solve them), recursion theory (primitive recursion < mu-recursion (that about all I know about recursion theory)), abstract grammars (regular grammars < context free grammars < context sensitive grammar < universal grammar). Each of these situations (and more, I'm sure) have many more levels between the ones I happen to have heard of.

Taking automata as an example, each level in that hierarchy can solve all problems solvable by ones lower down, but can't solve all problems solvable by ones at higher levels. However, at each level we lose some assurances about termination or resource use. Finite automata are guaranteed to terminate, but the more powerful Turing Machines are not. Expressions typeable in the polymorphic lambda calculus reduce to a normal form, where untyped expressions do not always have such a form. Equality of regular grammars is decidable, but not so for context free grammars in general. We won't have strange optimization techniques like Genetic Algorithms if every problem were linear.

This is really a very common situation, where we would like to determine the least powerful system powerful enough to solve our problem simply so that we can make as many assurances about the solution as possible. This is also one of the ways that things like closure operators and adjunctions come into computer science ("the least thing great enough to" or "the greatest thing less then"). I think it may even be a fundamental part of computation in some way that I don't know how to formalize. Regardless, it is certainly true that a great deal of effort goes into doing exactly this- using the least powerful tool powerful enough to solve a problem.

Tuesday, May 29, 2012

The R Language for Machine Learning

The other day I bought a book called "Machine Learning for Hackers," which uses R to show off some machine learning algorithms. There is a lot I don't know about machine learning, as I've only studied a subfield and not really the main statistical methods that are useful for the vast majority of problems. One reason I bought this book (in addition to having lots of gift cards) is that a friend of mine mentioned learning R recently, and I would like to know more about it. I don't know almost anything about R, so beware this post- its my initial impressions and misunderstandings.

My understanding is that R has a hodgepodge of properties that make it a bit strange. The basic data type is a vector (which is pretty cool and probably allows an efficient implementation of a lot of primitive operators on vectors), all data seems to have metadata attached, it is dynamically typed (properties of data are determined by metadata modifiable at runtime), there is an object system build on that metadata, it has higher-order function (terribly important, if you ask me), it is lazy (although common operations seem to force evaluation), its call-by-value in the sense of copying (although its looks like copying is done only when necessary), and it allows mutation of structures (although it limits propagation of effects on data through copying).

Overall its probably a good language for its domain- it got a lot attention to convenience, its designed to manipulate lots of complex data, and it supports some functional programming. I expect that it could get very messy with edge cases, odd interactions between mechanisms, possibly clumsy support for more regular programming tasks (I don't know either way about this one), and undefined semantics. The last one is a shame- languages should really know what their semantics or its programmings won't know what their programs really mean, and its hard to come up with variations of the language.

I'm pretty excited to learn these techniques, although this book does not go deep into theory. Its more about getting interested and involved in these algorithms with case studies and practical concerns. I think this is perfect- I'll learn what theory I want to know somewhere else. For now, I'm just glad to get a feel for the broader reach of this field that I like so much.

Wednesday, May 2, 2012

Genetic Algorithm and RGEP Complexity

I've been looking at efficient implementations of genetic operators lately. Obviously the actual complexity depends on the fitness function (sometimes more then anything else), the choice of genetic operators, the implementation of those operators, and the representation of the individuals. However, I'm really interested in the ones used in RGEP, so point mutation, one and two point crossover, rotation, and tournament selection.

I'm slowly converging on a set of structures and algorithms that seem to give a complexity on the order of O(g*p*log(n)) with g the number of generations, p the population size, and n the size of the individuals. This also applies to RGEP, as it is pretty much just a genetic algorithm with a Rotation operator. A genetic algorithm would normally have O(g*p*n) instead, assuming the usual genetic operators, the usual array of arrays representation, and a reasonable implementation of the genetic operators.

To get this time complexity, I plan on using a Vector (the Haskell library Data.Vector) for the population (assuming the usual multiset population) and a Sequence (the Haskell library Data.Sequence, which implements 2-3 finger trees) for the individuals. These are chosen because Vectors have efficient indexing and a nice function "backpermute" which can be used to select the individuals for a new population (given their indices from a selection operator) in O(n) time. Sequences have an amortized O(min(log(i, n-i))) split operator where i is the index to split on and n is the size of the sequence. They also have a O(min(n,m)) concatenation operation (n and m are the sizes of the two sequences). This is important as all of point mutation, crossover, and rotation are really not about copying, but about splitting and concatenating, and should really take less then linear time.

I have learned that roulette wheel selection can be done in O(n) time, sampling n times from a population where each sample can take O(1) time using either the algorithm in "Roulette-Wheel Selection via Stochastic Acceptance", or in my case maybe the one in "A Linear Algorithm for Generating Random Numbers with a Given Distribution". Even using Tournament Selection, which is my personal preference, we need only linear time in the population size to do selection.

I've talked before about point mutation by sampling from a geometric distribution, and given that we need only set up the distribution once and each sample will take O(1) time depends only on the number of mutations in the population. This is usually chosen so that each individual is mutated only a couple of times, so it occurs with about the same frequency regardless of the individuals size (and therefore shouldn't depend on that size). It is unfortunate that this is the one place where a vector would be better than a sequence (assuming it is a mutable vector), but I'm not terribly worried about this. If necessary, all the points can be either mutated in a single pass down the whole population, only changing those points that were determined to be mutation points, or by a series of applications of the function "adjust", which is where I'm getting what I'm going to call p*logn complexity because the number of mutations is determined by the size of the population and the complexity of splitting and rejoining an individual at a given point.

Given these observations, the time complexity is O(g * (mut + cross + rot + select)) where the complex of mut, cross, rot, and select are plogn, plogn, plogn, and n respectively. This makes the total complexity only O(g*p*logn).

This may not be the best we can get (I'm not sure, but I'm looking in to it) but its pretty dang good compared to my usually naive implementations.