Monday, December 31, 2012

Tropical Semirings and Time Complexity

I've have been thinking about how a compiler might track the cost of a computation, and assuming that it is known to terminate, compute it at compile time. I have no idea if this or something like this is usually done, but its basically constant folding.

What I was thinking is that many operations, arithmetic operations being the best example, can be tagged with a cost (say, 1 for all of the languages primitive operations). Any operation that takes an unbounded amount of time would be tagged with an "infinite" cost. The interesting thing here is that this amounts to a morphism from programs in some language to the tropical semiring (in this case, of natural numbers extended with infinity, with the additive operator the max function, and the multiplicative operator as addition), as I show below. Note that this assumes that they available control structures are sequencing and conditionals.

First off, primitive operations are assigned a fixed cost- perhaps just 1 for each operation. This could be more fine grained if desired, especially if the languages primitive operations are not simple.

Sequencing two operations translates to the addition of them. This seems appropriate as sequencing is like the product of programs. This is because the cost of doing two operations in sequence is the cost of doing the first operation plus the cost of doing the second.

Conditionals have at least two branches, as a one branch conditional statement is like a two branch condition with the empty operation as the second branch. Only one of the conditions will be executed, but we have to assume pessimistically that the more expensive one will be followed. This means that the cost of a conditional is the maximum of cost of the two branches.

Other control structures could perhaps be accommodated. Well founded recursion, or loops of certain forms, could be added, though in general they would have infinite cost. It seems like one would end up evaluating the largest expressions possible that is not tagged with an unknown (infinite) time cost.

So thats all. A trivial observation, and one that I'm sure is well known. To finish this off, note that something like this is also true for determining the size of algebraic data structures. Infinity comes up for structures that may have unbounded size (lists or trees), but otherwise the equations that define the structures (in the algebra of data types), when interpreted as a number is essentially its size, with variables acting as expected.

Monday, December 3, 2012

Robust Gene Expression Programming + Let Expressions

I've been thinking about (and implementing) Robust Gene Expression Programming recently. I've posted about this technique before, but as a refresher it can be thought of as either a Genetic Algorithm that evolves expressions in a given (unityped) language or as a postfix form of linear Genetic Programming (with the intent that the language is purely functional rather than imperative as linear GPs sometimes are). One of the things I noticed while developing this algorithm was that there is no reason one couldn't include stack manipulation operators into the expression language, regardless of the language one is interested in, and more importantly these operators add power to the algorithm essentially for free. Recently I've realized that "let" expressions could be similarly added, and that they may improve the algorithms ability to find useful programs in a nicely clean way.

Adding stack operators and "let" as an extension to a language is motivated by the observation that useful programs consist of useful subexpressions. This means that program with good subexpressions should propagate throughout generations allowing them to get mixed into individuals and so on. There are several schemes in the Gene Expression Programming world for helping useful subexpressions get reused. I tend to like simple schemes when dealing with these algorithms, and I think it is pretty cool that a stack operator like "dup" can be added to, say, a language for arithmetic expressions to duplicate it in the resulting expression tree, allowing it to be reused. Something like "tuck" could perform a similar function of duplicating subexpressions, and for allowing the expression tree to be essentially edited to move subtrees around and combine them in different ways.

Now for the point of this post- in addition to adding stack operators I think one could easily add "let" expressions to any expression language. Again we get them essentially for free in the sense that we can add them without changing the type of the expression language (they can be expanded out before the expression is evaluated for fitness). The scheme I have in mind involves adding variables to the terminals of the language and, for each variable added, a "let" operator for that variable taking two expressions and replacing the first expression for all occurrences of the variable in the second term. Variables that occur free in an term (not bound by a surrounding "let") are ignored during evaluation so they do not disrupt the usual evaluation (which is set up to handle problematic cases like this and still result in a valid expression).

Note that I've been calling these "let" expressions rather then lambda because they never form actual functions but must rather be immediately applied to a value to be included in the expression tree. One could do otherwise by designing an expression language with lambda terms, but as an extension to a language we must enforce the rule that each expression has the same type. Since there is only one type, and the "let" terms are immediately evaluated, these are more like "let" then lambda.

The main problem with this idea that I see is the somewhat ad-hoc way we have to decide beforehand how many variables we want to have. I don't have a big problem with this, and I think for many problems a small number would be sufficient to add some value, but it does feel a bit odd. On the other hand, this strategy is much simpler than some other ideas that have been looked at, and sometimes one exchanges simplicity for perfection.

It will probably be a while before I get around to implementing this to see if it actual helps at all, but I like that RGEP seems to have some simple, optional, and lightweight ways to extend itself without changing the core algorithm. Now if I could only figure out a clean way to allow more then one type...

Tuesday, November 6, 2012

SImply Easy! An Implementation of a Dependantly Typed Lambda Calculus

I like to mention papers that I've enjoyed on this blog, and I thought "Simply Easy! An Implementation of a Dependantly Typed Lambda Calculus" by Andres Loh, Conor McBride and Wouter Swierstra was a nice read. It makes use of De Bruijn Indices and Higher Order Abstract Syntax in an implementation of a Dependant lambda calculus in Haskell to show that its isn't so daunting or complex as one might expect. Its nice to see discussion of these techniques and to see how they simplify the implementation, and I appreciated that they took the time to discuss every line code. It was easy to read, assuming basic knowledge of each of the relevant subjects, and was successful in convincing me that a Dependant Type System is not black magic.

I definitely recommend this one.

BerkeleyX: CS188.1x Artificial Intelligence

I've been taking a great free online class on edx.org called BerkeleyX: CS188.1x Artificial Intelligence. The lectures are interesting and move at a reasonable pace, the projects are well thought out and instructive, the professor is interesting, and there are occasionally some more advanced topics discussed.

I'm totally a fan, and I may take more of these in the future.

Wednesday, October 24, 2012

Sampling a Geometric Distribution from a Uniform One

This post merely records a useful fact I may use in a program I've been thinking about: given a number sampled from a uniform distribution for [0..1] we can translate that into a value in a geometric distribution with the equation floor(ln(U) / ln(1 - p)) where U is the value from the uniform distribution and p is the desired probability of success. Without the floor function this generates a value from an exponential distribution with lambda value equal to -ln(1-p), so ln(U) / (-lambda) is the value generated.

Friday, October 5, 2012

The All Encompassing Forth

Note- ramblings ahead.

Programming in Forth is different from programming in most other languages. It seems to expand to fill whatever void it encounters. In most languages one writes a program that is run in a compiled or interpreted form different from the language it was written in. It is possible to write a Forth program in this style, but it is also possible to take another path- to extend the Forth programming language until it includes the capabilities we want. If you want a text editor you can extend Forth until it is a text editor. If you want to communicate with a piece of hardware, extend Forth until it includes that hardware. I've been thinking about this recently, and had some thoughts to record.

With Forth, the language itself is in some sense an extension of the inner interpreter. The inner interpreter is essentially just a finite control- a finite automata that executes some primitive instructions and read and writes to an infinite store (the return and parameter stack and memory). The outer interpreter is a useful extension of this machine because it gives a way for the system to extend itself, and once a system can extend itself we have a very cool thing. By feeding this system text it is able to add to its abilities by adding to its store of actions (the dictionary). Add definitions that process strings, and we have a string processor. If we describe a text editor to this system in the language that extends it, then it has a text editor. The feeling that we are extending the system and not creating a separate program comes from the fact that the system does not stop accepting input. We can feed it as much as we want, extending it with any functionality we want, and it will work through it and wait for user input when it is done. When it is reading user input any functionality added to it can be invoked.

A disadvantage to this model is that it is essentially a global namespace. This is not to say there is no modularity mechanism, as there are vocabularies, but you do have to be very careful with Forth. You can change the interpreter any time you want, and it has a global effect on the system. This is not like what I've heard smalltalk can be like, where I believe the environment is persistent. Usually with Forth we reinterpret everything from the start.

Its an interesting and possibly pretty cool system, and I always feel like one could try placing Forth as the heart of a system- like a Forth operating system (which of exists, and in fact it seems like traditionally Forths have seen themselves as the center of the system)- and just let it expand to fill any void it encounters. I seems like you would end up with a monolithic system (perhaps this is avoidable?) that includes everything one would need for a computing platform. I'm sure this has happened before, though I wonder what heights the idea has been taken to.

Sunday, September 16, 2012

Controlling an Arduino from GForth

I have been working recently on a way to control my Arduino Uno from GForth, and it is turning out pretty awesome. I've written a sketch for the Arduino that reads simple commands over the serial bus, performs their action (using a large switch statement) and returns a result. From the computer one can type something like "13 HIGH digitalWrite" to turn the LED on, or "10 arduinoMalloc" to allocate memory and get the address of the allocated block.

This project has come out of my failed attempts to write a Forth to run *on* the Arduino. Instead of accepting that the limited resources on the device restrict what you can do, I decided that it would be nicer to use a much more powerful computer and a fully general programming language and treat the Arduino as a peripheral that can be commanded. This lead me to the command-reponse protocol I've implemented.

Of course, you have the full power of Forth, so you can do anything you want, but the functionality that the Arduino currently exposes this way is: reading and writing SRAM, reading and writing EEPROM, allocating and freeing memory, setting the mode of a pin, reading and writing digital pins, reading and writing analog pins, delaying by a given number of microseconds, generating random numbers and random numbers from a given range, and setting the baud rate of the usb connection. This is pretty much all the functionality I can test right now without finding some hardware, but it is really easy to add new commands so I certainly will expand this set when I can think of something to add (suggestions?).

The tool as its stands only works in Linux (I had some trouble on Windows, but I may come back to it) and can take a device name and baud rate on the commands line. There is also a command line switch to compile and upload the current version of the command interpreter to the Arduino. Once started it is simply the GForth interpreter with some words for sending the commands mentioned above, as well as error handling on both the Arduino and the GForth side of things.

The Arduino sketch will currently reject invalid commands, invalid command lengths, and invalid arguments with an error code. The GForth side throws an exception explaining the problem and printing the result given (the length received by the Arduino in the case of an invalid length, for example). The Forth words that send commands will also check if the command has been defined yet, and will complain if it doesn't understand the result the Arduino sends back.

There are at least two ways I can think of this program being useful. First is to command the Arduino to do something complex or that may change over time or depends on information available to a laptop, say, but not the device. I'm hoping to think of a project like this to demo. The second use is in the fact that the whole thing occurs in the serialEvent function. This means that if you have a main loop that does something with the Arduino, but you would like to debug you program at runtime with a full interactive programming language that can peek and poke memory and sample inputs and outputs, then you can send commands at any time to do this. I feel like this might even be the most useful application, but I don't have anything right now that needs this kind of interactivity.

Things I would like to add- 32 bit arguments (currently everything is 16 bits), saving commands to eeprom to read at startup, control structure commands, grouping commands to send all at once instead of one at a time, some way to register commands to run repeatedly so you don't have to keep sending them, and an assembler that uploads to flash or eeprom and reads the uploaded program at startup. I think this might be one of the cooler applications- you could program in assembly on the Arduino and still use sketches, and without overwriting the bootloader. This relies on being able to write to pages of flash, and I don't know how to do that yet, but still a cool idea.

Thats it for now. This project so far has been very pleasant and opened me up to how nice Forth can be if you use the Forth Foundation Library. I have had some bugs, but interactive debugging and a quick testing cycle has made them much easier to track down then I would have expected. As usual with my posts- I hope to write more about this soon!

Forth Foundation Library

I've been writing a lot of Forth code recently, and I've discovered that using the Forth Foundation Library has been incredible boost in productivity and simplicity for me. It raises the level of abstraction beyond what I'm used to in C, and makes Forth really pleasant to program in. Anyone interested in using Forth beyond basic toy programs needs to consider this library. I'm currently using the linked list, cell array, argument parsing, enumeration, and format string modules, but I have plans to use at least the random number generation one and possible n-ary tree ones as well. I may even consider playing around with their GTK module, which looks interesting, and the finite state machine ones (there is both deterministic and nondeterministic).

Category of Stacks

In the usual category of types as objects with morphisms the terms of some language, the types are the usual base types and all types constructable from a set of type constructors. It seems that we would need a different kind of category for a concatenative language- one where the types are stacks with basics types, perhaps polymorphic types, and always ending in "Stack variables" indicating a "rest of the stack". This seems to be the usual way to consider type systems for these language, but I've never seem an investigation of the category this would form. Unfortunately, since I understand only the basics of even elementary category theory, all I can do is make some basic observations.

First off- The stack consisting of only a stack variable is terminal. It is terminal because you can always pop off anything from the defined part of a stack until you get to its stack variable. At first I thought it was initial too, because you can always add to the empty stack to get to a given type, but what would you add? At the least there is no unique term, and with polymorphic stack elements there is nothing reasonable to add at all. I don't believe that there is an initial object, unless there is some sense in which we can bind the "rest of the stack" to a given type, instantiating at any type we want. I'm not sure if this really makes sense, however.

Secondly- a stack with the elements of two stacks, the first stacks elements followed by the seconds elements, seems to form a product. We can always drop the elements of one of the two stacks to get the other one, which form the projections. Just as with regular type products there is an arbitrary order and an isomorphism swapping that order. Note that the terminal stack does seem to be the identity of the product, so that encouraging.

Thirdly- sums are a bit tricky. The only reasonable way I can think to form a sum is to take two stacks and push them as a single element of a sum type of stacks. This makes it a stack of stacks. Otherwise you would need to somehow "tag" parts of a stack that may be different lengths, and I don't seem how this would work. Its not a big deal to make this a stack of stacks, and I believe that has been done before, but it is interesting that if we don't do it that way then there are no sum types that I can think of. If we do allow this, it also seems like a similar product of stacks exists, But then we need a separate unit type as its identity as the empty stack would become the identity of the sum and the zero of the product.

A similar situation appears to occur with pushout and pullbacks, and limits and colimits. I can't make any of this rigorous, but it does seems to behave differently then regular categories of types, and is perhaps worth some study.

Thursday, September 6, 2012

Monads on the Stack

What would a monad look like in a concatenative language? It seems that each of the usual suspects from Haskell have some sort of interpretation in this context, and may provide a useful addition to programs in such a language. However, so far I have not be able to see how to show that my interpretation actually forms a monad over a suitable category. So- how to formalize this and what does it look like?

Working somewhat backwards, I think I know what I want this to look like. Hopefully this will lead either to an understanding of how to formalize this or to some other formalization that is more suited to a concatenative language. I imagine that syntactically the space character is overloaded and is the equivalent of >>= in Haskell. The default monad is Id and has no extra properties, but you can select a particular monad, perhaps using the technique in the blog article (or blarticle) Scrap Your Type Classes.

To fmap a word into the state monad simply use Factor's dip or [ id swap | ] if "|" is the vertical composition operator I discussed in a previous post. The function "return" is slightly harder- it may be a word that take the top of the stack and returns a closure that puts that item under the top of the stack ( s -- a s ). The problem with this is that it seems to imply that while we are using the state monad we can only use the state and one item, which is a bit of a restriction if you are used to having a stack to manipulate. I was hoping this would be a way to have a state if you wanted it but otherwise to write normal words, but to do that it seems like you have to make the state monad "stack polymorphic" taking an arbitrary number of items from the stack. but then what does "return" look like? Does it save an arbitrary number of items? Or am I completely on the wrong track here?

Moving on for now, the reader monad is pretty simple. It simply provides a top of stack that you don't have to use, and will not change through a computation. For fmap we have "id swap |" again, but for return we have "drop ." if "." is the composition of words on the stack.

The writer monad should consist of words that leave elements of some monoid on the stack that are combined by the action of the monoid and are no visible to the next word. Simple enough, fmap is "0 ." if 0 is the identity of the monoid, and return simply puts an item on the stack followed by the monoid's identity element.

Maybe is also easy if we focus on the top of the stack- it is just like Haskell where you adjoint an element to a type and the presence of this element, in this case having it on the top of the stack, short circuits the computation. Otherwise the top of the stack is a Just value and is used as normal.

Thats all for now I think- I hope to write next time about an attempt to investigate what this means and in what category we are. Maybe then it will Just illuminate itself, or maybe I will have Nothing. Tune in next time to find out.

Vertical Composition

I've been thinking lately about designing a pure concatenative language influenced by Haskell to explore an idea I have about concatenative programs as diagrams. It would involve investigating how monads would work out in this type of language, as well as optimization and reasoning in terms of diagrams. I would probably try embedding it in Haskell and perhaps one day compiled to C. As part of this I've noticed an operator that I haven't seen in any concatenative languages (not even Factor, with all its awesome combinators) but seems like it might be useful. I call this operators "vertical composition".

Vertical composition is similar to the product of functions- if f:a->b and g:c->d then fxg : axc -> bxd (also written as fxg : (a,c) -> (b,d)). If these functions effect the stack, then their usual composition (which I'm calling horizontal composition and is written for f:a->b and g:b->c as g.f:a->c or f;g:a->c) creates a function that performs the the first function, and then the second. This means that the composition is valid if the stack after the first operator is a valid input to the second (possible requiring extra items deeper in the stack then the first function used). Vertical composition, on the other hand, is always valid just as the pointwise action of a pair of functions is always valid. Instead of operating sequentially, vertical composition acts in parallel (at least, conceptually) by performing the first function on the part of the stack that it needs, and the second function on the part of the stack just under that part used by the first function. The resulting stack is the result of the first function followed by the result of the second. I will write vertical composition as f|g, though I'm considering f/g, so we pronounce it "f over g". I will try to clarify this below, but the point is that you can reach under part of the stack without rearranging it. This would involve some amount of rearranging to arrange so the stack ends up with no "holes" where, for example, if f has stack effect ( a b - ) and g has stack effect ( c - c ) then f|g has stack effect ( a b c - c ). This would mean that the value c should be two places lower after f|g, which can't be known beforehand in a language like forth. If implemented naively as a stack this is a problem but the diagrams I mentioned earlier should remove any overhead of that kind. I hope to post about this strategy soon.

Something interesting about this operator is that the operations it composes could be performed in parallel. This isn't something I expect to see, but it is interesting that the horizontal form of composition is sequential and the vertical form is parallel.

Note that the vertical composition has the empty word as its identity and is associative, just like horizontal composition.

So, this operator is definable. Is it useful? I will write about that in a future post. The point is that this is the first step to a postfix program as a diagram.

Tuesday, August 21, 2012

Generality is Correctness

At work I came across a situation where the length (and other properties) of a data structure was hard-coded in many similar functions across a project, where each function acts on a different instance of the structure. At first I was reluctant to implement my own version of the function in this way. Lacking a convenient way to abstract this function I'm okay with reimplementing it in my own code, but it is possible for the length to be calculated at run-time instead of hard-coded.

I reasoned that all the functions be implemented by calculating the length, as it would make them more generic. The function would work correctly if the sizes changed, or if it were using in a different situation then the one originally intended. In this sense the more generic the function the fewer properties it has, and the more likely the features it does have will be correct. Being very generic should make a function easy to understand (there are fewer things to be aware of) but should also restrict its behavior to exclude extra information (the length used in that function is part of its behavior not described by its type) or corner cases. An more extreme example is Parametric Polymorphism, where there are even derivable laws that such functions must satisfy, but it seems like this concept occurs in all levels of design.

In the end we are hard-coding all of these sizes. This is also a form of correctness, as all function do one and only one thing, and can be used in only one way. The main reason for doing it this way is that we can show that throughout the project the sizes are used in a consistent way and that nowhere does the size get misused. This means that we will not see a bug that a size is calculated incorrectly and ends up overwriting memory, getting memory access exceptions, or taking a huge amount of time to process a structure that appears larger than it is. On the other hand, these functions are specific to a single use, and it is difficult to make global changes to them if we needed to. In this particular case that is okay, but it makes maintenance more difficult.

The main thing I've learned here is that the tradeoffs where I work are not the ones I'm used to. We do not build small, reusable components, and there are not many mechanisms for abstraction or specification. Documentation becomes extremely important, and there are often subtle bugs (in my code included) involving memory management.

C Macros and Let Polymorphism

We use macros for constants quite a lot at work, and it occurred to me that since a macro is replaced at a token level (before parsing and certainly before type checking) and since C will do coercive casting on literals, macros are somewhat like a limited form of Let Polymorphism. Obviously they not the same thing, but in practice you sometimes get the advantages of Let Polymorphism in that you get the same expression at several types.

Wednesday, August 8, 2012

Blind Crowdsourcing of Captchas

I feel like someone must have thought of this before, but it seems like one way to break captchas would be to create a service that many people want, but needs to be protected against scripted use. Give your users captchas from other sites that you want to break, and they will gladly solve them. You then simply forward the answers along to a script that is doing something nefarious- whatever is it captchas are protecting against. Distributed organic computing for solving the captcha problem.

Wednesday, July 18, 2012

Swarms Rule

I've been thinking a bit about what rules make a swarm look like a swarm. The three rules that boids follow, cohesion, alignment, and separation, make a nice flock, but a swarm isn't really the same as a flock. A swarm's parts moves around even when it is still, if may change shape at any time, and it more dynamic and less structured then a flock. So what rules do we need to produce swarm behavior?

First off, there appear to be more then one type of swarm, or at least more then one thing a swarm may be doing. A bait ball is a circle of beings all circling in the same direction and staying close together but not converging to a point. A large, mostly stable swarm may appear to grow limbs or briefly split. A moving swarm may be more like a flock then a standing one. Perhaps there are other forms we may wish to create that are swarmlike?

So, the rules. A standing swarm that does not change shape seems to need only two rules- move randomly and stay inside a sphere. Luckily these are both O(n) time operations for a swarm of size n, so we can get away with an efficient little swarm if we allow individuals to move through each other. To get the shape changing effect I would guess we could create invisible attractive or repulsive forces and move them around, orbiting the center of the swarm and possibly reversing their effect occasionally. Another possibility is temporary leaders, either with new rules or with the same rules as others, that are followed by close by individuals. Another possibility is that individuals always sort of follow each other but the strength of that influence varies.

For a bait ball I would guess we would need separation from a central point, random movement, and a strong need for alignment. I'm not sure if this is enough, but I imagine the alignment would keep them in check and the random movement (possibly also a need to move forward) would keep them circling.

Finally the flocking behavior of a moving swarm may simply be the above standing swarm rules with a force pulling the individuals to a certain location. This may be enough to create a nice looking moving swarm.

Thats all I can think of for now. Any thoughts? Other types of swarms? Other rule sets that seem plausible?

A Very Cool Method for Computing Shortest Paths

I've been reading a very cool article A Very General Method of Computing Shortest Paths, which gives a nice general way to compute several things by solving an equation in a *-semiring or Kleene Algebra. It ties together shortest paths, regular expressions from automata, solving linear equations, and several other problems into one fairly simple algorithm. Definitely worth reading.

Sensor Controller Interpreter

I've started programming a prototype for MicroSwarmigism, and while things are going slowly, I've got a general notion in place that I want to record here. The idea is that changes to the game occur in three stages, consisting of a sensing stage where data is collected, a controlling stage where changes are calculated, and an interpreting or application stage where instructions are carried out. This is more general then what I'm describing here, and in fact it seems to be a nice way to decompose things like genetic operators or generally operations on a data structure. For now I'm focusing on how I'm planning on using it in Unity.

All three of Sensors, Controllers, and Interpreters are parametrized by their input and output, so they are really just unary functions. The difference is in how they are used, and in Unity, when they are called. Sensors are intended to be used to determine information about the state of the world such as mouse input or positions. Sensors run in Update, and I plan on only running Sensors when their data is asked for, so they are lazy. Controllers are intended to produce the data that is used to make changes. I call this data instructions, but really it can be anything. The Interpreters are intended to apply the instructions to some piece of data. This is done in LateUpate, so all sensors will have pulled their data and ordering of Controllers will not effect Sensors. The ordering on Interpreters is not controlled, so their effects should be commutative if possible.

An example of this that I'm working out right now is this: a Sensor collects mouse input information, a Controller reduces this information to a delta in mouse position, and an Interpreter merges the change in position with the position of a cursor. This would be how the mouse cursor works. Adding a collision sensor and left click sensor would allow a controller to determine is something is selected, and an interpreter to change the cursor's current selection.

Another example would be a swarm with a sensor telling it the position of each of its gisms, a controller averaging that position, and an interpreter on each gism using this information to effect its next move.

Yet another example would be mating- a sensor on a gism would determine close by gisms, another would determine energy level, a controller would determine if mating should occur. The given gisms should then have a sensor providing genetic material, and an interpreter applying crossover.

The one downside to this setup that I can think of is that since Update is used so collect data and produce instructions, and LateUpdate to apply them, there is no time that is used to enforce invariants the way LateUpdate is sometimes used. For example, if the mouse cursor goes off the edge of the screen, we could use LateUpdate to place it back on the screen after all other movements have been applied. I plan on simply being aware of this and programming carefully to mitigate the problem.

Wednesday, July 4, 2012

A Proof of Spirits

In addition to his electronics blog, To The Rails, my friend Chris now has a booze blog Proof of Spirits. Check it out.

Also check out my brother art blog Sculptor by Day for awesome drawings and computer graphics, as well as MicroSwarmigism updates.

Saturday, June 30, 2012

Possible Improvement on the Last Post's Algorithm

This post describes what may be an improvement of the efficient way to do the canonical Genetic Algorithm described in the last post. I'm not sure how well it would work in practice, but for a mostly converged population it could be a huge help. It can also help avoid recalculating fitness for two individual that happen to be the same. This will involve hash functions, and the downside is that it may result in some crossovers, selections, and fitness evaluations not being performed correctly when distinct sections of individuals hash to the same value.

Recall that in the last post we made use of a complete 2-tree representation of our individuals. What we will add here is one single piece of data to each internal node of the tree.

Assume that we can hash the contents of an individual. This is not much of an assumption given the usual fare for a GA- natural numbers, integers, bits, doubles, floats, etc. We calculate a hash of each location, and, moving up the tree, combine the hashes of the two leaves, fingertree style. Each genetic operator will now be slightly modified from the description in the last post by this change.

Mutation is the same, except for the unfortunate fact that we have to propagate changes to a location up the tree to the top. This makes each mutation take logarithmic time.

Crossover is where we start to see some advantages. When traversing down two individual's trees, if we reach a node where hashes agreed, we stop doing crossover. We assume that if the hashes agree than the subtrees are the same and don't require any copying. Note that there can be false positives here if two subtrees just happen to hash to the same value.

Selection is done similarly- if we are copying over a losing individual with a winning won as described in the last post, and we come to subtrees sharing the same hash, we again assume they are the same, and can stop the copying. Again, false positives can occur.

We could try to use this to avoid recalculating fitness for individuals by placing their hashes in a hash table with their fitness, although this is not a good idea in general. This is the one case where I would be very wary of collisions, especially as we may want to only save one result per location in the table to avoid having it grow too much (which is what happens in my experience if you try to hash every fitness evaluation). Again, not really recommended unless we really want to avoid doing evaluations and we can safely report the wrong fitness or gene expression sometimes (like an a certain game).

Unless I actually implement these algorithms, which I may, I think this is it for this idea. The 2-tree representation is something I've been thinking about for a long time, and I think it is really a good idea, despite the extra complexity and logarithmic access to individual locations (just use maps and folds like a regular person and it will be fine).

Efficient Genetic Algorithm Idea

This post describes a pretty efficient way to do Genetic Algorithms in an imperative language. Its pretty much like the pure way I talked about in an earlier post (using Haskell's Sequence and Vector types), except should be pretty straightforward to implement with destructive updates. Its a little harder to say exactly what the complexity is in practice, as it depends on how close the population is to convergence. However, even in the worst case it should be a good bit faster then a naive implementation. Note that this assumes the usual genetic operators and a population of linear individuals.

The representation of the population is as a collection (an array, for example) of complete 2-trees (trees with 2 children where all but the last level is full except possibly the last, and all nodes on the last are as leftmost as possibly). The nodes of the trees simply contain pointers to their children, and only the leaves contain values (the elements of the individual). The shape never changes, and it can be allocated as an array of pairs of pointers to tree nodes. The genetic material itself can be in a separate array or actually held in the leaves, as long as it is easy to get access to the leaves as an array. We will see why we need each piece of this setup soon.

Mutation is easy, as long as the locations can be accessed as a single array (or 2d array, or anything that can be accessed in constant time). As I've described before, rather then testing each location, simply calculate the number of failures you will have before the next success, skip over that many elements moving left to right through the population, and mutate that location. This calculation requires sampling from the geometric distribution describing failed mutation attempts before the next successful one. The only tiny trick here is to use an efficient sampling algorithm (Vose Alias Algorithm would be my choice) and to lump the probabilities of all indices beyond the end of the population into one. For example, if there are 10 indices in total, then 0-9 are given by the geometric distribution, and the 10th has the sum of all possibilities past 9 to infinity. If you generate something past the range of locations, you are done with mutation, so thing turns the infinite geometric distribution into a finite one.

Crossover is where we make use of the tree structure. This works essentially for the same reason the Sequence type in Haskell works for this, although its implementation is simpler. This is slightly tricky to explain, but it isn't really very complex. First off, note that we can usually consider the random pairs to be simply individuals next to each other in the population, as selection usually randomly picks out individuals, so they would already be shuffled. In this case we have to do this ourselves, which only takes O(n) time (shuffle a list of 0..n-1 and use the locations next to each other in this list as pairs).

Now we want to actually do crossover- we have our pair of individuals and a crossover point chosen. I'm only going to describe 1 point crossover, but 2 or n point would be essentially the same. What we want to do is to swap the child pointers in the two trees representing our individuals such that the range of indices from 0 to c (the cross point) and from c to m (the end of the individuals) are swapped. This proceeds as follows- traverse the two trees from their roots by checking if the crosspoint c is less then or greater then the midpoint of the range of indices in leave nodes under the current node- if less than, go to the left child, if greater then go to the right child. This information can be precalculated or stored at the internal nodes. When moving down a level, if you move left, simply follow the left pointer/reference. If you move right, then before recursing on the right child, swap the child pointers for the left children at the current node between the two trees. In other words, we swap all pointers on the left as we move down. If you get to a leaf node, swap the pointers for that node in the internal node that pointers to it.

This process is again hard to describe, but it is truly easy, especially if you look at something like http://www.haskell.org/haskellwiki/Zipper. The cool thing here is that it is a pointer update to do crossover, and requires only logarithmic time in the size of the individuals.

Finally we get to selection. Selection isn't really any different then usual here- we can do tournament selection and roulette wheel selection efficiently as I've discussed in previous posts. There is the problem of indexing into the individual- while in mutation we could consider the population on large array regardless of which individual a location belongs too, when inspecting a particular individual we need to traverse its tree to get the actual location of an index as they have been mixed up through crossover. Accessing a single location would take logarithmic time, but accessing the locations one after the next would only take linear time, so we many want to copy the individual off somewhere for evaluation.

The only other optimization I've considered is to do the replacement of winners with loser in place, leaving winners where they are in the population. We might do this as follows- assume we have the individuals in an array, and we know for each individual how many times they are to be represented in the next generation. We move two pointers through the population, one keeping track of the current individual with more then one representation, and the other pointing to the next individual with 0 representations in the next generation. Simply copy the winning individual into the losing one and decrement the representation count. This is repeated until the winning individual has a count of one, and we move to the next repeat winner until we get to the end of the population. This doesn't save a huge amount of copying, except perhaps in a mostly converged population using roulette wheel selection, where many individuals will have some representation in the next generation.

Thursday, June 28, 2012

Thoughts on C Callbacks

I've come across several examples recently at work of places where a C function pointer can be provided as a callback, and a single argument is (often optionally) provided. The argument is given when the callback is registered, and the function is called with the same argument each time, usually given as an "int". These are just some thoughts about this technique.

First off, the whole "only one argument" is not a big deal. These are essentially higher order functions, and as we all know, one can curry and uncurry functions. Of course, in C this is done by hand, but its still possible and done all the time. It is common to define a struct containing the values for a callback, and passing a pointer to such a struct in the callback to provide multiple arguments. As structs are product types, this is essentially uncurrying a callback of multiple arguments.

The whole "called with the same argument every time" is not really a big deal either- if you want to have a value that changes, just pass it a double-indirect pointer and change the pointer value. This must be done carefully (like everything in C) as in some situations your callback can be used in a context that can interrupt any context that changes the value of its argument, and synchronization primitives may not work the same in the two contexts (ISRs in VxWorks for example).

This is not the only situation I've seen, however. In VxWorks, tasks can have up to 10 arguments when they are spawned. I believe the reason they can be optional is that if ABI specifies that the caller cleans up the stack, the callee not using some extra values is not a problem. I imagine they would have to be placed with the leftmost argument on the top of the stack for this to work, but I haven't peeked around in memory to make sure this is true yet.

Monday, June 18, 2012

MicroSwarmigism

Joel and I have been talking about making a game together that has to do with controlling a swarm or swarms of tiny organisms. This is going to involve flocking behavior, Genetic Algorithms for traits, and possibly several other techniques (I may work in RGEP if it has a place), as well as lots of art, animation, and creative awesomeness. Joel has already done some great videos to show off ideas for the game and to give people a feel for it, as well as develop our understanding of it. You can see these videos on his blog, http://sculptorbyday.blogspot.com/.

The current plan is to use Unity, which is completely new to me, so I need to get a copy of Windows and start programming. I will probably use C#, but Javascript and Boo are not out of the question- I just don't know enough about any of them to have very strong opinions (opinions are very welcome- please!). I will have to learn quite a lot to get anything working, which I'm excited about. Games are such an expressive, visual, and interactive type of programming that I have only some small experience with, and hopefully a nice balance to the embedded systems programming I do for a living. I've always loved visualization of complex things like Genetic Algorithms and Particle Swarms, and the beautiful images (and games?) they give rise to.

MicroSwarmigism! As Fry from Futurama once said- girls like swarms of things, right?

Linear Lambda Calculus and PTIME-Completeness

I've just finished reading (the parts I could understand of) a paper with the same name as the title of this post, and I'm very impressed. It turns out that the linear lambda calculus has all the computing power of a boolean logic circuit and that the types in linear lambda terms uniquely determine their terms and since they are all typeable, there is an isomorphism between types and terms.

I definitely recommend the paper, especially the part where they do a simple embedding of boolean circuits in the simply typed lambda calculus and find that typing the resulting term results in the type of the embedding of true or false, which is also the normal form of the resulting lambda term. Good stuff all around!

One last cool thing- as I understood it, they also said that the type system of the simply typed lambda calculus is the linear lambda calculus. This seems to be common with systems of logic, where the upper layers are some restricted form of the lower.

Accuracy Based Learning Classifier Systems

I have always found learning classifier systems to be extremely interesting. A LCS is an agent that is able classify its environment (say, through sensor input, or given problem instances) and decide on an action based on a set of rules. The crazy thing about its decision making is that it arises from a collection of competing rules evolved by a Genetic Algorithm- a collective but singular mind. Even better than a LCS is an Accuracy Based Learning Classifier Systems (XCS), which attempts not just to find the actions that give the highest reward from its environment, but to find rules that accurately predict the actual reward. The nice thing here is that while in both systems you end up with a collection of rules that can be used as a classifier, with an XCS you end up with a collection of rules that map out the reward space. Either way, the result is the whole population (there are variations on this, but I'm not going to go into that), so we get a swarm of weak classifiers that collective can solve problems. The individuals compete with each other, but they also share rewards.

I won't go into huge detail on describing these systems- there is no single part that is terribly complicated, but they are altogether more complex then some of the other systems. If nothing else, they contain a GA component, and so they are more complex than a GA by itself. I will give a brief overview of XCS (for me, the more interesting of the two), as much for myself as for the reader. The description follows the very good explanation in the paper "Tournament Selection in XCS".

A XCS has some input that it must learn to classify using its internal rules (individuals), which consist of a classifier (some pattern to match to the input to determine if the individual has something to say about the input). Each individual must also have an action to perform if their classifier does match an input, as well as a predicted reward for the action along with an error in that reward prediction. They also keep a fitness value with them, which goes up and down depending on their ability to accurately predict high-reward actions.

A simple case of input is a fixed length bit vector, so that will be the example. The input is an element of {0,1}^L, bit vectors of length L, which are matched against classifiers in {0, 1, #}^L. If you are not familiar with this notations, the # is commonly used as a "don't care" symbols, matching anything it is compared to. So if we imagine that the system get an input 10101, and the individual has a classifier like 10101, 10###1, 1#####, ##101, or something like that, then the individual matches the input. We collect all individuals matching the input into the matching set of individuals, |M|. There will be some set of actions advocated by individuals in |M|, call it A, and we chose the action to take by the greatest fitness weighted average of predicted payoff for each action in A. That gets us to the action set |A|, the set of individuals that advocate (have as their genetic material the action) that has the highest average payoff weighted by the individuals fitness' that advocate that action. Having now chosen an action, the action is taken and the reward is reaped from the environment (the action may be a classification of input, and the reward an indication of how close the classification was to the correct one).

Having received a reward, the classifiers that advocated the reward must it. In addition to merely being rewarded, they are modified by how accurate their predictions of the reward was, and their error is modified if the reward was outside their prediction +- their error and beyond some predetermined threshold for error tolerance. I will not describe these updates- just look at the paper. <>A GA is then evoked for individuals that were in the action set and have not been part of the GA for a certain amount of time. This is one of the many parameters to this algorithm, and the time since an individual last took part in a GA is one of several extra values that are kept track of in individuals (also an experience and numerosity count- again, just read the paper). The individuals are selected from this subset of the action set by some selection mechanism (tournament being advocated in the paper), and mutated and crossed as usual. It seems like uniform crossover is favored here.

Interestingly, after the GA a "subsumption" operator is applied to delete individuals that are completely subsummed by others in the action set, are experience (have had their parameters adjusted a certain number of times) enough to be deleted, and are less accurate then the subsumping individual.

So there you go! XCS! Very cool algorithm, somewhat complex, lots to keep track of, but conceptually simple enough and a very awesome idea. I have never programmed one, but given that I'm making lots of good progress on my evolutionary algorithm framework, I may give it a try.

Wednesday, June 6, 2012

Condensed Table Method for Sampling Discrete Proability Distributions

My brother once told me that when faced with a problem in computer science, the answer is to use a hash table. It turns out that a mere lookup table, along with some cleverness, can be just as good. In this case, a lookup tables gives a pretty nice constant time algorithm for sampling discrete probability distributions. In this post I will describe the Condensed Table method and its implementation in the Haskell library "statistics". I was just about to implement Vose Alias algorithm, but seeing this already implemented method with constant time sampling may convince me otherwise (depending on its space cost for the distributions I'm interested in, which I think will be easily justifiable). The original paper (called "Fast Generation of Discrete Random Variables" and found at http://ideas.repec.org/a/jss/jstsof/11i03.html) describes this algorithm nicely, and I recommend it more then my explanation.

This algorithm requires the creation of a series of tables, one for each digit of the probabilities we are interested in. This may seem like a lot of tables if we want to resolve down to many digits of each probability, but we are free to chose any base system. The implementation in the statistic library uses base 256 and 32 bit numbers, so it only needs 4 tables (each digit is one byte of the 32 bit number). The implementation first multiples the probabilities by 2^32 and gets their integer equivalents, but we can imagine that we are inspecting the probabilities expressed in the desired base system one digit at a time.

Starting with some set of items paired with weights normalized so that they are between 0 and 1 and sum to 1 (a discrete probability distribution with finite support). Notice that if we multiple the probabilities of each item by, assuming base 10, 1000 (truncating after the decimal place) then we are left with numbers between 0 and 1000. If we allocate places in some huge array such that for each item "i" each number n between 0 and 1000, i gets put in n places. Sampling this array with a uniform distribution will get us an item i with probability equal to the original truncated after a certain number of digits (4 in this case). In applications where speed is more important that correctness, this cost of accuracy is certainly worth-while, especially as we can set it up so we get as much resolution as we could responsibly expect without resorting to much more complex and correct algorithms. Therefore, the main problem is that we need (in this case) O(1000*m) space, where m is the number of items in the original distribution. The more digits we use and the more items in the distribution, the more space we need, and the growth in space is fairly quick.

The trick in this algorithm can be seen by reorganizing the item in the large array. For an item that started with probability 0.252352 we multiple by 1000 and truncate, getting 2523. Instead of putting all 2523 of that item together, put 2000 of it. Do this for each item, so we are left with an array ending with groups of less than 1000 (all the groups larger then 1000 are grouped together in the front of the array). We repeat this process so that the 523 of our example item is grouped into 500 of that item and 23 at the end of the array. Doing this once more splits out the 20 together and the last 3 are in the last group of groups (which are all less then 10).

Having grouped this items by order of magnitude, we can reduce the 2000 to 2, the 500 to 5, the 20 to 2 and the 3 to 3. We have to keep these separately so we know which groups have been reduced by what number. We can then sample the distribution similar to before, but we are careful to search the group corresponding to where in the original large array we would have fallen even though we are not storing this array. Obviously we don't have to store this original array at all, but rather a series of arrays with numbers of representations in each array corresponding to the number found in each digit of the probabilities.

These array will still be somewhat large, but by my calculations, they will be on the order of kilobytes or maybe a couple of megabytes for the applications I have in mind. I have no problem paying in space for an advantage in time, especially since I expect to sample a single distribution many tens of thousands of times.

Again, I recommend the original paper- it walks through an example in much more depth then I bothered to.

Wednesday, May 30, 2012

Random Sampling from a Discrete Distribution

I can't recommend enough the article "Darts, Dice, and Coins: Sampling from a Discrete Distribution". It develops Vose Alias Algorithm step by step from a series of algorithms with increasing performance, proving the correctness of each algorithm along the way. I feel like I have a much better intuition about this awesome algorithm now, and the pictures and examples went a long way in helping along the way.

Good stuff.

Power vs Safety

I heard one time the idea that one should use the least powerful tool powerful enough to solve a problem. I didn't understand this at the time, but I've recently come to an understanding (not necessarily the meaning of the person who wrote it).

My first interpretation of this idea was that one should use the tool (I'm thinking in particular about programming languages, but this is more general then just that) with the fewest features. I thought I would rather use the *most* powerful tool I could find. I can understand, for example, not embedding a full programming language into a program if all you want is to let users enter arithmetic expressions, but not, for example, using BASIC rather than Haskell if its an option. You can benefit from simplicity, but I don't see an advantage to using languages without useful features just because you don't strictly *need* them.

Recently I came to another interpretation of this idea. It is common in computer science for there to be a series of systems of varying power that can solve increasingly difficult problems. The catch is that the more powerful the systems the weaker the statements one can make about them. This is true in automata theory (FSM < DFM < Pushdown Automata < Nondeterministic Pushdown Automata < Turing Machine, (although DFM < FSM as well)), the typed lambda calculus (the hierarchy is less obvious (at least to me), but much research goes into coming up with powerful systems (the calculus of constructions, for example) that are still weaker than the untyped lambda calculus to retain properties such as strong normalization or confluence), optimization theory (where you classify problems by the weakest form of optimization that can solve them), recursion theory (primitive recursion < mu-recursion (that about all I know about recursion theory)), abstract grammars (regular grammars < context free grammars < context sensitive grammar < universal grammar). Each of these situations (and more, I'm sure) have many more levels between the ones I happen to have heard of.

Taking automata as an example, each level in that hierarchy can solve all problems solvable by ones lower down, but can't solve all problems solvable by ones at higher levels. However, at each level we lose some assurances about termination or resource use. Finite automata are guaranteed to terminate, but the more powerful Turing Machines are not. Expressions typeable in the polymorphic lambda calculus reduce to a normal form, where untyped expressions do not always have such a form. Equality of regular grammars is decidable, but not so for context free grammars in general. We won't have strange optimization techniques like Genetic Algorithms if every problem were linear.

This is really a very common situation, where we would like to determine the least powerful system powerful enough to solve our problem simply so that we can make as many assurances about the solution as possible. This is also one of the ways that things like closure operators and adjunctions come into computer science ("the least thing great enough to" or "the greatest thing less then"). I think it may even be a fundamental part of computation in some way that I don't know how to formalize. Regardless, it is certainly true that a great deal of effort goes into doing exactly this- using the least powerful tool powerful enough to solve a problem.

Tuesday, May 29, 2012

The R Language for Machine Learning

The other day I bought a book called "Machine Learning for Hackers," which uses R to show off some machine learning algorithms. There is a lot I don't know about machine learning, as I've only studied a subfield and not really the main statistical methods that are useful for the vast majority of problems. One reason I bought this book (in addition to having lots of gift cards) is that a friend of mine mentioned learning R recently, and I would like to know more about it. I don't know almost anything about R, so beware this post- its my initial impressions and misunderstandings.

My understanding is that R has a hodgepodge of properties that make it a bit strange. The basic data type is a vector (which is pretty cool and probably allows an efficient implementation of a lot of primitive operators on vectors), all data seems to have metadata attached, it is dynamically typed (properties of data are determined by metadata modifiable at runtime), there is an object system build on that metadata, it has higher-order function (terribly important, if you ask me), it is lazy (although common operations seem to force evaluation), its call-by-value in the sense of copying (although its looks like copying is done only when necessary), and it allows mutation of structures (although it limits propagation of effects on data through copying).

Overall its probably a good language for its domain- it got a lot attention to convenience, its designed to manipulate lots of complex data, and it supports some functional programming. I expect that it could get very messy with edge cases, odd interactions between mechanisms, possibly clumsy support for more regular programming tasks (I don't know either way about this one), and undefined semantics. The last one is a shame- languages should really know what their semantics or its programmings won't know what their programs really mean, and its hard to come up with variations of the language.

I'm pretty excited to learn these techniques, although this book does not go deep into theory. Its more about getting interested and involved in these algorithms with case studies and practical concerns. I think this is perfect- I'll learn what theory I want to know somewhere else. For now, I'm just glad to get a feel for the broader reach of this field that I like so much.

Wednesday, May 2, 2012

Genetic Algorithm and RGEP Complexity

I've been looking at efficient implementations of genetic operators lately. Obviously the actual complexity depends on the fitness function (sometimes more then anything else), the choice of genetic operators, the implementation of those operators, and the representation of the individuals. However, I'm really interested in the ones used in RGEP, so point mutation, one and two point crossover, rotation, and tournament selection.

I'm slowly converging on a set of structures and algorithms that seem to give a complexity on the order of O(g*p*log(n)) with g the number of generations, p the population size, and n the size of the individuals. This also applies to RGEP, as it is pretty much just a genetic algorithm with a Rotation operator. A genetic algorithm would normally have O(g*p*n) instead, assuming the usual genetic operators, the usual array of arrays representation, and a reasonable implementation of the genetic operators.

To get this time complexity, I plan on using a Vector (the Haskell library Data.Vector) for the population (assuming the usual multiset population) and a Sequence (the Haskell library Data.Sequence, which implements 2-3 finger trees) for the individuals. These are chosen because Vectors have efficient indexing and a nice function "backpermute" which can be used to select the individuals for a new population (given their indices from a selection operator) in O(n) time. Sequences have an amortized O(min(log(i, n-i))) split operator where i is the index to split on and n is the size of the sequence. They also have a O(min(n,m)) concatenation operation (n and m are the sizes of the two sequences). This is important as all of point mutation, crossover, and rotation are really not about copying, but about splitting and concatenating, and should really take less then linear time.

I have learned that roulette wheel selection can be done in O(n) time, sampling n times from a population where each sample can take O(1) time using either the algorithm in "Roulette-Wheel Selection via Stochastic Acceptance", or in my case maybe the one in "A Linear Algorithm for Generating Random Numbers with a Given Distribution". Even using Tournament Selection, which is my personal preference, we need only linear time in the population size to do selection.

I've talked before about point mutation by sampling from a geometric distribution, and given that we need only set up the distribution once and each sample will take O(1) time depends only on the number of mutations in the population. This is usually chosen so that each individual is mutated only a couple of times, so it occurs with about the same frequency regardless of the individuals size (and therefore shouldn't depend on that size). It is unfortunate that this is the one place where a vector would be better than a sequence (assuming it is a mutable vector), but I'm not terribly worried about this. If necessary, all the points can be either mutated in a single pass down the whole population, only changing those points that were determined to be mutation points, or by a series of applications of the function "adjust", which is where I'm getting what I'm going to call p*logn complexity because the number of mutations is determined by the size of the population and the complexity of splitting and rejoining an individual at a given point.

Given these observations, the time complexity is O(g * (mut + cross + rot + select)) where the complex of mut, cross, rot, and select are plogn, plogn, plogn, and n respectively. This makes the total complexity only O(g*p*logn).

This may not be the best we can get (I'm not sure, but I'm looking in to it) but its pretty dang good compared to my usually naive implementations.

Tuesday, April 17, 2012

Efficient Roulette Wheel Selection

I ran into a pretty nice way of doing roulette wheel selection the other day. It requires O(n) set up time, and each selection takes O(log(n)) time, for a total of O(n+nlog(n)) or O(nlog(n)) time for the whole selection process. This is a lot better than the completely naive way (generating the "spin" of the wheel, and then running through the individuals looking for the one that was "hit") which takes O(n^2), and which I've been guilty of in the past.

First off is the setup. Given a population of individuals, paired with their fitnesses. Lets assume that this is in some linear structure like a linked list or vector. Go through each individual, setting their fitness value to the total sum of all previous individuals plus their own fitness. The summing of the fitnesses takes time linear in the population size.

Next is doing the actual selection. This is amazingly easy- generate a value between 0 and the sum of all fitness values. Then do a binary search through the population with the new (summed) fitness values. Each selection takes log(n) time (to do the binary search) and you need to do it n times, so the second part of this selection technique takes nlog(n) time.

So there you have it- an easy and quick O(nlog(n)) time implementation of roulette wheel selection. I will probably end up using this soon enough.

Friday, April 13, 2012

Haskell Evolutionary Algorithm Library, take 2

I've been working on rewriting the program I did for my thesis- an evolutionary algorithm library in Haskell. It should be more general then just evolutionary algorithms, but I can't think of a better name involving Haskell and Machine Learning. This post is a progress report.

So far its going great. The structure of the library is a mixture of functional lenses (from Data-Lens), pipes (from Conduits), random variables/probability distributions (from RVar), and a little monad for keeping state and doing things randomly while allowing mutable references. I'm not completely sold up on the mutable references thing (ST monad) for this situation but I would like to play around with using it for all operators and seeing if the efficiency is worth the extra complexity and type variable in all my definitions.

The plan is to provide a couple of implementations of the usual genetic operators- a very general one that should work with different structure nicely and can be easily used, a very efficient one that can be used on fewer structures (probably vectors of unboxed types) but is hopefully insanely fast compared to what I've done before (see my previous posts), and one thats mostly for fun where everything is done entirely in terms of pipes. The last one would probably be slow, but it would be interesting to see a GA defined not in terms of populations and individuals, but completely in terms of pipes of elements (bits for example).

One of my major ease-of-use goals is that the user should be able to run a Genetic Algorithm simply by providing a fitness function. It should be easy to change parameters and all that, and slightly less easy to add operators or define a new algorithm, but for a quick run the fitness function should be it. I don't just mean a fitness function of a fixed type- I would like the generic GA to be so generic that you can give it a fitness function that expects lists of Bools and get lists of Bools, or if you want vectors of Ints, you get vectors of Ints, etc.

The current next step is finishing the implementation of point mutation and crossover, and then starting on some selections. I've notice that the pipe concept (and the nice implementation from Conduits) vastly simplifies this sort of algorithm. I really believe it is the Right Thing for the job.

Tuesday, April 10, 2012

Efficient Crossover and Selection for Genetic Algorithms

In the last post I discussed an efficient implementation of point mutation for a Genetic Algorithm. This time we look at crossover and selection (and rotation, but thats for PGEP and RGEP, not a GA). These operators take linear time each, and we do no better than that here. However, we can reduce the coefficient nicely, which is important in practice.

First off, lets look at crossover- especially 1 or 2 point crossover (or both). Crossover is not a terribly expensive operator- we simply need to generate a random value between 0 and 1 (to determine if we will cross a pair of individuals), and then a cross point between 0 and the length of the individual. The only somewhat expensive thing here is moving pieces of the individuals around, so lets not do this yet. Instead, we just make a list/array/vector of the ranges each individual occupies. One way to do this would be to consider the population a single large vector (or a 2 dimensional vector) and each individual a list of ranges from their start to end. After 1 point crossover an individual is a 2 element list of pairs- start to cross point, other side of crosspoint to end. In other words, we record what bits in the population make up each individual without actually changing the population at all. If we do 2 point crossover, or multiple crossovers in a single generation, these ranges may consist of more pairs of locations, but are still basically the same. We don't want the individuals to become too fragmented like this, as with completely fragmented individual this would be less efficient then the usual implementation, but for a single generation fragmentation shouldn't be too bad.

Selection can be similarly performed without changing the population. We just record the winner of each tournament or the result of spins of the roulette wheel (or whatever selection one desires). One way to do this is an initially all 0 vector, counting the number of occurrences of each individual in the new population. We add one to a location if the individual corresponding to that location is selected.

While this post is mainly about Genetic Algorithms, I want to mention that the rotation operator from PGEP and RGEP can be performed in a similar fashion. We either record the number of locations to rotate for each individual (with 0 for individuals that aren't rotated that generation) or modify the ranges we created during crossover described above.

So, we have an untouched population, a vector of ranges for each individual that describes what parts of the population they occupy, and a count saying how many times they occur in the next generation. It might be nice to be able to do the operators this way several times before touching the population (except mutation, which can be done independently however we desire), but since we need the full individual for fitness evaluation anyway, we might as well perform the crossover and selection we recorded at the end of each generation.

Now all thats left is to copy each individual from the old generation to the new one in one single pass over the population. This involves inspecting each individual and copying them the number of times they were selected into successive places in the new population (I'm thinking a double buffering style scheme where you copy back and forth between two equal sized populations). Each individual consists of a series of ranges, so getting the individual to the new population means copying each range one after the next to successive locations. An individual [(0, 10), (103, 107), (23, 26)] would copy the values between the locations (assuming the population is considered one big vector) 0 and 10, 103 and 107, and 23 to 26 to a single vector from, say, 0 to 20.

The advantage here is that while each copy is somewhat complex, we end up doing a small amount of work to determine what actions we need to take are (recording how to do each operator instead of doing it) which is sparse data about rearranging the individuals genetic material, and then actually copying each location only once per generation. I'm hoping to try this out when I get around to actually implementing this evolutionary algorithm library, although I will provide the simpler implementations as well.

Monday, April 9, 2012

Point Mutation with Huffman Encoding and the Geometric Distribution

The usual form of point mutation in a Genetic Algorithm is a pretty inefficient affair- the most straightforward approach is to check each location (usually a bit) in each individual, and mutate it with a given probability. This means generating a huge number of random values between 0 and 1. I've noticed that this gives a performance hit in my own GA experiments, and I would like to give a general way of doing mutation without losing randomness or generating so many random values. I've talked about this before, but I think this is a much better technique then my other suggestions. It is efficient, general (doesn't rely on any particular structure except vector individuals that contains something to mutate), and easyish.

The first thing to notice about mutation is that it is somewhat like a Poisson process, except in discrete time. As we walk down an individual (or the whole population if we want to think of it as a huge vector), we will mutate some locations and leave others alone. Each location is like a time tick, and whether or not we mutate is like an event occurring during that time it or not. So, how many locations do we skip between each mutation? And, more to the point, can we skip these without having to check to any work with them at all? We can assume that for the most part the probability of mutation is very low, so skipping over all the locations that won't mutate would be a huge gain if we could do it in less work then just checking each one.

Recall that (one definition of) a geometric distribution gives the number of failures before a single success from a Bernoulli process. With a probability of success equal to the probability of mutation, this tells us, given some number n, the probability that we will skip n spaces before the next mutation. The cumulative distribution function is then the sum of 0 < i < n of the geometric distribution at each number from 0 to i. Now we need to use this to generate a single number telling us how many spaces to skip. There may be a distribution for this too, but I don't know what it is (I would be happy to hear about it if one exists).

This is where a post on A Neighborhood of Infinity called "Lossless decompression and the generation of random samples" comes in. This describes how, given the cumulative distribution function for a finite probability distribution we can create a structure for efficiently sampling the distribution. There may be a better way to do this, but the way I'm planning on using for now is to build a Huffman encoding for the set of elements of the distribution, where the code word for an element is a path down a binary tree label with probabilities that leads to that element.

In our case the distribution is infinite in the sense that it is possible for any number of failures to occur before the first success. However, the individual or population has only so many positions, so really it is only the probabilities 0 to n where n is the number of locations being looked at. Since we are cutting off the tail of this distribution, we should sum all other possible numbers from n to infinity and assign the action of doing no mutations this value.

Now that we have the tools, we simply calculate the cumulative distribution function for the geometric distribution with p = the probability of mutation (this code should do it "
fmap sum $ tail $ inits $ [((1-p)**x)*p | x <- [0..]]"). We build the Huffman encoding, which is very simple. Finally, we generate bits and use them as paths down the encoding tree. The element at the end is the number of bits to skip and gives the next location to mutate. This requires only enough bits to go down the tree, and may skip as much as the whole population.

It seems common for the probability of point mutation to be set so that about 1 or 2 mutations occur on each individual each generation. This means that that expectation of the distribution is around the size of the individual, so we will end up generating very very few numbers. We do need enough random bits to get to the end of the tree (to construct a path to a leaf), but it will be balanced so that common elements have shallow paths. This means that this would be a huge advantage over the naive implementation, possibly but 1 or 2 orders of magnitude, depending on individual size and point mutation probability.

Saturday, March 3, 2012

Possible Design for a Machine Learning/Evolutionary Algorithm Library

My newest idea for a Machine Learning/Evolutionary Algorithm library in Haskell is to separate the problem into two levels.

The first is a level of plumbing to connect genetic operators, for example, that allows the user to store state, parameters to the algorithm, to write to a log, and, since these algorithm are stochastic, random number generation. This will be implemented as what I'm going to call the Stochastic Reader Writer State Monad (SRWS), which is just the RWS monad with another State keeping a seed for the random number generator. This provides the functionality one would expect for one of these algorithms, but I don't plan on enforcing any conditions on exactly what context, state, or log one actually keeps. These will have to be filled in for each algorithm, although I plan on writing some default ones for basic Genetic Algorithms and maybe Gene Expression Programming.

The burden is on the user to make sure that parameters that don't change are stored in some sort of context, that they have a data structure to hold their state, and that they insert writes to the log where they want. This is the weakness of this approach that I tried to avoid in my last attempt at writing a library like this. It seems like each algorithm will have to duplicate a lot of data and have its own, possibly large, data structures for things like generation count or probability of mutation, even if it is only a small extension of Genetic Algorithms. I think the way around this is to provide a separate specialization of the generic plumbing for each algorithm type, along with some classes like GenerationCount with functions like s -> Int with s the state, which says it has some generation count without specifying how to extract it. This is basically like a getter in OO, and I'm not sure that I'm not just regressing to my imperative thinking with that design. Perhaps there is something nicer out there? Lenses, maybe?

The second layer of this library would be some sort of mechanism for constructing these algorithms. I don't think that this layer has somehow capture the nature of the algorithm, or even be the most direct was of specifying it. Right now, my focus is just to give a single, coherent way to build bigger algorithms out of little pieces. If the ideas are simple and composable then I would be satisfied, even if I expect they will not be the most efficient. My current plan is to use this idea of composing machines to build the necessary operators. I'm not entirely sure what form the machines will take, but I've been playing around with several possibilities, and I hope to post about some of them soon.

To give a simple example of how the concept of little machine could be applied here, I'm going to describe how I envision building the usual GA. Assuming a bit vector representation for individuals, the bits would be (conceptually) little machines that, when provided a number between 0 and 1, would output a 0 or a 1, possibly changing some internal state. This internal state will probably be accessible, but with some of the schemes I've looked at it is not visible (see http://en.wikibooks.org/wiki/Haskell/StephensArrowTutorial). Selection will be a function that, given a population with fitnesses for the individuals, returns a machine that, given a number between 0 and 1, will return an individual. Repeating this process to get pairs of individuals, we would construct a machine that, given a number between 0 and the length of the individuals, would return a pair of crossed individuals. Notice that these machine are really just functions, but they store information about what actions they take along with their data. This is, again, somewhat like OOP. The advantage here is that it provides a single concept with which to express the operators, and it localizes the information about how they work, so that it may be easy to automate.

In addition to developing some of the ideas for how to define these ideas, I hope to make clear why I think this idea is useful at all (which I imagine isn't clear at first, even with (maybe because of?) this example). Just a simple look ahead- the result may end up just being the SRWS monad just like the outer layer, but perhaps with a little extra structure or another wrapping.

Friday, March 2, 2012

Finding 1s in a Machine Word

I stumbled the other day on a pretty neat paper called "Using de Bruijn Sequences to Index a 1 in a Computer Word" about a very cool algorithm for finding the index of the least 1 in a bit string. I wanted to share it here because it feels elegant to me. Note that finding the index of the first one can be repeated to find the index of all ones, and to find the number of 1s.

First, some background- there is a surprisingly difficult problem called Population Count, which the the problem of finding the number of 1s in a bit string (usually a single machine word). This is not difficult in the sense of finding an algorithm for computing it (shifting and masking the for each bit would work) but rather in finding an solution better than linear in the size of the machine word. This is sometimes provided in hardware, but not all instruction sets have it, and it may not be exposed to the programmer even if it exists (outside of, say, inline assembly in C). This may seem like a strange thing to want, but it (and the problem of finding the index of 1s) comes up in a variety of situations. One example would be if you were keeping a bit-map of blocks in memory, and you wanted to know how many bits are set in a single word (this came up at work the other day). In this case you may also want to know the first bit set, say to find the first bad block when managing flash storage. Another is in the implementation of an efficient immutable trie, which is very nice and useful data structure. In this case, the first bit set to 1 would indicate the first symbol that a branch has a child for.

The algorithm takes three steps- first isolate the first 1, than hash it to a unique key, than finally index into a lookup table for solutions. This lookup table is very small it turns out- one index per bit in the word. For a 32 bit machine, this only has to take 32 bytes, which is basically nothing and can be (and is assumed to be) pre-calculated very easily. Lets take it one step at a time.

First, isolating the first one. This algorithm assumes there is at least one 1, which is easy to check beforehand. What we want to do is set all 1s to 0s except the least significant one. This is very easy, and is given as "x & -x" for a machine word x, with "&" the bit-wise and operator and "-" twos complement inverse. Notice that the twos complement inverse inverts all bits except the first one, so anding it with the original value makes that bit the only bit still set. The example they give is 01101000, with inverse 10010111, add one to get the twos complement 10011000, and them together 01101000 & 10011000 to get 00001000. Nice.

The next step is to produce a has of the resulting number. The has must be unique, so there are no collisions given values that are powers of 2, so that each possible first bit position with a one produces a unique key to index the lookup table with. This is where de Bruijn sequences come in. These sequences are just bit strings that, for some number n (a power of 2), contain exactly one instance of all log2(n) bit strings, if you allow yourself to wrap around. For example, for n=8, the sequence 00011101 contains all 3 bit (log2(8)=3) sequences, although to get 100, for example, you need to start at the rightmost bit and wrap around. They chose a sequence that starts with 000, for reasons we will get to soon. Now, for this fixed sequence, the hash function is given as: hash(x) = (x*s) >> (n-log2(n)) where s is the chosen de Bruijn sequence.

Now the cool part of the paper- this function must necessarily produce a unique key! To see this, we need to make a couple of observations. First off, since the input number, x, if a power of 2, multiplying by x is the same as a left shift. We do this multiplication modulo 2^n, so we ignore any bits after the nth. This means that we have lopped off some of the higher order bits of our de Bruijn sequence. Next, notice that the right shift operator will shift out all but the bits necessary to index into our lookup table. This means that we have lopped off some of the lower order bits now, leaving us with a subsection of the original sequence. At first I didn't see how we would get the bit sequences where we would have wrapped around, as the shifts that we are doing are not wrap-around shifts, but that is why we specified that the sequence should start with 0s! If it starts with zeros, and the multiplication by x adds zeros to the right, then get those sequences "for free" in the sense that they are still possible even without wrapping.

The next step is pretty obvious- we have a unique index, so just look up the answer in the table, and we are done.

This algorithm may not be the best in all situations, and they do some performance tests on it and discuss extensions and limitations in the paper, but its is pretty neat, I think. I like how we can derive a hash function with no collisions for this input, and how it cleverly ensures all indices are possible.

Incidentally, I found this paper while looking for a reference on de Bruijn indices, which are also very nice, but are completely different.

Tuesday, February 21, 2012

Denotational Programming and Real Life

I just watched a pretty interesting lecture called "Inventing on Principle", http://vimeo.com/36579366. The lecturer presents the idea that inventors should have a principal in their work, something like what artists may develop for their work. The principal of the speaker is that ideas are important, and that the creative process requires immediate interaction with their work- without which ideas often die. As example he shows some nicely interactive programs where things that are interacted with in the static medium of text are given an image to play with in the form of an image in one program, a game in another, and a bunch of variable assignments for a binary search. One of the cool ones was investigating a circuit an seeing its properties change in "real-time".

One thing that strikes me about this is the particular implementation is that it seems to rely on state. Certainly, there is a notion of state in animation and games, there is a notion of state in imperative programming, and even the physical experience of the computer (its changing state in time). This brings up a possible weakness of the denotational/declarative programming that I love so much. This is similar to my post "A Case for Imperative Reasoning". In that case, his objection was that denotational programming gives up the sometimes vital ability to specify certain aspects of the execution of a program (see "Lazy Functional State Threads" however). Here we have a different problem, which is that declarative programs are in a certain sense static or finished. The reason this is a problem is that we can't interact with it or easily see the steps between the starting program and the result.

It is interesting that the fundamental problem here is one of reasoning. The whole point of declarative programming is that we have well understood principals with which to reason so we can understand ours programs in a way that is impossible in certain languages. One way to look at this problem is that we want to provide at least the tools for reasoning about our programs that we expect in imperative languages without giving up the awesomeness of the declarative tradition.

One thing I want to mention is that in a language like Haskell, we can definitely interact with the environment, and we can create stateful programs, and we can create reactive programs. However, the language may reorder statements in any way, and it is not easy to predict (for me at least) what statements will be evaluated and when. This is also true of something like the type checker, which is similarly opaque. There has been work trying to remove this difficulty, see "Compositional Explanations of Types and Algorithm Debugging of Type Errors", and to adding debugging capabilities to Haskell or at least doing post-mordem traces. In some sense this isn't as much a problem as it is in other languages- in a pure language a piece of functional code never acts strangely or misbehaves- if it works in isolation then it works everywhere for all time. Also, algebraic data structures are normally easy to visualize, and there are programs for drawing them.

Despite the existence of these tools in particular and what seems to be a generally improving exosystem of tools for functional languages in general, this lecturer's ideas may be applicable in some way to pure functional language, and I expect that it would manifest itself differently in Haskell then in, for example, Javascript. Perhaps a term reducer could show a reduction sequence as a tree, or a heap-visualizer (these already exist, btw), or something like the speaker showed, but with the values of bound variables instead of the state of assigned variables, or maybe something much more exotic? I like visualizations of complex things like programs, but for large scale things they are often just pretty pictures- maybe there is a form for something like a stack trace (in languages where you get actually get such a thing) or the heap that is meaningful and could be inspected on a sample input to get feedback on, for example, performance and memory usage?

Wednesday, February 15, 2012

Spaces in Genetic Algorithms

There are many spaces involved in a Genetic Algorithm- the space of solutions we want to search, the space of representations that are our individuals, the space of values that make up the individuals (the space {0, 1} for bit vector representations), the population as a space (normally a set), the space of populations (the set of all possible populations), and the space of fitnesses (usually R+). Some of the spaces can be ordered in different ways, and often given different distance metrics. Their shape and the mappings between them have complex effect on the run of a GA.

I've been trying to describe Genetic Algorithms in terms of these spaces, so that a GA is a collection of movements in them. Since it is also stochastic, and some movements must be more likely then other in some situations, it seems like movement and sampling from (usually discrete) probability distributions is all there is to a GA. If we can describe GAs this way we may get two nice things: a way of describing and unifying our treatment of different variations (although some may be easier to describe this way), and possibly some obvious generalizations that suggest modifications to GA. There may also have properties that aren't obvious otherwise, although I don't know much about the literature on the formal study of GAs except vaguely that Markov Chains have been used to represent them, so I don't really know how to proceed there.

If nothing else, I like the idea of visualizations of GAs, and this suggests some ways to make them approachable by intuitive descriptions. Its common to talk about the solution space as a landscape with mountains and valleys, but this doesn't really give the whole story at all. This is especially true when the solutions and the individuals are not the same, which is actually very common.

Books Have

I just got some books in the mail: "Categories of Types", "Category Theory", "Types and Programming Languages", "Advanced Topics in Types and Programming Languages". Each one seems pretty amazing, and they cover types from practical implementation to to advanced type systems with lots of nice features, to their category theory interpretations, to pure category theory.

I have a lot to read.

Tuesday, January 31, 2012

Turing's Cathedral

I've just watched a google talk called Turing's Cathedral, and there was a part I thought was cool enough to record here. The speaker mentions the possibility of different types of computation besides the sequential access of instructions that perform actions.
The problem he mentioned with this approach is that the machines are so fast we need techniques like recursion just to provide enough instructions for them. I never thought about it that way.

Sunday, January 29, 2012

Evolutionary Algorithm Framework

I've been trying to come up with some satisfying framework/library for doing evolutionary algorithms and possibly other types of stochastic optimization for a while now. Part of the problem is that the space of possible algorithms is huge and diverse, and its easy to see how any design decision restricts the ability to express certain types of EAs. Another problem is that I've not been very organized about this. To combat the second problem I've decided to record some of my ideas before I forget them.

First off is the structure of the program. Should it be a library, like a set of combinators with which one constructs the desired algorithm? This would be nice, but I think ultimately it would be a bunch of individual libraries of functions to build different types of algorithms. The problem is that there is so much diversity in possible structures that one might use that to be truly general we can't say much at all about the stochastic operators and what structure they require of the population or individuals.

This leads to the second possibility, which is some definition of EAs in particular as a series of constraints. This would mean a class of some sort describing operations necessary to describe an EA. Part of the problem there is that it is hard to see what these really are, and how they are used to build an EA. This is true for two reasons that I've seen, the main one being state. Its hard to say what state will be necessary, and every algorithm might need slightly more or less than any other. The other reason is that even within GAs its hard to exactly specify the general concept of mutation, for example. It can mean several different things in different techniques, and the more general the definition the less it really says. This line of thinking always leads me to ideas that don't really help the user at all. They say the more abstract you go the more work you need to do to get something concrete, and I agree.

The other idea is to be more explicit about what I want, rather then getting caught up in generality. This would mean defining what I believe an EA really is, and then working on that problem instead, where hopefully the extra structure gives some inspiration. So what does an EA consist of? I'm my ponderings I have often thought that it is important to separate the parts out as much as possible to prevent unnecessary couplings between them. For example, the individuals should not "know" how to mutate themselves, in the sense that it shouldn't be part of their structure. One problem with that sort of coupling is when you go from, say, a regular GA to a GEP, where there are several mutation operators. Perhaps you could say mutation is the composition of each of the several operators, but this still doesn't help much when you want the same structure to mutate in different ways in different algorithms or different times in an algorithm.

Starting with the notion of a population, what sort of thing do we have? It would be easy to say that a population is a *set* of *lists* (or fixed length vectors) of *symbols*, but there are situations in which all of these things do not necessarily hold. It would be nice to include linear or grid populations, individuals with multiple "domains", and things like number or arbitrary structures rather then just symbols. Instead we might say that populations and individuals are combinatorial species, which is very generic, that they satisfy some predicate (are members of some classes) that describe "enough" structure to do EAs, or that they are members of some specific data type which should be defined to have as many possible structures as possible.

The last idea requires some single type, recursively defined, that covers enough ground to express the usual sort of EAs (I don't expect to cover all possibilities- I'm not sure thats really possible in a single library without generalizing to the point of uselessness). This would be something like "data Structure = R Double | Integer Int | Str String | Linear [Structure] | Pair Structure Structure | None", or something like that.

The idea of doing this with species is always intriguing, but I don't know enough about them to know if this is useful or approachable. This gets at the main problem here, which is that EAs are so general that we really need a general theory of container types to work within. The reason that this is a problem is that such a theory is a subject of study in its own right, and I don't know much about it.

Okay, I guess thats enough of that for now. Sorry about the rambling and aimless post, but I want this to go somewhere eventually, and I need to start researching and organizing my thoughts.