Thursday, December 29, 2011

The Sizes of Data Types

I noticed today that when considering the sizes of data types (in C in this case) we get a morphism from the algebra of types to a positive semiring of sizes.

The tropical semiring is given by (R U {*}, +, 0, min , *) with the underlying set the reals with infinity (written here as *). If we take max instead of min, and take negative infinity as its identity we get a max plus algebra. In this case I'm going to restrict us to the regular old natural numbers, as all sizes of things are whole, and take 0 instead of negative infinity (so it is not much like a tropical semiring at all, but this was the source of my thought process). The morphism of algebraic types into this algebraic structure is straightforward- for every primitive type we have some natural number of bytes, or words, or bits, or whatever we want to measure (holes?), and for types M and N, size(NxM) = size(N)+size(M), size(N+M) = max(size(N), size(M)) as they are represented in C. Assuming a Void type, size(Void) = 0, which works okay, and for a Unit type size(Unit) = 1.

Pointers in C are all represented by the same size thing, so there doesn't seem to be a need to have a star operator in this structure. Instead, I suppose we can just embed all pointer types as primitives, or just specify that the star operator in the type system acts specially with respect to the size morphism.

This is pretty trivial (and probably wrong), but I wanted to record it as I like to think about the shapes of things and the properties of shapes.

Tuesday, December 27, 2011

Efficient Mutation for Genetic Algorithms

In the next couple of posts I'm going to investigate some ways of implementing mutation in a Genetic Algorithm (or really, any Evolutionary Algorithm). Lets start off with some well known and used ones, and lead more interesting ones (that are probably common knowledge, but I've not seen used) for later posts).

The naive and most general way to implement mutation is to generate a number in [0, 1] for each location in each individual in the population, and mutate the location if the number is less then some preset parameter (the probability of mutation). This is general in the sense that no matter the content of each location, this will tell you whether or not to perform some form of mutation operation on it. Notice that while this will certainly work on a bit vector, it may be very slow, as for a population of size m with individuals having n bits we need m*n random numbers.

Another technique I've seen is to simple find the expected number of mutations, and change that many locations. This involves generating exactly that number of random locations (probability of mutation times number of locations) and mutating only those locations. This is also general in the same sense, but it does not seem morally correct. With the first method there is a chance that no mutations will occur, and a chance that every single location will mutate. The chances are small, but the point of mutation is small random change, and I just don't feel right making it so deterministic, despite the increased performance (assuming a small mutation rate and a large number of locations). To be honest I was a little appalled the first time I saw this used. You can make it slightly better by making those changes over the whole population, as each individual will experience closer to the random probability of mutation. For example a population of size 1000 with individuals length 100 and probability of mutation 0.001 (0.1 percent) we get 1000*100*0.001 = 100 mutations spread over the while population, meaning we need only generate 100 locations to mutate by generating numbers between 0 and n*m. We can also randomly choose individuals and randomly choose locations in those individuals, so we need 200 random numbers.

Okay, that was the two most common, simplest ways to do mutation. I can think of only two other ways, and I will go over those soon.

Wednesday, December 21, 2011

Systems Programming

I've come to a realization about programming recently which makes me appreciate my job all the more. What I've realized is that the type of programming that we there do isn't about abstractions (which are helpful nonetheless) but rather about hardware. This is programming not for an abstract machine or where there is a clear denotational semantics, but where there is a semantics in the actual hardware sitting in front of you- the execution on one particular machine and maybe a small class of similar machines. Some parts of the program may be so specialized that they will only ever work on your particular project because of unique hardware or some special case that you didn't expect. In this case, C is actually kind of nice. It isn't safe at all, and maybe another language could do better, but the ability to be unrestricted in what you do is a necessity. In this context it would be nice to have more safety, and I am interested in this concept and I'm reading some papers on type safe systems programming, but it has a clarity of purpose that I appreciate.

This has lead me to realize my problem with Object Oriented Programming isn't so much the idea, which is mostly about a class of data structures and their relationship with each other, but rather in the practice- where I see abstractions that have no foundation in math, that leak, are poorly conceived and ambiguous, and whose authors don't have access to the powerful tools of mathematics and computer science to use. I'm over-generalizing of course, and I'm not saying that math solves everything, but it is no doubt powerful, and it at least gives you a leg to stand on when you program. Starting with something that is well understood and composes well allows you to built something understandable that may compose, but starting with something that is not well understood and does not connect or compose at the primitive level does not give much hope of building reusable software at the abstract level.
So, that my rant for today!

Tuesday, December 20, 2011

Haskell News Quotes

Some quotes from the quote of the week on #haskell, posted on Contemplating Code.

hpc: and when you want to get really ridiculous, agda has universe polymorphism, which is when you don't know how high up the tower of turtles you have managed to climb
RaptorRarr: Analogy puzzles are to intelligence as __________ is to elephants
[djahandarie] Group projects are stupid [shachaf] Try a semigroup project sometime. You need to lose your identity.
heatsink: Maybe, (), and Bool go to the Lone Star Bar. The bouncer stops Maybe and says, "we don't serve your kind here."
Jafet: Can oleg create a term so complicated that even he could not type-check it?
elliott: i'm here to prove theorems and compile code and I'm all out of code
kmc: my algorithm is O(1), but the constant factor grows quadratically as the problem size increases
luqui: Haskell is a DSL for writing monad tutorials.
spoolio: Haskell programmers know the cost of nothing, the type of everything, and might know the value of a few things later.
James Cook: I think Haskell questions on SO tend to the opposite extreme; no matter how poorly thought-out the question, the Haskell community will descend on it like a swarm of helpful piranhas.
djahandarie: Category theorists are morphisms for turning cofree theorems into free theorems.
monochrom: the road to haskell is paved with good abstractions
ddarius: Attempting to join #not-not-math sent me to #math. Freakin' Boole.
monochrom: @let germanize = concat . words

Genetic Algorithms Through Structures

I've noticed that it is possible to describe the genetic operators of Genetic Algorithms (and other evolutionary algorithms) though simple data structures. This does not seem to be any advantage in speed or even simplicity, but I like to think about GAs in different ways and try to decouple all the parts from each other, and this is a way to decouple the randomness from the action of a genetic operator.

First off- mutation. There are many ways to do mutation, but the simplest is to test for each symbol on each individual, and with a certain probability replace that symbol with a random one from the symbol set. In particular you often will simply flip a bit. The knowledge needed to change a symbol is whether or not you need to change it, therefore a boolean contains enough information to make this decision. For an individual of some structure, that same structure containing boolean values is enough to perform mutation so given, say, a vector of bits (the individual) and another vector of bits (the mutations), we can deterministically flips the bits in the individual that correspond to the 1s in the mutation vector (by xoring then in this case).

For crossover all we need is a single number (for one point crossover) to indicate the cut point. To crossover a whole population we need a structure containing numbers, and a structure containing pairs of individuals, and we can create a structure containing paired individuals that are crossed. We can then unpair them and get a population. We could also create some selection structure, like pairs of indices into the original population indicating who to cross. For uniform crossover, or crossover with multiple individuals, we could create a list of individuals and a list of number where each number indicates which individual the position it is in should be filled from. In other words, each index's symbol is chosen from that index in the individual indicated by the number in our crossover number list.

Selection is no harder then any other mechanism. We can do tournament selection by a structure of pairs of numbers, the indices into the population to pick the individuals for the tournament from. Notice that this means that, like all the other operators, the genetic operator structure can be determined without looking at the population at all. These pairs would have to be tagged with a boolean indicating whether the more or less fit individual should be chosen when the selection structure is applied.

In RGEP and PGEP there is a genetic operator called rotation, which is basically a shift with wraparound. Clearly this can be indicated by the number of positions to shift by, which is a single natural number.

So those are the ones I can think of right now, but it does seem like any genetic operator can be decomposed into a random part which generates the data for a deterministic part. Given the structure generated and the population you can perform the operator without generating random number.

Sunday, December 18, 2011

Forth Interpreters and the C Stack

I haven't been working on my Forth project for a while, but I'm coming back around to it recently, and there are some things I want to change. One of these things is to start using a variation on a technique I saw in the paper "Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine". In this paper they give a single line interpreter for their machine: "while (TRUE) { cont = (*cont)(); }" which simply calls the function pointer cont, and expects the function to return a new function pointer. This has the advantage of always cleaning the stack up after every function call, preventing it from overflowing or from having to throw it away occasionally.

The plan as of right now is that instead of always returning the next thing to call, the looping interpreter will simply do the action of "next"- it will copy the current IP (Instruction Pointer) onto the return stack, copy W into IP, increment IP, and call the function W points too. All functions will return nothing and the threading will be taken care of for them- they do not have to each have the code of "next" as I've done in the past.

I'm also going to change how the original dictionary is compiled. Before I was doing some pretty tedious manipulation of an array of cells (integers) for some reason. Instead of this, I will use a word struct of fixed length (the name will be a pointer to a character string rather then the string itself- this makes finding the xt of a word easier). I don't know why I didn't do this before, but it should be simpler and cleaner now.

I've been thinking about what I'm going to do for structures in this Forth. I don't want to use objects particularly, so I'm going to be thinking about how to do reasonable algebraic datatypes in Forth. It will probably involve runtime checking of types. I've devised a scheme for doing classes in the Haskell sense which I will post about, but its in the planning stage and I may abandon the idea at any moment.

I'm looking forward to cleaning up the implementation and getting it actually running so I can start exposing hardware functionality and adding threading, interrupts, etc.

Monday, November 21, 2011

Structure of C Types

It seems like it is less common in imperative languages to investigate the structure of a language type system and try to understand its power. I think it is important when learning a language to know how it expresses things, and the limits of its ability to express things. In a previous post, I talked a little about some basic structure of Java's type system, so in this post I want to record an observation about C's.

There are several base types in C- int, char, double, as well as different variations like unsigned and const. None of these base types are subtypes of each other. The type void is the bottom type- there are no objects that actually have that type.

Structs are products of types. Products of types are like Cartesian products of sets. A simple example would be the struct defined like so:
struct Point {
int x;
int y;
};
Which gives the type of pairs of ints.

Unions of types are unions of sets. These aren't really all that useful- we need some way to determine which of the types in the union a particular element of the union is. Without this information an element of a Union is pretty useless. This is because we can only safely use the properties held by all types of a union, and since the type system is pretty flat there aren't many properties we can use for any given union of types. While structs are products, unions are sums (although to be proper sums they would need to be disjoint unions).

Enums are each the same as some natural number (the number of elements in the enumeration) which gives us sums, products, and natural numbers in the type system (with void as 0 in the sense that there are zero elements to the type). Notice that the number of possible elements of a struct of two enums (using only elements of the enum as C won't stop us from doing what we want) for enums of size n and m, is n*m. The number of elements in the union is n+m assuming they are disjoint. If they are not then the number of elements is like the union in set theory- we "remove" duplicates. In true disjoint unions elements that are otherwise the same are tagged with something to make them different.

C's only unary type constructor is the pointer/array. In some sense there are an infinite number of array constructors, as we can provide a size, although since this isn't enforced at all I'm going to ignore it. For any type we can make a pointer to it, and a pointer to that, and so on. Therefore the set of types is closed under pointing. Very often when one doesn't know the type of something we use a pointer to void. This is no coincidence- void pointers are the top type- the type that contains all other types (well, all other pointer types).

I think it is pretty interesting that void is 0 and void* is 1 in this system. Of course, with arbitrary casting void* are can actually be useful, as elements properties of the top type are only those properties of all other types, which is basically nothing. Otherwise a struct with a void* like:
struct Nat {
int n;
void* v;
};

Would be the same as:
stuct Nat {
int n;
};

Because we couldn't do anything with the variable v, which, translated to type theory, is like saying that N*1 = N. Similarly:
union Nat {
int n;
void v;
};

If this were a valid type then it would be N+0 which is equal to N as the void contributes nothing.

I think thats enough for now. I didn't get to function types, but maybe there will be a follow up post on this.

Turing Machines, cont

A little more about Turing Machines.

One way to talk about a "problem" or computation is as a set of pairs- the inputs paired with the outputs. For example, the problem of adding one to a natural number is the set {(n, n+1) | n a natural number}. For more complex problems the input is some encoding of the problem, like a list of cities for the Traveling Salesmen Problem, or a list of number that we want the average of. No matter what problem we chose, it is possible to encode the inputs as strings of bits ({0, 1}*). The output will be similarly encoded, and can be in whatever form we want. If the problem is a decision problem (given a number encoded in decimal, is this number prime?) they the answer may be interpreted by running the machine and looking at the first symbol on the tape (assuming we are using a tape infinite to the right but not to the left) as if it is a 1 then the number is prime and if it is a 0 then the number is composite. We could also spell out "the number is prime" or any other way we want. Notice, by the way, that the pairs of inputs and outputs are very much like functions, although they may be partial in the sense that for some input there may not be a solution (or the computation may not halting on that input).

Notice that a machine may be given a string of symbols that don't encode a solution to the problem we expect it to solve. This is a perfectly well defined situation- if it halts then the machine in some sense defines a larger problem then we defined for it (it comes up with some solution to some problem, although probably not a very interesting one). We don't concern ourselves with the action of a machine on arbitrary inputs for the most part- we just ignore the extra possible solutions.

Similarly, a Turing Machine can be encoded by writing down each part of its data, as we saw in a recent post. It doesn't matter how we do it- its only important that there be at least one way. This means that we can encode a machine and its input and concatenate them, resulting in a single computation. This suggests that we could pass this encoded machine + input into another machine. This brings us to the idea of a Universal Turing Machine- a Turing Machine is called universal if every other machine and input to that machine can be encoded such that the result of the universal machine U on encoded machine M and input I is the same as running M on I. In other words U(M, I) = M(I). These machine are interesting because they solve every possible problem if given an instance of the problem and a solver. Universal machine play an important role in automata theory, although I know nearly nothing about this part of Computer Science.

One other thing I wanted to mention about Turing Machines is the blank tape machines. A blank tape machine is a machine that expects to start out with a tape with all blanks. It turns out that given some machine M and input I, we can construct a machine M' such that if we run M' on a blank tape we get the result of running M on I- M'(_) = M(I). This can be interpreted by saying that a program can take input or we can hard-code the input in the source code.

Sunday, November 20, 2011

Prefix-free Universal Computable Functions

Next on the agenda is prefix-free universal computable functions. First I will talk about universal functions.

A function F (from N->N, taking a natural number and returning a natural number) is called universal if there is some way to encode any other function f and an input to f, call it x, such that when passed to F we get F(f, x) = f(x). Another way to think of this is to encode f and x, concatenate them and pass them to F as F(f ++ x) (using Haskells "++" for concatenation). This makes F a function of one variable that can perform the action of any other function. In other words, this type of function has the same place in recursion theory as a Universal Turing Machine has in automata theory- it shows that there are things in the theory which in some sense encode every other object in the theory, so that the theory contains itself. This kind of self reference appears to be essential to computation and we see it everywhere, the simplest example being in how programs are stored in memory the same way data is- things that compute can generally take encodings of other ones and run them.

A prefix-free code is a fairly simple concept. Given some set of symbols, a prefix free encoding is a set of strings of symbols such that for every pair of strings, neither is a prefix of the other. One example of such a code would be the following encoding of the natural numbers- e(0) = 1, for all n of the form k+1 e(n) = 0 ++ e(k). Written out this would look like {1, 01, 001, 0001, 00001, 000001, ...}, where every natural number n is encoded by n zeros followed by a one. Clearly no number encodes a string that is a prefix of another's encoding, as for that to happen an encoding would have to share each symbol with the symbol in the corresponding location in another string, but all strings longer then a given string (so that they may have the string as a prefix) will have a 0 in the only place the string has a 1, causing it to fail to be a prefix of the longer string.

So, what is a prefix-free universal computable function? This is a universal function whose domain (the numbers for which it is defined) is a prefix-free code. In other words, this allows us to distinguish between the function f encoded as input to the universal function F and the input to f, as we always know when a string ends in a prefix-free code.

There is another piece here- Chaitin's number is best described as having to do with infinite bit streams. However, the part of each stream we are interested in is finite, and we need a way to know where the interesting part of the stream ends. For this reason, we decode the bits of the stream as a prefix-free code, and we them know where to finish looking for the interesting portion.

Thats enough of that for now.

Saturday, November 19, 2011

Mu Recursion

The next topic in our discussion of Chaitin's number is Recursion Theory- another topic I don't talk about much on this blog. Recursion Theory is concerned with (possibly partial) functions from natural numbers to natural numbers. The recursive functions are those functions which are computable by a Turing Machine, and all Turing Machines can be expressed (albeit indirectly) by some (Mu) recursive function. Like the Lambda Calculus, this is just one more way to talk about computation- many systems give rise to computation, and they all share some overarching properties while differing in the details.

More formally, a function from natural numbers to natural numbers (N^m -> N with m a natural number) is called primitive recursive if it can be expressed using the following rules. The function plus(n, m) = n+m is primitive recursive (addition is clearly computable). A function that selects a one of its arguments fi(x0, x1, ..., xn) = xi for 0 <= i <= n is computable (sometimes called a projection because it is suggestive of projecting the first or second number in a coordinate on a plane to its axis). A function f that takes k arguments along with m functions of k arguments can be used to construct a function of k arguments as h(x0, x1, ..., xk) = f(g0(x0, x1, ..., xk), g1((x0, x1, ..., xk), ..., gm(x0, x1, ..., xk)). No function is primitive recursive if it can't be described using the facts just stated.

Many functions can be described using only the rules given so far, despite possibly seeming restrictive. However, there are functions that can be computed by a Turing Machine that aren't in the set of primitive recursive functions- this type of function is not quite powerful enough. We need one more rule to get all the power we need (and all the power possible- this will give us the full power of universal computation). This rule is often rewritten as Mu, which I will write as u, and is given as follows: for a function f that takes k+1 arguments, we construct the function uf(x1, x2, ..., xk) = z such that z is the smallest number such that f(x, x1, x2, ..., xk) = 0. I find this last rule less intuitive than the previous ones, but it isn't so bad when you get to know it.

Consider trying to compute the mu function for some easily computable function f. We have some fixed inputs that we where given and we want to find some small number that causes f to be 0 when given this number and all the fixed inputs. Clearly we would start with 0, compute the result, and if the result was itself 0, we would be done. If not we would try 1, than 2, than 3, and so on. If we found a number that resulted in a 0 we would be done, but if no number exists than we would be searching forever. This is the reason that mu brings us from the primitive recursive functions, whose computation always halts, to the full power of computation where we may have some function and some inputs that can't be calculated (their calculation will not halt).

Turing Machines

I been reading up on Chaitin's number recently, and I thought I would record my understanding of it. This is a (real) number that has an interesting contruction and some pretty far reaching consequences in logic and philosophy of math. Informally it can be thought of as the probability that a randomly chosen string of bits that encodes a program will halt. I don't normally talk about automa theory on this blog, so let me go slowly. This will be the start of some posts that I hope will culminate in a passable discussion of Chaitin's number and its relation to incompleteness. Starting out, however, I will give a pretty handwavy and informal definition of a Turing Machine.

A Turing Machine is an abstract machine that can perform any computations whatsoever, by definition. It is the basic assumption of computer science that if something can be computed then it can be computed by a Turing Machine. A Turing Machine is given as follows- (W, R, d, S, F, i) where W is the working set of symbols, and contains the set of symbols R, as well as the special symbols _ and |. S is the set of states the machine can be in with F a subset called the halting states (if the machine gets into one of these states it has finished computing). The state i is an element of S and is called the initial state (for obvious reasons).The function d is defined as d:SxW->SxWx{->, <-} (the arrows indicate a direction to move) which, given the current state of the machine and a symbol, gives the new state and the symbol the machine should "output", or write to its tape and the direction the read head should move. The idea of the tape is a sequence of places that have a symbol from the working set written on them. The machine is, at any time, looking at one of these places. It will use the information of its current state and the symbol it is looking at to either write a symbol from its symbol set R on that location and will move left, move right, and change its internal state.

The tape can be given by a finite string of symbols from W followed by an infinite string of _ symbols (which represents a blank location). The symbol | is at the very left-most location, and the function d must write | when it sees | and must move to the right. This is one way to stop a machine from "falling off" the left end of the tape (we can also make it infinite in both directions). Notice that the tape doesn't really have to be infinite- it must only be "long enough" to finish whatever computation we are doing. We can imagine that if the machine if ever about to move to the right and there is no more tape there, we simply tack on some additional locations. This is important because for a machine that eventually halts on its input, we will only need a finite tape, and therefore the machine itself isn't really infinite, yet we can sidestep the need to say exactly how large it is. If we had to give a fixed size, there would always be a problem that needed just a little more than we gave ourselves, so we just assume we always have more tape laying around (all of this is informal of course).

If we have some initial tape and the machine is initially in its initial state i and its initial location is the head of the tape. Execution proceeds by passing the current state (the machine itself can be thought of as a finite state machine with this infinite tape to keep as memory) and the symbol in the current location to the transition function d, and writing the resulting symbol to the location, moving into the resulting state, and moving in the resulting direction. If the new state is in the set of halting states, then the machine does not continue executing. More formally, if at time t the machine is in a state s that is a member of F then the tape at time t+n for all n is the same as at time t (no changes can be made to it).

So- this is the first step in understanding Chaitin's number. If this is interesting to you, please find a more detailed and rigorous definition before excepting this one. There are many ways to do this construction and some are easier in different situations.

Friday, November 18, 2011

Lagrange Interpolation in Haskell

Here is a naive translation of the formula for a Lagrange Polynomial into Haskell:
interp xys x = sum $ zipWith (*) ys $ map f xs where
xs = map fst xys
ys = map snd xys
f xj = product $ map (p xj) xs
p xj xm = if xj == xm then 1 else (x - xm) / (xj - xm)

This is a cool little function- given a set of points, and a new point not in the set (list), it will give the value of that new point as interpolated by a Lagrange Polynomial- the polynomial of the least degree that is the value for y for each value x in the given list.

I've done symbolic regression with Gene Expression Programming, where you find a polynomial of arbitrary degree to fit a set of points, so it is cool to me to see such a simple function to compare against that is guaranteed to be of the lowest possible order.

McCarthy's 91 Function in Haskell

I've been reading a lot of articles on wikipedia on different topics of the theory of computation today, and I happened across one on the McCarthy 91 function. This function over the natural number is very simple- f(n) = n - 10 if n > 100 and f(n) = M(M(n + 11)) otherwise.

A proof that M(n) = 91 for all 0 >= n > 101 in Haskell is pretty concise:
m n | n > 100 = n - 10 | otherwise = m $ m $ n + 11
all ((91 ==) . m) [0..101]
Which evaluates to true.

Thursday, November 17, 2011

Hiding the Implementation of a Struct in C

As a novice C programmer, I've been learning a lot of interesting things lately. One thing I've just learned is that you can hide the implementation of a struct by declaring it as a typedef for a struct that does not exist, and then using only pointers to it in the header file. The actual struct is defined in the implementation file, and its internal structure is completely hidden. The only problem with this of course is that you must provide some way of getting and setting fields in the structure, which is a pain Java programmers know well.

In the .h file you might have a line like:

typedef struct Point_t Point;

Which is implemented in the .c file as:

struct Point_t {
int x;
int y;
};
Then you need functions like:


int point_get_x(Point *point) {return point->x;}
int point_get_y(Point *point) {return point->y;}
void point_set_x(Point *point, int x) {point->x = x;}
void point_set_y(Point *point, int y) {point->y = y;}

So, why would we go through all this trouble just to hide things from users? First off, some people really like hiding things from other people- its easier to change implementation details and it can assist separation of concerns. On the other hand, it can also make code hard to understand and debug, and take control away from the user. On yet a third hand, it may be necessary for structural reasons like something I ran into recently.

For a project I'm working on, I've wrapped an API in a slightly higher-level API to make it more palatable for its users. This wrapper makes use of types defined by the API which I don't want to expose to the end-user, who should not have to interact with the low level API at all. The problem is that I have structs whose fields include structs defined in the lower level API. This means that the user can't access the structs directly without having at least the header files from the base program, which I want to avoid mostly cause it feels messy as if information is leaking out of my abstraction.

To fix this problem, I've hidden the implementation and exposed only getters (aka accessors) and setters (aka mutators) for the fields that I want the user to be able to control. The details are tedious, but it does give you a more fine grained way to expose information and it does hide the details of the underlying implementation better than I was before.

So I hope that was a little bit interesting- there is plenty of further reading found by googling "struct hiding in c" that gives more details and examples.

Tuesday, November 15, 2011

The Size of an Array in C

Recently I saw a piece of code that looked a little something like this:

void zero_array(char array[])

{

memset(arr, 0, sizeof(array));

}

I have to say, I was surprised by this. How could a program know the size of an arbitrary array? I expected sizeof to always be determined at compile time, and furthermore I wouldn't have thought that the size of an array could be determined at runtime anyway. These aren't forth-style arrays, where the length is stored before the elements- you only have a pointer with no additional information.

As it turns out, C99 added semantics to sizeof such that, given an array as an argument, the size can be determined at runtime in the right circumstances. This is because a chunk of allocated memory has some (implementation specific) information associated with it by the memory manager.

This of course means that this will only work on pointers returned from the memory manager (or of a static size), which can't be expressed by the type and is entirely implicit in the structure of the program. I'm not sure I would use this very often, especially if I didn't have strict control over the origin of a pointer, but its interesting and possibly useful, and I wanted to record it here for future reference.

Thursday, November 10, 2011

Neural Networks- Training

A Neural Network with n input neurons, k neurons in the hidden layer, and o neurons in the output layer can be considered a pair of matrices- one nxk matrix and one kxo matrix. These matrices contain the weights on each neurons inputs and are initially randomly generated. We multiple the input vector by the hidden neuron weights, apply the activation function to each, and then multiply the result by the output neuron weights, again applying the activation function, to get the resulting vector.

So we have these random weights on the inputs of our neurons, and we can calculate an output from them. We got the inputs from some problem, and they describe some function we want to approximate, some system we want to control (inputs as measurements of the system, output describing some action to take), some decision we want to make (which can be done in several ways), etc. Given that the weights are random they are pretty much guaranteed to not solve any particular problem we have (or at least, they won't give a very good solution). This means we need some way of adjusting the weights based on the output we got and the output we expected for the input. We know the correct output only for some inputs (the ones we have collected to train the network with) and we want a network that, given the inputs we have, give close or exactly the outputs we collected, and for inputs we don't have, will give close or exactly the outputs we want.

One way to do this is a "gradient descent" algorithm, which basically moves the network in the direction of the greatest improvement, like being on the side of a hill and taking a step in the steepest direction (the direction that will change your height the most). This will not necessarily find the best network (it won't in general) but it is relatively simple. The basic idea (using the epoch approach) is to sum the errors on each neuron for the whole training set, and to "propagate" this error backwards through the network (from output neurons to hidden layer neurons to input layer neurons). This is done by adjusting the weights of each input to each neuron by adding to the previous weight the sum of the squared difference between the actual and the desired outputs, multiplied by itself put through the derivative of the activation function and scaled by a fixed value between 0 and 1 called the learning rate (set at the start of the run). In symbols w(t+1) = w(t) + a*e*f'(e) with w(t) the weight at time t, a the learning rate, e the error, and f' the derivative of the activation function.

There are many other ways to do training. One particular technique that I think is pretty cool is to use a Genetic Algorithm. The individuals are vectors of weights, one for each input for each neuron. The fitness of an individual is simply the sum of the squared difference between actual and desired inputs (possibly summing up the errors of all neurons to produce a simple fitness value, or using some problem specific way of determining error). This technique has the advantages and disadvantages of Evolutionary Algorithms in general- it is very general, it won't make assumptions about the solution space, and it can potentially find the global maximum. On the other hand, it may take a long time.

Wednesday, November 9, 2011

Neural Networks

The perceptron by itself can be useful. They can be used for simple problems, as simpler classifiers to produce features for more advanced ones, or as part of a boosting algorithm that makes use of a group of perceptrons (adaboost for example). A really cool thing you can do with them is to connect them up with each other.

There are many ways to do this, with the simplest being the feed forward topology where we have several groups where each group is connected to each neuron in the next, with one group the input and other the output. Activation flows from the input to the output, with no loops, which is why its called feed forward. We can connect them all to each other in a mesh, or Each perceptron takes n inputs and produces 1 output.

If we have m input values, and we place each of these m on the inputs of k neurons, we will get k outputs (one from each neuron). Connecting each of these k outputs to n (with n the number of outputs desired) neurons, then we have a feed forward neural network with one so-called "hidden" layer. The hidden layer is very very important- have one will allow us to solve almost all problems (in principal) (as long as we chose a good activation function (which I'll get to later)). Having two hidden layers allows us to approximate and function, which is one of the crazy things about neural networks. We aren't in general going to be able to find the best network for a given problem, but its nice to know that it exists somewhere in the neural-network-space.

For this network to be able to approximate nicely we need to add one other thing. The outputs of neural will be the sum of the products of each input and the weight for that input fed into some function. This function has be any real valued function that you like, but certain ones are better then others. If we chose a non-linear function then we get the universal approximating behavior I mentioned earlier. Commonly a sigmoid function is chosen as it is something like a neuron firing after a sufficiently high input. To get a complete on/off activation we can use whats called a step function, which goes from, for example, 0 to 1 at a certain point (the activation point).

With these new ideas, a topology of neurons and an activation function, we have a nice new classifier. But the classifier itself isn't enough- we must be able to train it on a data set. In the next post I will describe a simple training mechanism called back propagation, and hopefully get into some other the other ways to do training.

Thursday, November 3, 2011

Perceptrons

I just got back from the conference Complex Adaptive Systems in Chicago, where I presented a paper called Robust Gene Expression Programming. I've posted about this algorithm before, and I won't go into it here. The conference was very interesting- I learned about some techniques I hadn't heard of (like Honey Bee Optimization and Spiking Neural Networks), and saw some variations and investigations of techniques I know a bit about. What I found was that I have a terribly inadequate understanding of these methods and the underlying mathematics. This has prompted me to try to learn more about these techniques. I will be posting a little about different ideas and basic AI/machine learning as I learn and relearn it.

The first thing I want to talk about is the humblest classifier, the perceptron. This little guy can only handle very simple problems when applied directly, but is nonetheless interesting and useful. It is the building block of neural networks, and it can be used when a simple classifier is needed such as for boosting or for feature extraction for a more advanced classifier.

Before giving the details, I want to discuss the notion of classification. Given some data points (medical history about each person in some group for example) we may want to label each point based on some attribute (likely to get a certain disease for example). Often we will have some data points with this extra attribute already known, and we will want some way of telling, given a new point of data, which label to give it based on what we know about previous points. There may be many labels (classes), but there must be at least two. One way to visualize this situation is a table, where each row is a sample and the last column is a class or label or group that in some sense describes the sample. Now we add a row with data in all columns but the last, and we want to fill it in. We will want to make use of the previous rows to do this, based on which ones are similar to the new data.

We may know the number of classes beforehand, like if we are classifying football games into win vs lose. In other situations we will not know how many classes to use, such as if we are clustering groups of data. In that case each cluster (collection of similar data points) will be given a class, and we end up with new knowledge of what groupings exist in a data set. For problems where we know the number of classes, notice that the number two is a bit of a magic number- we can break a problem of classifying into n classes into n problems classifying into two classes (for each class, is the data point in the class or not).

The perceptron can only classify into two classes. The perceptron itself consists of a vector of n+1 real numbers, where n is the number of inputs for the particular problem (different devices measurements for example, like temperature, pressure and humidity). The extra number is called the bias, and can just be set to 1. These n+1 numbers are called the weights, and can be set initially to 0.

The data is a set of points, with the inputs and the expected outputs. Continuing the example, say we have a set of measurements with the expected output as whether or not it will rain. These are known values, and will be used to train the perceptron to be able to predict whether or not it will rain based on a given days measurements in our example. One seeing this data, the neuron will output either a positive or negative number, which indicates which of the two classes the data belongs.

Training our little perceptron, with all its 0 weights, proceeds by going through the data and taking the dot product of the weights and the input data. This given us the actual output for this piece of data, which we can compare to the expect output that we gathered as part of our data. We than update the weights based on the difference between the expected and the actual output. In particular, for each weight, the next value, at time t+1) is w(t+1) = w(t) - a(d-y)x where w is the weight, a is a constant between 0 and 1 (called the learning rate), d is the expected output, y is the actual output, and x is the input.

We can do this learning step for each data point, going through the whole data set until the error on each point is below a value set by the user. In simple problems the error will go to 0- in the case where the classes can be separated by a hyperplane in the problem domain. This is all much easier to see graphically then in text.

Thats all I want to say for now. I hope this wasn't too rambling and unhelpful.

Friday, October 28, 2011

Mil Std 1553

I've been playing with a Mil Std 1553 bus for several weeks now, and I wanted to record what I've learned here.

1553 is a shared bus consisting of a Bus Controller (BC) and one or more Remote Terminal (RT). It is possible to go into "dynamic mode" where any device can act as a BC, but that is not how it is usually used. Instead, the BC normally knows everything about the systems- each RT, which messages they want to send, when and how often they can send what messages, etc. The BC is the only one that is allowed to use the bus- it polls the RTs for specific messages and the RTs must respond then and only then. It is possible for two RTs to talk if allowed by the devices themselves, but even then the BC asks one to talk to the other- they never start talking by themselves.

The polling that the BC does either tells an RT to send or receive a data message of a maximum of 32 words (with a word as 2 bytes). This poll is called a command message which consists of 20 bits of data, with 3 sync in the beginning, 16 bits of data, and an odd parity bit at the end. The data section contains a 5 bit RT address, where 0 and 31 are reserved as the broadcast address (although you can play with that a little), a bit indicating whether to send data or receive it, a 5 bit subaddress (which I'll get into in a minute), and 5 bits indicating the length of the message (where 0 indicates a message of length 32, so no 0 length messages are possible).

The subaddress is what tells the RT what type of message to send. We might even have 32 buffers of 32 words on the RT, where it just sends out the data in the 32 word section when it is requested. In practice its a lot more complex, but conceptually the subaddress is just an index into an array of 32 word messages structures. If the subaddress is given as 31 then it is not a request for data, but a mode-code. For mode codes, the 5 length bits indicate which mode messages is being sent rather then the length of a data transfer. Some mode messages will send out a single word of data, but that is simple specified in the standard, and must be known beforehand as it is not containing in the message itself except implicitly in the meaning of the code given. The mode codes are mostly for checking status, setting options, etc. The only one I've actually made use of is mode code 17, sync with data. This can be used periodically by the BC to keep the RTs in sync with it and to tell them a single word of data. It might, for example, be polling in a predetermined fashion, and the sync word might tell the RTs where it is in its schedule of message polls.

32 words of data is not enough to transport most messages. To extend the size of a message one can place it in multiple subaddresses and poll each one in turn, or poll the same subaddress multiple times and concatenate the results. The only problem with polling the same subaddress is that the RT or BC must be able to handle the rate at which it is requested, which can be fairly fast. A hardware mechanism for handling this is the easiest to deal with.

It is common for the BC to have a schedule giving a certain time slot to each message (1 ms for example) and always doing the same schedule each time period (1 second, for example). This schedule can change to only poll for messages that are needed, or to account for period messages like an occasional file transfer. Again, hardware support is important here, as the timing may have to be very accurate.

I think thats all I will say about Military Standard 1553 for now.

Tuesday, October 11, 2011

Breaking the Forth Wall

I've made yet more progress on my stand-alone Forth for the x86. I wanted to mention some interesting things about forth real quick while I work out some problems with the call-gate style of key handling and add DOES> words to the language.

There is some very interesting things about Forth as a language. It can very easily support simplified versions of some very powerful features from other languages such as coroutines, continuations, exception handling, closures, and object-orient features. Often these sorts of things are implemented in a similar spirit to other features in a Forth like memory management- the programmer takes more responsibility for her code than usual and the implementation is a simple one that has good enough characteristics to work for what the programmer needs. I could certainly program in Haskell and get all of the power I want in a language for most tasks, but Forth is interesting because it does these things in such a low level context, where we have access to every part of the system and can play with them as needed for whatever we want. For example, we can use the return stack to implement coroutines with a word as simple as ": yield r> r> swap >r >r ;". This will continue the word that called us, which can do the same thing to continue the word called. We might not get everything we want out of this type of implementation, and we have to be very careful to not mess things up, but its pretty good considering we are just above assembly language in terms of abstraction and we have to build everything ourselves.

Another thing I wanted to mention is that there seems to be some relationship between stack languages and functional programming. It is very natural for concatentive language to be seen as functional language where the main operation is composition rather than application. Notice that whitespace represents application in Haskell, but represents composition in Factor, Cat, Joy or Forth. Also, the concept of a graph reduction machine (which is used to give operational semantics to functional programming languages) apparently has found fast and very natural implementations on stack machines where the graph is compiled code (rather then traversing the graph structure with separate code).

In addition, I've been reading Moving Forths section about DOES>, explaining probably the most complex part of Forth, and they mention that the use of code fields in some ways anticipates object oriented programming. This reminded me of the STGM (Spineless Tagless Graph-Reduction Machine) which is Haskell's abstract machine and is implemented as its run-time system. I remember reading that they used essentially a code field to determine what to do with their objects (in the more general sense of entities in a program's memory) rather then tagging them with types and switching on the type (the tag is just a pointer to code that does the Right Thing). This is exactly how variables, constants, does words, and regular words are implemented in Forth. Its probably a well known technique, but Forth must have pioneered it.

Hopefully next time I will have an actual progress update and won't have to resort to a pun for a title.

Wednesday, September 28, 2011

Forth Progress Update

My standalone Forth for the x86 is coming along nicely. I make little bits of progress every time I work on it. Right now it can compile new words, variables, and constants, as well as do simple arithmetic. The command line interface needs work, but it already displays the system state (compile = 0, interpret = 1) in the bottom right corner of the screen, and shows the top 5 items on the stack on the bottom left corner (with the top of the stack on the right).

The next things I want to tackle is to factor out the screen information, the command line information (to which I want to add a command history so I can press down and up to cycle through commands), and so attach to each key a forth word call-back. I want to talk a little bit about that in this post.

I want different keys to do different things, and I would like to be able (when the system is better developed) to attach an arbitrary forth word to a key. When I say attached I mean that the each key gives you a fixed integer value, which I use as the index into an array of characters to get the character value (the keyboard doesn't just give an ASCII value). In addition to this use of the key as an index, I want to also index the integer given into an array of execution tokens (which are basically forth's equivalent of function pointers). The interpreter would read in a single character from the keyboard and call the word (function) in the character's index. This would allow me to customize the keyboard easily, and may help when I (waaay in the future) add a text editor. Most of the time keys will be attached to a word that calls putch (put the character on the top of the stack to the screen and move the cursor one position over), but keys like enter, the arrows, backspace, and possibly other will need special handling that is done as part of the outer interpreter right now and can't be configured easily.

One interesting use of this is to have many threads, each with a piece of memory for a screen and their own command line information, and to tie one to each of the function keys (F1-F12) to get virtual terminals. This would be a fun feature even while this is still a single threaded system. Way in the future I will read the APIC documentation (possibly HPET also?) to get a preemptive round robin going, but I would like at least a simple memory manager going before I allocate much fixed stuff on the heap (which is everything above the 3 MB mark right now, with the kernel at 1 MB and the system dictionary at 2 MB).

Well, that is enough for today- I have lots of fun things to work on! I love the response you get from adding features to a system like this with quick turnaround and useful results.

Monday, September 26, 2011

Types and Abstractions

I was talking to a friend the other day about what computation is. I mentioned how computation is basically "what a Turing Machine does", that it tries to formalize the intuitive notion of performing a calculation or proof (or any sequence of actions), and how there are a lot of systems that give rise to this phenomenon. In fact, computation is surprisingly easy to come by- we can get it out of a lot of substrates like machines, proof simplification (finding normal forms in certain logics), patterns in Conways game of life, rewriting strings of symbols in the various rewriting systems like universal grammars, and many more.

It is interesting to me that the substrate effect the way we can construct computations and the nature of the structures we construct, and yet there are underlying principals that unite the various systems of universal computation (systems equivalent to Turing Machines). There is always a way to have a computation blow up- nonterminal or the lack of a normal form for example- and there is never a way to tell when this will happen in the general case (if there is then there is no way to compute all things- the system give up being universal). This is not always such a bad thing, really, as it is interesting to work within more restricted systems with nice properties (strong normalization/guaranteed termination). This is analogous to what we do when we introduce layers of abstraction in programming- we introduce a layer of restrictions on the actions of the lower layer, simplifying its use but often removing some things that are useful. The point I want to make here is that type systems are essentially restrictions on untyped universes of structures (bits, lambda terms, machines, strings of symbols, etc) that describe computations (I'm only interested in computation here) in a way that is similar to abstracting over a lower level interface. We can impose rules on the structures and ensure certain properties, while at the same time losing some information and power from the lower layer.

This is a pretty unformed thought, but the point stands- type systems and abstraction seem essentially the same. It seems like the desire to expose all possible power of the lower layer is only possible if we break an abstraction, which is analogous to doing things like adding Y to a type system to get universal computation- we have to give up our clean abstraction and deal with the substrate directly with all its power and complexity. Resist the desire and lose power, give in and lose any nice property you could have gained (this is a simplification, but it is often true).

Friday, September 23, 2011

Inner Interpreter Works!

Progress update on my standalone forth- the inner interpreter works! I can compile primitive words and the threaded interpreter will run them to completion. If I start out with QUIT on the return stack, with some word's execution token (the location of its array of function pointers) in the IP register of the virtual machine (meaning the IP variable of a struct) and call next then the word will run and the program will block waiting for user input. This is because QUIT clears the return stack and starts the outer interpreter. The outer interpreter needs work, and I'm trying to get word lookup, compilation, and interpretation working.

Right now a word is arranged in memory like so- last_word, flags, length, characters..., function pointers..., exit.

The last_word is a pointer to the previous word in the dictionary of words. This means that each definition is linked to the last, up to the very first, which is linked to NULL. The characters are actually held in a cell length (4 bytes) for convenience. I'm willing to throw away memory for this system because I expect to use less than a single percent of my 2 gigs, so space is not a problem. If I packed them tighter than I would want to align the function pointers to 4 bytes, which would mean calculation the length of the word's name mod 4 and adding that many unused bytes at the end of the characters part of the definition. This is simply more work to save some bytes.

I'm working on getting numbers and words to be recognized and for words to be looked up and compiled or run. After that I will be able to extend the dictionary beyond a couple of single words and get this project on its feet. I'm hoping to also look into doing a preemptive round robin threading mechanism, as well as multiple terminals, a block file system, and a text editor. Big plans!

Wednesday, September 14, 2011

Forth Indirect Threading

Every time I want to remember how the inner interpreter works in Forth, I have to read the first part of Moving Forth again, so I thought I would record my thoughts this time for future reference. This is going to be in the context of my current project- YMMV.

The Forth inner interpreter consists of: next, exit, and docol (aka enter). There are two types of words- primitives and higher words. Primitives are machine code (C for my project) and higher order words are lists of other words (possibly also higher order). In ITC all words start with the location of some primitive functions. For primitives this is the code to execute the word, which must end in a call to next. For higher order words the first pointer is to the function docol, and the last a pointer to exit. The forth vm has an instruction pointer, IP, and a working register W (I plan on also having registers A and B, and possibly X and Y for indexing). When the outer interpreter has a word it needs to execute (entered by the user and looked up in the dictionary) it places the quit word on the return stack, sets IP to the execution token of the looked up word, and calls next.

Next copies what IP points to into W, and moves IP to the next cell over. It then calls the function that W now contains a pointer too. If this is a primitive then that function is the primitive itself, which executes some function. It must end with a call to next, which starts the process again. If all words where primitives, this would move the IP along the array of pointers until hitting exit. If a word is higher order, however, then the first function is docol. This function pushes IP unto the return stack, copies W to IP (as W contains the pointer to the docol pointer) and moves IP over one. It then calls next, which starts executing the current word just as before.

When exit is reached, the top of the return stack is popped into IP, and next is called to continue executing the next word up in the call tree.

If the return stack reaches the very bottom, where the outer interpreter places the pointer to quit, then the last exit will call the function quit. This function will clear the return stack (in case it is called in some nested context) and will clear the C frame stack (in my case) with some inline assembly. It will then call the outer interpreter to start up again, which will wait for user input.

Wednesday, September 7, 2011

Forth Threading Models

In the last post I gave one possible motivation for threading. Now I would like to go over a couple of techniques that can be used to achieve this affect. For more detail, see "Moving Forth" at http://www.bradrodriguez.com/papers/index.html, which is an amazing reference.

The first model is Direct Threaded Code (DTC). In this model, each subroutine (word) contains machine code. If the word is a primitive, this code is just the implementation as an assembly subroutine. If this is a high level word, and therefore only contains calls to other words, then the machine code is a push of the next instruction and a jmp to a subroutine that will start calling each address, one after the next. This would look like:

call DOCOL; a; b; c; NEXT;

Where DOCOL is the subroutine that calls each of a, b, and c (addresses of subroutines) in turn. NEXT is the address of the instruction that keeps this whole process moving- if we have entered a word from another word it will leave the current word and move one step outwards in the nesting. I will be explaining the various stack in a later post, so just know that NEXT handles cleaning up at the end of a word.

Another model is Indirect Threaded Code (ITC). In this model, a word must always contain a list of addresses of other words. This means that even the most primitive words must contain a pointer. Such a word might like (assuming the word starts at address 0x8000 with 32 bit cell sizes):

0x8004; add r1, r0, r1; call NEXT

This word simply jumps to the next instruction, as its require to jump somewhere. The actual action is to add two registers, and then call NEXT.

Another technique is Subroutine Threaded Code (STC). This one is a little bit ironic, as it "threads" the address by simply making each one a call. The disadvantage is that it may take up more space then other techniques as each address is actually the call instruction. Remember, however, that the original motivation here was to remove these redundant calls. While at one time it made sense to save that little bit of memory, now it may make more sense to increase speed by removing this extra level of indirection and just make all jumps subroutines calls. Another advantage here is that primitives can just be machine code- we always jump to a word like DTC. The word NEXT may literally be just "ret" or return as we are just doing the usual jumping to an address while storing the instruction pointer call we can do the usual return to the last subroutine by popping the return stack to the instruction pointer.

Yet another technique is token threading. In this technique a word is an array of values that aren't themselves addresses to other words, but are rather indices into another array of addresses of words. The only real advantage to this technique is that it can be used to get very very small compiled words- a table of only 256 words can fit each "token" (index) in a single byte. The problem is we have yet another level of indirection to wade through, and the number of words is limited to the size of the table. I don't like this technique very much as it also makes the structure of the dictionary particularly fixed (we will get to the dictionary later) and that makes some interesting things much harder.

While there are more forms of threading, I'm only going to discuss one more- switch threading. This technique is basically what a lot of bytecode interpreters do- they have a massive switch statement that determines what do with a compiled bytecode.

Well, thats it for now. I believe I will be doing a ITC interpreter for my new Forth, simply because it is easy to place pointers to functions in C. I'm considering playing around with the C stack a bit, but for now I'm looking at each word as an array of pointers to void functions with no arguments.

Forth Threading Models, motivation

I will be posting a lot more about Forth nowadays, as I work on my standalone Forth for the x86. First off I would like to talk about threading models, starting with some motivation.

Many languages have some abstract machine model that gives them some sort of semantics (sometimes rigorously, as an the operational semantics of Haskell in the STG Machine, and sometimes informally, as in Java's JVM). Forth has such a machine, which gives a basic idea of the things that a Forth system should do without specifying implementation details. This post, and hopefully many more in the future, is about that abstract machine as I understand it.

When writing a program in assembly, we may write a large number of subroutines to do trivial tasks, and to reuse these simple subroutines to define higher level routines. The higher level routines will consist largely of calls to other routines. Imagine we have a pretty resource constrained machine, and we are looking at a program that looks like this:

call a

call b

call c

ret

Which uses some "call" instruction, or if there is no "call" then it must push the instruction pointer (so we can return to it later), and then jump to the subroutine. We are imagining that a major concern is memory usage, so we immediately notice that there are a lot of calls. Why keep all of the instructions around if we are just going to call each address anyway? Instead of having these explicit calls, we could just have

call run

a

b

c

end

This requires a little explanation, especially as I've just made up "run" and "end". What this code is suppose to suggest is that we call a subroutine called "run", which is written once and used many times, which will look at the instruction pointer on the stack (pushed by the call instruction) and run that code. When that code returns, run will call b, than c, than end. The subroutine "end" is another one that is defined only once, and it will do any clean up necessary, which may just be a ret like a normal subroutine.

So we added a little wrapping around the bunch of calls, with the call to a special function to start and the address of a special function to end it off. If your program consists mostly of such lists of calls, then there is some savings of memory here, but that turns out to be a side note today. Now we have plenty of memory, often even in embedded systems. This technique turns out to have other advantages, however, and gives rise to the basic notions of the Forth language.

Okay! Thats enough for now I guess. I was hoping to get into threading techniques, but maybe next time!

Monday, September 5, 2011

Binary Operators

There is something that has bothered me for a long time in mathematics: what is up with the number 2? Why do we have "and" and "or" (conjunction and disjunction) in logic and type theory, meet and join in lattice theory, unions and intersections in topology (and set theory), Cartisian products and disjoint unions as polynomial data types (and set theory), products and sums in abstract algebra (sometimes we even have two "layers" of two operators which take two arguments, like in an algebra), and categorical products and coproducts in category theory. There are two things going on here- why are these binary operators (in the sense that they act on two objects) and why are there always two of them (why are they paired up)?

The reason this interests me is that we can easily define unary operators, or ternary, or what-ever-ary we want, so why is binary so common? Also, certainly not all structures have binary operators, and just having one doesn't always imply having the other, in general. For some, its because of their relationship with another pair of operators, like how lattice theory is historically a model theoretic view of truth-valued logic. However, the point is that there are often the two binary operators, not four ternary operators or one 5-ary operator or anything else. I don't know why this is the case, but I will take a stab at it.

One of the reasons is that its often really just one operator and its dual. This is particularly clear in category theory, where they are dual constructs, but similar duality conditions seem to hold for the others (perhaps not exactly? I don't know). This covers why two operators, but why are they binary?

The answer seems to be that they aren't, really. They can be given as binary operators, but as long as they are associative there is no ambiguity when defining them over finite lists of objects. Some can even be defined over infinite lists/sets. Giving them as binary operators is easy to define, but we can just as well define X V Y (with V the join) as the join of X and Y as we can define V{X, Y} as the join of all elements of a set (in this case X and Y).

So why two binary operators? It seems like we are really just looking at some operation, its dual, and the special case of defining it over two things instead of groups of things, which is just for convenience.

First Ever Successful Boot!

I'm restarting a project that I tried a few years ago- to write a stand-alone forth operating system. The complexity of the x86 processor has stopped me in the past, but this time I am more prepared. Another thing that has stopped me is getting a kernel to actually boot on a real computer. I've gotten them to boot on emulators, but for some reason I could never do it on my desktop. Well, this has finally changed! I got grub2 to boot a kernel image using tutorials from osdev, where the kernel simply prints NOAH to the screen (the hello world of hobby os development seems to be to print your own name to the screen).

This is great news! All the starting up stuff has been keeping me back for a while, and I still have a lot to get worked out (interrupts, memory model, memory protection, and keyboard controlling) but now I can start actually working on it and not have that nagging worry that I won't be able to boot it even if it works in an emulator.

I am trying to keep everything simple here- there will be no restrictions to memory, it will be mono-tasking at first (forth multitasking isn't super hard, but I need something to work with first before I go there), and I don't intend to support anything but a keyboard. One major difficultly is disk access- I would like to support a block file system, possible with largish block (4KB rather than the usual 80*25 to store a screen) or maybe not, I don't know. The problem is accessing the hard disk- a possibly complex operation that I don't know anything about. I would be okay with a partition that is organized into blocks, or a file on another file system that is loaded as a full block files system- which would be really neat because I could edit it in Linux as a text file, and load it as part of the operating system. The problem of course is I don't know how to do this either.

So in conclusion- YAY! I finally started this project! Its been forever! I'm planning on using it mostly for fun, but also as a way of getting more comfortable with C, which I will be using a lot in the future. I was going to just use assembly, to take complete control of the system and make the Forth registers the actual machine registers and all that like in my previous forth system for the z80 of my ti-83 calculator, but I decided it would be easier, and possibly more instructive for me to use C to learn about using it as a systems programming language.

Closed Categories, Properties, Philosophy of Mathematics

Warning- this post is just my recent thoughts about math. I don't know how to make them rigorous, and they may be completely wrong.

I've been thinking about how the subfields of math each center around the theory of some object. The object is normally given as a set theoretic construction satisfying some axioms, and the field of study is concerned with the properties of these objects, what objects in other fields satisfy the axioms (and are therefore part of the field), and with transformations between the chosen objects. Category theory in some sense captures this idea by making the objects of study this situation- a category can be thought of as containing the objects of some field of study, the morphisms are the transformation between the objects. Categories can just be thought of as another object, defined like any other field's chosen object, but they can be considered the embodiment of a field of mathematics. This makes it a little less surprising (though not much, this is pretty surprising!) that categories are so intimately related to theories (the object "a theory" consisting of types, function symbols, axioms, and relation symbols). In fact, the "internal language" of a category is a theory, and the model of a theory is a category. This is an amazing thing that I'm just beginning to grasp, really.

In most theories of some object, there are important properties that we want to show that a particular object has. In group theory, for example, we want a set to have, among other things, an identity element for it to be a group. The group homomorphism from the one element group to a given group points out that identity element, and therefore the one element group embodies the property of identity. The proof that a group has an identity is then embodied in the existence of a homomorphism between the group that embodies the identity and the chosen group.

What I'm wondering is- how general is this idea? Are all properties of objects embodied in some object, and all proofs embodied in the existence of morphisms from that object? In closed categories, where the morphism are objects of the category, this means that the theory of the objects of the category contain their own proofs. This seems like Topos theory, which requires a couple more properties then just being closes, which studies categories that have enough structure to contain systems of mathematics (as I understand it). Not all categories contain their morphisms in this way, and so not all theories contain their proofs, but it looks like Topos and their associated theories do this.

Interestingly being closed appears to be fundamental to categorical notions of computation. The categories whose theories are type systems over the lambda calculus are Cartisian closed categories, although I think that can be weakened in some ways. This is a basic fact of computation it seems- Turing machines have their universal Turing machines which run encoding of other machines, lambda terms are defined over other lambda terms, and actual machines can't distinguish (in general) between data and code. The whole Howard-Curry isomorphism and proofs as propositions interpretation gives us the fact that programs encode proofs of a proposition in some system of logic (in the case of the lambda calculus, lambda terms can be read as programs to be reduced, or as proofs. It just depends on how you read them). It seems to me that in a constructive mathematics all proofs would indeed be some form of transformation which can be encoded in an object (functors, homomorphism groups, etc) meaning that objects and their proofs are the same thing. This puts a perspective on things like naive set theory- if you can describe objects that don't make sense (contradictory sets) then you have to expect contradictory proofs. It may be that the safest thing to found a system of mathematics on is something where we can construct all objects, where the transformations between objects are contained in the theory, and therefore we can construct the object embodying all proofs (if we couldn't construct it, it wouldn't be true).

I normally ignore the philosophy of constructive mathematics as my interest in it is really just as a computer scientist. I think of math as being a human activity, so classical logic isn't really "wrong" and people are welcome to do it if they want (it is certainly powerful, even if its a bit odd at times). On the other hand, it does seem that constructive math, as I understand it, is on a good foundation where it safely contains its own proofs and its lack of freedom in defining objects (as opposed to classical math where we can define some very strange things) prevents the weirdness of classical math. Type theory started as a way to prevent set theoretic paradoxes, and intuitionistic type theory may be the way to go for a system of math that constrains its objects and proofs enough to keep math down to earth.

Okay, enough ranting. Sorry for the half formed thoughts, but I wanted to record them somewhere.

New Job = New Posts?

I have my first real, out-of-school, full-time type job at SSAI working on embedded systems for a NASA instrument for the International Space Station. I am enjoying it a lot so far, but I've been pretty busy and haven't posted for a while.

But that may be about to change! I'm learning a lot about real-time operating systems, embedded systems programming, network protocols using in space, radiation hardened devices, and programming in C (which I will be doing a lot of). This gives me lots of new material to talk about. I also have some side projects I'm starting up that might make for some good posts.

For right now I just wanted to comment on my experience with C. In my internship this summer I started to really like Perl, despite all my programming language snobbery with Haskell. I would definitely use it for text processing and as a replacement for shell scripts. Similarly, I've been discovering the good things about C. The attitude that I have right now, which I heard somewhere and seems to be true, is that C is a very nice assembly language. What I mean is that its abstraction (its abstract machine) is a fairly thin layer above the actual machine.

While the Haskellian in me doesn't like unions types and arbitrary casting, I'm starting to see how they are useful in systems programming. Bitfields, while possibly dangerous (compiler specific behavior has already come up in the last two weeks), and helpful with network programming and other densely packet structures.

So- the future. I hope to post about embedded systems, C, space, and my (second attempt at a) project to write a standalone Forth operating system.

Monday, July 25, 2011

Encoding in Robust Gene Expression Programming

I realized that I've never actually completely explained in this blog the decoding mechanism in my version on Gene Expression Programming- all my posts about it mention that I'm leaving out a bit of detail. So- here is goes.

We are given the set of terminal symbols and the set of operator symbols, which are problem specific and must be supplied by the user, as well as the number of genes per individual (the length). The individuals can then be generated as random bit vectors of size length*genesize where genesize is the minimum number of bits we can use to encode the symbols in the terminal and operator sets (plus one). We can find this by 1+log2(max(|terminals|, |operators|)). The plus one is because we need one extra bit per gene to determine if it will encode a terminal or operator.

Imagine we have a random individual. We decode first into the "raw symbol list" by splitting the bit vector into equal length chunks (the length is the result of the previous calculation) and decoding each chunk. Decoding a chunk is as follows: determine the type of the symbol by inspecting the first bit (0 for operator and 1 for terminal), and then turn the rest of the bits into a natural number in the usual binary encoding. This number is then used as the index in the set (or really, list) of symbols given by its type. In the case that the number is larger than the size of the list, we wrap around to the end, making it a circular list. In other words we take syms[n%len{syms)] if syms is the symbol list, n is the decoded value of the bits in the gene, and len gets the length of the provided list.

Having done this we have a list of symbols from the terminal and operator set, where the length is the given size of the individual. We must then turn this into an expression tree. While it is not necessary to actually build the tree, in principal we always can if we want. To do this we must do a reverse polish evaluation of the symbol list as an expression.

This evaluation starts with an empty stack, and evaluates each symbol in turn, updating the stack along the way. If a terminal is encountered then it is pushed onto the stack. If an operator is encountered, then the necessary number of arguments are popped off the stack (the arity of each operator is fixed and known ahead of time) and the result of the operator applied to the arguments is pushed as a result. If we are building a tree then the "result" is just a small tree with the root the operator and leaves the trees popped off the stack (which may be terminals or trees). In the case that the size of the stack is less than the number of required arguments the operator is simply skipped. This corrects for improperly defined trees, as there is no way for a partial tree to be constructed.

After each symbol has been evaluated the top of the stack is taken as the result, and this object (possible an expression tree) is the phenotype of the individual. This complex tree structure is much removed from the bit vector it started out as, but the process is really pretty painless. The resulting tree can be compiled or evaluated or inspected in some way to produce a fitness value for the individual.

Notice that it is possible for the stack to have many values at the end of evaluation which will be thrown out. These garbage expression may be more complex and interesting then the one chosen as the result, but it is not easy to know this in advance. It would be costly to evaluate each, and in common cases is is easy to see that the top of the stack will actually be very interesting.

Well, there it is, the full decoding process for RGEP. It might seem complex, but it is much nicer than some GEP mechanisms (IMHO) and has lots of advantages on several levels. I am not going to go into all the advantages of this encoding in this post, but suffice to say it is well motivated.

Wednesday, July 6, 2011

Lunch Assignment Problem

Last week someone I work with was assigned the task of setting up lunch groups for the interns in my program. This is something we do every week to meet new people, and it is intended that people eat with new and different people every week if possible. It should be clear that there may not be a solution after several weeks, but we can still find a solution with as small as possible a number of "collisions" (people eating together that have been assigned the same lunch group together in a previous week). Perhaps you can already see where I'm going with this?

This problem involves partitioning a set of people into n disjoint subsets with the minimal number of collisions. If each person is taken to be a node in an simple undirected graph then there is an edge between two people iff (if and only if) they have had lunch together before, and the edges can be weighted such that the weight is the number of times they have had lunch together. Alternatively each pair of people have a weight, which is simply 0 if they have never been in the same group. This provides a measure of the correctness of a solution, and therefore can be used as a fitness in an evolutionary algorithm!
I wrote a simple Random Mutation Hill Climber (RMHC) which we have been using the last two weeks to come up with our lunch assignments, and so far it has found perfect solutions. A RMHC simply mutates a randomly generated solution, and replaces the current solution with the mutated solution if the new one is better or equal to the previous one according to the fitness measure. The initial solution is generated by shuffling the list of people randomly and cutting the list into sublists of a fixed length. After that mutations occur by choosing two people and swapping the groups they are in.

I also tried a Genetic Algorithm for this problem, but it takes longer to run and seems to perform worse. Interestingly, next week it looks like the problem has gotten much harder in the sense that I doubt there is going to be a perfect solution. For this new instance of the problem the GA actually seems to work better! This is probably because the solution space has gotten much more complex and the RMHC, being merely a hill climber, has trouble exploring it even when run many times (so that many possible hills are climbed).

So! Evolutionary Algorithms tackling a very simple, real-life-style problem! Despite being a simple problem (simple to describe) the set of possible solutions is huge. It is infeasible to enumerate the solutions, and it doesn't seem to have an efficient deterministic solution, so it is the sort of problem that AI is often used to tackle. If anyone can classify this problem exactly I would be interested to hear about it- it doesn't remind me of anything off hand.

Sunday, July 3, 2011

Compiled My First Program!

More progress! I successfully compiled and ran a program in my little language for the Arduino! The program that I got running is this:

: wait 100 delay ;
: blink 13 on wait 13 off wait ;
: setup 13 output ;
: loop blink ;

This program simply blinks the LED attached to pin 13 every 100ms. Nothing interesting about the program, but the fact that even such a simple program compiles is nice to see.

The next step is to add a couple of control structures, and then on to a type checking system!

Parsing with Parsec

I've made some progress with my little concatenative language. I'm using Parsec, a Haskell parsing library, to parse an input string into the three types of top-level definitions that language currently has- words, variables, and constants. The next step I think will be actually compiling to C, and then trying to write a type checker.

Sunday, June 26, 2011

Video Lectures

I've been watching video lectures recently, and I have to recommend SICP:http://groups.csail.mit.edu/mac/classes/6.001/abelson-sussman-lectures/


and http://videolectures.net/ssll09_slaney_fom/, which is an introduction to meta-logic.

Right now I'm watching http://videolectures.net/ssll09_martin_cai/, which is about computability and logic.

The videolectures.net site is pretty awesome in general. It has good AI lectures, and a lot of other fun subjects.

Thursday, June 9, 2011

Arduino is Coming!

I ordered an Arduino last week, and it should come fairly soon. I finally committed to getting an Arduino when I was playing around with my friends Arduino and we implemented Pulse Width Modulation (which I learned about from him and his blog post: http://totherails.blogspot.com/2011/04/pulse-width-modulation.html) without using the built in PWM, but rather by turning on and off a port (which gives you more control than the built ability which is available on some of the ports).

One of my main interests in having an Arduino is actually making programming languages for it. I don't think it would be hard to implement a simple Forth (without a command line interpreter- I'm not really interested in tackling that right now), especially if it was just in C and maybe some assembly. I might do that, but I have another idea I'm looking at right now- I want to implement a concatenative language, like Forth, but with a strong, safe, static type system with polynomial types (ideally). My other main interest is giving it easy-to-use combinators, including anonymous ones. I may have to look over Factor, Cat, or Joy and see if there is some inspiration there.

I want to write a compiler in Haskell that outputs a C program (really in the C-like language provided with the Arduino). An example program might (I imagine) look like this:


: inc dup @ 1+ ! ;
: wait 50 ms delay ;
: blink 14 port dup on wait off wait ;
0 count =
[ blink count inc ] [ count @ 10 != ] while

Which would turn port 14 on and off 10 times, with a delay of 50 ms between each change in state. The stuff between the [] is an anonymous combinator. I would like the usual cast of higher order functions and the special ones for stack-based languages (I remember Factor having some very nice ones). The won't be as sophisticated a language as a real general purpose language, but if I could get it to run the sort of simple programs I imagine myself wanting to execute on the Arduino that would be great.

I will of course report my successes and failures here. I want to note here that the language may end up being less powerful than I would like as I would be willing to remove features to ensure correctness and beauty in this project. Good things come from formalism and from well understood and controlled programming languages.

Tuesday, June 7, 2011

Object Oriented Perl

I will be doing some work involving Perl this summer. I've been reading up on the language- I'm not terribly familiar with it. One thing that has become clear is that it is sort of an anti-Haskell in some ways. It is very ad-hoc, there is no attention to formalism or static analysis (there is very little we can say statically), and of course the type system is dynamic.

I am very interested in programming languages- their expressiveness and the tools each offers-, and looking at Perl from that perspective there is one interesting thing that I came across- the object system. As far as I can tell, an object is simply a reference that can be tagged as an object (blessed) such that, when referenced, will allow methods in the module it was tagged to be called with it as the first parameter. The interesting thing here is that the reference may be to anything. It looks like it is often a Hash because of their versatility, but it does not have to be- it could be any value.

The reason this is interesting is that it adds a little something to the concept of an object. I'm not sure what the implications are, but I like that its a little different and gives Perl's objects some personality. They need the extra boost, because otherwise they have a bit of an identity crisis I think- are very implicit and ad-hoc (there is no explicit syntax even) and its a little hard to see what they even are. Interestingly it is useful for the reference to be to a closure, and we can get some information hiding (important in OOP) from this.

Sunday, June 5, 2011

Robust Gene Expression Programming as a Genetic Algorithm

I've posted before about the structure of Robust Gene Expression Programming (RGEP), but I wanted to go over the sense in which RGEP is a variation on the standard Genetic Algorithm (GA). Of course, it is closely related to Gene Expression Programming (GEP), Prefix Gene Expression Programming (PGEP), and Binary Gene Expression Programming (BGEP), and it takes inspiration from Genotype-Phenotype Mapping systems (GPM), but because of its simplicity it is closer to a simple GA than these other systems.

RGEP can be considered a bit vector encoded GA with point mutation, one and two point crossover, tournament selection, and a special "rotation" operator that does a left wrap around shift of its bit vector in multiples of the length of the encoding of a single symbol. Besides the rotation operator the main difference between RGEP and a GA is the expression mechanism.

Expression is very important in RGEP. It takes the bit vector and turns it into an expression tree, giving the genotype/phenotype distinction with the nice expressive structure of trees as the result and the simple and easy to manipulate structure of bit vectors as the encoding that we actually operate on with the genetic operators.

Expression takes several stages- first we group bits into fixed size chunks, the first bit of each group determines the type of symbol (operator or terminal) and the rest indexes into the list of that type of symbol. If the index is higher than the number of symbols, we wrap back around (so its an index modulus the size of the list). This new list of symbols is called the raw symbol list. Then we do a postfix evaluation of the raw symbol lists, and if any operator would cause a stack underflow we ignore it. This implicitly does a GPM-style editing of the symbol list and results in either some object whose fitness we evaluate, or builds a tree structure that we can use. Notice that we don't have to create the tree if we don't want to- it is okay to just do a postfix evaluation and create the object directly.

The nice thing about RGEP is that this is really it. Thats all! GEP is much more complex and I tried to make the most natural choice for each part so that it would be easy to code and easy to explain. The only thing missing in this post is some details about encoding. A Genetic Algorithm library should support this technique with very little work and we get to solve problems with nice tree structured solutions nearly for free if we have a good GA library!

Gene Expression Programming

Some papers on Gene Expression Programming that I found useful for my thesis (I am trying to get a paper on my variation of GEP called Robust Gene Expression Programming accepted at the Complex Adaptive Systems conference, and I will post a reference to that paper if it is accepted):

X. Li, C. Zhou, W. Xiao, and P.C. Nelson, “Prefix Gene Expression Programming”, in Late Breaking Paper at the Genetic and Evolutionary Computation Conference(GECCO- 8. References 2005), Washington, D.C., 2005.

Xin Li, Chi Zhou, Weimin Xiao, and Peter C. Nelson. "Direct Evolution of Hierarchical Solutions with Self-Emergent Substructures." In Proceedings of the 4th International Conference on Machine Learning and Applications (ICMLA'05), pp. 337-342. Dec 15-17, 2005. Los Angeles, CA, USA.

Xin Li. "Self-Emergence of Structures in Gene Expression Programming." In
AAAI/SIGART Doctoral Consortium 2005. July 9-10, 2005. Pittsburgh, Pennsylvania, USA

Xin Li, Chi Zhou, Peter C. Nelson, and Thomas M. Tirpak. "Investigation of Constant Creation Techniques in the Context of Gene Expression Programming." In Late Breaking Paper at Genetic and Evolutionary Computation Conference (GECCO-2004). June 26-30, 2004. Seattle, Washington, USA

Ferreira, C. Gene Expression Programming: a New Adaptive Algorithm for Solving Problems. Complex Systems, 13, 2 (2001)

Ferreira, C. Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Angra do Heroismo, Portugal, 2002.

J. G. Moreno-Torres, X. Llorfla, and D. E. Goldberg. Binary representation in gene expression programming: Towards a better scalability. Technical report, Illinois Genetic Algorithms Lab (IlliGAL) at UIUC, 2009.

Banzhaf, W., Genotype-Phenotype-Mapping and Neutral Variation – A Case Study in Genetic Programming. In Y. Davidor, H.-P. Schwefel, and R. Männer, eds., Parallel Problem Solving from Nature III, Lecture Notes in Computer Science, 866: 322-332, Springer-Verlag, 1994.

R. E. Keller and W. Banzhaf, “Genetic Programming Using Genotype-phenotype Mapping from Linear Genomes into Linear Phenotypes”, in J. R. Koza, D. E., Goldberg, D. B. Fogel, J. R. Koza, W. Banzhaf, K. Chellapilla, M. Dorigo, D. B.Fogel, and R. L. Riolo, eds., Genetic Programming 1996


My favorite of these is the Prefix Gene Expression Programming paper- PGEP is (imho) a much cleaner and easy to use system then the original GEP and I think that it should be used for the basis of future extensions to GEP (I definitely based RGEP on PGEP (and on BGEP) not GEP).

Notice the papers on Genotype-Phenotype Mapping Systems- they are the basis of GEP, Grammatical evolution, Binary Genetic Programming, Developmental Genetic Programming, Cartesian Genetic Programming, and probably many other systems.

Saturday, June 4, 2011

Computation is Simplifying Proofs

I've been reading "Lectures on the Curry-Howard Isomorphism" recently. It covers all sorts of awesome things like intuitionistic logic, the lambda calculus, combinatorial logic, as well as things I haven't gotten to yet like dependent types (which are very neat), the lambda cube, and pure type systems.

This paper has made something clear to me which I've wondered for a while now, which is the question of why the lambda calculus would be the term language for a type system. What is it about the lambda calculus that makes it appropriate for determining the truth of propositions in the logic that a type theory corresponds to? Can we come up with other term languages that are just as good? I often have trouble with questions of why things are the way they are in math, as they can seem arbitrary at times, but every once in a while I stumble across an answer and all is well in the world.

In this case this paper mentions that normalizing in the typed lambda calculi is like proof simplification, hence the name of the post. Therefore computation is just simplifying a proof if we take normalization in the lambda calculus as our basis for computation. This is pretty awesome by itself, but it has one other surprising facet- the lambda terms are simply a linear representation of proofs in some logic! Amazing! I never saw it this way (perhaps it is obvious? certainly not to me) and I always took the idea that a term in a type made the proposition that it corresponds to true loosely as if the existence of the term simply validated the type by showing a construction for it, but it is a literal statement- the term is actually literally a proof of the statement of the type! Beta reduction is then simplifying the term which simplifies the proof.

When I get to dependent types and the lambda cube I will certainly post more about this.

Learning Genetic Algorithms, encodings

We have got all the basics of GAs down so far. We have an encoding (a representation of solutions- often as bit vectors) a mutation a crossover and a selection. Often this is enough, but if we want to use this technique for a particular problem we have to determine how to encode solutions to the problem as a linear structure. There are a series of problems that may arise here, the main one being that the easiest encoding may be too "loose" in the sense that it may be able to express invalid individuals.

A common example is in the traveling salesman problem (TSP) where we want a tour of a set of cities such that we visit each city once and only once. If this is represented as a list of cities then it is hard to enforce the constraints. Even starting out with only valid tours a single point mutation will almost certainly result in a tour that visits a city twice and one city zero times.

So what do we do about it? We have talked about editing as a way to ensure each individual is valid, and for the case of expression trees I believe that editing (it done simply and easily as in RGEP) is a very nice choice.

Other possibilities include modifying the fitness function such that invalid individuals take a fitness hit often proportional to some measure of how invalid they are. This may work in some cases, but in other there is no meaning for invalid individuals and we would have to turn them into valid ones (which is basically editing). Another possibility is to modify the operators to create problem specific mutations and crossovers (or sometimes adding different operators or removing ones that don't seem to work- often crossover is the casualty here). For the traveling salesman example we could replace point mutation by an inversion operator that inverts the path between two indices. We could also add a swap operator that just swaps the cities in two indices. Crossover is a bit difficult to modify here, but it has been done in several ways. This turns a general GA into a problem specific algorithm that may perform very well for some small (but possibly interesting) set of problems.

Another possibility is to come up with some clever encoding that cannot even express invalid individuals. This is very difficult in general as we are (very) often concerned with a subset of some obvious set of solutions like the valid tours in the set of all paths for the TSP. An example of such an encoding would be a (even length) list of numbers where each pair of number in the list (the first and the second, the third and the fourth, etc) are a swap of some fixed ordering of the cities. This cannot express an invalid individual as all swaps preserve validity and we can use the regular genetic operators we know so well. The only problem is that a level of indirection can have consequences that are hard to predict and reason about. For example a small change from point mutation on this list may have a large change on the resulting tour, so point mutation is more disruptive then we might expect it to be.

Another problem with encoding is that it may not be obvious that the problem can even be expressed as a vector of symbols. In Gene Expression Programming (GEP) and Linear Genetic Programming the linear representation allows us to do a GA style algorithm (especially with Robust Gene Expression Programming) on a tree structure, but there algorithm also specialize their operators to work with the resulting expressions. For other problems we may have to be very indirect or clever just to create an appropriate encoding, and we may have to simplify the problem or exclude certain possibilities. If we are laying out a factory, how do we determine where each component goes without duplicates (which may or may not be allowed)? If solving sudoku, maybe the linear encoding of reading the board off row by row is not the best because there is a lack of locality (squares nearby in the game board are not necessarily nearby in the representation) which makes crossover less helpful and more disruptive. Perhaps we need an inversion operator that only inverts with respective to crossover but not with respect to expression into a sudoku board (an interesting possibility usually done with a type of inversion operator which evolves the encoding along with the solutions).

So, we have seen some possible difficulties with encodings in a Genetic Algorithm. I would like to post some about my musing about the natural of GAs and evolutionary algorithms in general, as I have an interest in generalizing and making them more rigorous.