Monday, November 21, 2011

Structure of C Types

It seems like it is less common in imperative languages to investigate the structure of a language type system and try to understand its power. I think it is important when learning a language to know how it expresses things, and the limits of its ability to express things. In a previous post, I talked a little about some basic structure of Java's type system, so in this post I want to record an observation about C's.

There are several base types in C- int, char, double, as well as different variations like unsigned and const. None of these base types are subtypes of each other. The type void is the bottom type- there are no objects that actually have that type.

Structs are products of types. Products of types are like Cartesian products of sets. A simple example would be the struct defined like so:
struct Point {
int x;
int y;
};
Which gives the type of pairs of ints.

Unions of types are unions of sets. These aren't really all that useful- we need some way to determine which of the types in the union a particular element of the union is. Without this information an element of a Union is pretty useless. This is because we can only safely use the properties held by all types of a union, and since the type system is pretty flat there aren't many properties we can use for any given union of types. While structs are products, unions are sums (although to be proper sums they would need to be disjoint unions).

Enums are each the same as some natural number (the number of elements in the enumeration) which gives us sums, products, and natural numbers in the type system (with void as 0 in the sense that there are zero elements to the type). Notice that the number of possible elements of a struct of two enums (using only elements of the enum as C won't stop us from doing what we want) for enums of size n and m, is n*m. The number of elements in the union is n+m assuming they are disjoint. If they are not then the number of elements is like the union in set theory- we "remove" duplicates. In true disjoint unions elements that are otherwise the same are tagged with something to make them different.

C's only unary type constructor is the pointer/array. In some sense there are an infinite number of array constructors, as we can provide a size, although since this isn't enforced at all I'm going to ignore it. For any type we can make a pointer to it, and a pointer to that, and so on. Therefore the set of types is closed under pointing. Very often when one doesn't know the type of something we use a pointer to void. This is no coincidence- void pointers are the top type- the type that contains all other types (well, all other pointer types).

I think it is pretty interesting that void is 0 and void* is 1 in this system. Of course, with arbitrary casting void* are can actually be useful, as elements properties of the top type are only those properties of all other types, which is basically nothing. Otherwise a struct with a void* like:
struct Nat {
int n;
void* v;
};

Would be the same as:
stuct Nat {
int n;
};

Because we couldn't do anything with the variable v, which, translated to type theory, is like saying that N*1 = N. Similarly:
union Nat {
int n;
void v;
};

If this were a valid type then it would be N+0 which is equal to N as the void contributes nothing.

I think thats enough for now. I didn't get to function types, but maybe there will be a follow up post on this.

Turing Machines, cont

A little more about Turing Machines.

One way to talk about a "problem" or computation is as a set of pairs- the inputs paired with the outputs. For example, the problem of adding one to a natural number is the set {(n, n+1) | n a natural number}. For more complex problems the input is some encoding of the problem, like a list of cities for the Traveling Salesmen Problem, or a list of number that we want the average of. No matter what problem we chose, it is possible to encode the inputs as strings of bits ({0, 1}*). The output will be similarly encoded, and can be in whatever form we want. If the problem is a decision problem (given a number encoded in decimal, is this number prime?) they the answer may be interpreted by running the machine and looking at the first symbol on the tape (assuming we are using a tape infinite to the right but not to the left) as if it is a 1 then the number is prime and if it is a 0 then the number is composite. We could also spell out "the number is prime" or any other way we want. Notice, by the way, that the pairs of inputs and outputs are very much like functions, although they may be partial in the sense that for some input there may not be a solution (or the computation may not halting on that input).

Notice that a machine may be given a string of symbols that don't encode a solution to the problem we expect it to solve. This is a perfectly well defined situation- if it halts then the machine in some sense defines a larger problem then we defined for it (it comes up with some solution to some problem, although probably not a very interesting one). We don't concern ourselves with the action of a machine on arbitrary inputs for the most part- we just ignore the extra possible solutions.

Similarly, a Turing Machine can be encoded by writing down each part of its data, as we saw in a recent post. It doesn't matter how we do it- its only important that there be at least one way. This means that we can encode a machine and its input and concatenate them, resulting in a single computation. This suggests that we could pass this encoded machine + input into another machine. This brings us to the idea of a Universal Turing Machine- a Turing Machine is called universal if every other machine and input to that machine can be encoded such that the result of the universal machine U on encoded machine M and input I is the same as running M on I. In other words U(M, I) = M(I). These machine are interesting because they solve every possible problem if given an instance of the problem and a solver. Universal machine play an important role in automata theory, although I know nearly nothing about this part of Computer Science.

One other thing I wanted to mention about Turing Machines is the blank tape machines. A blank tape machine is a machine that expects to start out with a tape with all blanks. It turns out that given some machine M and input I, we can construct a machine M' such that if we run M' on a blank tape we get the result of running M on I- M'(_) = M(I). This can be interpreted by saying that a program can take input or we can hard-code the input in the source code.

Sunday, November 20, 2011

Prefix-free Universal Computable Functions

Next on the agenda is prefix-free universal computable functions. First I will talk about universal functions.

A function F (from N->N, taking a natural number and returning a natural number) is called universal if there is some way to encode any other function f and an input to f, call it x, such that when passed to F we get F(f, x) = f(x). Another way to think of this is to encode f and x, concatenate them and pass them to F as F(f ++ x) (using Haskells "++" for concatenation). This makes F a function of one variable that can perform the action of any other function. In other words, this type of function has the same place in recursion theory as a Universal Turing Machine has in automata theory- it shows that there are things in the theory which in some sense encode every other object in the theory, so that the theory contains itself. This kind of self reference appears to be essential to computation and we see it everywhere, the simplest example being in how programs are stored in memory the same way data is- things that compute can generally take encodings of other ones and run them.

A prefix-free code is a fairly simple concept. Given some set of symbols, a prefix free encoding is a set of strings of symbols such that for every pair of strings, neither is a prefix of the other. One example of such a code would be the following encoding of the natural numbers- e(0) = 1, for all n of the form k+1 e(n) = 0 ++ e(k). Written out this would look like {1, 01, 001, 0001, 00001, 000001, ...}, where every natural number n is encoded by n zeros followed by a one. Clearly no number encodes a string that is a prefix of another's encoding, as for that to happen an encoding would have to share each symbol with the symbol in the corresponding location in another string, but all strings longer then a given string (so that they may have the string as a prefix) will have a 0 in the only place the string has a 1, causing it to fail to be a prefix of the longer string.

So, what is a prefix-free universal computable function? This is a universal function whose domain (the numbers for which it is defined) is a prefix-free code. In other words, this allows us to distinguish between the function f encoded as input to the universal function F and the input to f, as we always know when a string ends in a prefix-free code.

There is another piece here- Chaitin's number is best described as having to do with infinite bit streams. However, the part of each stream we are interested in is finite, and we need a way to know where the interesting part of the stream ends. For this reason, we decode the bits of the stream as a prefix-free code, and we them know where to finish looking for the interesting portion.

Thats enough of that for now.

Saturday, November 19, 2011

Mu Recursion

The next topic in our discussion of Chaitin's number is Recursion Theory- another topic I don't talk about much on this blog. Recursion Theory is concerned with (possibly partial) functions from natural numbers to natural numbers. The recursive functions are those functions which are computable by a Turing Machine, and all Turing Machines can be expressed (albeit indirectly) by some (Mu) recursive function. Like the Lambda Calculus, this is just one more way to talk about computation- many systems give rise to computation, and they all share some overarching properties while differing in the details.

More formally, a function from natural numbers to natural numbers (N^m -> N with m a natural number) is called primitive recursive if it can be expressed using the following rules. The function plus(n, m) = n+m is primitive recursive (addition is clearly computable). A function that selects a one of its arguments fi(x0, x1, ..., xn) = xi for 0 <= i <= n is computable (sometimes called a projection because it is suggestive of projecting the first or second number in a coordinate on a plane to its axis). A function f that takes k arguments along with m functions of k arguments can be used to construct a function of k arguments as h(x0, x1, ..., xk) = f(g0(x0, x1, ..., xk), g1((x0, x1, ..., xk), ..., gm(x0, x1, ..., xk)). No function is primitive recursive if it can't be described using the facts just stated.

Many functions can be described using only the rules given so far, despite possibly seeming restrictive. However, there are functions that can be computed by a Turing Machine that aren't in the set of primitive recursive functions- this type of function is not quite powerful enough. We need one more rule to get all the power we need (and all the power possible- this will give us the full power of universal computation). This rule is often rewritten as Mu, which I will write as u, and is given as follows: for a function f that takes k+1 arguments, we construct the function uf(x1, x2, ..., xk) = z such that z is the smallest number such that f(x, x1, x2, ..., xk) = 0. I find this last rule less intuitive than the previous ones, but it isn't so bad when you get to know it.

Consider trying to compute the mu function for some easily computable function f. We have some fixed inputs that we where given and we want to find some small number that causes f to be 0 when given this number and all the fixed inputs. Clearly we would start with 0, compute the result, and if the result was itself 0, we would be done. If not we would try 1, than 2, than 3, and so on. If we found a number that resulted in a 0 we would be done, but if no number exists than we would be searching forever. This is the reason that mu brings us from the primitive recursive functions, whose computation always halts, to the full power of computation where we may have some function and some inputs that can't be calculated (their calculation will not halt).

Turing Machines

I been reading up on Chaitin's number recently, and I thought I would record my understanding of it. This is a (real) number that has an interesting contruction and some pretty far reaching consequences in logic and philosophy of math. Informally it can be thought of as the probability that a randomly chosen string of bits that encodes a program will halt. I don't normally talk about automa theory on this blog, so let me go slowly. This will be the start of some posts that I hope will culminate in a passable discussion of Chaitin's number and its relation to incompleteness. Starting out, however, I will give a pretty handwavy and informal definition of a Turing Machine.

A Turing Machine is an abstract machine that can perform any computations whatsoever, by definition. It is the basic assumption of computer science that if something can be computed then it can be computed by a Turing Machine. A Turing Machine is given as follows- (W, R, d, S, F, i) where W is the working set of symbols, and contains the set of symbols R, as well as the special symbols _ and |. S is the set of states the machine can be in with F a subset called the halting states (if the machine gets into one of these states it has finished computing). The state i is an element of S and is called the initial state (for obvious reasons).The function d is defined as d:SxW->SxWx{->, <-} (the arrows indicate a direction to move) which, given the current state of the machine and a symbol, gives the new state and the symbol the machine should "output", or write to its tape and the direction the read head should move. The idea of the tape is a sequence of places that have a symbol from the working set written on them. The machine is, at any time, looking at one of these places. It will use the information of its current state and the symbol it is looking at to either write a symbol from its symbol set R on that location and will move left, move right, and change its internal state.

The tape can be given by a finite string of symbols from W followed by an infinite string of _ symbols (which represents a blank location). The symbol | is at the very left-most location, and the function d must write | when it sees | and must move to the right. This is one way to stop a machine from "falling off" the left end of the tape (we can also make it infinite in both directions). Notice that the tape doesn't really have to be infinite- it must only be "long enough" to finish whatever computation we are doing. We can imagine that if the machine if ever about to move to the right and there is no more tape there, we simply tack on some additional locations. This is important because for a machine that eventually halts on its input, we will only need a finite tape, and therefore the machine itself isn't really infinite, yet we can sidestep the need to say exactly how large it is. If we had to give a fixed size, there would always be a problem that needed just a little more than we gave ourselves, so we just assume we always have more tape laying around (all of this is informal of course).

If we have some initial tape and the machine is initially in its initial state i and its initial location is the head of the tape. Execution proceeds by passing the current state (the machine itself can be thought of as a finite state machine with this infinite tape to keep as memory) and the symbol in the current location to the transition function d, and writing the resulting symbol to the location, moving into the resulting state, and moving in the resulting direction. If the new state is in the set of halting states, then the machine does not continue executing. More formally, if at time t the machine is in a state s that is a member of F then the tape at time t+n for all n is the same as at time t (no changes can be made to it).

So- this is the first step in understanding Chaitin's number. If this is interesting to you, please find a more detailed and rigorous definition before excepting this one. There are many ways to do this construction and some are easier in different situations.

Friday, November 18, 2011

Lagrange Interpolation in Haskell

Here is a naive translation of the formula for a Lagrange Polynomial into Haskell:
interp xys x = sum $ zipWith (*) ys $ map f xs where
xs = map fst xys
ys = map snd xys
f xj = product $ map (p xj) xs
p xj xm = if xj == xm then 1 else (x - xm) / (xj - xm)

This is a cool little function- given a set of points, and a new point not in the set (list), it will give the value of that new point as interpolated by a Lagrange Polynomial- the polynomial of the least degree that is the value for y for each value x in the given list.

I've done symbolic regression with Gene Expression Programming, where you find a polynomial of arbitrary degree to fit a set of points, so it is cool to me to see such a simple function to compare against that is guaranteed to be of the lowest possible order.

McCarthy's 91 Function in Haskell

I've been reading a lot of articles on wikipedia on different topics of the theory of computation today, and I happened across one on the McCarthy 91 function. This function over the natural number is very simple- f(n) = n - 10 if n > 100 and f(n) = M(M(n + 11)) otherwise.

A proof that M(n) = 91 for all 0 >= n > 101 in Haskell is pretty concise:
m n | n > 100 = n - 10 | otherwise = m $ m $ n + 11
all ((91 ==) . m) [0..101]
Which evaluates to true.

Thursday, November 17, 2011

Hiding the Implementation of a Struct in C

As a novice C programmer, I've been learning a lot of interesting things lately. One thing I've just learned is that you can hide the implementation of a struct by declaring it as a typedef for a struct that does not exist, and then using only pointers to it in the header file. The actual struct is defined in the implementation file, and its internal structure is completely hidden. The only problem with this of course is that you must provide some way of getting and setting fields in the structure, which is a pain Java programmers know well.

In the .h file you might have a line like:

typedef struct Point_t Point;

Which is implemented in the .c file as:

struct Point_t {
int x;
int y;
};
Then you need functions like:


int point_get_x(Point *point) {return point->x;}
int point_get_y(Point *point) {return point->y;}
void point_set_x(Point *point, int x) {point->x = x;}
void point_set_y(Point *point, int y) {point->y = y;}

So, why would we go through all this trouble just to hide things from users? First off, some people really like hiding things from other people- its easier to change implementation details and it can assist separation of concerns. On the other hand, it can also make code hard to understand and debug, and take control away from the user. On yet a third hand, it may be necessary for structural reasons like something I ran into recently.

For a project I'm working on, I've wrapped an API in a slightly higher-level API to make it more palatable for its users. This wrapper makes use of types defined by the API which I don't want to expose to the end-user, who should not have to interact with the low level API at all. The problem is that I have structs whose fields include structs defined in the lower level API. This means that the user can't access the structs directly without having at least the header files from the base program, which I want to avoid mostly cause it feels messy as if information is leaking out of my abstraction.

To fix this problem, I've hidden the implementation and exposed only getters (aka accessors) and setters (aka mutators) for the fields that I want the user to be able to control. The details are tedious, but it does give you a more fine grained way to expose information and it does hide the details of the underlying implementation better than I was before.

So I hope that was a little bit interesting- there is plenty of further reading found by googling "struct hiding in c" that gives more details and examples.

Tuesday, November 15, 2011

The Size of an Array in C

Recently I saw a piece of code that looked a little something like this:

void zero_array(char array[])

{

memset(arr, 0, sizeof(array));

}

I have to say, I was surprised by this. How could a program know the size of an arbitrary array? I expected sizeof to always be determined at compile time, and furthermore I wouldn't have thought that the size of an array could be determined at runtime anyway. These aren't forth-style arrays, where the length is stored before the elements- you only have a pointer with no additional information.

As it turns out, C99 added semantics to sizeof such that, given an array as an argument, the size can be determined at runtime in the right circumstances. This is because a chunk of allocated memory has some (implementation specific) information associated with it by the memory manager.

This of course means that this will only work on pointers returned from the memory manager (or of a static size), which can't be expressed by the type and is entirely implicit in the structure of the program. I'm not sure I would use this very often, especially if I didn't have strict control over the origin of a pointer, but its interesting and possibly useful, and I wanted to record it here for future reference.

Thursday, November 10, 2011

Neural Networks- Training

A Neural Network with n input neurons, k neurons in the hidden layer, and o neurons in the output layer can be considered a pair of matrices- one nxk matrix and one kxo matrix. These matrices contain the weights on each neurons inputs and are initially randomly generated. We multiple the input vector by the hidden neuron weights, apply the activation function to each, and then multiply the result by the output neuron weights, again applying the activation function, to get the resulting vector.

So we have these random weights on the inputs of our neurons, and we can calculate an output from them. We got the inputs from some problem, and they describe some function we want to approximate, some system we want to control (inputs as measurements of the system, output describing some action to take), some decision we want to make (which can be done in several ways), etc. Given that the weights are random they are pretty much guaranteed to not solve any particular problem we have (or at least, they won't give a very good solution). This means we need some way of adjusting the weights based on the output we got and the output we expected for the input. We know the correct output only for some inputs (the ones we have collected to train the network with) and we want a network that, given the inputs we have, give close or exactly the outputs we collected, and for inputs we don't have, will give close or exactly the outputs we want.

One way to do this is a "gradient descent" algorithm, which basically moves the network in the direction of the greatest improvement, like being on the side of a hill and taking a step in the steepest direction (the direction that will change your height the most). This will not necessarily find the best network (it won't in general) but it is relatively simple. The basic idea (using the epoch approach) is to sum the errors on each neuron for the whole training set, and to "propagate" this error backwards through the network (from output neurons to hidden layer neurons to input layer neurons). This is done by adjusting the weights of each input to each neuron by adding to the previous weight the sum of the squared difference between the actual and the desired outputs, multiplied by itself put through the derivative of the activation function and scaled by a fixed value between 0 and 1 called the learning rate (set at the start of the run). In symbols w(t+1) = w(t) + a*e*f'(e) with w(t) the weight at time t, a the learning rate, e the error, and f' the derivative of the activation function.

There are many other ways to do training. One particular technique that I think is pretty cool is to use a Genetic Algorithm. The individuals are vectors of weights, one for each input for each neuron. The fitness of an individual is simply the sum of the squared difference between actual and desired inputs (possibly summing up the errors of all neurons to produce a simple fitness value, or using some problem specific way of determining error). This technique has the advantages and disadvantages of Evolutionary Algorithms in general- it is very general, it won't make assumptions about the solution space, and it can potentially find the global maximum. On the other hand, it may take a long time.

Wednesday, November 9, 2011

Neural Networks

The perceptron by itself can be useful. They can be used for simple problems, as simpler classifiers to produce features for more advanced ones, or as part of a boosting algorithm that makes use of a group of perceptrons (adaboost for example). A really cool thing you can do with them is to connect them up with each other.

There are many ways to do this, with the simplest being the feed forward topology where we have several groups where each group is connected to each neuron in the next, with one group the input and other the output. Activation flows from the input to the output, with no loops, which is why its called feed forward. We can connect them all to each other in a mesh, or Each perceptron takes n inputs and produces 1 output.

If we have m input values, and we place each of these m on the inputs of k neurons, we will get k outputs (one from each neuron). Connecting each of these k outputs to n (with n the number of outputs desired) neurons, then we have a feed forward neural network with one so-called "hidden" layer. The hidden layer is very very important- have one will allow us to solve almost all problems (in principal) (as long as we chose a good activation function (which I'll get to later)). Having two hidden layers allows us to approximate and function, which is one of the crazy things about neural networks. We aren't in general going to be able to find the best network for a given problem, but its nice to know that it exists somewhere in the neural-network-space.

For this network to be able to approximate nicely we need to add one other thing. The outputs of neural will be the sum of the products of each input and the weight for that input fed into some function. This function has be any real valued function that you like, but certain ones are better then others. If we chose a non-linear function then we get the universal approximating behavior I mentioned earlier. Commonly a sigmoid function is chosen as it is something like a neuron firing after a sufficiently high input. To get a complete on/off activation we can use whats called a step function, which goes from, for example, 0 to 1 at a certain point (the activation point).

With these new ideas, a topology of neurons and an activation function, we have a nice new classifier. But the classifier itself isn't enough- we must be able to train it on a data set. In the next post I will describe a simple training mechanism called back propagation, and hopefully get into some other the other ways to do training.

Thursday, November 3, 2011

Perceptrons

I just got back from the conference Complex Adaptive Systems in Chicago, where I presented a paper called Robust Gene Expression Programming. I've posted about this algorithm before, and I won't go into it here. The conference was very interesting- I learned about some techniques I hadn't heard of (like Honey Bee Optimization and Spiking Neural Networks), and saw some variations and investigations of techniques I know a bit about. What I found was that I have a terribly inadequate understanding of these methods and the underlying mathematics. This has prompted me to try to learn more about these techniques. I will be posting a little about different ideas and basic AI/machine learning as I learn and relearn it.

The first thing I want to talk about is the humblest classifier, the perceptron. This little guy can only handle very simple problems when applied directly, but is nonetheless interesting and useful. It is the building block of neural networks, and it can be used when a simple classifier is needed such as for boosting or for feature extraction for a more advanced classifier.

Before giving the details, I want to discuss the notion of classification. Given some data points (medical history about each person in some group for example) we may want to label each point based on some attribute (likely to get a certain disease for example). Often we will have some data points with this extra attribute already known, and we will want some way of telling, given a new point of data, which label to give it based on what we know about previous points. There may be many labels (classes), but there must be at least two. One way to visualize this situation is a table, where each row is a sample and the last column is a class or label or group that in some sense describes the sample. Now we add a row with data in all columns but the last, and we want to fill it in. We will want to make use of the previous rows to do this, based on which ones are similar to the new data.

We may know the number of classes beforehand, like if we are classifying football games into win vs lose. In other situations we will not know how many classes to use, such as if we are clustering groups of data. In that case each cluster (collection of similar data points) will be given a class, and we end up with new knowledge of what groupings exist in a data set. For problems where we know the number of classes, notice that the number two is a bit of a magic number- we can break a problem of classifying into n classes into n problems classifying into two classes (for each class, is the data point in the class or not).

The perceptron can only classify into two classes. The perceptron itself consists of a vector of n+1 real numbers, where n is the number of inputs for the particular problem (different devices measurements for example, like temperature, pressure and humidity). The extra number is called the bias, and can just be set to 1. These n+1 numbers are called the weights, and can be set initially to 0.

The data is a set of points, with the inputs and the expected outputs. Continuing the example, say we have a set of measurements with the expected output as whether or not it will rain. These are known values, and will be used to train the perceptron to be able to predict whether or not it will rain based on a given days measurements in our example. One seeing this data, the neuron will output either a positive or negative number, which indicates which of the two classes the data belongs.

Training our little perceptron, with all its 0 weights, proceeds by going through the data and taking the dot product of the weights and the input data. This given us the actual output for this piece of data, which we can compare to the expect output that we gathered as part of our data. We than update the weights based on the difference between the expected and the actual output. In particular, for each weight, the next value, at time t+1) is w(t+1) = w(t) - a(d-y)x where w is the weight, a is a constant between 0 and 1 (called the learning rate), d is the expected output, y is the actual output, and x is the input.

We can do this learning step for each data point, going through the whole data set until the error on each point is below a value set by the user. In simple problems the error will go to 0- in the case where the classes can be separated by a hyperplane in the problem domain. This is all much easier to see graphically then in text.

Thats all I want to say for now. I hope this wasn't too rambling and unhelpful.