Thursday, November 3, 2011

Perceptrons

I just got back from the conference Complex Adaptive Systems in Chicago, where I presented a paper called Robust Gene Expression Programming. I've posted about this algorithm before, and I won't go into it here. The conference was very interesting- I learned about some techniques I hadn't heard of (like Honey Bee Optimization and Spiking Neural Networks), and saw some variations and investigations of techniques I know a bit about. What I found was that I have a terribly inadequate understanding of these methods and the underlying mathematics. This has prompted me to try to learn more about these techniques. I will be posting a little about different ideas and basic AI/machine learning as I learn and relearn it.

The first thing I want to talk about is the humblest classifier, the perceptron. This little guy can only handle very simple problems when applied directly, but is nonetheless interesting and useful. It is the building block of neural networks, and it can be used when a simple classifier is needed such as for boosting or for feature extraction for a more advanced classifier.

Before giving the details, I want to discuss the notion of classification. Given some data points (medical history about each person in some group for example) we may want to label each point based on some attribute (likely to get a certain disease for example). Often we will have some data points with this extra attribute already known, and we will want some way of telling, given a new point of data, which label to give it based on what we know about previous points. There may be many labels (classes), but there must be at least two. One way to visualize this situation is a table, where each row is a sample and the last column is a class or label or group that in some sense describes the sample. Now we add a row with data in all columns but the last, and we want to fill it in. We will want to make use of the previous rows to do this, based on which ones are similar to the new data.

We may know the number of classes beforehand, like if we are classifying football games into win vs lose. In other situations we will not know how many classes to use, such as if we are clustering groups of data. In that case each cluster (collection of similar data points) will be given a class, and we end up with new knowledge of what groupings exist in a data set. For problems where we know the number of classes, notice that the number two is a bit of a magic number- we can break a problem of classifying into n classes into n problems classifying into two classes (for each class, is the data point in the class or not).

The perceptron can only classify into two classes. The perceptron itself consists of a vector of n+1 real numbers, where n is the number of inputs for the particular problem (different devices measurements for example, like temperature, pressure and humidity). The extra number is called the bias, and can just be set to 1. These n+1 numbers are called the weights, and can be set initially to 0.

The data is a set of points, with the inputs and the expected outputs. Continuing the example, say we have a set of measurements with the expected output as whether or not it will rain. These are known values, and will be used to train the perceptron to be able to predict whether or not it will rain based on a given days measurements in our example. One seeing this data, the neuron will output either a positive or negative number, which indicates which of the two classes the data belongs.

Training our little perceptron, with all its 0 weights, proceeds by going through the data and taking the dot product of the weights and the input data. This given us the actual output for this piece of data, which we can compare to the expect output that we gathered as part of our data. We than update the weights based on the difference between the expected and the actual output. In particular, for each weight, the next value, at time t+1) is w(t+1) = w(t) - a(d-y)x where w is the weight, a is a constant between 0 and 1 (called the learning rate), d is the expected output, y is the actual output, and x is the input.

We can do this learning step for each data point, going through the whole data set until the error on each point is below a value set by the user. In simple problems the error will go to 0- in the case where the classes can be separated by a hyperplane in the problem domain. This is all much easier to see graphically then in text.

Thats all I want to say for now. I hope this wasn't too rambling and unhelpful.

3 comments:

  1. The Humblest Perceptron: One lonely perceptron defies convention to head out across the great, unknown hyperplane in search of his father.

    ReplyDelete
  2. Meanwhile, the aliens were killed by god's humblest creature... the Tyrannosaurus Rex.

    ReplyDelete
  3. A classic bildungsroman among classifiers.

    ReplyDelete