This notebook illustrates how to use the NeuralNets module to teach a neural network to recognize digits. It also explores the different optimization and regularization methods supported by the module.

An Artificial Neural Network is a machine learning model that is inspired by the way biological nervous systems, such as the brain, process information. The building block of neural networks is called a neuron. All a neuron does is take a weighted sum of its inputs and pass it through some non-linear function (activation function) to produce its output. A (feed-forward) neural network is a bunch of neurons arranged in layers, where each neuron in layer *i* takes its input from all the neurons in layer *i-1*. For more information on how neural networks work, follow this link.

In this notebook, we'll look at how a neural network can be used to recognize digits. We'll train the network on the USPS dataset of handwritten digits.

We'll start by loading the data and dividing it into a training set, a validation set, and a test set. The USPS dataset has 9298 examples of handwritten digits. We'll intentionally use just a small portion (1000 examples) of the dataset for training . This is to keep training time small and to illustrate the effects of different regularization methods.

In []:

```
from scipy.io import loadmat
from modshogun import RealFeatures, MulticlassLabels
# load the dataset
dataset = loadmat('../../../data/multiclass/usps.mat')
Xall = dataset['data']
# the usps dataset has the digits labeled from 1 to 10
# we'll subtract 1 to make them in the 0-9 range instead
Yall = np.array(dataset['label'].squeeze(), dtype=np.double)-1
# 1000 examples for training
Xtrain = RealFeatures(Xall[:,0:1000])
Ytrain = MulticlassLabels(Yall[0:1000])
# 4000 examples for validation
Xval = RealFeatures(Xall[:,1001:5001])
Yval = MulticlassLabels(Yall[1001:5001])
# the rest for testing
Xtest = RealFeatures(Xall[:,5002:-1])
Ytest = MulticlassLabels(Yall[5002:-1])
```

To create a neural network in shogun, we'll first create an instance of NeuralNetwork and then initialize it by telling it how many inputs it has and what type of layers it contains. To specifiy the layers of the network a DynamicObjectArray is used. The array contains instances of NeuralLayer-based classes that determine the type of neurons each layer consists of. Some of the supported layer types are: NeuralLinearLayer, NeuralLogisticLayer and NeuralSoftmaxLayer.

We'll create a feed-forward, fully connected (every neuron is connected to all neurons in the layer below) neural network with 2 logistic hidden layers and a softmax output layer. The network will have 256 inputs, one for each pixel (16*16 image). The first hidden layer will have 256 neurons, the second will have 128 neurons, and the output layer will have 10 neurons, one for each digit class. Note that we're using a big network, compared with the size of the training set. This is to emphasize the effects of different regularization methods. We'll try training the network with:

- No regularization
- L2 regularization
- L1 regularization
- Dropout regularization

Therefore, we'll create 4 versions of the network, train each one of them differently, and then compare the results on the validation set.

In []:

```
from modshogun import NeuralNetwork, NeuralInputLayer, NeuralLogisticLayer, NeuralSoftmaxLayer
from modshogun import DynamicObjectArray, MSG_INFO
# setup the layers
layers = DynamicObjectArray()
layers.append_element(NeuralInputLayer(256)) # input layer, 256 neurons
layers.append_element(NeuralLogisticLayer(256)) # first hidden layer, 256 neurons
layers.append_element(NeuralLogisticLayer(128)) # second hidden layer, 128 neurons
layers.append_element(NeuralSoftmaxLayer(10)) # output layer, 10 neurons
# create the networks
net_no_reg = NeuralNetwork(layers)
net_no_reg.quick_connect()
net_no_reg.initialize()
net_l2 = NeuralNetwork(layers)
net_l2.quick_connect()
net_l2.initialize()
net_l1 = NeuralNetwork(layers)
net_l1.quick_connect()
net_l1.initialize()
net_dropout = NeuralNetwork(layers)
net_dropout.quick_connect()
net_dropout.initialize()
```

We can also visualize what the network would look like. To do that we'll draw a smaller network using networkx. The network we'll draw will have 8 inputs (labeled X), 8 neurons in the first hidden layer (labeled H), 4 neurons in the second hidden layer (labeled U), and 6 neurons in the output layer (labeled Y). Each neuron will be connected to all neurons in the layer that precedes it.

In []:

```
# import networkx, install if necessary
try:
import networkx as nx
except ImportError:
import pip
pip.main(['install', '--user', 'networkx'])
import networkx as nx
G = nx.DiGraph()
pos = {}
for i in range(8):
pos['X'+str(i)] = (i,0) # 8 neurons in the input layer
pos['H'+str(i)] = (i,1) # 8 neurons in the first hidden layer
for j in range(8): G.add_edge('X'+str(j),'H'+str(i))
if i<4:
pos['U'+str(i)] = (i+2,2) # 4 neurons in the second hidden layer
for j in range(8): G.add_edge('H'+str(j),'U'+str(i))
if i<6:
pos['Y'+str(i)] = (i+1,3) # 6 neurons in the output layer
for j in range(4): G.add_edge('U'+str(j),'Y'+str(i))
nx.draw(G, pos, node_color='y', node_size=750)
```

NeuralNetwork supports two methods for training: LBFGS (default) and mini-batch gradient descent.

LBFGS is a full-batch optimization methods, it looks at the entire training set each time before it changes the network's parameters. This makes it slow with large datasets. However, it works very well with small/medium size datasets and is very easy to use as it requires no parameter tuning.

Mini-batch Gradient Descent looks at only a small portion of the training set (a mini-batch) before each step, which it makes it suitable for large datasets. However, it's a bit harder to use than LBFGS because it requires some tuning for its parameters (learning rate, learning rate decay,..)

Training in NeuralNetwork stops when:

- Number of epochs (iterations over the entire training set) exceeds max_num_epochs
- The (percentage) difference in error between the current and previous iterations is smaller than epsilon, i.e the error is not anymore being reduced by training

To see all the options supported for training, check the documentation

We'll first write a small function to calculate the classification accuracy on the validation set, so that we can compare different models:

In []:

```
from modshogun import MulticlassAccuracy
def compute_accuracy(net, X, Y):
predictions = net.apply_multiclass(X)
evaluator = MulticlassAccuracy()
accuracy = evaluator.evaluate(predictions, Y)
return accuracy*100
```

**Training without regularization**

We'll start by training the first network without regularization using LBFGS optimization. Note that LBFGS is suitable because we're using a small dataset.

In []:

```
# turn on INFO logging in order to monitor the progress of training
net_no_reg.io.set_loglevel(MSG_INFO)
net_no_reg.epsilon = 1e-6
net_no_reg.max_num_epochs = 600
net_no_reg.set_labels(Ytrain)
net_no_reg.train(Xtrain) # this might take a while, depending on your machine
# compute accuracy on the validation set
print "Without regularization, accuracy on the validation set =", compute_accuracy(net_no_reg, Xval, Yval), "%"
```

**Training with L2 regularization**

We'll train another network, but with L2 regularization. This type of regularization attempts to prevent overfitting by penalizing large weights. This is done by adding \(\frac{1}{2} \lambda \Vert W \Vert_2\) to the optimization objective that the network tries to minimize, where \(\lambda\) is the regularization coefficient.

In []:

```
# turn on L2 regularization
net_l2.l2_coefficient = 3e-4
net_l2.epsilon = 1e-6
net_l2.max_num_epochs = 600
net_l2.set_labels(Ytrain)
net_l2.train(Xtrain) # this might take a while, depending on your machine
# compute accuracy on the validation set
print "With L2 regularization, accuracy on the validation set =", compute_accuracy(net_l2, Xval, Yval), "%"
```

**Training with L1 regularization**

We'll now try L1 regularization. It works by by adding \(\lambda \Vert W \Vert_1\) to the optimzation objective. This has the effect of penalizing all non-zero weights, therefore pushing all the weights to be close to 0.

In []:

```
# turn on L1 regularization
net_l1.l1_coefficient = 3e-5
net_l1.epsilon = 1e-6
net_l1.max_num_epochs = 600
net_l1.set_labels(Ytrain)
net_l1.train(Xtrain) # this might take a while, depending on your machine
# compute accuracy on the validation set
print "With L1 regularization, accuracy on the validation set =", compute_accuracy(net_l1, Xval, Yval), "%"
```

**Training with dropout**

The idea behind dropout is very simple: randomly ignore some neurons during each training iteration. When used on neurons in the hidden layers, it has the effect of forcing each neuron to learn to extract features that are useful in any context, regardless of what the other hidden neurons in its layer decide to do. Dropout can also be used on the inputs to the network by randomly omitting a small fraction of them during each iteration.

When using dropout, it's usually useful to limit the L2 norm of a neuron's incoming weights vector to some constant value.

Due to the stochastic nature of dropout, LBFGS optimization doesn't work well with it, therefore we'll use mini-batch gradient descent instead.

In []:

```
from modshogun import NNOM_GRADIENT_DESCENT
# set the dropout probabilty for neurons in the hidden layers
net_dropout.dropout_hidden = 0.5
# set the dropout probabilty for the inputs
net_dropout.dropout_input = 0.2
# limit the maximum incoming weights vector lengh for neurons
net_dropout.max_norm = 15
net_dropout.epsilon = 1e-6
net_dropout.max_num_epochs = 600
# use gradient descent for optimization
net_dropout.optimization_method = NNOM_GRADIENT_DESCENT
net_dropout.gd_learning_rate = 0.5
net_dropout.gd_mini_batch_size = 100
net_dropout.set_labels(Ytrain)
net_dropout.train(Xtrain) # this might take a while, depending on your machine
# compute accuracy on the validation set
print "With dropout, accuracy on the validation set =", compute_accuracy(net_dropout, Xval, Yval), "%"
```

We can see that, dropout regularization works best in our case, and L1 and L2 regularization have very similar performance. Now we can measure the classification accuracy on the test set using the dropout network:

In []:

```
print "With dropout, accuracy on the test set =", compute_accuracy(net_dropout, Xtest, Ytest), "%"
```

We can also look at some of the images and the network's response to each of them:

In []:

```
predictions = net_dropout.apply_multiclass(Xtest)
_=figure(figsize=(16,6))
# plot some images, with the predicted label as the title of each image
# this code is borrowed from the KNN notebook by Chiyuan Zhang and SÃ¶ren Sonnenburg
for i in range(10):
ax=subplot(2,5,i)
title(int(predictions[i]))
ax.imshow(Xtest[:,i].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)
ax.set_xticks([])
ax.set_yticks([])
```