A (single layer) autoencoder is a neural network that has three layers: an input layer, a hidden (encoding) layer, and a decoding layer. The network is trained to reconstruct its inputs, which forces the hidden layer to try to learn good representations of the inputs.

In order to encourage the hidden layer to learn good input representations, certain variations on the simple autoencoder exist. Shogun currently supports two of them: Denoising Autoencoders [1] and Contractive Autoencoders [2]. In this notebook we'll focus on denoising autoencoders.

For denoising autoencoders, each time a new training example is introduced to the network, it's randomly corrupted in some mannar, and the target is set to the original example. The autoencoder will try to recover the orignal data from it's noisy version, which is why it's called a denoising autoencoder. This process will force the hidden layer to learn a good representation of the input, one which is not affected by the corruption process.

A deep autoencoder is an autoencoder with multiple hidden layers. Training such autoencoders directly is usually difficult, however, they can be pre-trained as a stack of single layer autoencoders. That is, we train the first hidden layer to reconstruct the input data, and then train the second hidden layer to reconstruct the states of the first hidden layer, and so on. After pre-training, we can train the entire deep autoencoder to fine-tune all the parameters together. We can also use the autoencoder to initialize a regular neural network and train it in a supervised manner.

In this notebook we'll apply deep autoencoders to the USPS dataset for handwritten digits. We'll start by loading the data and dividing it into a training set and a test set:

In [1]:

```
%pylab inline
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from scipy.io import loadmat
from shogun import RealFeatures, MulticlassLabels, Math
# load the dataset
dataset = loadmat(os.path.join(SHOGUN_DATA_DIR, 'multiclass/usps.mat'))
Xall = dataset['data']
# the usps dataset has the digits labeled from 1 to 10
# we'll subtract 1 to make them in the 0-9 range instead
Yall = np.array(dataset['label'].squeeze(), dtype=np.double)-1
# 4000 examples for training
Xtrain = RealFeatures(Xall[:,0:4000])
Ytrain = MulticlassLabels(Yall[0:4000])
# the rest for testing
Xtest = RealFeatures(Xall[:,4000:-1])
Ytest = MulticlassLabels(Yall[4000:-1])
# initialize the random number generator with a fixed seed, for repeatability
Math.init_random(10)
```

Similar to regular neural networks in Shogun, we create a deep autoencoder using an array of NeuralLayer-based classes, which can be created using the utility class NeuralLayers. However, for deep autoencoders there's a restriction that the layer sizes in the network have to be symmetric, that is, the first layer has to have the same size as the last layer, the second layer has to have the same size as the second-to-last layer, and so on. This restriction is necessary for pre-training to work. More details on that can found in the following section.

We'll create a 5-layer deep autoencoder with following layer sizes: 256->512->128->512->256. We'll use rectified linear neurons for the hidden layers and linear neurons for the output layer.

In [2]:

```
from shogun import NeuralLayers, DeepAutoencoder
layers = NeuralLayers()
layers = layers.input(256).rectified_linear(512).rectified_linear(128).rectified_linear(512).linear(256).done()
ae = DeepAutoencoder(layers)
```

Now we can pre-train the network. To illustrate exactly what's going to happen, we'll give the layers some labels: L1 for the input layer, L2 for the first hidden layer, and so on up to L5 for the output layer.

In pre-training, an autoencoder will formed for each encoding layer (layers up to the middle layer in the network). So here we'll have two autoencoders: L1->L2->L5, and L2->L3->L4. The first autoencoder will be trained on the raw data and used to initialize the weights and biases of layers L2 and L5 in the deep autoencoder. After the first autoencoder is trained, we use it to transform the raw data into the states of L2. These states will then be used to train the second autoencoder, which will be used to initialize the weights and biases of layers L3 and L4 in the deep autoencoder.

The operations described above are performed by the the pre_train() function. Pre-training parameters for each autoencoder can be controlled using the pt_* public attributes of DeepAutoencoder. Each of those attributes is an SGVector whose length is the number of autoencoders in the deep autoencoder (2 in our case). It can be used to set the parameters for each autoencoder indiviually. SGVector's set_const() method can also be used to assign the same parameter value for all autoencoders.

Different noise types can be used to corrupt the inputs in a denoising autoencoder. Shogun currently supports 2 noise types: dropout noise, where a random portion of the inputs is set to zero at each iteration in training, and gaussian noise, where the inputs are corrupted with random gaussian noise. The noise type and strength can be controlled using pt_noise_type and pt_noise_parameter. Here, we'll use dropout noise.

In [3]:

```
from shogun import AENT_DROPOUT, NNOM_GRADIENT_DESCENT
ae.pt_noise_type.set_const(AENT_DROPOUT) # use dropout noise
ae.pt_noise_parameter.set_const(0.5) # each input has a 50% chance of being set to zero
ae.pt_optimization_method.set_const(NNOM_GRADIENT_DESCENT) # train using gradient descent
ae.pt_gd_learning_rate.set_const(0.01)
ae.pt_gd_mini_batch_size.set_const(128)
ae.pt_max_num_epochs.set_const(50)
ae.pt_epsilon.set_const(0.0) # disable automatic convergence testing
# uncomment this line to allow the training progress to be printed on the console
#from shogun import MSG_INFO; ae.io.set_loglevel(MSG_INFO)
# start pre-training. this might take some time
ae.pre_train(Xtrain)
```

In [4]:

```
ae.set_noise_type(AENT_DROPOUT) # same noise type we used for pre-training
ae.set_noise_parameter(0.5)
ae.set_max_num_epochs(50)
ae.set_optimization_method(NNOM_GRADIENT_DESCENT)
ae.set_gd_mini_batch_size(128)
ae.set_gd_learning_rate(0.0001)
ae.set_epsilon(0.0)
# start fine-tuning. this might take some time
_ = ae.train(Xtrain)
```

In [5]:

```
# get a 50-example subset of the test set
subset = Xtest[:,0:50].copy()
# corrupt the first 25 examples with multiplicative noise
subset[:,0:25] *= (random.random((256,25))>0.5)
# corrupt the other 25 examples with additive noise
subset[:,25:50] += random.random((256,25))
# obtain the reconstructions
reconstructed_subset = ae.reconstruct(RealFeatures(subset))
# plot the corrupted data and the reconstructions
figure(figsize=(10,10))
for i in range(50):
ax1=subplot(10,10,i*2+1)
ax1.imshow(subset[:,i].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)
ax1.set_xticks([])
ax1.set_yticks([])
ax2=subplot(10,10,i*2+2)
ax2.imshow(reconstructed_subset[:,i].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)
ax2.set_xticks([])
ax2.set_yticks([])
```

The figure shows the corrupted examples and their reconstructions. The top half of the figure shows the ones corrupted with multiplicative noise, the bottom half shows the ones corrupted with additive noise. We can see that the autoencoders can provide decent reconstructions despite the heavy noise.

Next we'll look at the weights that the first hidden layer has learned. To obtain the weights, we can call the get_layer_parameters() function, which will return a vector containing both the weights and the biases of the layer. The biases are stored first in the array followed by the weights matrix in column-major format.

In [6]:

```
# obtain the weights matrix of the first hidden layer
# the 512 is the number of biases in the layer (512 neurons)
# the transpose is because numpy stores matrices in row-major format, and Shogun stores
# them in column major format
w1 = ae.get_layer_parameters(1)[512:].reshape(256,512).T
# visualize the weights between the first 100 neurons in the hidden layer
# and the neurons in the input layer
figure(figsize=(10,10))
for i in range(100):
ax1=subplot(10,10,i+1)
ax1.imshow(w1[i,:].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)
ax1.set_xticks([])
ax1.set_yticks([])
```

In [7]:

```
from shogun import NeuralSoftmaxLayer
nn = ae.convert_to_neural_network(NeuralSoftmaxLayer(10))
nn.set_max_num_epochs(50)
nn.set_labels(Ytrain)
_ = nn.train(Xtrain)
```

Next, we'll evaluate the accuracy on the test set:

In [8]:

```
from shogun import MulticlassAccuracy
predictions = nn.apply_multiclass(Xtest)
accuracy = MulticlassAccuracy().evaluate(predictions, Ytest) * 100
print("Classification accuracy on the test set =", accuracy, "%")
```

Convolutional autoencoders [3] are the adaptation of autoencoders to images (or other spacially-structured data). They are built with convolutional layers where each layer consists of a number of feature maps. Each feature map is produced by convolving a small filter with the layer's inputs, adding a bias, and then applying some non-linear activation function. Additionally, a max-pooling operation can be performed on each feature map by dividing it into small non-overlapping regions and taking the maximum over each region. In this section we'll pre-train a convolutional network as a stacked autoencoder and use it for classification.

In Shogun, convolutional autoencoders are constructed and trained just like regular autoencoders. Except that we build the autoencoder using CNeuralConvolutionalLayer objects:

In [9]:

```
from shogun import DynamicObjectArray, NeuralInputLayer, NeuralConvolutionalLayer, CMAF_RECTIFIED_LINEAR
conv_layers = DynamicObjectArray()
# 16x16 single channel images
conv_layers.append_element(NeuralInputLayer(16,16,1))
# the first encoding layer: 5 feature maps, filters with radius 2 (5x5 filters)
# and max-pooling in a 2x2 region: its output will be 10 8x8 feature maps
conv_layers.append_element(NeuralConvolutionalLayer(CMAF_RECTIFIED_LINEAR, 5, 2, 2, 2, 2))
# the second encoding layer: 15 feature maps, filters with radius 2 (5x5 filters)
# and max-pooling in a 2x2 region: its output will be 20 4x4 feature maps
conv_layers.append_element(NeuralConvolutionalLayer(CMAF_RECTIFIED_LINEAR, 15, 2, 2, 2, 2))
# the first decoding layer: same structure as the first encoding layer
conv_layers.append_element(NeuralConvolutionalLayer(CMAF_RECTIFIED_LINEAR, 5, 2, 2))
# the second decoding layer: same structure as the input layer
conv_layers.append_element(NeuralConvolutionalLayer(CMAF_RECTIFIED_LINEAR, 1, 2, 2))
conv_ae = DeepAutoencoder(conv_layers)
```

Now we'll pre-train the autoencoder:

In [10]:

```
conv_ae.pt_noise_type.set_const(AENT_DROPOUT) # use dropout noise
conv_ae.pt_noise_parameter.set_const(0.3) # each input has a 30% chance of being set to zero
conv_ae.pt_optimization_method.set_const(NNOM_GRADIENT_DESCENT) # train using gradient descent
conv_ae.pt_gd_learning_rate.set_const(0.002)
conv_ae.pt_gd_mini_batch_size.set_const(100)
conv_ae.pt_max_num_epochs[0] = 30 # max number of epochs for pre-training the first encoding layer
conv_ae.pt_max_num_epochs[1] = 10 # max number of epochs for pre-training the second encoding layer
conv_ae.pt_epsilon.set_const(0.0) # disable automatic convergence testing
# start pre-training. this might take some time
conv_ae.pre_train(Xtrain)
```

And then convert the autoencoder to a regular neural network for classification:

In [11]:

```
conv_nn = ae.convert_to_neural_network(NeuralSoftmaxLayer(10))
# train the network
conv_nn.set_epsilon(0.0)
conv_nn.set_max_num_epochs(50)
conv_nn.set_labels(Ytrain)
# start training. this might take some time
_ = conv_nn.train(Xtrain)
```

And evaluate it on the test set:

In [12]:

```
predictions = conv_nn.apply_multiclass(Xtest)
accuracy = MulticlassAccuracy().evaluate(predictions, Ytest) * 100
print("Classification accuracy on the test set =", accuracy, "%")
```

- [1] Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, Vincent, 2010
- [2] Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, Rifai, 2011
- [3] Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction, J. Masci, 2011