An RBM is an **energy based probabilistic model**. It consists of two groups of variables: the **visible variables** $ v $ and the **hidden variables** $ h $. The key assumption that RBMs make is that the hidden units are conditionally independent given the visible units, and vice versa.

The RBM defines its distribution through an **energy function** $E(v,h)$, which is a function that assigns a number (called energy) to each possible state of the visible and hidden variables. The probability distribution is defined as:

$$ P(v,h) := \frac{\exp(-E(v,h))}{Z} , \qquad Z = \sum_{v,h} \exp(-E(v,h))$$

where $Z$ is a constant that makes sure that the distribution sums/integrates to $1$. This distribution is also called a Gibbs distribution and $Z$ is sometimes called the **partition function**.

From the definition of $P(v,h)$ we can see that the probability of a configuration increases as its energy decreases. Training an RBM in an unsupervised manner involves manipulating the energy function so that it would assign low energy (and therefore high probability) to values of $v$ that are similar to the training data, and high energy to values that are far from the training data.

For an RBM with binary visible and hidden variables the energy function is defined as:

$ E(v,h) = -\sum_i \sum_j h_i W_{ij} v_j - \sum_i h_i c_i - \sum_j v_j b_j $

where $b \in \mathbb{R^n} $ is the bias for the visible units, $c \in \mathbb{R^m}$ is the bias for hidden units and $ W \in \mathbb{R^{mxn}}$ is the weight matrix between the hidden units and the visible units.

Plugging that definition into the definition of the probability distribution will yield the following conditional distributions for each of the hidden and visible variables:

$$ P(h=1|v) = \frac{1}{1+exp(-Wv-c)}, \quad P(v=1|h) = \frac{1}{1+exp(-W^T h-b)} $$

We can do a quick visualization of an RBM:

In []:

```
%pylab inline
%matplotlib inline
import networkx as nx
G = nx.Graph()
pos = {}
for i in range(8):
pos['V'+str(i)] = (i,0)
pos['H'+str(i)] = (i,1)
for j in range(8): G.add_edge('V'+str(j),'H'+str(i))
figure(figsize=(7,2))
nx.draw(G, pos, node_color='y', node_size=750)
```

If an RBM is properly trained, the hidden units learn to extract useful features from training data. An obvious way to go further would be transform the training data using the trained RBM, and train yet another RBM on the transformed data. The second RBM will learn to extract useful features from the features that the first RBM extracts. The process can be repeated to add a third RBM, and so.

When stacked on top of each other, those RBMs form a Deep Belief Network [1]. The network has directed connections going from the units in each layer to units in the layer below it. The connections between the top layer and the layer below it are undirected. The process of stacking RBMs to form a DBN is called pre-training the DBN.

After pre-training, the DBN can be used to initialize a similarly structured neural network which can be used for supervised classification.

We can do a visualization of a 4-layer DBN:

In []:

```
G = nx.DiGraph()
pos = {}
for i in range(8):
pos['V'+str(i)] = (i,0)
pos['H'+str(i)] = (i,1)
pos['P'+str(i)] = (i,2)
pos['Q'+str(i)] = (i,3)
for j in range(8):
G.add_edge('H'+str(j),'V'+str(i))
G.add_edge('P'+str(j),'H'+str(i))
G.add_edge('Q'+str(j),'P'+str(i))
G.add_edge('P'+str(j),'Q'+str(i))
figure(figsize=(5,5))
nx.draw(G, pos, node_color='y', node_size=750)
```

RBMs in Shogun are handled through the CRBM class. We create one by specifying the number of visible units and their type (binary, Gaussian, and Softmax visible units are supported), and the number of hidden units (only binary hidden units are supported).

In this notebook we'll train a few RBMs on the USPS dataset for handwritten digits. We'll have one RBM for each digit class, making 10 RBMs in total:

In []:

```
from modshogun import RBM, RBMVUT_BINARY, Math
# initialize the random number generator with a fixed seed, for repeatability
Math.init_random(10)
rbms = []
for i in range(10):
rbms.append(RBM(25, 256, RBMVUT_BINARY)) # 25 hidden units, 256 visible units (one for each pixel in a 16x16 binary image)
rbms[i].initialize()
```

Next we'll load the USPS dataset:

In []:

```
from scipy.io import loadmat
dataset = loadmat('../../../data/multiclass/usps.mat')
Xall = dataset['data']
# the usps dataset has the digits labeled from 1 to 10
# we'll subtract 1 to make them in the 0-9 range instead
Yall = np.array(dataset['label'].squeeze(), dtype=np.double)-1
```

In []:

```
from modshogun import RealFeatures, RBMMM_PSEUDO_LIKELIHOOD
# uncomment this line to allow the training progress to be printed on the console
#from modshogun import MSG_INFO; rbms[0].io.set_loglevel(MSG_INFO)
for i in range(10):
# obtain the data for digit i
X_i = Xall[:,Yall==i]
# binarize the data for use with the RBM
X_i = (X_i>0).astype(float64)
# set the number of contrastive divergence steps
rbms[i].cd_num_steps = 5
# set the gradient descent parameters
rbms[i].gd_learning_rate = 0.005
rbms[i].gd_mini_batch_size = 100
rbms[i].max_num_epochs = 30
# set the monitoring method to pseudo-likelihood
rbms[i].monitoring_method = RBMMM_PSEUDO_LIKELIHOOD
# start training
rbms[i].train(RealFeatures(X_i))
```

In []:

```
samples = zeros((256,100))
for i in range(10):
# initialize the sampling chain with a random state for the visible units
rbms[i].reset_chain()
# run 10 chains for a 1000 steps to obtain the samples
samples[:,i*10:i*10+10] = rbms[i].sample_group(0, 1000, 10)
# plot the samples
figure(figsize=(7,7))
for i in range(100):
ax=subplot(10,10,i+1)
ax.imshow(samples[:,i].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)
ax.set_xticks([])
ax.set_yticks([])
```

Now we'll create a DBN, pre-train it on the digits dataset, and use it initialize a neural network which we can use for classification.

DBNs are handled in Shogun through the CDeepBeliefNetwork class. We create a network by specifying the number of visible units it has, and then add the desired number of hidden layers using add_hidden_layer(). When done, we call initialize() to initialize the network:

In []:

```
from modshogun import DeepBeliefNetwork
dbn = DeepBeliefNetwork(256) # 256 visible units
dbn.add_hidden_layer(200) # 200 units in the first hidden layer
dbn.add_hidden_layer(300) # 300 units in the second hidden layer
dbn.initialize()
```

In []:

```
# take 3000 examples for training, the rest for testing
Xtrain = Xall[:,0:3000]
Ytrain = Yall[0:3000]
Xtest = Xall[:,3000:-1]
Ytest = Yall[3000:-1]
# set the number of contrastive divergence steps
dbn.pt_cd_num_steps.set_const(5)
# set the gradient descent parameters
dbn.pt_gd_learning_rate.set_const(0.01)
dbn.pt_gd_mini_batch_size.set_const(100)
dbn.pt_max_num_epochs.set_const(30)
# binarize the data and start pre-training
dbn.pre_train(RealFeatures((Xtrain>0).astype(float64)))
```

In []:

```
# obtain the weights of the first hidden layer
w1 = dbn.get_weights(0)
# plot the weights between the first 100 units in the hidden layer and the visible units
figure(figsize=(7,7))
for i in range(100):
ax1=subplot(10,10,i+1)
ax1.imshow(w1[i,:].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)
ax1.set_xticks([])
ax1.set_yticks([])
```

In []:

```
from modshogun import NeuralSoftmaxLayer, MulticlassLabels
# get the neural network
nn = dbn.convert_to_neural_network(NeuralSoftmaxLayer(10))
# add some L2 regularization
nn.l2_coefficient = 0.0001
# start training
nn.set_labels(MulticlassLabels(Ytrain))
_ = nn.train(RealFeatures(Xtrain))
```

And finally we'll measure the classification accuracy on the test set:

In []:

```
from modshogun import MulticlassAccuracy
predictions = nn.apply_multiclass(RealFeatures(Xtest))
accuracy = MulticlassAccuracy().evaluate(predictions, MulticlassLabels(Ytest)) * 100
print "Classification accuracy on the test set =", accuracy, "%"
```