The most common and the most widely used representation of document collections is the Bag of Words model.The BoW representation considers each document as a collection of tokens. A token is defined as the minimum arbitrary sequence of characters that can be considered as atomic. Tokens depending on the choice of the user can be either whole words or n-grams, etc. A simple approach is to consider every possible token in your document collection as a different dimension in a feature space and then vectorize each document by assigning 1 to every dimension that corresponds to a token contained in that document and 0 to every other. This is the simpler approach and is known as the Boolean Vector Space Model. Another approach is, instead of using {1,0}, to use a weight for every term, like their frequencies, but this is out of our scope.

Let's now consider a document collection consisting of the following documents:

**Document 1** = "This is the first document"

**Document 2** = "Document classification: Introduction"

**Document 3** = "A third document about classification"

and suppose that we consider as tokens, every word that appears in the collection.

Then we would arrive to the following representation:

Document 1 | Document 2 | Document 3 | |

[0] this | 1 | 0 | 0 |

[1] is | 1 | 0 | 0 |

[2] the | 1 | 0 | 0 |

[3] first | 1 | 0 | 0 |

[4] document | 1 | 1 | 1 |

[5] classification | 0 | 1 | 1 |

[6] introduction | 0 | 1 | 0 |

[7] a | 0 | 0 | 1 |

[8] third | 0 | 0 | 1 |

[9] about | 0 | 0 | 1 |

- First of all, the dimensionality of the feature space (or else, the number of distinct tokens in the collection) is usually
**not known beforehand**and requires a pass over the entire dataset to be calculated. This can make the creation of the document-term matrix trickier, especially in online learning scenarios. - Secondly, this dimensionality can grow to be
**very large**; imagine considering as tokens the set of every possible word, or every possible 8-gram. - Thirdly, this approach requires an
**additional dictionary**in order to work that maps every token to its appropriate dimension, for instance every appearance of 'this' above should always be mapped to dimension 0, and this mapping data structure will also grow to be**very large.**

This way:

- the dimensionality of the target feature space is now
**known beforehand**, as it's simply the number of all possible outcomes of the hash functions, restricted to a specific range [0, n-1] by using the modulo function. No extra passes over the dataset are needed this way. - In addition, the dimension can now be
**restricted to a certain size**that is convenient depending on the limitations of the available hardware - The dictionary is also
**not needed**anymore since each token is directly mapped to the appropriate dimension through the outcome of the hash function.

The above approach will result in a hashed document-term matrix, similar to the one shown above, of a pre-defined dimension size (number of rows).

Each token will now be assigned to a random dimension, although always the same for the same token, that is determined from the hash function.

This will probably incure some collisions, but if the hash functions is good enough and our choice for the dimension size is sufficiently large it will not be a problem. Especially if we are considering large collections.

After the creation of the hashed document-term matrix, the intuition is to pass it as a dataset to a linear classifier that will take care of the rest!

The response to that is to read our collection as it is and compute the hash of every token only when it's required, on-the-fly.

First of all we import the required components from the modshogun library.

In [1]:

```
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from modshogun import StringCharFeatures, RAWBYTE, HashedDocDotFeatures, NGramTokenizer
```

RAWBYTE is simply an enum that specifies that every possible character can be found in the collection.

The HashedDocDotFeatures is where all the magic happens! This class is responsible for encapsulating the document collection and providing the calculation of the hashed representation of each document whenever it is required.

The NGramTokenizer is the tokenizer we will use on our collection. Its job is to parse a document and create the tokens. As its name suggests, it creates a sequence of ngrams. Another option would be the DelimiterTokenizer, which allows the user to specify the delimiter that will be used in order to create the tokens.

Suppose now that we have the following documents:

In [2]:

```
doc_1 = "this is the first document"
doc_2 = "document classification introduction"
doc_3 = "a third document about classification"
document_collection = [doc_1, doc_2, doc_3]
```

In [3]:

```
from modshogun import BinaryLabels
from numpy import array
labels = BinaryLabels(array([-1, 1, 1]))
```

We now create our document collection:

In [4]:

```
string_features = StringCharFeatures(document_collection, RAWBYTE)
```

In [5]:

```
hash_size = 8
tokenizer = NGramTokenizer(5)
normalize = True
hashed_feats = HashedDocDotFeatures(hash_size, string_features, tokenizer, normalize)
```

The hashed_feats object now has all the required parameters and knows how to communicate and provide the hashed representation **only** when it is needed by the various algorithms of Shogun.

So how do we proceed to actually learn a model that can classify documents?

We said before that the idea after we have taken care of the hashing is to use a linear classifier.

Shogun has many linear classifiers available, but for the moment we will consider SVMOcas and we will see how to use it:

In [6]:

```
from modshogun import SVMOcas
C = 0.1
epsilon = 0.01
svm = SVMOcas(C, hashed_feats, labels)
svm.set_epsilon(epsilon)
```

Now, let's train our svm:

In [7]:

```
_=svm.train()
```

Now that we have a model ready, we would like to see how well it does. Let's see what it predicts for our training data:

In [8]:

```
predicted_labels = svm.apply()
print (predicted_labels.get_labels())
```

We can see that it misclassified the first document. This has to do with the nature of our overly-simplified toy dataset which doesn't provide enough information. However, another option of the HashedDocDotFeatures class will allow us to extract some more information from the same dataset!

For example, if we have a dataset consisting of the following example vector:

A | B | C | |

x: | 2 | 3 | 1 |

A | B | C | AA | AB | AC | BB | BC | CC | |

x: | 2 | 3 | 1 | 2*2=4 | 2*3=6 | 2*1=2 | 3*3=9 | 3*1=3 | 1*1=1 |

The idea behind the quadratic features is very simple and can be easily implemented for numerical features. However in the case of text collections, where we have tokens instead of numbers, things are not so straightforward.

The approach we have selected in Shogun for the text collections does not compute all the quadratic features, but only a subset of them defined using a k-skip n-grams rule. N-grams in this context refer to a collection of up to n consecutive tokens and should not be confused with our choice for what consists a token, where it can be whole words or ngrams themselves. The k-skip in front allows us to skip up to k tokens when we group them in up-to n sized groups. Although this is a simple idea it can be a bit confusing when introduced through a definition, so let's have a look at an example: Suppose we have the following

Tokens : ["a", "b", "c", "d"]

and that we want to consider a 2-skip 2-grams approach. This means that we can combine up to 2 tokens (that is a single token or two tokens) while skipping up to 2 tokens between them (that means we can skip 0,1 and 2 tokens). Therefore, we obtain the following combinations:

["a" (0-skips, 1-gram), "ab" (0-skips, 2-gram), "ac" (1-skip 2-gram), "ad" (2-skips, 2-gram), "b" (0-skips, 1-gram), "bc" (0-skips, 2-gram), "bd" (1-skip, 2-gram), "c" (0-skip, 1-gram), "cd" (0-skip, 2-gram), "d" (0-skip, 1-gram)]

These newly created tokens are then treated like our regular ones and hashed to find their appropriate dimension.

As discussed above this idea may enhance our information on the dataset, but it also requires more time to be computed. So if used, the choice of k and n should be kept to small numbers.

In order to use the k-skip n-gram approach explained above, one only has to specify his choice for k and n to the HashedDocDotFeatures object he is creating.

So, by keeping our previous choice of parameters intact and introducing the k and n we obtaing the following piece of code :

In [9]:

```
k = 3 # number of tokens, up to which we allow it to skip
n = 3 # number of tokens, up to which we allow it to combine
hashed_feats_quad = HashedDocDotFeatures(hash_size, string_features, tokenizer, normalize, n, k)
```

Let's see if there was any improvement in our predicting capabilities.

By keeping the same svm settings, we will train on the new HashedDocDotFeatures object we have just created. Although we have the same document collection underneath, we are now considering a different representation model because of the quadratic features.

In [10]:

```
svm.set_features(hashed_feats_quad)
svm.train()
predicted_labels = svm.apply()
print(predicted_labels.get_labels())
```

Better!

**Attention!** Some clarifications now will follow on the Quadratic Features to clear any misunderstandings:

- The term "ngram" can come up on two different levels.

The first one is in the tokenization process of the documents, where we have a raw stream of characters and our job there is to split the document on sequences of characters that are considered atomic. For instance, we may decide that sequences of three characters are the basic units which make up a document and therefore we are considering 3-grams (for tokens).

The second level, is when we are combining the tokens generated in the previous phase to create quadratic features, by following the aforementioned k-skip n-grams rule. Now we are not considering anymore the document as a sequence of characters, but rather as a sequence of tokens. And it is on those tokens that we are now applying the rule. Therefore, if we decide to apply a 0-skip 3-grams rule then we may combine up to 3 tokens (not characters). In conclusion, the way the quadratic features work is completely independent of the Tokenizer used on the documents. We could have chosen a DelimiterTokenizer and still consider ngrams (of tokens) on the quadratic level. - The "quadratic" in front of "quadratic features" also refers to the way the computing time increases when we select this approach! So it should only be used when it is necessary.

The example that was demonstrated before was very small and simple and was manipulated for demonstration purposes, to show the capabilities of the HashedDocDotFeatures class.

We demonstrate results when compared against SVMLight, using the Spectrum Kernel.

We consider the following settings:

number of bits in hash = 16, 8-grams for tokens, SVM epsilon = 0.01.

In [11]:

```
from pylab import *
# HashedDocDotFeatures results
hashed_training_examples = [5000, 10000, 15000, 20000, 25000, 30000, 50000, 100000]
# For C=1
hashed_C_1_sec = [2682.750000,5202.690000,8120.460000,10846.410000,13944.200000,17016.840000,30496.720000,66302.950000]
hashed_C_1_roc = [0.980730,0.986382,0.988894,0.990666,0.991602,0.991957,0.993680,0.995184]
# For C=0.1
hashed_C_01_sec = [1074.130000,2142.390000,3434.710000,4641.380000,5984.530000,7206.040000,12864.270000,28393.540000]
hashed_C_01_roc = [0.976560,0.982660,0.985251,0.987380,0.988368,0.989022,0.990950,0.993197]
# Spectrum kernel results
kernel_training_examples = [5000, 10000, 15000, 20000, 25000]
# For C=1
kernel_C_1_sec = [2912.410000,6543.220000,10840.550000,16108.360000,19899.610000]
kernel_C_1_roc = [0.971284,0.976628,0.979715,0.982084,0.984355]
# For C=0.1
kernel_C_01_sec = [1441.380000,3261.870000,5071.040000,7568.130000,10436.430000]
kernel_C_01_roc = [0.946308,0.955245,0.961576,0.965204,0.968264]
figure(figsize=(12,6))
subplot(1,2,1)
plot(hashed_training_examples, hashed_C_1_sec, 'b')
plot(kernel_training_examples, kernel_C_1_sec, 'r')
title("Time comparison for C=1")
xlabel("Number of examples")
ylabel("Time in seconds")
legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=2)
subplot(1,2,2)
plot(hashed_training_examples, hashed_C_1_roc, 'b')
plot(kernel_training_examples, kernel_C_1_roc, 'r')
title("Area under ROC comparison for C=1")
xlabel("Number of examples")
ylabel("auROC")
_=legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=4)
```

In [12]:

```
clf
figure(figsize=(12,6))
subplot(1,2,1)
plot(hashed_training_examples, hashed_C_01_sec, 'b')
plot(kernel_training_examples, kernel_C_01_sec, 'r')
title("Time comparison for C=0.1")
xlabel("Number of examples")
ylabel("Time in seconds")
ylim((0,70000))
legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=2)
subplot(1,2,2)
plot(hashed_training_examples, hashed_C_01_roc, 'b')
plot(kernel_training_examples, kernel_C_01_roc, 'r')
title("Area under ROC comparison for C=0.1")
xlabel("Number of examples")
ylabel("auROC")
_=legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=4)
```

Another experiment we conducted was to create a language detection application. For that purpose we created a dataset consisting of 10,000 documents for 5 different languages; English, Greek, German, Italian and Spanish.

The demo uses underneath it a svm model trained on 30,000 documents with Ngrams of size 4, bits of hash size = 18, quadratic features with options: 2-skips 3-grams and svm C = 0.1. The demo can be found here and you can test it on your own how well it does!

The hashing trick has also been implemented in Shogun for the DenseFeatures and SparseFeatures classes, as HashedDenseFeatures and HashedSparseFeatures classes. These classes also support quadratic features and provide the option to maintain or drop the linear terms.

For online algorithms or for cases when the data do not fit in memory there exist similar classes with the prefix "Streaming" that support reading examples from the disk and providing them to some algorithms one at a time. The classes specifically are StreamingHashedDocDotFeatures, StreamingHashedDenseFeatures and StreamingHashedSparseFeatures. If one has mixed features, that are not just numerical or not just text, then he can use the CombinedDotFeatures class to combine objects of the aforementioned classes!

Another option is to use the Vw* algorithms and the VwExample class that require the input to be in vw format and are a bit trickier to use.

In addition, a HashedDocConverter class exists that allows the user to get the hashed BoW representation of a document collection as a CSparseFeatures object.

- Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, Alex Smola : Feature Hashing for Large Scale Multitask Learning
- Olivier Chapelle, Eren Manavoglu, Romer Rosales : Simple and scalable response prediction for display advertising
- S̈oren Sonnenburg, Gunnar R̈atsch, Konrad Rieck : Large Scale Learning with String Kernels Vowpal Wabbit
- A post in metaoptimize </ol>