SHOGUN  4.1.0
SubsequenceStringKernel.h File Reference

Go to the source code of this file.

Classes

class  CSubsequenceStringKernel
class SubsequenceStringKernel that implements String Subsequence Kernel (SSK) discussed by Lodhi et. al.[1]. A subsequence is any ordered sequence of $$n$$ characters occurring in the text, though not necessarily contiguous. More formally, string $$u$$ is a subsequence of string $$s$$, iff there exists indices $$\mathbf{i}=(i_{1},\dots,i_{|u|})$$, with $$1\le i_{1} \le \cdots \le i_{|u|} \le |s|$$, such that $$u_{j}=s_{i_{j}}$$ for $$j=1,\dots,|u|$$, written as $$u=s[\mathbf{i}]$$. The feature mapping $$\phi$$ in this scenario is given by

$\phi_{u}(s)=\sum_{\mathbf{i}:u=s[\mathbf{i}]}\lambda^{l(\mathbf{i})}$

for some $$lambda\le 1$$, where $$l(\mathbf{i})$$ is the length of the subsequence in $$s$$, given by $$i_{|u|}-i_{1}+1$$. The kernel here is an inner product in the feature space generated by all subsequences of length $$n$$.

$K_{n}(s,t)=\sum_{u\in\Sigma^{n}}\langle \phi_{u}(s), \phi_{u}(t)\rangle = \sum_{u\in\Sigma^{n}}\sum_{\mathbf{i}:u=s[\mathbf{i}]} \sum_{\mathbf{j}:u=t[\mathbf{j}]}\lambda^{l(\mathbf{i})+l(\mathbf{j})}$

Since the subsequences are weighted by the exponentially decaying factor $$\lambda$$ of their full length in the text, more weight is given to those occurrences that are nearly contiguous. A direct computation is infeasible since the dimension of the feature space grows exponentially with $$n$$. The paper describes an efficient computation approach using a dynamic programming technique. More...

SHOGUN Machine Learning Toolbox - Documentation