Shogun's ML functionality is currently split into feature representations, feature preprocessors, kernels, kernel normalizers, distances, classifier, clustering algorithms, distributions, performance evaluation measures, regression methods, structured output learners. The following gives a brief overview over all the ML related Algorithms/Classes/Methods implemented within shogun.
Feature Representations
Shogun supports a wide range of feature representations. Among them are the so called simple features (cf., CSimpleFeatures) that are standard 2-d Matrices, strings (cf., CStringFeatures) that however in contrast to other meanings of string are just a list of vectors of arbitrary length and sparse features (cf., CSparseFeatures) to efficiently represent sparse matrices.
Each of these feature objects
- Simple Features (CSimpleFeatures)
- Strings (CStringFeatures)
- Sparse Features (CSparseFeatures)
supports any of the standard types from bool to floats:
Supported Types
- bool
- 8bit char
- 8bit Byte
- 16bit Integer
- 16bit Word
- 32bit Integer
- 32bit Unsigned Integer
- 32bit Float matrix
- 64bit Float matrix
- 96bit Float matrix
Many other feature types available. Some of them are based on the three basic feature types above, like CTOPFeatures (TOP Kernel features from CHMM), CFKFeatures (Fisher Kernel features from CHMM) and CRealFileFeatures (vectors fetched from a binary file). It should be noted that all feature objects are derived from CFeatures More complex
- CAttributeFeatures - Features of attribute value pairs.
- CCombinedDotFeatures - Features that allow stacking of dot features.
- CCombinedFeatures - Features that allow stacking of arbitrary features.
- CDotFeatures - Features that support a certain set of features (like multiplication with a scalar + addition to a dense vector). Examples are sparse and dense features.
- CDummyFeatures - Features without content; Only number of vectors is known.
- CExplicitSpecFeatures - Implement spectrum kernel feature space explicitly.
- CImplicitWeightedSpecFeatures - DotFeatures that implicitly implement weighted spectrum kernel features.
- CWDFeatures - DotFeatures that implicitly implement weighted degree kernel features.
In addition, labels are represented in CLabels and the alphabet of a string in CAlphabet.
Preprocessors
The aforementioned features can be on-the-fly preprocessed to e.g. substract the mean or normalize vectors to norm 1 etc. The following pre-processors are implemented
- CNormOne - Normalizes vectors to norm 1.
- CLogPlusOne - add 1 and applies log().
- CPCACut - Keeps eigenvectors with the highest eigenvalues.
- CPruneVarSubMean - removes dimensions with little variance, substracting the mean.
- CSortUlongString - Sorts vectors.
- CSortWordString - Sorts vectors.
Classifiers
A multitude of Classifiers are implemented in shogun. Among them are several standard 2-class classifiers, 1-class classifiers and multi-class classifiers. Several of them are linear classifiers and SVMs. Among the fastest linear SVM-classifiers are CSGD, CSVMOcas and CLibLinear (capable of dealing with millions of examples and features).
Linear Classifiers
- CPerceptron - standard online perceptron
- CLDA - fishers linear discriminant
- CLPM - linear programming machine (1-norm regularized SVM)
- CLPBoost - linear programming machine using boosting on the features
- CSVMPerf - a linear svm with l2-regularized bias
- CLibLinear - a linear svm with l2-regularized bias
- CSVMLin - a linear svm with l2-regularized bias
- CSVMOcas - a linear svm with l2-regularized bias
- CSubgradientSVM - SVM based on steepest subgradient descent
- CSubgradientLPM - LPM based on steepest subgradient descent
Support Vector Machines
- CSVMLight - A variant of SVMlight using pr_loqo as its internal solver.
- CLibSVM - LibSVM modified to use shoguns kernel framework.
- CMPDSVM - Minimal Primal Dual SVM
- CGPBTSVM - Gradient Projection Technique SVM
- CWDSVMOcas - CSVMOcas based SVM using explicitly spanned WD-Kernel feature space
- CGMNPSVM - A true multiclass one vs. rest SVM
- CGNPPSVM - SVM solver based on the generalized nearest point problem
- CMCSVM - An experimental multiclass SVM
- CLibSVMMultiClass - LibSVMs one vs. one multiclass SVM solver
- CLibSVMOneClass - LibSVMs one-class SVM
Distance Machines
- k-Nearest Neighbor - Standard k-NN
Regression
Vector Regression
- CSVRLight - SVMLight based SVR
- CLibSVR - LIBSVM based SVR
Others
- CKRR - Kernel Ridge Regression
Distributions
- CHMM - Hidden Markov Models
- CHistogram - Histogram
- CLinearHMM - Markov chains (embedded in ``Linear'' HMMs)
Clustering
- CHierarchical - Agglomerative hierarchical single linkage clustering.
- CKMeans - k-Means Clustering
Multiple Kernel Learning
- CMKLRegression for q-norm MKL with Regression
- CMKLOneClass for q-norm 1-class MKL
- CMKLClassification for q-norm 2-class MKL
- CGMNPMKL for 1-norm multi-class MKL
Kernels
- CAUCKernel - To maximize AUC in SVM training (takes a kernel as input)
- CChi2Kernel - Chi^2 Kernel
- CCombinedKernel - Combined kernel to work with multiple kernels
- CCommUlongStringKernel - Spectrum Kernel with spectrums of up to 64bit
- CCommWordStringKernel - Spectrum kernel with spectrum of up to 16 bit
- CConstKernel - A ``kernel'' returning a constant
- CCustomKernel - A user supplied custom kernel
- CDiagKernel - A kernel with nonzero elements only on the diagonal
- CDistanceKernel - A transformation to transform distances into similarities
- CFixedDegreeStringKernel - A string kernel
- CGaussianKernel - The standard Gaussian kernel
- CGaussianShiftKernel - Gaussian kernel with shift (inspired by the Weighted Degree shift kernel
- CGaussianShortRealKernel - Gaussian Kernel on 32bit Floats
- CHistogramWordStringKernel - A TOP kernel on Sequences
- CLinearByteKernel - Linear Kernel on Bytes
- CLinearKernel - Linear Kernel
- CLinearStringKernel - Linear Kernel on Strings
- CLinearWordKernel - Linear Kernel on Words
- CLocalAlignmentStringKernel - The local alignment kernel
- CLocalityImprovedStringKernel - The locality improved kernel
- CMatchWordStringKernel - Another String kernel
- COligoStringKernel - The oligo string kernel
- CPolyKernel - the polynomial kernel
- CPolyMatchStringKernel - polynomial kernel on strings
- CPolyMatchWordStringKernel - polynomial kernel on strings
- CPyramidChi2 - pyramid chi2 kernel (from image analysis)
- CRegulatoryModulesStringKernel - regulatory modules string kernel
- CSalzbergWordStringKernel - salzberg features based string kernel
- CSigmoidKernel - Tanh sigmoidal kernel
- CSimpleLocalityImprovedStringKernel - A variant of the locality improved kernel
- CSparseGaussianKernel - Gaussian Kernel on sparse features
- CSparseLinearKernel - Linear Kernel on sparse features
- CSparsePolyKernel - Polynomial Kernel on sparse features
- CTensorProductPairKernel - The Tensor Product Pair Kernel (TPPK)
- CWeightedCommWordStringKernel - A weighted (or blended) spectrum kernel
- CWeightedDegreePositionStringKernel - Weighted Degree kernel with shift
- CWeightedDegreeStringKernel - Weighted Degree string kernel
Kernel Normalizers
Since several of the kernels pose numerical challenges to SVM optimizers, kernels can be ``normalized'' for example to have ones on the diagonal.
- CSqrtDiagKernelNormalizer - divide kernel by square root of product of diagonal
- CAvgDiagKernelNormalizer - divide by average diagonal value
- CFirstElementKernelNormalizer - divide by first kernel element k(0,0)
- CIdentityKernelNormalizer - no normalization
- CDiceKernelNormalizer - normalization inspired by the dice coefficient
- CRidgeKernelNormalizer - adds a ridge on the kernel diagonal
- CTanimotoKernelNormalizer - tanimoto coefficient inspired normalizer
- CVarianceKernelNormalizer - normalize vectors in feature space to norm 1
Distances
Distance Measures to measure the distance between objects. They can be used in CDistanceMachine's like CKNN. The following distances are implemented:
- CBrayCurtisDistance - Bray curtis distance
- CCanberraMetric - Canberra metric
- CChebyshewMetric - Chebyshew metric
- CChiSquareDistance - Chi^2 distance
- CCosineDistance - Cosine distance
- CEuclidianDistance - Euclidian Distance
- CGeodesicMetric - Geodesic metric
- CHammingWordDistance - Hammin distance
- CJensenMetric - Jensen metric
- CManhattanMetric - Manhatten metric
- CMinkowskiMetric - Minkowski metric
- CTanimotoDistance - Tanimoto distance
Evaluation
Performance Measures
Performance measures assess the quality of a prediction and are implemented in CPerformanceMeasures. They following measures are implemented:
- Receiver Operating Curve (ROC)
- Area under the ROC curve (auROC)
- Area over the ROC curve (aoROC)
- Precision Recall Curve (PRC)
- Area under the PRC (auPRC)
- Area over the PRC (aoPRC)
- Detection Error Tradeoff (DET)
- Area under the DET (auDET)
- Area over the DET (aoDET)
- Cross Correlation coefficient (CC)
- Weighted Relative Accuracy (WRAcc)
- Balanced Error (BAL)
- F-Measure
- Accuracy
- Error