Improving accessibility to shogun

Various usability improvements
Mentor: Soeren Sonnenburg Difficulty: Easy to Difficult Requirements: C++ programming, python, swig, some machine learning background
Currently Shogun has a number of rough edges, which polishing would make it more accessible to others. The students task would be to
 Implement customizable SWIG interface classes (so called director classes [1,2]) making it possible to extend certain class methods using some interface language, e.g. for fast prototyping one could implement a kernel in python (or java or octave), which can be used by any of shogun's SVMs. Other usecases of such director classes include: using python classifiers as shogun machines with shogun's pythonmodular interface, prototyping new preprocessors, implementing features with database support with some clue code and various more.
 Implement a better python integration, i.e. for crossvalidation it is currently very complex to specify the parameter settings to try out. In addition, python since version 2.6 supports the python buffer protocol [3], which when implemented for e.g. shoguns CSimpleFeatures or Labels would enable accessing shogun objects like python numpy arrays (to a certain extend). python array protocol. Then we could mention various other known
 The octave modular interface currently has no support for sparse matrices. The task would be to implement these in the swig_typemaps.i based on the python_modular typemaps (file swig_typemaps.i in interfaces/python_modular and interfaces/octave_modular)
 Various code contains ASSERT()'s which should be replaced with more understandable SG_ERROR() messages
 The current implementation of SGVector [4] should be converted to using reference counts.
 Various minor usability improvements.
This task offers good opportunities to see various aspects of shogun. A particular (fun!) challenge is the rather complex swig crosslanguage development process.
[1] http://www.swig.org/Doc2.0/Python.html#Python_nn34 [2] www.swig.org/Doc2.0/Python.html#Python_nn34 [3] http://www.python.org/dev/peps/pep3118/ [4] http://www.shoguntoolbox.org/doc/en/current/classshogun_1_1SGVector.html

Mentor: Cheng Soon Ong
Difficulty: Medium Requirements: C++ programming, python, (optional django)
The machine learning data set repository mldata.org provides data and task downloads in a standardized format. Make shogun to natively be able to read these data formats and interact with the mldata.org website, i.e., by
 read and write support for the hdf5 based data sets and tasks (see http://mldata.org/about/hdf5/)
 contribute features to mldata.org for commandline based up/downloading of data,task,methods,solutions
 contribute features to mloss.org for commandline based up/downloading of methods and other community features
Note that the optional part of this task aims to help the machine learning community portals for which python and django skills are needed (both portal website's source code are also open source).

Reading and writing of shogun features / objects in .hdf5 (pytables/matlab 7.0/matlab 7.X/octave 3.x/h5py/arff) format
Mentor: Soeren Sonnenburg Difficulty: Easy to Medium Requirements: C++ programming
Shogun already has partial support for the hierarchical data format version 5 in CHDF5File. This enables loading and saving files in python,r,octave,matlab,... sometimes providing better hdf5 support than any of the native implementions. The task would be to extend hdf5 support by implementing
 varialbe length data types support
 symbolic links
 compression support
 slicing support
 attribute support
Extensive information about hdf5 can be found in its reference manual.
Matlab 7 file format specifications are publicly available online. In addtion Octave has support for reading files up to version 7.2. The task is to develop a file i/o class based on these specifications that allows for reading and writing .mat files directly. Note that one can borrow a lot of code from octave here (see the files src/lsmat4.cc and loadsave.cc).
Natively support loading and writing of octave files. One can make use of the already implemented functionality in octave (files lsoctascii.{cc,h} and lsoctbinary.{cc,h}. So this task involves a lot of code borrowing.

Framework Improvements

Integrate Existing and Implement New Machine Learning Algorithms

Integrate C5.0
Mentor: Soeren Sonnenburg Difficulty: Medium Requirements: C++ programming, basic machine learning
C5.0 is among the most famous decision trees. It has just recently been released under the GPL  and thus it would be great to integrate it as a ``machine'' in shogun. Since the source code of C5.0 is in C, your task is to encapsulate C5.0 in C++ classes as shogun machine using shogun features as inputs.

Implement Olivier Chapelles fast newton based SVM solver and integration of updated versions of liblinear/ocas
Mentor: Soeren Sonnenburg Difficulty: Easy to Medium Requirements: C++ programming, basic octave or matlab, basic machine learning
Code is available in matlab from http://mloss.org/software/view/30/. The task is to translate this code to C++ and integrate it in shogun. This involves
 Defining a new CLinearMachine derived class CNewtonSVM
 Translating the few lines of matlab code
Once familiar with the shogun, students could integrate updated versions of liblinear and svmocas: New versions of liblinear and libocas are availabe that now support multiclass classification. The task is to
 Merge the newer versions into shogun
 To integrate the new multiclass linear SVMs utilizing shogun's COFFIN framework implemented in CDotFeatures
 (Optionally) Implement the Local Binary Pattern (LBP) features, useful in gender recognition based frontal face pictures.

Bundle method solver for structured output learning
Mentor: Vojtech Franc Difficulty: Medium to Difficult Requirements: C++ programming, some background in optimization, machine learning.
 Get familiar with Shogun.
 Implement the classical bundle method for SO learning.
 Integrate the algorithm to Shogun.
 Compare experimentally the implemented algorithm with the StructSVM[2] and other SO solvers already integrated in Shogun.
Learning of the structured output classifiers leads to solving a convex minimization problem which is not tractable by standard algorithms. A significant effort in ML community has been put to development of specialized solvers among which the Bundle Method for Risk Minimization (BMRM) [1], implemented e.g. in popular StructSVM [2], is the current the stateoftheart. The BMRM is a simplified variant of bundle methods which are standard tools for nonsmooth optimization [3]. The simplicity of the BMRM is compensated by its reduced efficiency. Experiments show that a careful implementation of the classical bundle methods perform significantly faster (speedup ~ 510) than their variants (like BMRM) adopted by ML community. The goal will be an OS library implementing the classical bundle method for the SO learning and its integration to Shogun.
Tasks: Literature: [1] C.H. Teo, S.V.N. Vishwanthan, A.J. Smola, Q.V. Le. Bundle Methods for Regularized Risk Minimization. JMLR, 11(Jan):311−365, 2010. [2] T.Joachims. StructSVM. [3] C. Lemarechal, A.Nemirovskii, Y. Nesterov. New variants of bundle methods. Mathematical Programming. 69, 111147. 1995

Implement Gaussian Processes and regression techniques
Mentor: Oliver Stegle Difficulty: Medium Requirements: C++ programming, machine learning methods for regression, understand octave or matlab code.
While Shogun provides for a number of scalable models for classification, flexible and accurate regression techniques are partially missing at this point. (currently only (Kernel)Ridge Regression, Support Vector Regression and Least Squares Regression are implemented). Your task will be to adapt and translate established Gaussian process models for usage in Shogun, integrating ideas from existing matlab and python toolboxes (see http://www.gaussianprocess.org/#code) and in particular the GPML Toolbox. To start out, we will implement stateoftheart Gaussian process regression models and adapt shogun as needed for efficient inference in Gaussian process models: Gaussian process regression using a standard Gaussian kernel (for example [1], chapter 1 and 2)
 Extend key kernels in shogun to provide for parameter derivatives and add further interfaces needed to speed up Gaussian process parameter learning.
 In the second part of the project, we aim to implement extensions that make shogun even more useful for advanced regression models. Depending on your interests, we suggest to either extend the models to cope with outliers in the data (robust regression) or implement the Gaussian process latent variable model for dimensionality reduction in highdimensional datasets: Implement the Gaussian proces latent variable model in Shogun [2]
 Add support for robust Gaussian process regression [3]
 In addition, the interested student could enhance the regression methods currently available in shogun (e.g. by developing more numerically robust methods) and implement other stateoftheart regression techniques.
This project allows the student to get familiar with various regression methods often used in machine learning and in particular GPs.
[1] C. Rasmussen, C. Williams Gaussian processe for machine learning [2] N. Lawrence, Probabilistic nonlinear principle component analysis with Gaussian Process Latent Variable Models, JMLR 2005 [3] P. Jylänki, J. Vanhatalo, A. Vehtari, Robust Gaussian process regression with a Studentt Likelihood, JMLR 2011

Implement multitask and domain adaptation algorithms
Mentor: Christian Widmer Difficulty: Medium Requirements: C++ programming, advanced machine learning
Multitask Learning and Domain Adaptation are active fields of research in the Machine Learning Community. The main aim is to combine information from similar prediction tasks in order to come up with better classifiers. As an example consider building a classifier that predicts whether a person in Germany likes a particular movie, given their purchase history. The resulting classifier could be very similar to the one you would get for, say, Dutch customers, however quite different from Japanese customers. Yet, one possibly could still exploit useful information in the Japanese data set even for the German or Dutch customers. This is exactly where multitask learning comes in: Related problems are coupled together, leading to classifiers that exploit all available information. In practice, this is done by certain modification to the standard formulations of classifiers such as the SVM. In this project, the student would work close to active research. As a first step they would be closely supervised in order to implement a few Multitask Learning algorithms available in literature, for example:
 Convex Multitask Feature Learning (paper)
 Learning Multiple Tasks using Manifold Regularization (paper)
Depending on the speed of progress, the student will then be involved in the efficient implementations of active research projects. The working prototypes will be provided in python and the task will be to come up with fast, efficient, welltested and welldocumented C++ implementations as part of SHOGUN. This projects gives the student the chance to work close to ongoing academic research. It offers the student to learn about the optimization techniques needed for machine learning, getting from prototypes to "productionready" implementations and of course the research field of multitask learning.

Kernel Twosample/Dependence test
Mentor: Arthur Gretton, Soeren Sonnenburg Difficulty: Medium to Difficult Requirements: C++ programming, Numerical algorithm implementations, Knowledge of probability distributions/statistical hypothesis tests and kernel methods, Reading Matlab Code
Hypothesis tests are an important tool for statistical analysis. Given samples from two distributions p and q, the goal is to test whether p=q. A teststatistic is computed from samples and being compared to a threshold. Based on this, the nullhypothesis (p=q) is or is not rejected.
For this poject, kernel versions of hypothesis tests are to be implemented, based on the "Maximum Mean Discrepancy" (MMD), (see [1], [2], [3], [4]). These would have for example application in the area of bioinformatics (microarray data distributions) or database attribute matching where data is highdimensional. In addition, they would allow to test whether data comes from two different distributions in fields where data is not given in vector form (for example graphs) The idea of these tests is to find a well behaved (e.g. smooth) function which is large on the points drawn from p, and small (as negative as possible) on the points from q. The mean function values on the two samples is used to compute the teststatistic. When this is large, the samples are likely from different distributions.
Related to the two sample test are dependence tests, i.e. telling whether two samples are statistically (in)dependent, (see for example this wikipedia article). For this project, a kernel dependence test based on the Hilbert Schmidt Independence Criterion (HSIC), as described in [5], is to be implemented. As for the MMD tests, there exist multiple method of computing the significace threshold. See [6] for this and some slides for an intuitive description
There already exist Matlab implementations for both MMD [4] and HSIC [5]. For MMD, a good start would be the linear time statistic in [1, Corollary 22], where the Nulldistribution is a Gaussian. A more complicated test is based on the asymptotic distribution of the Unbiased MMD Statistic [4, section 5]. A difficulty here is to compute the significance threshold. This should be done via two possibilities: a method based on bootstrap resampling (see [1], [2]) and a method based on the Eigenspectrum of the Gram matrix on the aggregate sample from p and q (see [3]), and a cheap approximation of the latter [3]. For HSIC, there are multiple versions for computing the significance threshold, in paritcular one based on bootstrapping [5] [6], and one faster consistent version [reference to come]
Another (optional, if the rest works) difficulty would be to pay a bit more attention to kernel (parameter) selection in context of the above tests. However, some usable heuristics exist.
The first practical step for the student is read and understand the mentioned papers to understand the ideas behind the methods. It is not important to gain deep mathematical insight but rather to know what is important for the implementation. There already exists a Matlab implementation which can be used as an orientation. Looking at/understanding the code would be the second step. Since Matlab has many numerical algorithms builtin, it might be necessary to include some of them into SHOGUN. There are yet no hypothesis/dependence tests in SHOGUN, therefore, a framework concept for these has to be thought of. This includes creating new classes/interfaces to make this as flexible as possible. All this has to be done in respect to existing interfaces, so a good knowledge of the SHOGUN framework is crucial/has to be acquired. The algorithms/interfaces have to be documented well and examples/tests for their usage have to be created. Creating examples may be nontrivial here since the data should be more complex in order. Finally, SHOGUN's usual interfaces to other languages have to be created and examples should be translated to at least some of the language bindings.
This project offers a great opportunity to get familiar with kernel statistical hypothesis testing. The mentioned methods were all developed quite recently by wellknown scientists. Since there already exists a working implementation in Matlab, coding has not to be done completely from scratch. From the implementation side, students will get to know the SHOGUN framework since the are expanding it. Also experience writing numerical algorithms will be gathered. Implementing algorithms for scientific opensource software in any case is a great experience.
[1] Main reference for MMD (long technical report): "A Kernel Method for the TwoSample Problem" Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schökopf, Alexander J. Smola [2] Short summary of [1]: "A KernelMethod for the TwoSampleProblem" Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander J. Smola [3] A better method for approximating the nulldistribution of the asymptotic distribution of the Unbiased MMD Statistic: "A Fast, Consistent Kernel TwoSample Test", Gretton, Arthur Fukumizu, Kenji Harchaoui, Zaid [4] Matlab implementation and references: http://people.kyb.tuebingen.mpg.de/arthur/mmd.htm [5] Main reference for HSIC: "A Kernel Statistical Test of Independence" Gretton, A. K. Fukumizu C.H. Teo L. Song, B Schoelkopf A. Smola NIPS 21, 2007. [6] References and implementation of the kernel dependence test: http://people.kyb.tuebingen.mpg.de/arthur/indep.htm

Implementation of / Integration via existing GPL code of latent SVMs.
Mentor: Alexander Binder Requirements: C++ programming, machine learning Difficulty: Difficult
SVMs with latent variables
 checking what implementations licensed under the GPL exist in C/C++
 to what kind of problems are they practically applicable
 if necessary implement one latent SVM model
 implement templates for minimization over userdefined latent variables (difficult!)
 implement dynamic interfaces
 implement static interfaces for important application cases
 implement more than one solution strategy
 implement heuristics for dealing with large sample sizes
 check such heuristics for their efficiency on large scale data
http://www.cs.cornell.edu/~cnyu/latentssvm/, http://research.microsoft.com/pubs/72874/lsvm_amta.pdf and particularly http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.4938&rep=rep1&type=pdf are an ongoing success story in Object detection approaches where the latent variable is the unknown position of an object center and object parts. Felzenswalbs approach received the "Lifetime Achievement" Prize based on its impact in research (see http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html) at the Pascal VOC2010 workshop. Latent parameters are equally appealing for disccriminative methods in other application areas as well.
The task consists of The challenge in implementation lies in solving the non(!)convex optimization problem in an way which is acceptable for practitioners , including allowing for templates for minimization over the latent variable, which are desried to be consistent with the interfaces to other languages, termination after a maximal runtime or heuristic frameworks which allow to iterate it on smaller chunks of data while keeping only the hardest samples.

Implementation of Boosting type learning method for nonconstantly dimensional features defined as sets of fixed dim vectors with varying set size.
Mentor: Alexander Binder Difficulty: Medium Requirements: C++ programming, machine learning
http://www.cs.cmu.edu/~rahuls/pub/cvpr2008unifiedrahuls.pdf is a method generally usable (outside its original domain) for features which are sets of fixed dimensional vectors with varying set size. This is a notable relaxation of the classical paradigm of constantly (fixed) dimensional features which could extend the usability of the Shogun toolbox at this edge.
The task consists of
 implement the model and solution algorithm
 implement a parallelization of the algorithm
 implement heuristics for dealing with large sample sizes
 check such heuristics for their efficiency on large scale data
 implement dynamic and static interfaces
The challenge here would be to implement it in a scalable way for larger sample sizes, possibly by parallelization or several heuristics to run it over chunks of the training data.
