Improving accessibility to shogun
Various usability improvements
Mentor: Soeren Sonnenburg
Difficulty: Easy to Difficult
Requirements: C++ programming, python, swig, some machine learning background
Currently Shogun has a number of rough edges, which polishing would make it more accessible to others. The students task would be to
- Implement customizable SWIG interface classes (so called director classes [1,2]) making it possible to extend certain class methods using some interface language, e.g. for fast prototyping one could implement a kernel in python (or java or octave), which can be used by any of shogun's SVMs. Other use-cases of such director classes include: using python classifiers as shogun machines with shogun's python-modular interface, prototyping new preprocessors, implementing features with database support with some clue code and various more.
- Implement a better python integration, i.e. for cross-validation it is currently very complex to specify the parameter settings to try out. In addition, python since version 2.6 supports the python buffer protocol [3], which when implemented for e.g. shoguns CSimpleFeatures or Labels would enable accessing shogun objects like python numpy arrays (to a certain extend). python array protocol. Then we could mention various other known
- The octave modular interface currently has no support for sparse matrices. The task would be to implement these in the swig_typemaps.i based on the python_modular typemaps (file swig_typemaps.i in interfaces/python_modular and interfaces/octave_modular)
- Various code contains ASSERT()'s which should be replaced with more understandable SG_ERROR() messages
- The current implementation of SGVector [4] should be converted to using reference counts.
- Various minor usability improvements.
This task offers good opportunities to see various aspects of shogun. A particular (fun!) challenge is the rather complex swig cross-language development process.
[1] http://www.swig.org/Doc2.0/Python.html#Python_nn34
[2] www.swig.org/Doc2.0/Python.html#Python_nn34
[3] http://www.python.org/dev/peps/pep-3118/
[4] http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1SGVector.html
-
Mentor: Cheng Soon Ong
Difficulty: Medium
Requirements: C++ programming, python, (optional django)
The machine learning data set repository mldata.org provides data and task
downloads in a standardized format. Make shogun to natively be able to
read these data formats and interact with the mldata.org website, i.e., by
- read and write support for the hdf5 based data sets and tasks (see http://mldata.org/about/hdf5/)
- contribute features to mldata.org for command-line based up/downloading of data,task,methods,solutions
- contribute features to mloss.org for command-line based up/downloading of methods and other community features
Note that the optional part of this task aims to help the machine learning community portals for which python and django skills are needed (both portal website's source code are also open source).
Reading and writing of shogun features / objects in .hdf5 (pytables/matlab 7.0/matlab 7.X/octave 3.x/h5py/arff) format
Mentor: Soeren Sonnenburg
Difficulty: Easy to Medium
Requirements: C++ programming
Shogun already has partial support for the hierarchical data format version 5 in CHDF5File. This enables loading and saving files in python,r,octave,matlab,... sometimes providing better hdf5 support than any of the native implementions. The task would be
to extend hdf5 support by implementing
- varialbe length data types support
- symbolic links
- compression support
- slicing support
- attribute support
Extensive information about hdf5 can be found in its reference manual.
Matlab 7 file format specifications are publicly available online. In addtion Octave has support for reading files up to version 7.2. The task is to develop a file i/o class based on these specifications that
allows for reading and writing .mat files directly. Note that one can borrow a lot of code from octave here (see the files src/ls-mat4.cc and load-save.cc.
Natively support loading and writing of octave files. One can make use of the already implemented functionality in octave (files ls-oct-ascii.{cc,h} and ls-oct-binary.{cc,h}. So this task involves a lot of code borrowing.
|
Framework Improvements
|
Integrate Existing and Implement New Machine Learning Algorithms
Integrate C5.0
Mentor: Soeren Sonnenburg
Difficulty: Medium
Requirements: C++ programming, basic machine learning
C5.0 is among the most famous decision trees. It has just recently been released under the GPL - and thus it would be great to integrate it as a ``machine'' in shogun. Since the source code of C5.0 is in C, your task is to encapsulate C5.0 in C++ classes as shogun machine using shogun simplefeatures as inputs.
Implement Olivier Chapelles fast newton based SVM solver and integration of updated versions of liblinear/ocas
Mentor: Soeren Sonnenburg
Difficulty: Easy to Medium
Requirements: C++ programming, basic octave or matlab, basic machine learning
Code is available in matlab from
http://mloss.org/software/view/30/.
The task is to translate this code to C++ and integrate it in shogun. This involves
- Defining a new CLinearClassifier derived class CNewtonSVM
- Translating the few lines of matlab code
Once familiar with the shogun, students could integrate updated versions of liblinear and svmocas:
New versions of liblinear and libocas are availabe that now support multiclass classification. The task is to
- Merge the newer versions into shogun
- To integrate the new multi-class linear SVMs utilizing shogun's COFFIN framework implemented in CDotFeatures
- (Optionally) Implement the Local Binary Pattern (LBP) features, useful in gender recognition based frontal face pictures.
Bundle method solver for structured output learning
Mentor: Vojtech Franc
Difficulty: Medium to Difficult
Requirements: C++ programming, some background in optimization, machine learning.
Learning of the structured output classifiers leads to solving a convex minimization problem which is not tractable by standard algorithms. A significant effort in ML community has been put to development of specialized solvers among which the Bundle Method for Risk Minimization (BMRM) [1], implemented e.g. in popular StructSVM [2], is the current the state-of-the-art. The BMRM is a simplified variant of bundle methods which are standard tools for non-smooth optimization [3]. The simplicity of the BMRM is compensated by its reduced efficiency. Experiments show that a careful implementation of the classical bundle methods perform significantly faster (speedup ~ 5-10) than their variants (like BMRM) adopted by ML community. The goal will be an OS library implementing the classical bundle method for the SO learning and its integration to Shogun.
Tasks:
- Get familiar with Shogun.
- Implement the classical bundle method for SO learning.
- Integrate the algorithm to Shogun.
- Compare experimentally the implemented algorithm with the StructSVM[2] and other SO solvers already integrated in Shogun.
Literature:
[1] C.H. Teo, S.V.N. Vishwanthan, A.J. Smola, Q.V. Le. Bundle Methods for Regularized Risk Minimization. JMLR, 11(Jan):311−365, 2010.
[2] T.Joachims. StructSVM.
[3] C. Lemarechal, A.Nemirovskii, Y. Nesterov. New variants of bundle methods. Mathematical Programming. 69, 111-147. 1995
Implement Gaussian Processes and regression techniques
Mentor: Oliver Stegle
Difficulty: Medium
Requirements: C++ programming, machine learning methods for regression, understand octave or matlab code.
While Shogun provides for a number of scalable models for classification, flexible and accurate regression techniques are partially missing at this point.
(currently only (Kernel)Ridge Regression, Support Vector Regression and Least Squares Regression are implemented).
Your task will be to adapt and translate established Gaussian process models for usage in Shogun, integrating ideas from existing matlab and python toolboxes (see http://www.gaussianprocess.org/#code) and in particular the GPML Toolbox.
To start out, we will implement state-of-the-art Gaussian process regression models and adapt shogun as needed for efficient inference in Gaussian process models:
Gaussian process regression using a standard Gaussian kernel (for example [1], chapter 1 and 2)
- Extend key kernels in shogun to provide for parameter derivatives and add further interfaces needed to speed up Gaussian process parameter learning.
- In the second part of the project, we aim to implement extensions that make shogun even more useful for advanced regression models. Depending on your interests, we suggest to either extend the models to cope with outliers in the data (robust regression) or implement the Gaussian process latent variable model for dimensionality reduction in high-dimensional datasets:
Implement the Gaussian proces latent variable model in Shogun [2]
- Add support for robust Gaussian process regression [3]
- In addition, the interested student could enhance the regression methods currently available in shogun (e.g. by developing more numerically robust methods) and implement other state-of-the-art regression techniques.
This project allows the student to get familiar with various regression methods often used in machine learning and in particular GPs.
[1] C. Rasmussen, C. Williams Gaussian processe for machine learning
[2] N. Lawrence, Probabilistic non-linear principle component analysis with Gaussian Process Latent Variable Models, JMLR 2005
[3] P. Jylänki, J. Vanhatalo, A. Vehtari, Robust Gaussian process regression with a Student-t Likelihood, JMLR 2011
-
Implement multitask and domain adaptation algorithms
Mentor: Christian Widmer
Difficulty: Medium
Requirements: C++ programming, advanced machine learning
Multitask Learning and Domain Adaptation are active fields of research in the Machine Learning Community. The main aim is to combine information from similar prediction tasks in order to come up with better classifiers.
As an example consider building a classifier that predicts whether a person in Germany likes a particular movie, given their purchase history. The resulting classifier could be very similar to the one you would
get for, say, Dutch customers, however quite different from Japanese customers. Yet, one possibly could still exploit useful information in the Japanese data set even for the German or Dutch customers.
This is exactly where multitask learning comes in: Related problems are coupled together, leading to classifiers that exploit all available information. In practice, this is done by certain modification to the
standard formulations of classifiers such as the SVM.
In this project, the student would work close to active research. As a first step they would be closely supervised in order to implement a few Multitask Learning algorithms available in literature, for example:
- Convex Multitask Feature Learning (paper)
- Learning Multiple Tasks using Manifold Regularization (paper)
Depending on the speed of progress, the student will then be involved in the efficient implementations of active research projects. The working prototypes will be provided in python and the task will be to come
up with fast, efficient, well-tested and well-documented C++ implementations as part of SHOGUN.
This projects gives the student the chance to work close to ongoing academic research. It offers the student to learn about the optimization techniques needed for machine learning,
getting from prototypes to "production-ready" implementations and of course the research field of multitask learning.
Kernel Two-sample/Dependence test
Mentor: Arthur Gretton, Soeren Sonnenburg
Difficulty: Medium to Difficult
Requirements: C++ programming, Numerical algorithm implementations, Knowledge of probability distributions/statistical hypothesis tests and kernel methods, Reading Matlab Code
Hypothesis tests are an important tool for statistical analysis. Given samples from two distributions p and q, the goal is to test whether p=q. A test-statistic is computed from samples and being compared to a threshold. Based on this, the null-hypothesis (p=q) is or is not rejected.
For this poject, kernel versions of hypothesis tests are to be implemented, based on the "Maximum Mean Discrepancy" (MMD), (see [1], [2], [3], [4]).
These would have for example application in the area of bioinformatics (microarray data distributions) or database attribute matching where data is high-dimensional. In addition, they would allow to test whether data comes from two different distributions in fields where data is not given in vector form (for example graphs)
The idea of these tests is to find a well behaved (e.g. smooth) function which is large on the points drawn from p, and small (as negative as possible) on the points from q. The mean function values on the two samples is used to compute the test-statistic. When this is large, the samples are likely from different distributions.
Related to the two sample test are dependence tests, i.e. telling whether two samples are statistically (in)dependent, (see for example http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).
For this project, a kernel dependence test based on the Hilbert Schmidt Independence Criterion (HSIC), as described in [5], is to be implemented.
As for the MMD tests, there exist multiple method of computing the significace threshold. See [6] for this and some slides for an intuitive description
There already exist Matlab implementations for both MMD [4] and HSIC [5].
For MMD, a good start would be the linear time statistic in [1, Corollary 22], where the Null-distribution is a Gaussian.
A more complicated test is based on the asymptotic distribution of the Unbiased MMD Statistic [4, section 5]. A difficulty here is to compute the significance threshold. This should be done via two possibilities: a method based on bootstrap resampling (see [1], [2]) and a method based on the Eigenspectrum of the Gram matrix on the aggregate sample from p and q (see [3]), and a cheap approximation of the latter [3].
For HSIC, there are multiple versions for computing the significance threshold, in paritcular one based on bootstrapping [5] [6], and one faster consistent version [reference to come]
Another (optional, if the rest works) difficulty would be to pay a bit more attention to kernel (parameter) selection in context of the above tests. However, some usable heuristics exist.
The first practical step for the student is read and understand the mentioned papers to understand the ideas behind the methods. It is not important to gain deep mathematical insight but rather to know what is important for the implementation.
There already exists a Matlab implementation which can be used as an orientation. Looking at/understanding the code would be the second step. Since Matlab has many numerical algorithms built-in, it might be necessary to include some of them into SHOGUN.
There are yet no hypothesis/dependence tests in SHOGUN, therefore, a framework concept for these has to be thought of. This includes creating new classes/interfaces to make this as flexible as possible. All this has to be done in respect to existing interfaces, so a good knowledge of the SHOGUN framework is crucial/has to be acquired.
The algorithms/interfaces have to be documented well and examples/tests for their usage have to be created. Creating examples may be non-trivial here since the data should be more complex in order.
Finally, SHOGUN's usual interfaces to other languages have to be created and examples should be translated to at least some of the language bindings.
This project offers a great opportunity to get familiar with kernel statistical hypothesis testing. The mentioned methods were all developed quite recently by well-known scientists. Since there already exists a working implementation in Matlab, coding has not to be done completely from scratch.
From the implementation side, students will get to know the SHOGUN framework since the are expanding it. Also experience writing numerical algorithms will be gathered. Implementing algorithms for scientific open-source software in any case is a great experience.
[1] Main reference for MMD (long technical report): "A Kernel Method for the Two-Sample Problem"
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schökopf, Alexander J. Smola
[2] Short summary of [1]: "A KernelMethod for the Two-Sample-Problem"
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander J. Smola
[3] A better method for approximating the null-distribution of the asymptotic distribution of the Unbiased MMD Statistic: "A Fast, Consistent Kernel Two-Sample Test", Gretton, Arthur Fukumizu, Kenji Harchaoui, Zaid
[4] Matlab implementation and references: http://people.kyb.tuebingen.mpg.de/arthur/mmd.htm
[5] Main reference for HSIC: "A Kernel Statistical Test of Independence"
Gretton, A. K. Fukumizu C.-H. Teo L. Song, B Schoelkopf A. Smola NIPS 21, 2007.
[6] References and implementation of the kernel dependence test: http://people.kyb.tuebingen.mpg.de/arthur/indep.htm
Implementation of / Integration via existing GPL code of latent SVMs.
Mentor: Alexander Binder
Requirements: C++ programming, machine learning
Difficulty: Difficult
SVMs with latent variables
http://www.cs.cornell.edu/~cnyu/latentssvm/,
research.microsoft.com/pubs/72874/lsvm_amta.pdf and particularly
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.4938&rep=rep1&type=pdf
are an ongoing success story in Object detection approaches where the
latent variable is the unknown position of an object center and object
parts. Felzenswalbs approach received the "Lifetime Achievement" Prize
based on its impact in research (see
http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html)
at the Pascal VOC2010 workshop. Latent parameters are equally
appealing for disccriminative methods in other application areas as well.
The task consists of
- checking what implementations licensed under the GPL exist in C/C++
- to what kind of problems are they practically applicable
- if necessary implement one latent SVM model
- implement templates for minimization over user-defined latent variables (difficult!)
- implement dynamic interfaces
- implement static interfaces for important application cases
- implement more than one solution strategy
- implement heuristics for dealing with large sample sizes
- check such heuristics for their efficiency on large scale data
The challenge in implementation lies in solving the non(!)-convex
optimization problem in an way which is acceptable for practitioners ,
including allowing for templates for minimization over the latent
variable, which are desried to be consistent with the interfaces to other languages,
termination after a maximal runtime or heuristic frameworks which allow
to iterate it on smaller chunks of data while keeping only the hardest
samples.
Implementation of Boosting type learning method for non-constantly dimensional features defined as sets of fixed dim vectors with varying set size.
Mentor: Alexander Binder
Difficulty: Medium
Requirements: C++ programming, machine learning
http://www.cs.cmu.edu/~rahuls/pub/cvpr2008-unified-rahuls.pdf
is a method generally usable (outside its original domain) for features which are sets of fixed dimensional vectors with varying set size. This is a notable relaxation of the classical paradigm of constantly (fixed) dimensional features which could extend the usability of the Shogun toolbox at this edge.
The task consists of
- implement the model and solution algorithm
- implement a parallelization of the algorithm
- implement heuristics for dealing with large sample sizes
- check such heuristics for their efficiency on large scale data
- implement dynamic and static interfaces
The challenge here would be to implement it in a scalable way for larger sample sizes, possibly by parallelization or several heuristics to run it over chunks of the training data.
|