# Ideas for Google Summer of Code 2012 Projects

Shogun toolbox is a software library for machine learning algorithms, with a special focus on large-scale applications. During Summer of Code 2012 we are looking to extend the library in three different ways:
1. Improving accessibility to shogun by developing improving i/o support (more file formats) and mloss.org/mldata.org integration.
2. Framework improvements (frameworks for regression, multiclass, structured output problems, quadratic progamming solvers).
3. Integration of existing and new machine algorithms.
Below is listed a set of suggestions for projects. If you have your own concrete idea or want to solve a task from 2011 - talk to us! We might be interested.

GSoC 2011 ideas

back to shogun toolbox homepage

## Improving accessibility to shogun

• #### Various usability improvements

Mentor: Soeren Sonnenburg
Difficulty: Easy to Difficult
Requirements: C++ programming, python, swig, some machine learning background

Currently Shogun has a number of rough edges, which polishing would make it more accessible to others. The students task would be to
1. Implement customizable SWIG interface classes (so called director classes [1,2]) making it possible to extend certain class methods using some interface language, e.g. for fast prototyping one could implement a kernel in python (or java or octave), which can be used by any of shogun's SVMs. Other use-cases of such director classes include: using python classifiers as shogun machines with shogun's python-modular interface, prototyping new preprocessors, implementing features with database support with some clue code and various more.
2. Implement a better python integration, i.e. for cross-validation it is currently very complex to specify the parameter settings to try out. In addition, python since version 2.6 supports the python buffer protocol [3], which when implemented for e.g. shoguns CSimpleFeatures or Labels would enable accessing shogun objects like python numpy arrays (to a certain extend). python array protocol. Then we could mention various other known
3. The octave modular interface currently has no support for sparse matrices. The task would be to implement these in the swig_typemaps.i based on the python_modular typemaps (file swig_typemaps.i in interfaces/python_modular and interfaces/octave_modular)
4. Various code contains ASSERT()'s which should be replaced with more understandable SG_ERROR() messages
5. The current implementation of SGVector [4] should be converted to using reference counts.
6. Various minor usability improvements.
This task offers good opportunities to see various aspects of shogun. A particular (fun!) challenge is the rather complex swig cross-language development process.
• #### mloss.org and mldata.org integration

Mentor: Cheng Soon Ong
Difficulty: Medium
Requirements: C++ programming, python, (optional django)

The machine learning data set repository mldata.org provides data and task downloads in a standardized format. Make shogun to natively be able to read these data formats and interact with the mldata.org website, i.e., by
Note that the optional part of this task aims to help the machine learning community portals for which python and django skills are needed (both portal website's source code are also open source).
• #### Reading and writing of shogun features / objects in .hdf5 (pytables/matlab 7.0/matlab 7.X/octave 3.x/h5py/arff) format

Mentor: Soeren Sonnenburg
Difficulty: Easy to Medium
Requirements: C++ programming

Shogun already has partial support for the hierarchical data format version 5 in CHDF5File. This enables loading and saving files in python,r,octave,matlab,... sometimes providing better hdf5 support than any of the native implementions. The task would be to extend hdf5 support by implementing
1. varialbe length data types support
3. compression support
4. slicing support
5. attribute support
• Extensive information about hdf5 can be found in its reference manual.

Matlab 7 file format specifications are publicly available online. In addtion Octave has support for reading files up to version 7.2. The task is to develop a file i/o class based on these specifications that allows for reading and writing .mat files directly. Note that one can borrow a lot of code from octave here (see the files src/ls-mat4.cc and load-save.cc).

Natively support loading and writing of octave files. One can make use of the already implemented functionality in octave (files ls-oct-ascii.{cc,h} and ls-oct-binary.{cc,h}. So this task involves a lot of code borrowing.

## Framework Improvements

• #### Built generic structured output learning framework

Mentor: Nico Goernitz
Difficulty: Medium to Difficult
Requirements: C++ programming, machine learning

In many cases machine learning problems result in a classification task where one seeks an algorithm that decides whether a given data point is of one or the other class. Among the most powerful and successful methods solving such problems is the famous support vector machine (SVM). Naturally it's input data consist of high-dimensional lists of features capturing properties of the underlying objects (e.g. weight, size, color, ...). Sometimes these objects are more complex and have a build-in structure like trees and sequences. However, the output is restricted to a binary decision and hence, is not helpful when it comes to predicting e.g. the most likely sequence for speech recognition or genome annotation. To overcome this problem but retain the characteristics of the SVM the structured-output support vector machine (SO-SVM) has been proposed (Tsochantaridis et al [3]). Another powerful class of structured output predictors is the conditional random field (CRF, Lafferty et al [4]). It shares important properties with the SO-SVM but performance might differ depending on the application at hand.

Here, the student will implement a generic SO-SVM and CRF framework to enable handling structured output problems and make it an ideal base for specific user-defined implementations. A crucial point is - of course - speed and therefore implementation covers only primal versions of the SO-SVM. However, this is not a big restriction because most applications already have a rich representation. As application she/he will implement a hidden (semi-)Markov SO-SVM which will be used for transcript mapping or gene prediction. There is already an implementation available in Matlab/C [1]. Code for Viterbi-decoding of semi-Markov models is already available in Shogun [2].

While the programming difficulty is only medium a basic understanding in machine learning is mandatory. This renders the whole problem difficult.

Timetable
1. get used to shogun (especially how to read and write data) [2]
2. test HM-SVM toolbox with toy data [1]
3. get a basic grasp of how the SO-SVM algorithm looks like
4. implement the standard SO-SVM algorithm (a QP-solver is provided)
5. test the SO-SVM by implementing a multi-class SVM (here the joint feature map and argmax-function are especially easy, this is also the first application)
6. implement HM-SVM application (define specific loss- and argmax-functions)
7. implement a CRF and test on the same application

[1] an implementation of the hidden Markov SO-SVM in Matlab/C is available at http://mloss.org/software/view/250/
[2] Shogun framework as well as examples and plenty of doumentation can be downloaded from http://www.shogun-toolbox.org
[3] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun: Support vector machine learning for interdependent and structured output spaces, ICML 2004
[4] J. Lafferty, A. McCallum, F. Pereira: Conditional random fields: Probabilistic models for segmenting and labeling sequence data, ICML 2001
• #### Built generic multiclass learning framework

Mentor: Soeren Sonnenburg
Difficulty: Medium
Requirements: C++ programming, some knowledge of octave/matlab, machine learning

While shogun can be considered the state-of-the-art toolbox for binary classifiers, only few multi-class methods are currently available to shogun. This task is about creating a general multiclass framework nicely complementing other areas of shogun:

Error-correcting output codes (ECOC) enable the use of binary classifiers for multi-class classification problems. Various ECOC schemes like one-vs-rest, one-vs-one, binary trees are well established techniques.

Student applying for this task would have to port the various ECOC schemes available in the Error-Correcting Output Codes Library (matlab code, paper) and continue with implementing novel algorithms, like [1,2].

[1,2] were proven to be efficient for massively multiclass problems (i.e., large number of classes) but are currently lacking any open-source implementation.
Utilizing the ECOC library, latest literature on the subject the student is expected to carry out state-of-the-art implementations ( with the help of the mentor and possibly the authors of the ECOC library and the below papers).

This project offers opportunities to contribute to scientific software; to establish new contacts with members of the machine learning and open source community; to get familiar with latest multiclass classification techniques.

[1] "ShareBoost": Efficient Multiclass Learning with Feature Sharing" by Shalev-Shwartz et al.
[2] "Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition" by Deng et al.
• #### Interface and develop the general quadratic solver library libqp

Mentor: Vojtech Franc
Difficulty: Medium to Difficult
Requirements: Some background in optimization, programming in C++

libqp is a C library which implements algorithms for solving two instances of convex Quadratic Programming (QP) frequently used in Machine Learning (see this link for more details). libqp uses a variant of the sequential minimal optimization algorithm able to solve very-large QP problems where the Hessian does not fit to computer memory. The task would be to make this library accessible from within the shogun framework by interfacing it via the modular interfaces and to extend libqp by implementing solvers for new QP problems and by integrating some existing solvers (like pr_loqo) already in shogun.

Tasks (in parallel to interfacing libqp to shogun):
1. Get familiar with Shogun.
2. Implement sequential minimal solver with 2nd order working set selection for the standard SVM dual QP problem [1][2].
3. Implement sequential minimal solver for the QP problem with box constraints [3].
4. Integrate pr_logo solver to libqp.
5. Create a set of benchmark problems and use them to compare libqp solvers with a commercial package like Mosek [4].
References:
[1] V. Franc. Optimization Algorithms for Kernel Methods. PhD thesis, CMP, FEE CTU Prague. Defended in November 2005.
[2] R.-E. Fan, P.-H. Chen, C.-J. Lin. Working Set Selection Using Second Order Information for Training SVM. JMLR. vol 6. 2005.
[3] V. Franc, M.Navara, V.Hlavac. Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem. CAIP2005.
[4] MOSEK.

## Integrate Existing and Implement New Machine Learning Algorithms

• #### Integrate C5.0

Mentor: Soeren Sonnenburg
Difficulty: Medium
Requirements: C++ programming, basic machine learning

C5.0 is among the most famous decision trees. It has just recently been released under the GPL - and thus it would be great to integrate it as a machine'' in shogun. Since the source code of C5.0 is in C, your task is to encapsulate C5.0 in C++ classes as shogun machine using shogun features as inputs.
• #### Implement Olivier Chapelles fast newton based SVM solver and integration of updated versions of liblinear/ocas

Mentor: Soeren Sonnenburg
Difficulty: Easy to Medium
Requirements: C++ programming, basic octave or matlab, basic machine learning

Code is available in matlab from http://mloss.org/software/view/30/. The task is to translate this code to C++ and integrate it in shogun. This involves
1. Defining a new CLinearMachine derived class CNewtonSVM
2. Translating the few lines of matlab code
Once familiar with the shogun, students could integrate updated versions of liblinear and svmocas: New versions of liblinear and libocas are availabe that now support multiclass classification. The task is to
1. Merge the newer versions into shogun
2. To integrate the new multi-class linear SVMs utilizing shogun's COFFIN framework implemented in CDotFeatures
3. (Optionally) Implement the Local Binary Pattern (LBP) features, useful in gender recognition based frontal face pictures.
• #### Bundle method solver for structured output learning

Mentor: Vojtech Franc
Difficulty: Medium to Difficult
Requirements: C++ programming, some background in optimization, machine learning.

1. Get familiar with Shogun.
2. Implement the classical bundle method for SO learning.
3. Integrate the algorithm to Shogun.
4. Compare experimentally the implemented algorithm with the StructSVM[2] and other SO solvers already integrated in Shogun.
• Learning of the structured output classifiers leads to solving a convex minimization problem which is not tractable by standard algorithms. A significant effort in ML community has been put to development of specialized solvers among which the Bundle Method for Risk Minimization (BMRM) [1], implemented e.g. in popular StructSVM [2], is the current the state-of-the-art. The BMRM is a simplified variant of bundle methods which are standard tools for non-smooth optimization [3]. The simplicity of the BMRM is compensated by its reduced efficiency. Experiments show that a careful implementation of the classical bundle methods perform significantly faster (speedup ~ 5-10) than their variants (like BMRM) adopted by ML community. The goal will be an OS library implementing the classical bundle method for the SO learning and its integration to Shogun.

Literature:
[1] C.H. Teo, S.V.N. Vishwanthan, A.J. Smola, Q.V. Le. Bundle Methods for Regularized Risk Minimization. JMLR, 11(Jan):311−365, 2010.
[2] T.Joachims. StructSVM.
[3] C. Lemarechal, A.Nemirovskii, Y. Nesterov. New variants of bundle methods. Mathematical Programming. 69, 111-147. 1995
• #### Implement Gaussian Processes and regression techniques

Mentor: Oliver Stegle
Difficulty: Medium
Requirements: C++ programming, machine learning methods for regression, understand octave or matlab code.

While Shogun provides for a number of scalable models for classification, flexible and accurate regression techniques are partially missing at this point. (currently only (Kernel)Ridge Regression, Support Vector Regression and Least Squares Regression are implemented). Your task will be to adapt and translate established Gaussian process models for usage in Shogun, integrating ideas from existing matlab and python toolboxes (see http://www.gaussianprocess.org/#code) and in particular the GPML Toolbox. To start out, we will implement state-of-the-art Gaussian process regression models and adapt shogun as needed for efficient inference in Gaussian process models: Gaussian process regression using a standard Gaussian kernel (for example [1], chapter 1 and 2)
1. Extend key kernels in shogun to provide for parameter derivatives and add further interfaces needed to speed up Gaussian process parameter learning.
2. In the second part of the project, we aim to implement extensions that make shogun even more useful for advanced regression models. Depending on your interests, we suggest to either extend the models to cope with outliers in the data (robust regression) or implement the Gaussian process latent variable model for dimensionality reduction in high-dimensional datasets: Implement the Gaussian proces latent variable model in Shogun [2]
3. Add support for robust Gaussian process regression [3]
4. In addition, the interested student could enhance the regression methods currently available in shogun (e.g. by developing more numerically robust methods) and implement other state-of-the-art regression techniques.
This project allows the student to get familiar with various regression methods often used in machine learning and in particular GPs.

[1] C. Rasmussen, C. Williams Gaussian processe for machine learning
[2] N. Lawrence, Probabilistic non-linear principle component analysis with Gaussian Process Latent Variable Models, JMLR 2005
[3] P. Jylänki, J. Vanhatalo, A. Vehtari, Robust Gaussian process regression with a Student-t Likelihood, JMLR 2011

Mentor: Christian Widmer
Difficulty: Medium
Requirements: C++ programming, advanced machine learning

Multitask Learning and Domain Adaptation are active fields of research in the Machine Learning Community. The main aim is to combine information from similar prediction tasks in order to come up with better classifiers.
As an example consider building a classifier that predicts whether a person in Germany likes a particular movie, given their purchase history. The resulting classifier could be very similar to the one you would get for, say, Dutch customers, however quite different from Japanese customers. Yet, one possibly could still exploit useful information in the Japanese data set even for the German or Dutch customers. This is exactly where multitask learning comes in: Related problems are coupled together, leading to classifiers that exploit all available information. In practice, this is done by certain modification to the standard formulations of classifiers such as the SVM.
In this project, the student would work close to active research. As a first step they would be closely supervised in order to implement a few Multitask Learning algorithms available in literature, for example:
1. Convex Multitask Feature Learning (paper)
2. Learning Multiple Tasks using Manifold Regularization (paper)
Depending on the speed of progress, the student will then be involved in the efficient implementations of active research projects. The working prototypes will be provided in python and the task will be to come up with fast, efficient, well-tested and well-documented C++ implementations as part of SHOGUN.
This projects gives the student the chance to work close to ongoing academic research. It offers the student to learn about the optimization techniques needed for machine learning, getting from prototypes to "production-ready" implementations and of course the research field of multitask learning.
• #### Kernel Two-sample/Dependence test

Mentor: Arthur Gretton, Soeren Sonnenburg
Difficulty: Medium to Difficult
Requirements: C++ programming, Numerical algorithm implementations, Knowledge of probability distributions/statistical hypothesis tests and kernel methods, Reading Matlab Code

Hypothesis tests are an important tool for statistical analysis. Given samples from two distributions p and q, the goal is to test whether p=q. A test-statistic is computed from samples and being compared to a threshold. Based on this, the null-hypothesis (p=q) is or is not rejected.

For this poject, kernel versions of hypothesis tests are to be implemented, based on the "Maximum Mean Discrepancy" (MMD), (see [1], [2], [3], [4]). These would have for example application in the area of bioinformatics (microarray data distributions) or database attribute matching where data is high-dimensional. In addition, they would allow to test whether data comes from two different distributions in fields where data is not given in vector form (for example graphs) The idea of these tests is to find a well behaved (e.g. smooth) function which is large on the points drawn from p, and small (as negative as possible) on the points from q. The mean function values on the two samples is used to compute the test-statistic. When this is large, the samples are likely from different distributions.

Related to the two sample test are dependence tests, i.e. telling whether two samples are statistically (in)dependent, (see for example this wikipedia article). For this project, a kernel dependence test based on the Hilbert Schmidt Independence Criterion (HSIC), as described in [5], is to be implemented. As for the MMD tests, there exist multiple method of computing the significace threshold. See [6] for this and some slides for an intuitive description

There already exist Matlab implementations for both MMD [4] and HSIC [5]. For MMD, a good start would be the linear time statistic in [1, Corollary 22], where the Null-distribution is a Gaussian. A more complicated test is based on the asymptotic distribution of the Unbiased MMD Statistic [4, section 5]. A difficulty here is to compute the significance threshold. This should be done via two possibilities: a method based on bootstrap resampling (see [1], [2]) and a method based on the Eigenspectrum of the Gram matrix on the aggregate sample from p and q (see [3]), and a cheap approximation of the latter [3]. For HSIC, there are multiple versions for computing the significance threshold, in paritcular one based on bootstrapping [5] [6], and one faster consistent version [reference to come]

Another (optional, if the rest works) difficulty would be to pay a bit more attention to kernel (parameter) selection in context of the above tests. However, some usable heuristics exist.

The first practical step for the student is read and understand the mentioned papers to understand the ideas behind the methods. It is not important to gain deep mathematical insight but rather to know what is important for the implementation. There already exists a Matlab implementation which can be used as an orientation. Looking at/understanding the code would be the second step. Since Matlab has many numerical algorithms built-in, it might be necessary to include some of them into SHOGUN. There are yet no hypothesis/dependence tests in SHOGUN, therefore, a framework concept for these has to be thought of. This includes creating new classes/interfaces to make this as flexible as possible. All this has to be done in respect to existing interfaces, so a good knowledge of the SHOGUN framework is crucial/has to be acquired. The algorithms/interfaces have to be documented well and examples/tests for their usage have to be created. Creating examples may be non-trivial here since the data should be more complex in order. Finally, SHOGUN's usual interfaces to other languages have to be created and examples should be translated to at least some of the language bindings.

This project offers a great opportunity to get familiar with kernel statistical hypothesis testing. The mentioned methods were all developed quite recently by well-known scientists. Since there already exists a working implementation in Matlab, coding has not to be done completely from scratch. From the implementation side, students will get to know the SHOGUN framework since the are expanding it. Also experience writing numerical algorithms will be gathered. Implementing algorithms for scientific open-source software in any case is a great experience.

[1] Main reference for MMD (long technical report): "A Kernel Method for the Two-Sample Problem" Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schökopf, Alexander J. Smola
[2] Short summary of [1]: "A KernelMethod for the Two-Sample-Problem" Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander J. Smola
[3] A better method for approximating the null-distribution of the asymptotic distribution of the Unbiased MMD Statistic: "A Fast, Consistent Kernel Two-Sample Test", Gretton, Arthur Fukumizu, Kenji Harchaoui, Zaid
[4] Matlab implementation and references: http://people.kyb.tuebingen.mpg.de/arthur/mmd.htm
[5] Main reference for HSIC: "A Kernel Statistical Test of Independence" Gretton, A. K. Fukumizu C.-H. Teo L. Song, B Schoelkopf A. Smola NIPS 21, 2007.
[6] References and implementation of the kernel dependence test: http://people.kyb.tuebingen.mpg.de/arthur/indep.htm
• #### Implementation of / Integration via existing GPL code of latent SVMs.

Mentor: Alexander Binder
Requirements: C++ programming, machine learning
Difficulty: Difficult

SVMs with latent variables
1. checking what implementations licensed under the GPL exist in C/C++
2. to what kind of problems are they practically applicable
3. if necessary implement one latent SVM model
4. implement templates for minimization over user-defined latent variables (difficult!)
5. implement dynamic interfaces
6. implement static interfaces for important application cases
7. implement more than one solution strategy
8. implement heuristics for dealing with large sample sizes
9. check such heuristics for their efficiency on large scale data
• http://www.cs.cornell.edu/~cnyu/latentssvm/, http://research.microsoft.com/pubs/72874/lsvm_amta.pdf and particularly http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.4938&rep=rep1&type=pdf are an ongoing success story in Object detection approaches where the latent variable is the unknown position of an object center and object parts. Felzenswalbs approach received the "Lifetime Achievement" Prize based on its impact in research (see http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html) at the Pascal VOC2010 workshop. Latent parameters are equally appealing for disccriminative methods in other application areas as well.

The task consists of The challenge in implementation lies in solving the non(!)-convex optimization problem in an way which is acceptable for practitioners , including allowing for templates for minimization over the latent variable, which are desried to be consistent with the interfaces to other languages, termination after a maximal runtime or heuristic frameworks which allow to iterate it on smaller chunks of data while keeping only the hardest samples.
• #### Implementation of Boosting type learning method for non-constantly dimensional features defined as sets of fixed dim vectors with varying set size.

Mentor: Alexander Binder
Difficulty: Medium
Requirements: C++ programming, machine learning

http://www.cs.cmu.edu/~rahuls/pub/cvpr2008-unified-rahuls.pdf is a method generally usable (outside its original domain) for features which are sets of fixed dimensional vectors with varying set size. This is a notable relaxation of the classical paradigm of constantly (fixed) dimensional features which could extend the usability of the Shogun toolbox at this edge.