
Ideas for Google Summer of Code 2011 Projects
Shogun is a software library for machine learning algorithms, with a special focus on largescale applications. This summer we are looking to extend the library in four different ways: Improving interfaces to other machine learning libraries or integrating them when appropriate, improved i/o support, framework improvements and new machine algorithms. Below is listed a set of suggestions for projects.
back to shogun toolbox homepage 

Table of Contents

Interfacing and Integrating other Machine Learning Libraries

Input/Output

Framework Improvements

New Machine Learning Algorithms

List of Ideas

Interfacing or integrating other Machine Learning Libraries
Interfaces to other languages

Implement modular Java interface
Mentor: Mikio Braun Difficulty: Easy Requirements: C++ programming, java, swig
Investigate the status of numerical java and implement swig typemaps to translate numerical java variables (matrices, vectors, sparse vectors, strings) into shogun representations. To begin with, develop swigtypemaps for jblas. It might be worth to have things like JRuby, Jython etc. work with the Java bindings of shogun.

Implement modular ruby interface.
Mentor: Mikio Braun Difficulty: Easy Requirements: C++ programming, ruby, swig
Investigate the status of numerical ruby and implement swig typemaps to translate numerical ruby variables (matrices, vectors, sparse vectors, strings) into shogun representations.

Input/Output

Reading and writing of shogun features / objects in .hdf5 (pytables/matlab/h5py) format
Mentor: Soeren Sonnenburg Difficulty: Easy to Medium Requirements: C++ programming
Shogun already has partial support for the hierarchical data format version 5 in CHDF5File. This enables loading and saving files in python,r,octave,matlab,... sometimes providing better hdf5 support than any of the native implementions. The task would be to extend hdf5 support by implementing
 varialbe length data types support
 symbolic links
 compression support
 slicing support
 attribute support
Extensive information about hdf5 can be found in its reference manual.

Reading and writing of matlab 7.0 files
Mentor: Soeren Sonnenburg Difficulty: Easy to Medium Requirements: C++ programming
Matlab 7 file format specifications are publicly available online. In addtion Octave has support for reading files up to version 7.2. The task is to develop a file i/o class based on these specifications that allows for reading and writing .mat files directly. Note that one can borrow a lot of code from octave here (see the files src/lsmat4.cc and loadsave.cc).

Reading and writing of octave 3.x files
Mentor: Soeren Sonnenburg Difficulty: Easy Requirements: C++ programming
Natively support loading and writing of octave files. One can make use of the already implemented functionality in octave (files lsoctascii.{cc,h} and lsoctbinary.{cc,h}. So this task involves a lot of code borrowing.

Framework Improvements

Built welldesigned and flexible crossvalidation framework into shogun
Mentor: Soeren Sonnenburg Difficulty: Difficult Requirements: C++ programming, basic machine learning
This should include Nfold crossvalidation, simple validation, repeated sampling etc. It should be possible to easily exchange the strategy of how examples are split up (e.g. in compbio this, could mean that genes in a particular split come from a certain genomic region). This is considered difficult only because you will need to deeply understand the internals of shogun objects. The involved tasks are
 Understand shogun's parameter framework and extend it to flag parameters for use in model selection.
 Implement a class CParameterSetting for one particular configuration the learning machine will be run with
 Implement a class CSettingGenerator class that based on parameter inputs generates one particular CParameterSetting
 Implement various model selection schemes (train,validation,test split, nfold cross validation, leave one out cross validation, nested cross validation,...)

Implement framework for online learning
Mentor: Soeren Sonnenburg Difficulty: Easy Requirements: C++ programming, machine learning
Currently within shogun it is assumed that the number of data points is known beforehand. This may be troublesome for online learning algorithms. Implement 'online features' with calls to get_next_feature_vector() that always returns another object or NULL. Implemnent it in a way hat is either read from a file, pipe or socket. This is closely related to the vowpal vabbit integration above and one will be able to borrow code from there http://hunch.net/~vw/. In a next step convert online algorithms such as sgd, larank, liblinear to work with those features.

Built generic structured output learning framework
Mentor:Jonas Behr, Nico Goernitz Difficulty: Difficult Requirements: C++ programming, machine learning
In many cases machine learning problems result in a classification task where one seeks an algorithm that decides whether a given data point is of one or the other class. Among the most powerful and successful methods solving such problems is the famous support vector machine (SVM). Naturally it's input data consist of highdimensional lists of features capturing properties of the underlying objects (e.g. weight, size, color, ...). Sometimes these objects are more complex and have a buildin structure like trees and sequences. However, the output is restricted to a binary decision and hence, is not helpful when it comes to predicting e.g. the most likely sequence for speech recognition or genome annotation. To overcome this problem but retain the characteristics of the SVM the structuredoutput support vector machine (SOSVM) has been proposed (Tsochantaridis et al [3]). Another powerful class of structured output predictors is the conditional random field (CRF, Lafferty et al [4]). It shares important properties with the SOSVM but performance might differ depending on the application at hand.
Here, the student will implement a generic SOSVM and CRF framework to enable handling structured output problems and make it an ideal base for specific userdefined implementations. A crucial point is  of course  speed and therefore implementation covers only primal versions of the SOSVM. However, this is not a big restriction because most applications already have a rich representation. As application she/he will implement a hidden (semi)Markov SOSVM which will be used for transcript mapping or gene prediction. There is already an implementation available in Matlab/C [1]. Code for Viterbidecoding of semiMarkov models is already available in Shogun [2].
While the programming difficulty is only medium a basic understanding in machine learning is mandatory. This renders the whole problem difficult.
Timetable
 get used to shogun (especially how to read and write data) [2]
 test HMSVM toolbox with toy data [1]
 get a basic grasp of how the SOSVM algorithm looks like
 implement the standard SOSVM algorithm (a QPsolver is provided)
 test the SOSVM by implementing a multiclass SVM (here the joint feature map and argmaxfunction are especially easy, this is also the first application)
 implement HMSVM application (define specific loss and argmaxfunctions)
 implement a CRF and test on the same application
[1] an implementation of the hidden Markov SOSVM in Matlab/C is available at http://mloss.org/software/view/250/ [2] Shogun framework as well as examples and plenty of doumentation can be downloaded from http://www.shoguntoolbox.org [3] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun: Support vector machine learning for interdependent and structured output spaces, ICML 2004 [4] J. Lafferty, A. McCallum, F. Pereira: Conditional random fields: Probabilistic models for segmenting and labeling sequence data, ICML 2001
General Solver framework for convex problems.
Mentor: Gunnar Raetsch Difficulty: Difficult (this might be too ambitious) Requirements: C++ programming, optimization, machine learning
There exists cvx for matlab which is useful but lacks some natural options like number of iterations. Convexity can be checked via algebraic rules (like sums of convex functions are convex) and convex atoms. The aim is to establish a framework with one solving strategy initially rather then to get perfect solutions for all kinds of convex problems.

New Machine Learning Algorithms

Implementation of the Reduced Set Method for RBF kernel.
Mentor: Vojtech Franc Difficulty: Medium Requirements: C++ programming, matlab or octave, basic math/statistics
The Gaussian kernels are among the most popular dissimilarity measures used in practical applications of kernel methods. Their popularity is due to the flexibility of the decision rule expressed as a linear combination of the Gaussian kernels. The discriminative power of such rule can be smoothly changed by tuning a single parameter (the kernel width) from linear to arbitrary complex one. The eminent flexibility is paid by high complexity of the decision rule, i.e. the number of the Gaussian kernels appearing in the linear combination. High complexity of the decision rule has a detrimental impact on the decision time being one of the most important criteria in practice. The Reduced Set (RS) Methods is a technique which approximates a complex kernelbased decision rule by a simple one using a prescribed number of kernels. The core of the RS methods is to solve so called preimage problem. In the case of the Gaussian kernels, the preimage problem leads to highly nonconvex smooth optimization. The goal of the project is to implement the existing techniques for its solution, namely:
 B.Schoekopf, P.Knirsch, and A.Smola, C.Burges. Fast approximation of support vector kernel expansions, and an interpretation of clustering as approximation in feature spaces. DAGM. 1998.
 J.T.Kwok and I.W.Tsang. The preimage problem in kernel methods. ICML. 2003.
 Plain gradientbased method.
As a starting point the student can use Matlab implementations of the RS methods which are available in STPRtoolbox.

Implement Kernel Feature Analysis (KFA)
Mentor: Konrad Rieck Difficulty: Medium Requirements: C++ programming, machine learning
Dimension reduction is a prerequisite in many practical applications of machine learning. Kernel Feature Analysis (KFA) is an algorithm for performing dimension reduction using kernel functions. In contrast to the other kernelbased techniques for dimension reduction, KFA determines a lowdimensional subspace using sparse components and thus is suitable for largescale application. KFA is described in the following two papers. The algorithms can be implemented as sketched in the work of Jiang et al., where further tricks for improving the runtime are discusssed.
 Sparse Kernel Feature Analysis. Smola et al. (1999)
 Accelerated Kernel Feature Analysis. Jiang et al. (2006)

ML estimation of parameters of Gaussian Mixture Models with carefully implemented EM algorithm.
Mentor: Vojtech Franc Difficulty: Difficult Requirements: C++ programming, matlab or octave, basic math/statistics
The ExpectationMaximization algorithm estimating parameters of the Gaussian Mixture Model is a standard tool used probability density modeling. Despite the simplicity of the EM algorithm itself programming a robust implementation which would be applicable in practice is a tricky problem. The difficulty stems from numerical problems inherently involved in the algorithm, e.g. summation of extremely low numbers or inverting ill conditioned covariance matrices. The goal of the project is a robust implementation which uses several computational tricks and techniques, e.g. performing summation of low numbers in logarithm domain, using SVD decomposition to represent covariance matrices, removing "singular" Gaussian components etc. An initial Matlab implementation will be provided by the mentor.

Implement missing machine learning algorithms
Mentor: Christian Widmer Difficulty: Easy to Medium Requirements: C++ programming, basic machine learning, potentially basic skills in Ocatve/MATLAB/Python
Shogun provides a rich set of kernel implementations, including (but not limited to) a rich set of string kernels particularly relevant for applications in Computational Biology. Traditionally, the focus of Shogun has been on Support Vector Machines (SVMs), however there exist numerous other Machine Learning methods, that could also benefit from the set of kernels provided by Shogun. In particular, we have the following two in mind:
 Gaussian Processes
 Kernel Logistic Regression
In addition to that, a whole range of preprocessing methods are available that would greatly complement the features that are in shogun already. These include:
 Kernel PCA (KPCA)
 Stochastic Neighborhood Embedding (SNE)
 Locally Linear Embedding (LLE)
 ISOMAP
 Multidimensional Scaling (MDS)
For this task, the first step is to scan mloss.org and similar sites for suitable implementations. Once these are identified, the student will have to port the respective code to the Shogun code base, making sure to respect existing interfaces (and add new ones, if necessary). Lastly, documentation, examples and Unit tests will have to be written for all of the integrated methods.
While this project does not involve the implementation of novel ML methods form scratch, it offers a great opportunity for a student to get familiar with a whole range of highly relevant Machine Learning methods in theory and practice. On top of that, the student will gather experience in contributing to a large scientific software project.

Implement SVMbright a GPL'ed version of SVMLight and make it publicly available
Mentor: Soeren Sonnenburg Difficulty: Difficult Requirements: C++ programming, support vector machines, optimization theory
SVMlight is one of the first SVM toolboxes. While efficient, it is not open source. The task is to implement the algorithms described in Joachims 98 as free software.

Implementation of / Integration via existing GPL code of latent SVMs.
Mentor: Alexander Binder Requirements: C++ programming, machine learning Difficulty: Difficult
SVMs with latent variables
 checking what implementations licensed under the GPL exist in C/C++
 to what kind of problems are they practically applicable
 if necessary implement one latent SVM model
 implement templates for minimization over userdefined latent variables (difficult!)
 implement dynamic interfaces
 implement static interfaces for important application cases
 implement more than one solution strategy
 implement heuristics for dealing with large sample sizes
 check such heuristics for their efficiency on large scale data
http://www.cs.cornell.edu/~cnyu/latentssvm/, http://research.microsoft.com/pubs/72874/lsvm_amta.pdf and particularly http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.4938&rep=rep1&type=pdf are an ongoing success story in Object detection approaches where the latent variable is the unknown position of an object center and object parts. Felzenswalbs approach received the "Lifetime Achievement" Prize based on its impact in research (see http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html) at the Pascal VOC2010 workshop. Latent parameters are equally appealing for disccriminative methods in other application areas as well.
The task consists of The challenge in implementation lies in solving the non(!)convex optimization problem in an way which is acceptable for practitioners , including allowing for templates for minimization over the latent variable, which are desried to be consistent with the interfaces to other languages, termination after a maximal runtime or heuristic frameworks which allow to iterate it on smaller chunks of data while keeping only the hardest samples.

Implementation of Boosting type learning method for nonconstantly dimensional features defined as sets of fixed dim vectors with varying set size.
Mentor: Alexander Binder Difficulty: Medium Requirements: C++ programming, machine learning
http://www.cs.cmu.edu/~rahuls/pub/cvpr2008unifiedrahuls.pdf is a method generally usable (outside its original domain) for features which are sets of fixed dimensional vectors with varying set size. This is a notable relaxation of the classical paradigm of constantly (fixed) dimensional features which could extend the usability of the Shogun toolbox at this edge.
The task consists of
 implement the model and solution algorithm
 implement a parallelization of the algorithm
 implement heuristics for dealing with large sample sizes
 check such heuristics for their efficiency on large scale data
 implement dynamic and static interfaces
The challenge here would be to implement it in a scalable way for larger sample sizes, possibly by parallelization or several heuristics to run it over chunks of the training data.
