|
Ideas for Google Summer of Code 2011 Projects
Shogun is a software library for machine learning algorithms, with a special focus on large-scale applications. This summer we are looking to extend the library in four different ways: Improving interfaces to other machine learning libraries or integrating them when appropriate, improved i/o support, framework improvements and new machine algorithms. Below is listed a set of suggestions for projects.
back to shogun toolbox homepage
|
|
Table of Contents
|
Interfacing and Integrating other Machine Learning Libraries
|
Input/Output
|
Framework Improvements
|
New Machine Learning Algorithms
|
List of Ideas
|
Interfacing or integrating other Machine Learning Libraries
Interfaces to other languages
-
Implement modular Java interface
Mentor: Mikio Braun
Difficulty: Easy
Requirements: C++ programming, java, swig
Investigate the status of numerical java and implement swig typemaps to
translate numerical java variables (matrices, vectors, sparse vectors,
strings) into shogun representations. To begin with, develop swig-typemaps
for jblas. It might be worth to have things
like JRuby, Jython etc. work with the Java bindings of shogun.
Implement modular ruby interface.
Mentor: Mikio Braun
Difficulty: Easy
Requirements: C++ programming, ruby, swig
Investigate the status of numerical ruby and implement swig typemaps to
translate numerical ruby variables (matrices, vectors, sparse vectors,
strings) into shogun representations.
|
Input/Output
Reading and writing of shogun features / objects in .hdf5 (pytables/matlab/h5py) format
Mentor: Soeren Sonnenburg
Difficulty: Easy to Medium
Requirements: C++ programming
Shogun already has partial support for the hierarchical data format version 5 in CHDF5File. This enables loading and saving files in python,r,octave,matlab,... sometimes providing better hdf5 support than any of the native implementions. The task would be
to extend hdf5 support by implementing
- varialbe length data types support
- symbolic links
- compression support
- slicing support
- attribute support
Extensive information about hdf5 can be found in its reference manual.
Reading and writing of matlab 7.0 files
Mentor: Soeren Sonnenburg
Difficulty: Easy to Medium
Requirements: C++ programming
Matlab 7 file format specifications are publicly available online. In addtion Octave has support for reading files up to version 7.2. The task is to develop a file i/o class based on these specifications that
allows for reading and writing .mat files directly. Note that one can borrow a lot of code from octave here (see the files src/ls-mat4.cc and load-save.cc.
Reading and writing of octave 3.x files
Mentor: Soeren Sonnenburg
Difficulty: Easy
Requirements: C++ programming
Natively support loading and writing of octave files. One can make use of the already implemented functionality in octave (files ls-oct-ascii.{cc,h} and ls-oct-binary.{cc,h}. So this task involves a lot of code borrowing.
|
Framework Improvements
Built well-designed and flexible cross-validation framework into shogun
Mentor: Soeren Sonnenburg
Difficulty: Difficult
Requirements: C++ programming, basic machine learning
This should include N-fold cross-validation, simple validation, repeated sampling etc.
It should be possible to easily exchange the strategy of how examples are split up (e.g. in comp-bio this, could mean that genes in a particular split come from a certain genomic region). This is considered difficult only because you will need to deeply understand the internals of shogun objects. The involved tasks are
- Understand shogun's parameter framework and extend it to flag parameters for use in model selection.
- Implement a class CParameterSetting for one particular configuration the learning machine will be run with
- Implement a class CSettingGenerator class that based on parameter inputs generates one particular CParameterSetting
- Implement various model selection schemes (train,validation,test split, n-fold cross validation, leave one out cross validation, nested cross validation,...)
Implement framework for online learning
Mentor: Soeren Sonnenburg
Difficulty: Easy
Requirements: C++ programming, machine learning
Currently within shogun it is assumed that the number of data points is known beforehand. This may be troublesome for online learning algorithms.
Implement 'online features' with calls to get_next_feature_vector() that always returns another object or NULL. Implemnent it in a way hat is either read from a file, pipe or socket. This is closely related to the vowpal vabbit integration above and one will be able to borrow code from there
http://hunch.net/~vw/.
In a next step convert online algorithms such as sgd, larank, liblinear to work with those features.
Built generic structured output learning framework
Mentor:Jonas Behr,
Nico Goernitz
Difficulty: Difficult
Requirements: C++ programming, machine learning
In many cases machine learning problems result in a classification task where
one seeks an algorithm that decides whether a given data point is
of one or the other class. Among the most powerful and successful methods
solving such problems is the famous support vector machine (SVM).
Naturally it's input data consist of
high-dimensional lists of features capturing properties of the
underlying objects (e.g. weight, size, color, ...). Sometimes these objects
are more complex and have a build-in structure like trees and sequences.
However, the output is restricted to a binary decision and hence, is not
helpful when it comes to predicting e.g. the most likely sequence for speech recognition or genome annotation.
To overcome this problem but retain the characteristics of the
SVM the structured-output support vector machine (SO-SVM) has been proposed
(Tsochantaridis et al [3]). Another powerful class of structured output
predictors is the conditional random field (CRF, Lafferty et al [4]). It shares important properties with the SO-SVM but
performance might differ depending on the application at hand.
Here, the student will implement a generic SO-SVM and CRF framework to enable handling
structured output problems and make it an ideal base for
specific user-defined implementations.
A crucial point is - of course - speed and therefore implementation
covers only primal versions of the SO-SVM. However, this is not a big restriction
because most applications already have a rich representation.
As application she/he will implement a hidden (semi-)Markov SO-SVM which
will be used for transcript mapping or gene prediction.
There is already an implementation available in
Matlab/C [1]. Code for Viterbi-decoding of semi-Markov models is already available in Shogun [2].
While the programming difficulty is only medium a basic understanding in
machine learning is mandatory. This renders the whole problem difficult.
Timetable
- get used to shogun (especially how to read and write data) [2]
- test HM-SVM toolbox with toy data [1]
- get a basic grasp of how the SO-SVM algorithm looks like
- implement the standard SO-SVM algorithm (a QP-solver is provided)
- test the SO-SVM by implementing a multi-class SVM (here the
joint feature map and argmax-function are especially easy,
this is also the first application)
- implement HM-SVM application (define specific loss- and argmax-functions)
- implement a CRF and test on the same application
[1] an implementation of the hidden Markov SO-SVM in Matlab/C is available at http://mloss.org/software/view/250/
[2] Shogun framework as well as examples and plenty of doumentation can be downloaded from http://www.shogun-toolbox.org
[3] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun: Support vector machine learning for interdependent and structured output spaces, ICML 2004
[4] J. Lafferty, A. McCallum, F. Pereira: Conditional random fields: Probabilistic models for segmenting and labeling sequence data, ICML 2001
General Solver framework for convex problems.
Mentor: Gunnar Raetsch
Difficulty: Difficult (this might be too ambitious)
Requirements: C++ programming, optimization, machine learning
There exists cvx for matlab which is useful but lacks some natural options like number of iterations. Convexity can be checked via algebraic rules (like sums of convex functions are convex) and convex atoms. The aim is to establish a framework with one solving strategy initially rather then to get perfect solutions for all kinds of convex problems.
|
New Machine Learning Algorithms
Implementation of the Reduced Set Method for RBF kernel.
Mentor: Vojtech Franc
Difficulty: Medium
Requirements: C++ programming, matlab or octave, basic math/statistics
The Gaussian kernels are among the most popular dissimilarity measures used
in practical applications of kernel methods. Their popularity is due to the
flexibility of the decision rule expressed as a linear combination of the
Gaussian kernels. The discriminative power of such rule can be smoothly
changed by tuning a single parameter (the kernel width) from linear to
arbitrary complex one. The eminent flexibility is paid by high complexity of
the decision rule, i.e. the number of the Gaussian kernels appearing in the
linear combination. High complexity of the decision rule has a detrimental
impact on the decision time being one of the most important criteria in
practice. The Reduced Set (RS) Methods is a technique which approximates a
complex kernel-based decision rule by a simple one using a prescribed number
of kernels. The core of the RS methods is to solve so called pre-image
problem. In the case of the Gaussian kernels, the pre-image problem leads to
highly non-convex smooth optimization. The goal of the project is to
implement the existing techniques for its solution, namely:
- B.Schoekopf, P.Knirsch, and A.Smola, C.Burges. Fast
approximation of support vector kernel expansions, and an interpretation of
clustering as approximation in feature spaces. DAGM. 1998.
- J.T.Kwok and I.W.Tsang. The pre-image problem in kernel
methods. ICML. 2003.
- Plain gradient-based method.
As a starting point the student can use Matlab implementations of the
RS methods which are available in STPRtoolbox.
Implement Kernel Feature Analysis (KFA)
Mentor: Konrad Rieck
Difficulty: Medium
Requirements: C++ programming, machine learning
Dimension reduction is a prerequisite in many practical applications of machine learning. Kernel Feature Analysis (KFA) is an algorithm for performing dimension reduction using kernel functions. In contrast to the other kernel-based techniques for dimension reduction, KFA determines a low-dimensional subspace using sparse components and thus is suitable for large-scale application.
KFA is described in the following two papers. The algorithms can be implemented as sketched in the work of Jiang et al., where further tricks for improving the run-time are discusssed.
-
Sparse Kernel Feature Analysis.
Smola et al. (1999)
-
Accelerated Kernel Feature Analysis.
Jiang et al. (2006)
ML estimation of parameters of Gaussian Mixture Models with carefully
implemented EM algorithm.
Mentor: Vojtech Franc
Difficulty: Difficult
Requirements: C++ programming, matlab or octave, basic math/statistics
The Expectation-Maximization algorithm estimating parameters of the Gaussian
Mixture Model is a standard tool used probability density modeling. Despite
the simplicity of the EM algorithm itself programming a robust
implementation which would be applicable in practice is a tricky
problem. The difficulty stems from numerical problems inherently involved in
the algorithm, e.g. summation of extremely low numbers or inverting ill
conditioned covariance matrices. The goal of the project is a robust
implementation which uses several computational tricks and techniques,
e.g. performing summation of low numbers in logarithm domain, using SVD
decomposition to represent covariance matrices, removing "singular" Gaussian
components etc. An initial Matlab implementation will be provided by the mentor.
-
Implement missing machine learning algorithms
Mentor: Christian Widmer
Difficulty: Easy to Medium
Requirements: C++ programming, basic machine learning, potentially basic skills in Ocatve/MATLAB/Python
Shogun provides a rich set of kernel implementations, including (but not limited to) a rich set of string kernels particularly relevant for applications in Computational Biology.
Traditionally, the focus of Shogun has been on Support Vector Machines (SVMs),
however there exist numerous other Machine Learning methods, that could also
benefit from the set of kernels provided by Shogun. In particular, we have the
following two in mind:
- Gaussian Processes
- Kernel Logistic Regression (e.g. this implementation)
In addition to that, a whole range of pre-processing methods are available that would greatly complement the features that are in shogun already. These include:
- Kernel PCA (KPCA)
- Stochastic Neighborhood Embedding (SNE)
- Locally Linear Embedding (LLE)
- ISOMAP
- Multidimensional Scaling (MDS)
For this task, the first step is to scan mloss.org and similar sites for suitable implementations. Once these are identified, the student will have to port the respective code to the Shogun code base, making sure to respect existing interfaces (and add new ones, if necessary). Lastly, documentation, examples and Unit tests will have to be written for all of the integrated methods.
While this project does not involve the implementation of novel ML methods form scratch, it offers a great opportunity for a student to get familiar with a whole range of highly relevant Machine Learning methods in theory and practice. On top of that, the student will gather experience in contributing to a large scientific software project.
-
Implement SVMbright a GPL'ed version of SVMLight and make it publicly available
Mentor: Soeren Sonnenburg
Difficulty: Difficult
Requirements: C++ programming, support vector machines, optimization theory
SVMlight is one of the first SVM toolboxes. While efficient, it is not open source. The task is to implement the algorithms described in Joachims 98 as free software.
Implementation of / Integration via existing GPL code of latent SVMs.
Mentor: Alexander Binder
Requirements: C++ programming, machine learning
Difficulty: Difficult
SVMs with latent variables
http://www.cs.cornell.edu/~cnyu/latentssvm/,
research.microsoft.com/pubs/72874/lsvm_amta.pdf and particularly
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.4938&rep=rep1&type=pdf
are an ongoing success story in Object detection approaches where the
latent variable is the unknown position of an object center and object
parts. Felzenswalbs approach received the "Lifetime Achievement" Prize
based on its impact in research (see
http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html)
at the Pascal VOC2010 workshop. Latent parameters are equally
appealing for disccriminative methods in other application areas as well.
The task consists of
- checking what implementations licensed under the GPL exist in C/C++
- to what kind of problems are they practically applicable
- if necessary implement one latent SVM model
- implement templates for minimization over user-defined latent variables (difficult!)
- implement dynamic interfaces
- implement static interfaces for important application cases
- implement more than one solution strategy
- implement heuristics for dealing with large sample sizes
- check such heuristics for their efficiency on large scale data
The challenge in implementation lies in solving the non(!)-convex
optimization problem in an way which is acceptable for practitioners ,
including allowing for templates for minimization over the latent
variable, which are desried to be consistent with the interfaces to other languages,
termination after a maximal runtime or heuristic frameworks which allow
to iterate it on smaller chunks of data while keeping only the hardest
samples.
Implementation of Boosting type learning method for non-constantly dimensional features defined as sets of fixed dim vectors with varying set size.
Mentor: Alexander Binder
Difficulty: Medium
Requirements: C++ programming, machine learning
http://www.cs.cmu.edu/~rahuls/pub/cvpr2008-unified-rahuls.pdf
is a method generally usable (outside its original domain) for features which are sets of fixed dimensional vectors with varying set size. This is a notable relaxation of the classical paradigm of constantly (fixed) dimensional features which could extend the usability of the Shogun toolbox at this edge.
The task consists of
- implement the model and solution algorithm
- implement a parallelization of the algorithm
- implement heuristics for dealing with large sample sizes
- check such heuristics for their efficiency on large scale data
- implement dynamic and static interfaces
The challenge here would be to implement it in a scalable way for larger sample sizes, possibly by parallelization or several heuristics to run it over chunks of the training data.
|