| Improving accessibility to shogun
Various usability improvementsMentor: Soeren SonnenburgDifficulty: Easy to Difficult
 Requirements: C++ programming, python, swig, some machine learning background
 
 Currently Shogun has a number of rough edges, which polishing would make it more accessible to others. The students task would be to
 
			This task offers good opportunities to see various aspects of shogun. A particular (fun!) challenge is the rather complex swig cross-language development process. Implement customizable SWIG interface classes (so called director classes [1,2]) making it possible to extend certain class methods using some interface language, e.g. for fast prototyping one could implement a kernel in python (or java or octave), which can be used by any of shogun's SVMs. Other use-cases of such director classes include: using python classifiers as shogun machines with shogun's python-modular interface, prototyping new preprocessors, implementing features with database support with some clue code and various more.
			 Implement a better python integration, i.e. for cross-validation it is currently very complex to specify the parameter settings to try out. In addition, python since version 2.6 supports the python buffer protocol [3], which when implemented for e.g. shoguns CSimpleFeatures or Labels would enable accessing shogun objects like python numpy arrays (to a certain extend). 					        python array protocol. Then we could mention various other known
			 The octave modular interface currently has no support for sparse matrices. The task would be to implement these in the swig_typemaps.i based on the python_modular typemaps (file swig_typemaps.i in interfaces/python_modular and interfaces/octave_modular)
			 Various code contains ASSERT()'s which should be replaced with more understandable SG_ERROR() messages
			 The current implementation of SGVector [4] should be converted to using reference counts.
			 Various minor usability improvements.
		 [1] http://www.swig.org/Doc2.0/Python.html#Python_nn34
 [2] www.swig.org/Doc2.0/Python.html#Python_nn34
 [3] http://www.python.org/dev/peps/pep-3118/
 [4] http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1SGVector.html
		Mentor: Cheng Soon OngDifficulty: Medium
 Requirements: C++ programming, python, (optional django)
 
 The machine learning data set repository mldata.org provides data and task
		downloads in a standardized format. Make shogun to natively be able to
		read these data formats and interact with the mldata.org website, i.e., by
 
			Note that the optional part of this task aims to help the machine learning community portals for which python and django skills are needed (both portal website's source code are also open source).read and write support for the hdf5 based data sets and tasks (see http://mldata.org/about/hdf5/)contribute features to mldata.org for command-line based up/downloading of data,task,methods,solutionscontribute features to mloss.org for command-line based up/downloading of methods and other community featuresReading and writing of shogun features / objects in .hdf5 (pytables/matlab 7.0/matlab 7.X/octave 3.x/h5py/arff) formatMentor: Soeren SonnenburgDifficulty: Easy to Medium
 Requirements: C++ programming
 
 Shogun already has partial support for the hierarchical data format version 5 in CHDF5File. This enables loading and saving files in python,r,octave,matlab,... sometimes providing better hdf5 support than any of the native implementions. The task would be
	to extend hdf5 support by implementing
 
		varialbe length data types supportsymbolic linkscompression supportslicing supportattribute support
	Extensive information about hdf5 can be found in its reference manual.
Matlab 7 file format specifications are publicly available online. In addtion Octave has support for reading files up to version 7.2. The task is to develop a file i/o class based on these specifications that
	allows for reading and writing .mat files directly. Note that one can borrow a lot of code from octave here (see the files src/ls-mat4.cc and load-save.cc. 
	Natively support loading and writing of octave files. One can make use of the already implemented functionality in octave (files ls-oct-ascii.{cc,h} and ls-oct-binary.{cc,h}. So this task involves a lot of code borrowing. | 
	| Framework Improvements | 
	| Integrate Existing and Implement New Machine Learning Algorithms
Integrate C5.0Mentor: Soeren SonnenburgDifficulty: Medium
 Requirements: C++ programming, basic machine learning
 
 C5.0 is among the most famous decision trees. It has just recently been released under the GPL - and thus it would be great to integrate it as a ``machine'' in shogun. Since the source code of C5.0 is in C, your task is to encapsulate C5.0 in C++ classes as shogun machine using shogun simplefeatures as inputs.
Implement Olivier Chapelles fast newton based SVM solver and integration of updated versions of liblinear/ocasMentor: Soeren SonnenburgDifficulty: Easy to Medium
 Requirements: C++ programming, basic octave or matlab, basic machine learning
 
 Code is available in matlab from
http://mloss.org/software/view/30/.
The task is to translate this code to C++ and integrate it in shogun. This involves
 
	Once familiar with the shogun, students could integrate updated versions of liblinear and svmocas:
New versions of liblinear and libocas are availabe that now support multiclass classification. The task is toDefining a new CLinearClassifier derived class CNewtonSVMTranslating the few lines of matlab code 
		Merge the newer versions into shogunTo integrate the new multi-class linear SVMs utilizing shogun's COFFIN framework implemented in CDotFeatures(Optionally) Implement the Local Binary Pattern (LBP) features, useful in gender recognition based frontal face pictures.Bundle method solver for structured output learningMentor: Vojtech FrancDifficulty: Medium to Difficult
 Requirements: C++ programming, some background in optimization, machine learning.
 
 
Learning of the structured output classifiers leads to solving a convex minimization problem which is not tractable by standard algorithms. A significant effort in ML community has been put to development of specialized solvers among which the Bundle Method for Risk Minimization (BMRM) [1], implemented e.g. in popular StructSVM [2], is the current the state-of-the-art. The BMRM is a simplified variant of bundle methods which are standard tools for non-smooth optimization [3]. The simplicity of the BMRM is compensated by its reduced efficiency. Experiments show that a careful implementation of the classical bundle methods perform significantly faster (speedup ~ 5-10) than their variants (like BMRM) adopted by ML community. The goal will be an OS library implementing the classical bundle method for the SO learning and its integration to Shogun.
Tasks: 
 
Literature:Get familiar with Shogun.
Implement the classical bundle method for SO learning.
Integrate the algorithm to Shogun.
Compare experimentally the implemented algorithm with the StructSVM[2] and other SO solvers already integrated in Shogun.
 [1] C.H. Teo, S.V.N. Vishwanthan, A.J. Smola, Q.V. Le. Bundle Methods for Regularized Risk Minimization. JMLR, 11(Jan):311−365, 2010.
 [2] T.Joachims.  StructSVM.
 [3] C. Lemarechal, A.Nemirovskii, Y. Nesterov. New variants of bundle methods. Mathematical Programming. 69, 111-147. 1995
 
 Implement Gaussian Processes and regression techniquesMentor: Oliver StegleDifficulty: Medium
 Requirements: C++ programming, machine learning methods for regression, understand octave or matlab code.
 
 While Shogun provides for a number of scalable models for classification, flexible and accurate regression techniques are partially missing at this point. 
	 (currently only (Kernel)Ridge Regression, Support Vector Regression and Least Squares Regression are implemented).
Your task will be to adapt and translate established Gaussian process models for usage in Shogun, integrating ideas from existing matlab and python toolboxes (see http://www.gaussianprocess.org/#code) and in particular the GPML Toolbox.
To start out, we will implement state-of-the-art Gaussian process regression models and adapt shogun as needed for efficient inference in Gaussian process models:
Gaussian process regression using a standard Gaussian kernel (for example [1], chapter 1 and 2)
 
 
		  This project allows the student to get familiar with various regression methods often used in machine learning and in particular GPs. Extend key kernels in shogun to provide for parameter derivatives and add further interfaces needed to speed up Gaussian process parameter learning.
		   In the second part of the project, we aim to implement extensions that make shogun even more useful for advanced regression models. Depending on your interests, we suggest to either extend the models to cope with outliers in the data (robust regression) or implement the Gaussian process latent variable model for dimensionality reduction in high-dimensional datasets:
		  Implement the Gaussian proces latent variable model in Shogun [2]
		   Add support for robust Gaussian process regression [3]
		   In addition, the interested student could enhance the regression methods currently available in shogun (e.g. by developing more numerically robust methods) and implement other state-of-the-art regression techniques. 
 
 
[1] C. Rasmussen, C. Williams Gaussian processe for machine learning [2] N. Lawrence, Probabilistic non-linear principle component analysis with Gaussian Process Latent Variable Models, JMLR 2005
 [3] P. Jylänki, J. Vanhatalo, A. Vehtari, Robust Gaussian process regression with a Student-t Likelihood, JMLR 2011
 
 Implement multitask and domain adaptation algorithmsMentor: Christian WidmerDifficulty: Medium
 Requirements: C++ programming, advanced machine learning
 
 Multitask Learning and Domain Adaptation are active fields of research in the Machine Learning Community. The main aim is to combine information from similar prediction tasks in order to come up with better classifiers.
 As an example consider building a classifier that predicts whether a person in Germany likes a particular movie, given their purchase history. The resulting classifier could be very similar to the one you would
get for, say, Dutch customers, however quite different from Japanese customers. Yet, one possibly could still exploit useful information in the Japanese data set even for the German or Dutch customers.
This is exactly where multitask learning comes in: Related problems are coupled together, leading to classifiers that exploit all available information. In practice, this is done by certain modification to the 
standard formulations of classifiers such as the SVM.
 In this project, the student would work close to active research. As a first step they would be closely supervised in order to implement a few Multitask Learning algorithms available in literature, for example:
 
Depending on the speed of progress, the student will then be involved in the efficient implementations of active research projects. The working prototypes will be provided in python and the task will be to come
up with fast, efficient, well-tested and well-documented C++ implementations as part of SHOGUN.Convex Multitask Feature Learning (paper)Learning Multiple Tasks using Manifold Regularization (paper) This projects gives the student the chance to work close to ongoing academic research. It offers the student to learn about the optimization techniques needed for machine learning,
getting from prototypes to "production-ready" implementations and of course the research field of multitask learning.
Kernel Two-sample/Dependence testMentor: Arthur Gretton, Soeren SonnenburgDifficulty: Medium to Difficult
 Requirements: C++ programming, Numerical algorithm implementations, Knowledge of probability distributions/statistical hypothesis tests and kernel methods, Reading Matlab Code
 
 
 
Hypothesis tests are an important tool for statistical analysis. Given samples from two distributions p and q, the goal is to test whether p=q. A test-statistic is computed from samples and being compared to a threshold. Based on this, the null-hypothesis (p=q) is or is not rejected.
 
For this poject, kernel versions of hypothesis tests are to be implemented, based on the "Maximum Mean Discrepancy" (MMD), (see [1], [2], [3], [4]).
These would have for example application in the area of bioinformatics (microarray data distributions) or database attribute matching where data is high-dimensional. In addition, they would allow to test whether data comes from two different distributions in fields where data is not given in vector form (for example graphs)
The idea of these tests is to find a well behaved (e.g. smooth) function which is large on the points drawn from p, and small (as negative as possible) on the points from q. The mean function values on the two samples is used to compute the test-statistic. When this is large, the samples are likely from different distributions.
 
Related to the two sample test are dependence tests, i.e. telling whether two samples are statistically (in)dependent, (see for example http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).
For this project, a kernel dependence test based on the Hilbert Schmidt Independence Criterion (HSIC), as described in [5], is to be implemented.
As for the MMD tests, there exist multiple method of computing the significace threshold. See [6] for this and some slides for an intuitive description
 
There already exist Matlab implementations for both MMD [4] and HSIC [5].
For MMD, a good start would be the linear time statistic in [1, Corollary 22], where the Null-distribution is a Gaussian.
A more complicated test is based on the asymptotic distribution of the Unbiased MMD Statistic [4, section 5]. A difficulty here is to compute the significance threshold. This should be done via two possibilities: a method based on bootstrap resampling (see [1], [2]) and a method based on the Eigenspectrum of the Gram matrix on the aggregate sample from p and q (see [3]), and a cheap approximation of the latter [3].
For HSIC, there are multiple versions for computing the significance threshold, in paritcular one based on bootstrapping [5] [6], and one faster consistent version [reference to come]
Another (optional, if the rest works) difficulty would be to pay a bit more attention to kernel (parameter) selection in context of the above tests. However, some usable heuristics exist. 
The first practical step for the student is read and understand the mentioned papers to understand the ideas behind the methods. It is not important to gain deep mathematical insight but rather to know what is important for the implementation.
There already exists a Matlab implementation which can be used as an orientation. Looking at/understanding the code would be the second step. Since Matlab has many numerical algorithms built-in, it might be necessary to include some of them into SHOGUN.
There are yet no hypothesis/dependence tests in SHOGUN, therefore, a framework concept for these has to be thought of. This includes creating new classes/interfaces to make this as flexible as possible. All this has to be done in respect to existing interfaces, so a good knowledge of the SHOGUN framework is crucial/has to be acquired.
The algorithms/interfaces have to be documented well and examples/tests for their usage have to be created. Creating examples may be non-trivial here since the data should be more complex in order.
Finally, SHOGUN's usual interfaces to other languages have to be created and examples should be translated to at least some of the language bindings.
 
This project offers a great opportunity to get familiar with kernel statistical hypothesis testing. The mentioned methods were all developed quite recently by well-known scientists. Since there already exists a working implementation in Matlab, coding has not to be done completely from scratch.
From the implementation side, students will get to know the SHOGUN framework since the are expanding it. Also experience writing numerical algorithms will be gathered. Implementing algorithms for scientific open-source software in any case is a great experience.
[1] Main reference for MMD (long technical report): "A Kernel Method for the Two-Sample Problem"
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schökopf, Alexander J. Smola [2] Short summary of [1]: "A KernelMethod for the Two-Sample-Problem"
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander J. Smola
 [3] A better method for approximating the null-distribution of the asymptotic distribution of the Unbiased MMD Statistic: "A Fast, Consistent Kernel Two-Sample Test", Gretton, Arthur Fukumizu, Kenji Harchaoui, Zaid
 [4] Matlab implementation and references: http://people.kyb.tuebingen.mpg.de/arthur/mmd.htm
 [5] Main reference for HSIC: "A Kernel Statistical Test of Independence"
Gretton, A.  K. Fukumizu C.-H. Teo L. Song, B Schoelkopf A. Smola NIPS 21, 2007.
 [6] References and implementation of the kernel dependence test: http://people.kyb.tuebingen.mpg.de/arthur/indep.htm
 
http://www.cs.cornell.edu/~cnyu/latentssvm/, 
		research.microsoft.com/pubs/72874/lsvm_amta.pdf and particularly
		http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.4938&rep=rep1&type=pdf
		are an ongoing success story in Object detection approaches where the 
		latent variable is the unknown position of an object center and object 
		parts. Felzenswalbs approach received the "Lifetime Achievement" Prize 
		based on its impact in research (see 
		http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html) 
		at the Pascal VOC2010 workshop. Latent parameters are equally 
		appealing for disccriminative methods in other application areas as well.Implementation of / Integration via existing GPL code of latent SVMs.Mentor: Alexander BinderRequirements: C++ programming, machine learning
 Difficulty: Difficult
 SVMs with latent variables
The task consists of
 
		The challenge in implementation lies in solving the non(!)-convex 
		optimization problem in an way which is acceptable for practitioners , 
		including allowing for templates for minimization over the latent 
		variable, which are desried to be consistent with the interfaces to other languages, 
		termination after a maximal runtime or heuristic frameworks which allow 
		to iterate it on smaller chunks of data while keeping only the hardest 
		samples.checking what implementations licensed under the GPL exist in C/C++to what kind of problems are they practically applicableif necessary implement one latent SVM modelimplement templates for minimization over user-defined latent variables (difficult!)implement dynamic interfacesimplement static interfaces for important application casesimplement more than one solution strategyimplement heuristics for dealing with large sample sizescheck such heuristics for their efficiency on large scale data Implementation of Boosting type learning method for non-constantly dimensional features defined as sets of fixed dim vectors with varying set size.Mentor: Alexander BinderDifficulty: Medium
 Requirements: C++ programming, machine learning
 
 http://www.cs.cmu.edu/~rahuls/pub/cvpr2008-unified-rahuls.pdf
		is a method generally usable (outside its original domain) for features which are sets of fixed dimensional vectors with varying set size. This is a notable relaxation of the classical paradigm of constantly (fixed) dimensional features which could extend the usability of the Shogun toolbox at this edge.
 The task consists of
 
		implement the model and solution algorithmimplement a parallelization of the algorithmimplement heuristics for dealing with large sample sizescheck such heuristics for their efficiency on large scale dataimplement dynamic and static interfaces The challenge here would be to implement it in a scalable way for larger sample sizes, possibly by parallelization or several heuristics to run it over chunks of the training data.
 |