| 
		|   | Ideas for Google Summer of Code 2011 ProjectsShogun is a software library for machine learning algorithms, with a special focus on large-scale applications. This summer we are looking to extend the library in four different ways: Improving interfaces to other machine learning libraries or integrating them when appropriate, improved i/o support, framework improvements and new machine algorithms. Below is listed a set of suggestions for projects.back to shogun toolbox homepage
 |  | 
| Table of Contents | 
		
			| Interfacing and Integrating other Machine Learning Libraries  | 
			| Input/Output | 
			| Framework Improvements | 
			| New Machine Learning Algorithms | 
| List of Ideas | 
		| Interfacing or integrating other Machine Learning LibrariesInterfaces to other languages
	Implement modular Java interfaceMentor: Mikio BraunDifficulty: Easy
 Requirements: C++ programming, java, swig
 
 Investigate the status of numerical java and implement swig typemaps to
translate numerical java variables (matrices, vectors, sparse vectors,
strings) into shogun representations. To begin with, develop swig-typemaps
for jblas. It might be worth to have things
like JRuby, Jython etc. work with the Java bindings of shogun.
Implement modular ruby interface.Mentor: Mikio BraunDifficulty: Easy
 Requirements: C++ programming, ruby, swig
 
 Investigate the status of numerical ruby and implement swig typemaps to
	translate numerical ruby variables (matrices, vectors, sparse vectors,
	strings) into shogun representations.
 | 
	| Input/OutputExtensive information about hdf5 can be found in its reference manual.Reading and writing of shogun features / objects in .hdf5 (pytables/matlab/h5py) formatMentor: Soeren SonnenburgDifficulty: Easy to Medium
 Requirements: C++ programming
 
 Shogun already has partial support for the hierarchical data format version 5 in CHDF5File. This enables loading and saving files in python,r,octave,matlab,... sometimes providing better hdf5 support than any of the native implementions. The task would be
	to extend hdf5 support by implementing
 
		varialbe length data types supportsymbolic linkscompression supportslicing supportattribute supportReading and writing of matlab 7.0 filesMentor: Soeren SonnenburgDifficulty: Easy to Medium
 Requirements: C++ programming
 
 Matlab 7 file format specifications are publicly available online. In addtion Octave has support for reading files up to version 7.2. The task is to develop a file i/o class based on these specifications that
	allows for reading and writing .mat files directly. Note that one can borrow a lot of code from octave here (see the files src/ls-mat4.cc and load-save.cc.
Reading and writing of octave 3.x filesMentor: Soeren SonnenburgDifficulty: Easy
 Requirements: C++ programming
 
 Natively support loading and writing of octave files. One can make use of the already implemented functionality in octave (files ls-oct-ascii.{cc,h} and ls-oct-binary.{cc,h}. So this task involves a lot of code borrowing.
 | 
	| Framework ImprovementsBuilt well-designed and flexible cross-validation framework into shogunMentor: Soeren SonnenburgDifficulty: Difficult
 Requirements: C++ programming, basic machine learning
 
 This should include N-fold cross-validation, simple validation, repeated sampling etc. 
	It should be possible to easily exchange the strategy of how examples are split up (e.g. in comp-bio this, could mean that genes in a particular split come from a certain genomic region). This is considered difficult only because you will need to deeply understand the internals of shogun objects. The involved tasks are
 
		Understand shogun's parameter framework and extend it to flag parameters for use in model selection.Implement a class CParameterSetting for one particular configuration the learning machine will be run withImplement a class CSettingGenerator class that based on parameter inputs generates one particular CParameterSettingImplement various model selection schemes (train,validation,test split, n-fold cross validation, leave one out cross validation, nested cross validation,...)Implement framework for online learningMentor: Soeren SonnenburgDifficulty: Easy
 Requirements: C++ programming, machine learning
 
 Currently within shogun it is assumed that the number of data points is known beforehand. This may be troublesome for online learning algorithms.
Implement 'online features' with calls to get_next_feature_vector() that always returns another object or NULL. Implemnent it in a way hat is either read from a file, pipe or socket. This is closely related to the vowpal vabbit integration above and one will be able to borrow code from there
	http://hunch.net/~vw/.
In a next step convert online algorithms such as sgd, larank, liblinear to work with those features.
Built generic structured output learning frameworkMentor:Jonas Behr, 
Nico GoernitzDifficulty: Difficult
 Requirements: C++ programming, machine learning
 
 In many cases machine learning problems result in a classification task where
		one seeks an algorithm that decides whether a given data point is
		of one or the other class. Among the most powerful and successful methods 
		solving such problems is the famous support vector machine (SVM).
		Naturally it's input data consist of 
		high-dimensional lists of features capturing properties of the 
		underlying objects (e.g. weight, size, color, ...). Sometimes these objects 
		are more complex and have a build-in structure like trees and sequences. 
		However, the output is restricted to a binary decision and hence, is not
	  helpful when it comes to predicting e.g. the most likely sequence for speech recognition or genome annotation.
		To overcome this problem but retain the characteristics of the
		SVM the structured-output support vector machine (SO-SVM) has been proposed
		(Tsochantaridis et al [3]). Another powerful class of structured output
	  predictors is the conditional random field (CRF, Lafferty et al [4]). It shares important properties with the SO-SVM but 
	  performance might differ depending on the application at hand.
 
 Here, the student will implement a generic SO-SVM and CRF framework to enable handling
		structured output problems and make it an ideal base for 
		specific user-defined implementations.
		A crucial point is - of course - speed and therefore implementation 
		covers only primal versions of the SO-SVM. However, this is not a big restriction
		because most applications already have a rich representation.	  
	  As application she/he will implement a hidden (semi-)Markov SO-SVM which
	  will be used for transcript mapping or gene prediction.	
		There is already an implementation available in 
		Matlab/C [1]. Code for Viterbi-decoding of semi-Markov models is already available in Shogun [2].
 
 While the programming difficulty is only medium a basic understanding in
    machine learning is mandatory. This renders the whole problem difficult.
 
 Timetable
 
 
		  get used to shogun (especially how to read and write data) [2]test HM-SVM toolbox with toy data [1]get a basic grasp of how the SO-SVM algorithm looks likeimplement the standard SO-SVM algorithm (a QP-solver is provided)test the SO-SVM by implementing a multi-class SVM (here the 
		  joint feature map and argmax-function are especially easy,
		  this is also the first application)implement HM-SVM application (define specific loss- and argmax-functions)implement a CRF and test on the same application [1] an implementation of the hidden Markov SO-SVM in Matlab/C is available at  http://mloss.org/software/view/250/
 [2] Shogun framework as well as examples and plenty of doumentation can be downloaded from http://www.shogun-toolbox.org
 [3] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun: Support vector machine learning for interdependent and structured output spaces, ICML 2004
 [4] J. Lafferty, A. McCallum, F. Pereira: Conditional random fields: Probabilistic models for segmenting and labeling sequence data, ICML 2001
 
General Solver framework for convex problems.Mentor: Gunnar RaetschDifficulty: Difficult (this might be too ambitious)
 Requirements: C++ programming, optimization, machine learning
 
 There exists cvx for matlab which is useful but lacks some natural options like number of iterations. Convexity can be checked via algebraic rules (like sums of convex functions are convex) and convex atoms. The aim is to establish a framework with one solving strategy initially rather then to get perfect solutions for all kinds of convex problems.
 | 
	| New Machine Learning Algorithms
		 Implementation of the Reduced Set Method for RBF kernel.Mentor: Vojtech FrancDifficulty: Medium
 Requirements: C++ programming, matlab or octave, basic math/statistics
 
 The Gaussian kernels are among the most popular dissimilarity measures used
    in practical applications of kernel methods. Their popularity is due to the
    flexibility of the decision rule expressed as a linear combination of the
    Gaussian kernels. The discriminative power of such rule can be smoothly
    changed by tuning a single parameter (the kernel width) from linear to
    arbitrary complex one. The eminent flexibility is paid by high complexity of
    the decision rule, i.e. the number of the Gaussian kernels appearing in the
    linear combination. High complexity of the decision rule has a detrimental
    impact on the decision time being one of the most important criteria in
    practice. The Reduced Set (RS) Methods is a technique which approximates a
    complex kernel-based decision rule by a simple one using a prescribed number
    of kernels. The core of the RS methods is to solve so called pre-image
    problem. In the case of the Gaussian kernels, the pre-image problem leads to
    highly non-convex smooth optimization. The goal of the project is to
    implement the existing techniques for its solution, namely:
 
      As a starting point the student can use Matlab implementations of the 
    RS methods which are available in STPRtoolbox. B.Schoekopf, P.Knirsch, and A.Smola, C.Burges. Fast
      approximation of support vector kernel expansions, and an interpretation of
      clustering as approximation in feature spaces. DAGM. 1998. J.T.Kwok and I.W.Tsang. The pre-image problem in kernel
      methods. ICML. 2003. Plain gradient-based method. Implement Kernel Feature Analysis (KFA)Mentor: Konrad RieckDifficulty: Medium
 Requirements: C++ programming, machine learning
 
 Dimension reduction is a prerequisite in many practical applications of machine learning. Kernel Feature Analysis (KFA) is an algorithm for performing dimension reduction using kernel functions. In contrast to the other kernel-based techniques for dimension reduction, KFA determines a low-dimensional subspace using sparse components and thus is suitable for large-scale application. 
KFA is described in the following two papers. The algorithms can be implemented as sketched in the work of Jiang et al., where further tricks for improving the run-time are discusssed.
 
Sparse Kernel Feature Analysis. 
Smola et al. (1999)
Accelerated Kernel Feature Analysis.
Jiang et al. (2006)ML estimation of parameters of Gaussian Mixture Models with carefully
		implemented EM algorithm.Mentor: Vojtech FrancDifficulty: Difficult
 Requirements: C++ programming, matlab or octave, basic math/statistics
 
 The Expectation-Maximization algorithm estimating parameters of the Gaussian
    Mixture Model is a standard tool used probability density modeling. Despite
    the simplicity of the EM algorithm itself programming a robust
    implementation which would be applicable in practice is a tricky
    problem. The difficulty stems from numerical problems inherently involved in
    the algorithm, e.g. summation of extremely low numbers or inverting ill
    conditioned covariance matrices. The goal of the project is a robust
    implementation which uses several computational tricks and techniques,
    e.g. performing summation of low numbers in logarithm domain, using SVD
    decomposition to represent covariance matrices, removing "singular" Gaussian
    components etc. An initial Matlab implementation will be provided by the mentor.
 Implement missing machine learning algorithmsMentor: Christian WidmerDifficulty: Easy to Medium
 Requirements: C++ programming, basic machine learning, potentially basic skills in Ocatve/MATLAB/Python
 
 Shogun provides a rich set of kernel implementations, including (but not limited to) a rich set of string kernels particularly relevant for applications in Computational Biology.
Traditionally, the focus of Shogun has been on Support Vector Machines (SVMs),
however there exist numerous other Machine Learning methods, that could also
benefit from the set of kernels provided by Shogun. In particular, we have the
following two in mind:
 
 
Gaussian ProcessesKernel Logistic Regression (e.g. this implementation) In addition to that, a whole range of pre-processing methods are available that would greatly complement the features that are in shogun already. These include:
 
Kernel PCA (KPCA)Stochastic Neighborhood Embedding (SNE)Locally Linear Embedding (LLE)ISOMAPMultidimensional Scaling (MDS) For this task, the first step is to scan mloss.org and similar sites for suitable implementations. Once these are identified, the student will have to port the respective code to the Shogun code base, making sure to respect existing interfaces (and add new ones, if necessary). Lastly, documentation, examples and Unit tests will have to be written for all of the integrated methods.
 
 While this project does not involve the implementation of novel ML methods form scratch, it offers a great opportunity for a student to get familiar with a whole range of highly relevant Machine Learning methods in theory and practice. On top of that, the student will gather experience in contributing to a large scientific software project.
 
 Implement SVMbright a GPL'ed version of SVMLight and make it publicly availableMentor: Soeren SonnenburgDifficulty: Difficult
 Requirements: C++ programming, support vector machines, optimization theory
 
 SVMlight is one of the first SVM toolboxes. While efficient, it is not open source. The task is to implement the algorithms described in Joachims 98 as free software.
http://www.cs.cornell.edu/~cnyu/latentssvm/, 
		research.microsoft.com/pubs/72874/lsvm_amta.pdf and particularly
		http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.4938&rep=rep1&type=pdf
		are an ongoing success story in Object detection approaches where the 
		latent variable is the unknown position of an object center and object 
		parts. Felzenswalbs approach received the "Lifetime Achievement" Prize 
		based on its impact in research (see 
		http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html) 
		at the Pascal VOC2010 workshop. Latent parameters are equally 
		appealing for disccriminative methods in other application areas as well.Implementation of / Integration via existing GPL code of latent SVMs.Mentor: Alexander BinderRequirements: C++ programming, machine learning
 Difficulty: Difficult
 SVMs with latent variables
The task consists of
 
		The challenge in implementation lies in solving the non(!)-convex 
		optimization problem in an way which is acceptable for practitioners , 
		including allowing for templates for minimization over the latent 
		variable, which are desried to be consistent with the interfaces to other languages, 
		termination after a maximal runtime or heuristic frameworks which allow 
		to iterate it on smaller chunks of data while keeping only the hardest 
		samples.checking what implementations licensed under the GPL exist in C/C++to what kind of problems are they practically applicableif necessary implement one latent SVM modelimplement templates for minimization over user-defined latent variables (difficult!)implement dynamic interfacesimplement static interfaces for important application casesimplement more than one solution strategyimplement heuristics for dealing with large sample sizescheck such heuristics for their efficiency on large scale data Implementation of Boosting type learning method for non-constantly dimensional features defined as sets of fixed dim vectors with varying set size.Mentor: Alexander BinderDifficulty: Medium
 Requirements: C++ programming, machine learning
 
 http://www.cs.cmu.edu/~rahuls/pub/cvpr2008-unified-rahuls.pdf
		is a method generally usable (outside its original domain) for features which are sets of fixed dimensional vectors with varying set size. This is a notable relaxation of the classical paradigm of constantly (fixed) dimensional features which could extend the usability of the Shogun toolbox at this edge.
 The task consists of
 
		implement the model and solution algorithmimplement a parallelization of the algorithmimplement heuristics for dealing with large sample sizescheck such heuristics for their efficiency on large scale dataimplement dynamic and static interfaces The challenge here would be to implement it in a scalable way for larger sample sizes, possibly by parallelization or several heuristics to run it over chunks of the training data.
 |