

Table of Contents 

Machine learning tasks


Infrastructure improvements


List of Ideas 

Machine learning tasksEssential Deep Learning ModulesMentor: Theofanis Karaletsos (website: http://www.mskcc.org/research/lab/gunnarratsch/member/theofaniskaraletsos)Shogun comentor: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn) Difficulty: Medium Requirements: C++, Python, machine learning Deep Learning has recently attracted a lot of attention in the Machine Learning community for its ability to learn features that allow high performance in a variety of tasks ranging from Computer Vision to Speech Recognition and Bioinformatics. The main goal of this task is to integrate essential building blocks of deep learning algorithms into Shogun. This includes Restricted Boltzmann Machines (RBM) and training algorithms, Deep Belief Networks (DBN), feedforward networks and convolutional networks. The architecture and software design for flexible usage and adaptation of models should be created to build a foundation for integrating many more models, evaluation methods and training algorithms in the future. Speaking of details, this idea considers implementation of some software foundation for deep learning and the first few algorithms (RBMs and training, stacking of RBM’s, wakesleep for DBN’s, discriminative finetuning with backprop, FFN). It is also worth to investigate current implementations (like Caffe [3]) and possibly even implement some wrapping code to use them in Shogun. Variational Learning for Recommendations with Big DataMentor: Mohammad Emtiyaz Khan (website: http://www.cs.ubc.ca/~emtiyaz/)Shogun comentor: Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS) Difficulty: Medium to Difficult Requirements: C++, familiarity with optimization methods, prefered (but not required) that you have basic understanding of Bayesian models (or only GPs) and variational approximations Useful: Matlab (for reference code), Eigen3, Shogun’s GP framework The era of big data brings calls for methods that scale well. For example, websites such as Netfix and Amazon collect huge amount of data about users’ preferences. Efficient use of such data can help improve recommendations and user experience through personalization. In this project, our goal will be to develop scalable code for learning from such data.
You will gain experience working with optimization algorithms, and get a little peek into convex optimization theory as well. You will learn about Bayesian models such as matrix factorization and Gaussian process, and variational inference. Most fun part is that you will get to play with realworld data such as for recommendation systems. You will be able to understand why these datasets are difficult, and what kind of models are useful for them. The project is quite flexible in terms of how much can be done and offers a lot of room for sophisticated extensions while the basic tasks are not too hard. You can play a lot with these things (and we expect you to). Should be quite fun! Fundamental Machine Learning algorithmsMentor: Fernando Iglesias (email: fernando.iglesiasg at gmail.com, irc: iglesiasg)Difficulty: Easy to Medium Requirements: experience with C++, Python, Octave/Matlab, understanding of basic machine learning algorithms Useful: experience with Eigen3 The machine learning research community is fastpaced, and international conferences produce new methods every year. Nevertheless, wellestablished, fundamental algorithms are an indispensable tool for most practitioners as they are well understood and their performance has been proven in a wide range of applications. The goal of this project is to implement fundamental ML algorithms that Shogun currently lacks, polishing and improving the existing ones, and illustrating their use through (multilanguage) examples and IPython notebooks on reallife datasets.
[1] C4.5. [2] Random forests. [3] KDtrees. [4] Kernel Density Estimation. [5] GPs notebook. Implementation of Recent Oneclass SVM ExtensionsMentor: Nico Goernitz (website: http://www.user.tuberlin.de/nico.goernitz/)Difficulty: Medium Requirements: C++, Python, basic machine learning and optimization skills Oneclass learning, aka. density levelset estimation, anomaly detection, fraud detection, concept learning, etc., has been around for a while. Most famous algorithm are the oneclass support vector machine (Schoelkopf et al, 1999) and the support vector data description (Tax and Duin, 1999). Both share common ideas and, in fact, under fairly general circumstances, are interchangeable. Due to their simplicity, they have been applied in many different areas and various, even surprising, settings (e.g. multiclass classification, just to name one that contradicts its name). Large Scale Learning: approximative kernel expansion in loglinear time (aka Fast Food)Mentor: Andreas Ziehe (website: http://www.ml.tuberlin.de/menue/mitglieder/andreas_ziehe/)Shogun comentor: Soeren Sonnenburg (email: sonne@debian.org, irc: sonney2k) Difficulty: Medium to Difficult Requirements: C++, machine learning, linear algebra Large scale learning with kernels is hindered by the fact that kernel classifiers require computing kernel expansions of the form \( f(x)=\sum_{i=1}^N \alpha_i K(x,x_i) \). Obviously, the more nonzero coefficients \( \alpha_i \) there are the slower the kernel machine. Recently, progress has been made in drastically speeding up kernel machines by approximating the kernel feature space [1,2,3]. The "Fast Food" approach [3] which can be seen as a modification of the "random kitchen sinks" method offers O(n log d) computation and O(n) storage for n basis functions in d dimensions. The key to achieve these speedups is based on using the Fast Hadamard transform for fast matrixvector multiplication [6].
[1] Efficient Additive Kernels via Explicit Feature Maps (Vedaldi, 2011) [2] Random kitchen sinks. [3] Fastfood  Computing Hilbert Space Expansions in loglinear time, Quoc Le, Tamas Sarlos, Alexander Smola; JMLR W&CP 28 (3): 244–252, 2013 [4] Fast Food video lecture. [5] Fast Food techtalk [6] SPIRAL WHT Package. Generic Framework for Markov Chain Monte Carlo Algorithms and Stan InterfaceMentors: Theodore Papamarkou (website: http://www2.warwick.ac.uk/fac/sci/statistics/staff/academicresearch/papamarkou/), Dino Sejdinovic (website: http://www.gatsby.ucl.ac.uk/~dino/)Shogun comentor: Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS) Difficulty: Medium to Difficult Requirements: C++, basic understanding of Monte Carlo methods Useful: Stan, Shogun’s GP framework, experience with Hamiltonian Monte Carlo Monte Carlo methods are used for Bayesian statistical analysis in various disciplines, including machine learning. For this reason, several statistical methods coded in some of Shogun’s existing toolboxes require interacting with Monte Carlo samplers. An example of this is fully Bayesian GP classification that requires sampling from the posterior of the GP hyperparameters. The aim of this project is to provide a coding framework for Monte Carlo sampling in Shogun. Other subparts in Shogun will be able to use this framework in order to either call Monte Carlo samplers already available in the framework or code new samplers by complying to the framework’s unified API. The fully modular framework would allow both the adaptive and the nonadaptive MCMC methods, and an easy addition of the novel adaptation procedures, such as Kameleon MCMC [3]. In addition, pseudomarginal MCMC (based on [1]) will be included. Main application would be in the use of MCMC methods for Gaussian Process Classification, building on two previous GSoC projects on GPs.
References: [1] CInferenceMethod in Shogun's Doxygen documentation. [2] Shogun MCMC prototype, in particular the UML diagram. [3] D. Sejdinovic, M. Lomeli Garcia, H. Strathmann, C. Andrieu and A. Gretton, Kernel adaptive MetropolisHastings, working implementation at GitHub. Output Kernel LearningMentor: Cheng Soon Ong (website: http://www.onghome.my/)Shogun comentor: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn) Difficulty: Medium Requirements: C++, linear algebra In this task a student would work on implementations of a few algorithms for output kernel learning. Output kernel learning is a kernelbased technology to solve learning problems with multiple outputs, such as multiclass and multilabel classification or vectorial regression while automatically learning the relationships between the different output components. This idea is mainly about implementing the algorithm with proper testing routines and gives student a great opportunity to work on stateoftheart machine learning algorithms and problem formulations. Testing and Measuring Variable Interactions With KernelsMentor: Dino Sejdinovic (website: http://www.gatsby.ucl.ac.uk/~dino/)Shogun comentor: Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS) Difficulty: Requirenemtns: Strong C++ skills, basic knowledge of kernel methods and hypothesis testing Useful: matlab, python, eigen3 Testing and measuring dependence between paired data observations is a fundamental scientific task, and is often an important part of the feature selection procedure needed for a more sophisticated ML problem. In recent years, a significant progress was made in both the machine learning and the statistics community in order to capture nonlinear associations, and to extend the formalism, via kernel trick, to datasets that belong to more complex, structured domains, like strings or graphs. The aim of this project is to extend Shogun’s modular implementation of these novel dependence measures and make it applicable to generic data domains. In addition, the corresponding feature selection procedures will be implemented and integrated into Shogun’s data preprocessing framework. The project builds on the GSoC 2012 project on kernel methods for twosample testing [8].
[1] A. Gretton, K. Fukumizu, C.H. Teo, L. Song, B. Scholkopf and A. Smola, A kernel statistical test of independence. In Advances in Neural Information Processing Systems (NIPS), 2008. [2] G. Szekely and M. Rizzo, Brownian distance covariance. Ann. Appl. Stat. 3 1236– 1265, 2009 [3] D. Sejdinovic, A. Gretton and W. Bergsma, A kernel test for threevariable interactions, in Advances in Neural Information Processing Systems (NIPS), 2013. [4] K. Fukumizu, A. Gretton, X. Sun, and B. Scholkopf, Kernel Measures of Conditional Dependence. Advances in Neural Information Processing Systems, 2008. [5] B. Poczos, Z. Ghahramani, and J. Schneider, Copulabased Kernel Dependency Measures, International Conference on Machine Learning (ICML), 2012. [6] A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil and K. Fukumizu, Optimal kernel choice for largescale twosample tests, in Advances in Neural Information Processing Systems (NIPS), 2012. [7] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt, Feature Selection via Dependence Maximization, J. Mach. Learn. Res. 13:1393−1434, 2012. [8] GSoC 2012 followup and CTestStatistic in Shogun's documentation. [9] W. Zaremba, A. Gretton, and M. Blaschko, Btest: A Nonparametric, Low Variance Kernel Twosample Test. In Advances in Neural Information Processing Systems (NIPS), 2013. Dictionary LearningMentor: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn)Difficulty: Medium Requirements: C++, Python Dictionary learning is an optimization approach to avoid feature engineering in machine learning. This task is a great opportunity to learn basics of dictionary learning and gradually reach stateoftheart techniques relevant in various machine learning applications. LP/QP Optimization FrameworkMentor: Viktor Gal (email: vigsterk at gmail.com, irc: wiking)Difficulty: Difficult Requirements: optimization theory, good c++ skills Useful: opencl/cuda knowledge This task considers implementation of a framework for LP/QP optimization within Shogun. It shall be as modular as possible, since we want to have a general KKT solver as well as part of this project. The framework would be a thin wrapper that defines mappings for several existing and known optimization libraries (libqp, mosek, glpk, cplex, etc.). Using this framework implement a cone programming (both LP and QP). LargeScale (hierarchical) MultiLabel ClassificationMentor: Thoralf Klein (email: thoralf at fischlustig.de, irc: thoralf)Difficulty: Medium Requirements: multilabel theory, structuredoutput learning, good c++ skills Formally speaking, multilabel classification is about predicting sets of labels for given inputs, i.e. every input can be assigned to none or many labels. It is commonly used in textmining and related to multiclass classification, where have a set of labels as well, but each input is only assigned to *exactly* one label. While in multiclass classification the domains of the target labels are not allowed to overlap, multilabel classification can handle these cases as well. This opens a broad range of new problem settings, but since we are predicting *sets of integers* instead of integers, this yields many interesting computational problems. The first task in this project: Implement simple, but efficient approach to predict multilabels for given inputs. The multilabel alphabet is considered to have 1000 or more labels  enough to show the limitations of naive approaches. Solutions of this task may include parallelized training runs, exploiting the sparseness of the multilabels and using dimensionreduction techniques like feature hashing. (Labelclasses for this already exist and are just waiting for integration.) The second task is to implement some wellknown metrics for evaluating multilabel predictions. Applying the developed model on some provided training data should be straightforward then. The last (and most challenging) task task is to extend the approach to hierarchical multilabels, often called taxonomies. Exploiting this structure allows us to implement more efficient algorithms. If one, for example, requires that a label can only be set if the parentlabel is set as well, then this allows us to to lazily predict the structure. This would save a lot of computing power, but requires to implement an efficient decoding (dynamic programming, adapting the forwardbackward algorithm, also known as Viterbi decoding or belief propagation). Finally it can be stated, that the focus of this project is less about researching multilabel classification, but to implement memory and timeefficient algorithms that scale well on growing problems. There is already a lot of research done in this field. This project allows the participant to adjust the focus between algorithms, sophisticated models, optimizing infrastructure, or efficient data structures. Implement dual coordinate solver for Structured Output SVMsMentor: Vojtech Franc (website: http://cmp.felk.cvut.cz/~xfrancv/)Shogun mentor: Soeren Sonnenburg (email: sonne at gmail.com, irc: sonney2k) Difficulty: Medium Requirements: optimization theory, C++, basic knowledge of structured output learning The structured output Support Vector Machines (SOSVM) [1] is a supervised method for learning parameters of a linear classifier with possibly exponentially large number of classes. Learning is translated into a convex optimization program size of which scales with the number of classes making the problem intractable by offtheshelf solvers. A dual coordinate ascent (DCA) based methods, well known e.g. from twoclass SVMs, have been recently proposed for solving the SOSVM [2][3]. The DCA algorithms combine advantages of approximate online solvers and precise cutting plane methods used so far. In particular, the DCA algorithms process training example in a online fashion by a simple update rule and they provide a certificate of optimality at the same time. Structured Output Learning with Approximate InferencesMentor: Shell Hu (email: dom343 AT gmail DOT com, irc: hushell)Shogun mentor: Thoralf Klein (email:thoralf AT fischlustig DOT de, irc: thoralf) Difficulty: Medium to Difficult Requirements: C++, Python, probabilistic graphical models, structured output learning A major challenge of each structured output problem is to implement an efficient "inference" method. This method is then used in training and prediction to determines the highestscoring/"best" output for a given input. Since the output space is usually exponential in size, we cannot evaluate all possible outputs to choose the "best" one. (Examples for such "NP hard inference problems" are graphs, where one wants to predict "outputs" for each node, while each node depends on its neighbors.) 

Infrastructure improvementsSVMbright  SVM^light as MIT license rewriteMentor: Björn Esser (email: bjoern.esser at gmail.com, irc besser82)Difficulty: Difficult Requirements: C++(11) programming, support vector machines, optimization theory SVMlight is one of the first and the most advanced SVM solvers. While efficient, it is not open source. This idea is about implementing the algorithms described in Joachims 98 as free, MITstyle licensed software. OpenCV Integration and Computer Vision ApplicationsMentor: Kevin Hughes (email: kevinhughes27 at gmail.com, irc: pickle27)Difficulty: Medium Requirements: understanding of computer vision and machine learning, good C++ and python skills Build OpenCV integration into Shogun using appropriate design patterns (likely a factory) and update any of the surrounding Shogun infrastructure as required. This Gist might be useful. After integrating with OpenCV create several involved applications using Shogun and OpenCV. Some possibilities include: Optical Character Recognition (OCR), License Plate Identification, Facial Recognition, Fingerprint Identification or suggest your own application that interests you! Implement a framework for pluginbased architectureMentors: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn), Viktor Gal (email: vigsterk at gmail.com, irc: wiking)Difficulty: Difficult Requirements: advanced C++ skills (rather deep understanding of shared libraries, linking, etc) Currently, Shogun is made of a monolithic structure of classes which seems to be a bit cumbersome to extend and maintain. We consider some kind of plugin architecture as a possible way to solve these problems. Such an architecture would support dynamic behaviour of plugins: a user could download a new classifier and run it instantly without any rebuilds. In this task the student have a chance to get deep understanding of important lowlevel details of dynamic libraries and ABIs. Implement an easy to use model selection APIMentor: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn)Difficulty: Medium Requirements: medium C++ skills, any kind of a sense of syntax beauty The Shogun’s model selection framework is a powerful tool to select parameters. Although it lacks some user friendly syntax and is overly verbose. This idea considers endeavour of redesigning the model selection framework to be much more easy to use. The student assigned for this project would learn how to design crosslanguage APIs. Independent jobs FrameworkMentors: Viktor (email: vigsterkr at gmail.com, irc wiking)Difficulty: Difficult Requirements: distributed computing, openmp, c++, hadoop, spark Although Shogun has a huge selection of ML algorithms currently it lacks support for parallel processing, i.e. running these algorithm using multiple cores or multiple nodes in a cluster. Shogun cloud extensionsMentors: Viktor (email: vigsterkr at gmail.com, irc wiking)Difficulty: Medium Requirements: distributed computing, python, docker Cloud service of Shogun was introduced with release 3.0. It is basically a simple Flask application that serves docker containers with Shogun and python modular interface with IPython server in it for the users. The code is available here. There are several issue (extensions) one would need to solve during this task
Native MS Windows portMentors: Viktor (email: vigsterkr at gmail.com, irc wiking)Difficulty: Medium Requirements: visual c++ Shogun is missing native MS Windows port. Because of this we are missing a big user base as there's still a lot of researchers and developers who use MS Windows primarily. A MetaLanguage for Shogun examplesMentors: Soeren Sonnenburg (email: soeren.sonnenburg at shoguntoolbox.org, irc sonney2k) Heiko Strathmann (email: heiko.strathmann at gmail.com, irc HeikoS)Difficulty: Medium Requirements: basic knowledge of the programming languages that Shogun interfaces with (C++, Python, Octave, Java, etc), basic computerscience and formal languages (compilers) Useful: SWIG, Shogun's examples, basic ML Shogun uses a unique way of automagically generating access to Shogun’s C++ classes from a large number of target languages, based on SWIG. Code in those interfaces is very similar in its syntactic structure, see for example the different modular APIexamples. We experienced that this is one of the most impressive features of Shogun which none of the other toolboxes out there have. While this is great in principle, creating examples also creates a lot of overhead since they have to written in every of the target languages explicitly. This is boring work and as a result, we only have proper examples in python. Lobbying Shogun in MLPACK’s automatic benchmarking system (joint project)Mentors: Ryan Curtin and Marcus Edel (MLPACK authors), Soeren Sonnenburg (email: soeren.sonnenburg at shoguntoolbox.org, irc: sonney2k), Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS)Difficulty: Medium Requirements: basics in Python, ML, Shogun, MLPACK, web design Useful: buildbot, advanced ML, Shogun internal, d3js Last year, at the GSoC 2013 mentor summit, we met a cool guy named Ryan, who made a great idea happen with his student Marcus: A crosstoolbox benchmarking system. This is a website where multiple different ML toolboxes are compared to each other on various fixed datasets. Currently, the focus is on runtime and memory footprint of algorithms. The resulting summaries are very informative already. The MLPACK guys will try to push this further this year, see their description. We would like to participate in development of this project to make it even more useful. Our (Shogunbiased) vision is to:
Runtime is an important quality criterion, given that the results are comparable. We therefore think that it would be very useful to be able to compare results of various implementations the same algorithm. This increases the quality of the diverse ecosystem of ML toolboxes out there and is helps everyone: Users will get a useful tool to choose the from lots of packages. Developers will be pointed to potential problems in their code. In the optimal case, this gives raise to a friendly competition between different toolboxes on producing better results faster. Having daily results makes sure this neat project survives over the years and might attract other projects to participate. We could even think about integrating it to MLOSS and use MLOSS datasets. Collaboration with MLPACK: This project would be under the main lead of the MLPACK guys, so you would talk to them a lot, trying to identify what is needed for the above extensions. We have the idea that you will give them a helping hand in pushing the infrastructure of their project (buildbot!) about 30% of the time, spend another 30% on polishing the visualisation and presentation of the results, and spend the remaining 30% on Shogun specific things, such as polishing the used Shogun code, fix possible bugs and add new algorithms from our toolbox. Have a look at the existing code.
Shogun Binary PackagesMentors: Viktor Gal (email: vigsterkr at gmail.com, irc wiking), Soeren Sonnenburg (email: soeren.sonnenburg AT shoguntoolbox.org)Difficulty: Medium Requirements: knowledge of OSX, win, linux, experience with packaging software Useful: compiling software in windows/mac (as more complicated as in linux), proper hacking skills :) Sometimes, attention that Shogun gets evaporates due to the fact that people really struggle to install it. Compiling code, changing cmake settings, library paths, etc is not everyone’s thing. We therefore aim to offer an easy way to install Shogun on mainstream operating system via binary packaging. An easy 3minute installation has a massive potential to increase the number of our users. In particular, we aim to automate the packaging process such that nightly binary packages are constructed by a buildslave. One big question here is how dependencies are handled: Are our binaries self contained (on which systems) or do they dynamically link against others? This will obviously depend on the target system.
Why this is cool: The impact of this project on Shogun’s world dominance is massive  even without any machine learning! The optimal student will be able to use his existing packaging skills and knowledge of multiple operating systems to push the number of Shogun users significantly. Apart from everyone loving you, this project also offers a wide variety of interesting technical challenges. Ask Viktor, he has lots of ideas on this.
Shogun Missionary & Shogun in EducationMentors: Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS), Soeren Sonnenburg (email: sonne@debian.org, irc: sonney2k)Difficulty: Medium Requirements: Clear written English, verbal creativity, ability to explain technical topics, basic ML Useful: a broad knowledge of ML algorithms, IPython (notebook) & LaTeX, Shogun’s examples & webdemo framework You will be surprised how many things Shogun can do, how many ML algorithms are implemented under the hood, and how fast they are. Problem is: nobody knows that they exist, or how to use them. This project is about telling the world how great Shogun is.
This project will massively boost Shogun’s acceptance in the world. First of all, a large number of algorithms in Shogun (such as MKL and most SVMs) are not covered at all by a demo that goes beyond simple API demonstration. Making them visible will attract people. Second, these examples are extremely useful in promoting Shogun to people on conferences/workshops etc, for example to get funding from industr 
Feb. 17, 2014  SHOGUN 3.2.0  
Jan. 6, 2014  SHOGUN 3.1.1  
Jan. 5, 2014  SHOGUN 3.1.0  
Oct. 28, 2013  SHOGUN 3.0.0  
March 17, 2013  SHOGUN 2.1.0  
Sept. 1, 2012  SHOGUN 2.0.0  
Dec. 1, 2011  SHOGUN 1.1.0 