Ideas

GSoC 2014 Ideas

SHOGUN Logo

Ideas for Google Summer of Code 2014 Projects

Shogun is a software toolbox for machine learning algorithms, with a special focus on large-scale applications. During Google Summer of Code 2014 we are looking forward to extending our library with both state-of-the-art and fundamental machine learning algorithms, as well as with infrastructure improvements.

Below is listed a set of suggestions for projects. If you have your own concrete idea, or want to solve a task proposed during previous years that has not been tackled yet - talk to us! We might be interested.

Last but not least, read thoroughly the FAQ for GSoC applicants.

GSoC 2011 ideas
GSoC 2012 ideas
GSoC 2013 ideas

back to shogun toolbox homepage

Table of Contents

Machine learning tasks

Infrastructure improvements

List of Ideas

Machine learning tasks

  • Essential Deep Learning Modules

    Mentor: Theofanis Karaletsos (website: http://www.mskcc.org/research/lab/gunnar-ratsch/member/theofanis-karaletsos)
    Shogun co-mentor: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn)
    Difficulty: Medium
    Requirements: C++, Python, machine learning

    Deep Learning has recently attracted a lot of attention in the Machine Learning community for its ability to learn features that allow high performance in a variety of tasks ranging from Computer Vision to Speech Recognition and Bioinformatics. The main goal of this task is to integrate essential building blocks of deep learning algorithms into Shogun. This includes Restricted Boltzmann Machines (RBM) and training algorithms, Deep Belief Networks (DBN), feed-forward networks and convolutional networks. The architecture and software design for flexible usage and adaptation of models should be created to build a foundation for integrating many more models, evaluation methods and training algorithms in the future. Speaking of details, this idea considers implementation of some software foundation for deep learning and the first few algorithms (RBMs and training, stacking of RBM’s, wake-sleep for DBN’s, discriminative fine-tuning with backprop, FFN). It is also worth to investigate current implementations (like Caffe [3]) and possibly even implement some wrapping code to use them in Shogun.

    This idea is a great chance to learn the deep learning approach and essential principles of implementing deep learning algorithms. With such a serious attention being drawn to deep learning this is an important skill for any researcher or engineer working with data.

    References:
    [1] Comprehensive Python toolboxes and tutorials with code examples are available in Theano and in codeboxes of Alex Krizhevsky and Nitish Srivastava.
    [2] Deep learning reading list.

  • Variational Learning for Recommendations with Big Data

    Mentor: Mohammad Emtiyaz Khan (website: http://www.cs.ubc.ca/~emtiyaz/)
    Shogun co-mentor: Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS)
    Difficulty: Medium to Difficult
    Requirements: C++, familiarity with optimization methods, prefered (but not required) that you have basic understanding of Bayesian models (or only GPs) and variational approximations
    Useful: Matlab (for reference code), Eigen3, Shogun’s GP framework

    The era of big data brings calls for methods that scale well. For example, websites such as Netfix and Amazon collect huge amount of data about users’ preferences. Efficient use of such data can help improve recommendations and user experience through personalization. In this project, our goal will be to develop scalable code for learning from such data.

    We will focus on models that use matrix factorization and Gaussian processes. Matrix factorization is a very popular model for recommendation systems - in fact, in the Netflix challenge, this model had the best performance obtained by any single model! We will use variational approximations since they have potential to scale well to big data. The objective function in Bayesian models are usually intractable and variational methods instead optimize a tractable lower bound to learn the model.

    Our main tool in this project will be the implementation of many convex optimization methods for fast variational inference. Previously, in other GSoCs, Jacob Walker and Roman Votjakov implemented a flexible framework for Gaussian process regression and classification models. We will use their initial work and extend the existing infrastructure to allow variational inference for GPs.

    The project time line along with estimated time and difficulty level. (Note that this is a preliminary list, which might prove to be too much.)
    • (2 week, difficult) Implement KL-method of Nickisch and Rasmussen 2008 for GPs.
    • (1 week, easy) Implement KL-method of Challis and Barber 2011 for GPs.
    • (1 week, easy) Implement the dual variational inference, Khan et. al. 2013 for GPs.
    • (2 weeks, difficult) Implement Stochastic method for computing inverses approximately (from Hannes’ toolbox) for GPs.
    • (1 weeks, easy) Generalize these methods to a general latent Gaussian model.
    • (1 week, easy) Implement probabilistic PCA of Tipping and Bishop 1999.
    • (2 week, difficult) Incorporate KL-method into PCA to learn from non-Gaussian data simliar to Matchbox 2009.

    You will gain experience working with optimization algorithms, and get a little peek into convex optimization theory as well. You will learn about Bayesian models such as matrix factorization and Gaussian process, and variational inference. Most fun part is that you will get to play with real-world data such as for recommendation systems. You will be able to understand why these datasets are difficult, and what kind of models are useful for them. The project is quite flexible in terms of how much can be done and offers a lot of room for sophisticated extensions while the basic tasks are not too hard. You can play a lot with these things (and we expect you to). Should be quite fun!

  • Fundamental Machine Learning algorithms

    Mentor: Fernando Iglesias (email: fernando.iglesiasg at gmail.com, irc: iglesiasg)
    Difficulty: Easy to Medium
    Requirements: experience with C++, Python, Octave/Matlab, understanding of basic machine learning algorithms
    Useful: experience with Eigen3

    The machine learning research community is fast-paced, and international conferences produce new methods every year. Nevertheless, well-established, fundamental algorithms are an indispensable tool for most practitioners as they are well understood and their performance has been proven in a wide range of applications. The goal of this project is to implement fundamental ML algorithms that Shogun currently lacks, polishing and improving the existing ones, and illustrating their use through (multi-language) examples and IPython notebooks on real-life datasets.

    A first part of the project involves the implementation of decision trees in Shogun. The C4.5 algorithm [1] is a starting point -- we are also open to suggestions. Once a method to generate a decision tree is ready, use it to provide a random forests implementation [2].

    A second part is related to the well-known k-nearest neighbours (kNN) algorithm. For fast kNN search, we use a very efficient cover tree implementation already integrated into Shogun. kd-trees are another data structure useful to perform kNN search. This part of the project consists of implementing a kd-tree class whose performance is comparable to the cover tree which is currently available.

    Yet another new fundamental method we want to include in Shogun is kernel density estimation [4] (also known as Parzen windows in some domains). Such an implementation must be able to leverage the many kernels available in Shogun.

    We would also like to have a more flexible Principal Component Analysis (PCA) and Kernel PCA (KPCA) interface. For instance, we would like to give the user the possibility to choose between using Singular Value Decomposition (SVD) or eigendecomposition depending on whether the number of feature vectors is larger than the number of dimensions, or vice versa. The current implementations could be polished as well, introducing the use of Eigen3.

    Least-Angle Regression (LARS) and Lasso are two fundamental statistical tools for regression already implemented in Shogun. We would like to have a slightly large example in the form of a notebook that compares these regression techniques. This comparison would become broader including other methods implemented in Shogun such as Support Vector Regression and Gaussian Processes [5].

    A very important part of this project is to illustrate how these methods should be used using IPython notebooks and examples (at least covering C++, Python, and Octave/Matlab). The student should be able to come up with (or find in textbooks, articles, etc) good-looking -- i.e. with nice figures -- examples that illustrate the concepts underlying the implemented algorithms. A large-scale classification application with kNN is another idea.

    Suggested entrance tasks:

    References:
    [1] C4.5.
    [2] Random forests.
    [3] KD-trees.
    [4] Kernel Density Estimation.
    [5] GPs notebook.
  • Implementation of Recent One-class SVM Extensions

    Mentor: Nico Goernitz (website: http://www.user.tu-berlin.de/nico.goernitz/)
    Difficulty: Medium
    Requirements: C++, Python, basic machine learning and optimization skills

    One-class learning, aka. density level-set estimation, anomaly detection, fraud detection, concept learning, etc., has been around for a while. Most famous algorithm are the one-class support vector machine (Schoelkopf et al, 1999) and the support vector data description (Tax and Duin, 1999). Both share common ideas and, in fact, under fairly general circumstances, are interchangeable. Due to their simplicity, they have been applied in many different areas and various, even surprising, settings (e.g. multi-class classification, just to name one that contradicts its name).

    Semi-supervised anomaly detection (2013), latent variable support vector data description (2014) and multitask one-class support vector machines (2010) are some quite recent extensions based on one-class SVM and SVDD. Some of the above methods have multiple formulations (convex and convex approximations), while others are intrinisc non-convex.

    Basic optimization knowledge is neccessary as well as understanding of the underlying machine learning techniques (SVMs, kernels, regularization, etc.). Implementation languages will be C++ and Python. Standalone Python implementations of the above mentioned methods will be provided.

    References:
    N. Görnitz, M. Kloft, K. Rieck, U. Brefeld "Toward Supervised Anomaly Detection" Journal of Artificial Intelligence Research (JAIR), 2013.
    Haiqin Yang, Irwin King, Michael R. Lyu, "Multi-task Learning for One-class Classification" The 2010 International Joint Conference on Neural Networks (IJCNN)
    N. Görnitz, A. Porbadnigk, A. Binder, C. Sannelli, M. Braun, K.-R. Mueller, M. Kloft "Learning and Evaluation in Presence of Non-i.i.d. Label Noise" (accepted at) AISTATS, 2014

  • Large Scale Learning: approximative kernel expansion in loglinear time (aka Fast Food)

    Mentor: Andreas Ziehe (website: http://www.ml.tu-berlin.de/menue/mitglieder/andreas_ziehe/)
    Shogun co-mentor: Soeren Sonnenburg (email: sonne@debian.org, irc: sonney2k)
    Difficulty: Medium to Difficult
    Requirements: C++, machine learning, linear algebra

    Large scale learning with kernels is hindered by the fact that kernel classifiers require computing kernel expansions of the form \( f(x)=\sum_{i=1}^N \alpha_i K(x,x_i) \). Obviously, the more non-zero coefficients \( \alpha_i \) there are the slower the kernel machine. Recently, progress has been made in drastically speeding up kernel machines by approximating the kernel feature space [1,2,3]. The "Fast Food" approach [3] which can be seen as a modification of the "random kitchen sinks" method offers O(n log d) computation and O(n) storage for n basis functions in d dimensions. The key to achieve these speedups is based on using the Fast Hadamard transform for fast matrix-vector multiplication [6].

    Suggested road map:

    • Familiarize yourself with [1,2,3] and shogun (in particular CKernelMachine and CDotFleatures).
    • Familiarize yourself with random fourier features [2] via RandomFourierDotFeatures in shogun/features/ and RandomFourierGaussPreproc.cpp in shogun/preprocessor.
    • Familiarize yourself with the Fast Hadamard transform and its implementation [6].
    • Familiarize with [3,4,5] and implement the FastFood method. Implement a speed comparison of [1,2,3].
    References:
    [1] Efficient Additive Kernels via Explicit Feature Maps (Vedaldi, 2011)
    [2] Random kitchen sinks.
    [3] Fastfood - Computing Hilbert Space Expansions in loglinear time, Quoc Le, Tamas Sarlos, Alexander Smola; JMLR W&CP 28 (3): 244–252, 2013
    [4] Fast Food video lecture.
    [5] Fast Food techtalk
    [6] SPIRAL WHT Package.
  • Generic Framework for Markov Chain Monte Carlo Algorithms and Stan Interface

    Mentors: Theodore Papamarkou (website: http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/papamarkou/), Dino Sejdinovic (website: http://www.gatsby.ucl.ac.uk/~dino/)
    Shogun co-mentor: Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS)
    Difficulty: Medium to Difficult Requirements: C++, basic understanding of Monte Carlo methods
    Useful: Stan, Shogun’s GP framework, experience with Hamiltonian Monte Carlo

    Monte Carlo methods are used for Bayesian statistical analysis in various disciplines, including machine learning. For this reason, several statistical methods coded in some of Shogun’s existing toolboxes require interacting with Monte Carlo samplers. An example of this is fully Bayesian GP classification that requires sampling from the posterior of the GP hyperparameters. The aim of this project is to provide a coding framework for Monte Carlo sampling in Shogun. Other sub-parts in Shogun will be able to use this framework in order to either call Monte Carlo samplers already available in the framework or code new samplers by complying to the framework’s unified API. The fully modular framework would allow both the adaptive and the non-adaptive MCMC methods, and an easy addition of the novel adaptation procedures, such as Kameleon MCMC [3]. In addition, pseudo-marginal MCMC (based on [1]) will be included. Main application would be in the use of MCMC methods for Gaussian Process Classification, building on two previous GSoC projects on GPs.

    The project entails the following sequential steps:

    • The major goal of the project would be to organize the infrastructure of the MCMC framework in a way that
      • compartmentalizes its components so as to enhance future code usage and development,
      • organizes the existing map of Monte Carlo samplers in a natural way,
      • facilitates interaction with external Monte Carlo libraries.
    • The designed framework will then be coded as a standalone Shogun toolbox, serving
      • as a Monte Carlo API for Shogun developers,
      • as a toolbox with its own MCMC samplers.
    • The final stage will involve using the coded API to code an interface for Stan in Shogun. This last step is provisional and its progress will depend on the timeline of the previous two steps and on the technicalities arising from Stan’s coding infrastructure. Ideally, Stan’s interface in Shogun will be developed, yet if technical issues hinder timely development, then minimally a plan will provided outlining how future development will tackle the task.
    The framework has been already partially conceptualized and documented using a UML graph, see [2]. A Python prototype has also been coded to implement the UML graph, see [2]. This preparatory work will ease the completion of step 1, shifting the focus to step 2. You will play a key role in the initial step of the long-term effort to extend functionalities offered by Shogun, and to build a bridge between toolsets used by the Statistics and ML practitioners. This would potentially have a large impact in terms of usage, as MCMC methods are a ubiquitous tool in a variety of scientific disciplines. Solid understanding of MCMC is crucial, however it is not required that you understand all sophisticated methods in detail yet. We will provide a detailed description of all involved algorithms along with working implementations and help you on the way.

    References:
    [1] CInferenceMethod in Shogun's Doxygen documentation.
    [2] Shogun MCMC prototype, in particular the UML diagram.
    [3] D. Sejdinovic, M. Lomeli Garcia, H. Strathmann, C. Andrieu and A. Gretton, Kernel adaptive Metropolis-Hastings, working implementation at GitHub.
  • Output Kernel Learning

    Mentor: Cheng Soon Ong (website: http://www.ong-home.my/)
    Shogun co-mentor: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn)
    Difficulty: Medium
    Requirements: C++, linear algebra

    In this task a student would work on implementations of a few algorithms for output kernel learning. Output kernel learning is a kernel-based technology to solve learning problems with multiple outputs, such as multi-class and multi-label classification or vectorial regression while automatically learning the relationships between the different output components. This idea is mainly about implementing the algorithm with proper testing routines and gives student a great opportunity to work on state-of-the-art machine learning algorithms and problem formulations.

    References:
    [1] F. Dinuzzo. Learning output kernels for multi-task problems. Neurocomputing, 118:119-126, 2013.
    [2] F. Dinuzzo, and K. Fukumizu. Learning low-rank output kernels. JMLR: Workshop and Conference Proceedings. Proceedings of the 3rd Asian Conference on Machine Learning. 20:181–196, Taoyuan, Taiwan, 2011.
    [3] F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto. Learning output kernels with block coordinate descent. In International Conference on Machine Learning, Bellevue WA (USA), 2011.

  • Testing and Measuring Variable Interactions With Kernels

    Mentor: Dino Sejdinovic (website: http://www.gatsby.ucl.ac.uk/~dino/)
    Shogun co-mentor: Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS)
    Difficulty: Difficult
    Requirenemtns: Strong C++ skills, basic knowledge of kernel methods and hypothesis testing
    Useful: matlab, python, eigen3

    Testing and measuring dependence between paired data observations is a fundamental scientific task, and is often an important part of the feature selection procedure needed for a more sophisticated ML problem. In recent years, a significant progress was made in both the machine learning and the statistics community in order to capture non-linear associations, and to extend the formalism, via kernel trick, to datasets that belong to more complex, structured domains, like strings or graphs. The aim of this project is to extend Shogun’s modular implementation of these novel dependence measures and make it applicable to generic data domains. In addition, the corresponding feature selection procedures will be implemented and integrated into Shogun’s data preprocessing framework. The project builds on the GSoC 2012 project on kernel methods for two-sample testing [8].

    This will be an exciting opportunity to acquire intimate knowledge of cutting-edge kernel techniques, with a large number of potential users from different scientific fields. While *basic* understanding of the involved concepts is crucial, it is not required that you understand all methods in detail. We will provide a detailed description of all involved algorithms along with working implementations and help you on the way.

    Suggested road map:

    • Review and enhance Hilbert/Schmidt Independence Criterion (HSIC) [1] and distance correlation/covariance [2] (two dependence measures
    • belonging to the same kernel-based framework) – allowing the user to specify either the kernel or the metric needed to compute the quantities in a modular fashion
    • Implement a measure of multivariate (higher-order) interaction [3]
    • Implement normalized HSIC/NOCCO [4] and the copula-based kernel-dependence measure [5]
    • Implement conditional variants of each of the above quantities [4] (optional)
    • Review and enhance the Shogun class for the statistical significance tests based on the corresponding dependence measures as the test
    • statistics, using several additional approaches to approximate the statistic distribution under the null hypothesis, for example [9]
    • Implement automated selection of the parameters corresponding to each of the dependence measures that result in the most powerful test,
    • following the approach in [6] (this would build on the existing framework for two-sample testing)
    • Implement a Shogun class for the feature selection procedures(e.g., [7]) that would based on the corresponding dependence measures, to be
    • added to the Shogun preprocessing framework
    • Detailed iPython worked examples for the assessment of testing performance and the feature selection performance -- these will include a
    • big data problem with millions of data points.
    • API illustration via mini-examples in both C++ and at least two modular language bindings.
    References:
    [1] A. Gretton, K. Fukumizu, C.H. Teo, L. Song, B. Scholkopf and A. Smola, A kernel statistical test of independence. In Advances in Neural Information Processing Systems (NIPS), 2008.
    [2] G. Szekely and M. Rizzo, Brownian distance covariance. Ann. Appl. Stat. 3 1236– 1265, 2009
    [3] D. Sejdinovic, A. Gretton and W. Bergsma, A kernel test for three-variable interactions, in Advances in Neural Information Processing Systems (NIPS), 2013.
    [4] K. Fukumizu, A. Gretton, X. Sun, and B. Scholkopf, Kernel Measures of Conditional Dependence. Advances in Neural Information Processing Systems, 2008.
    [5] B. Poczos, Z. Ghahramani, and J. Schneider, Copula-based Kernel Dependency Measures, International Conference on Machine Learning (ICML), 2012.
    [6] A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil and K. Fukumizu, Optimal kernel choice for large-scale two-sample tests, in Advances in Neural Information Processing Systems (NIPS), 2012.
    [7] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt, Feature Selection via Dependence Maximization, J. Mach. Learn. Res. 13:1393−1434, 2012.
    [8] GSoC 2012 follow-up and CTestStatistic in Shogun's documentation.
    [9] W. Zaremba, A. Gretton, and M. Blaschko, B-test: A Non-parametric, Low Variance Kernel Two-sample Test. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • Dictionary Learning

    Mentor: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn)
    Difficulty: Medium
    Requirements: C++, Python

    Dictionary learning is an optimization approach to avoid feature engineering in machine learning. This task is a great opportunity to learn basics of dictionary learning and gradually reach state-of-the-art techniques relevant in various machine learning applications.

  • LP/QP Optimization Framework

    Mentor: Viktor Gal (email: vigsterk at gmail.com, irc: wiking)
    Difficulty: Difficult
    Requirements: optimization theory, good c++ skills
    Useful: opencl/cuda knowledge

    This task considers implementation of a framework for LP/QP optimization within Shogun. It shall be as modular as possible, since we want to have a general KKT solver as well as part of this project. The framework would be a thin wrapper that defines mappings for several existing and known optimization libraries (libqp, mosek, glpk, cplex, etc.). Using this framework implement a cone programming (both LP and QP).

  • Large-Scale (hierarchical) Multi-Label Classification

    Mentor: Thoralf Klein (email: thoralf at fischlustig.de, irc: thoralf)
    Difficulty: Medium
    Requirements: multi-label theory, structured-output learning, good c++ skills

    Formally speaking, multi-label classification is about predicting sets of labels for given inputs, i.e. every input can be assigned to none or many labels. It is commonly used in text-mining and related to multi-class classification, where have a set of labels as well, but each input is only assigned to *exactly* one label.

    While in multi-class classification the domains of the target labels are not allowed to overlap, multi-label classification can handle these cases as well. This opens a broad range of new problem settings, but since we are predicting *sets of integers* instead of integers, this yields many interesting computational problems.

    The first task in this project: Implement simple, but efficient approach to predict multi-labels for given inputs. The multi-label alphabet is considered to have 1000 or more labels -- enough to show the limitations of naive approaches. Solutions of this task may include parallelized training runs, exploiting the sparseness of the multi-labels and using dimension-reduction techniques like feature hashing. (Label-classes for this already exist and are just waiting for integration.)

    The second task is to implement some well-known metrics for evaluating multi-label predictions. Applying the developed model on some provided training data should be straightforward then.

    The last (and most challenging) task task is to extend the approach to hierarchical multi-labels, often called taxonomies. Exploiting this structure allows us to implement more efficient algorithms. If one, for example, requires that a label can only be set if the parent-label is set as well, then this allows us to to lazily predict the structure. This would save a lot of computing power, but requires to implement an efficient decoding (dynamic programming, adapting the forward-backward algorithm, also known as Viterbi decoding or belief propagation).

    Finally it can be stated, that the focus of this project is less about researching multi-label classification, but to implement memory- and time-efficient algorithms that scale well on growing problems. There is already a lot of research done in this field. This project allows the participant to adjust the focus between algorithms, sophisticated models, optimizing infrastructure, or efficient data structures.

  • Implement dual coordinate solver for Structured Output SVMs

    Mentor: Vojtech Franc (website: http://cmp.felk.cvut.cz/~xfrancv/)
    Shogun mentor: Soeren Sonnenburg (email: sonne at gmail.com, irc: sonney2k)
    Difficulty: Medium
    Requirements: optimization theory, C++, basic knowledge of structured output learning

    The structured output Support Vector Machines (SO-SVM) [1] is a supervised method for learning parameters of a linear classifier with possibly exponentially large number of classes. Learning is translated into a convex optimization program size of which scales with the number of classes making the problem intractable by off-the-shelf solvers. A dual coordinate ascent (DCA) based methods, well known e.g. from two-class SVMs, have been recently proposed for solving the SO-SVM [2][3]. The DCA algorithms combine advantages of approximate online solvers and precise cutting plane methods used so far. In particular, the DCA algorithms process training example in a on-line fashion by a simple update rule and they provide a certificate of optimality at the same time.

    The goal of this project will be to i) implement several variants of the DCA algorithms ([2][3], others will be provided by methors), ii) integrate them smoothly to Shogun and iii) compare them with SO-SVM solvers already included in Shogun on standard benchmarks.

    References:
    [1] I. Tsochantaridis, T.Joachims, T.Hoffman, Y.Altun. Large margin methods for structured and interdependent output variables. JMLR, 2005.
    [2] P.Bamamurugan, S.K.Shevade, S.Sundararajan, and S.S.Keerthi. A sequential dual method for structural SVMs. In SIAM Conference on Data Mining. 2011.
    [3] S.Lacoste-Julien, M.Jaggi, M.Schmidt, and P.Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In proc. of ICML, 2013.

  • Structured Output Learning with Approximate Inferences

    Mentor: Shell Hu (email: dom343 AT gmail DOT com, irc: hushell)
    Shogun mentor: Thoralf Klein (email:thoralf AT fischlustig DOT de, irc: thoralf)
    Difficulty: Medium to Difficult
    Requirements: C++, Python, probabilistic graphical models, structured output learning

    A major challenge of each structured output problem is to implement an efficient "inference" method. This method is then used in training and prediction to determines the highest-scoring/"best" output for a given input. Since the output space is usually exponential in size, we cannot evaluate all possible outputs to choose the "best" one. (Examples for such "NP hard inference problems" are graphs, where one wants to predict "outputs" for each node, while each node depends on its neighbors.)

    Therefore Structured Output (SO) problems needs to define "efficient" inference or even approximations to the real solution. Previous research [1,2] show that a bad approximation of inference will probably result in a poor predictor.

    In the first part of the project, the student will implement two well-known approximation methods: (a) linear programming (LP) and (b) graph-cut relaxations. Both methods are known to behave well in the SO learning setting. The task can be seen as a good entrance to machine learning, graph theory and graphical models.

    The second part of the project is to implement the dual loss formulation of structured output learning using LP relaxation [3]. Instead of solving computationally intensive inference per input example, the dual loss formulation turns the entire problem into a joint minimization over parameters and dual variables, which is shown to be a more efficient method for large- scale structured output learning problems.

    The most exciting part of the project is that we will show some cool computer vision demos, such as image denoising and image segmentation.

    References:
    [1] Alex Kulesza and Fernando Pereira, Structured Learning with Approximate Inference, NIPS 2007.
    [2] Thomas Finley and Thorsten Joachims, Training Structural SVMs when Exact Inference is Intractable, ICML 2008.
    [3] Ofer Meshi, David Sontag, Tommi Jaakkola and Amir Globerson, Learning Efficiently with Approximate Inference via Dual Losses, ICML 2010.

  • Infrastructure improvements

  • SVMbright - SVM^light as MIT license rewrite

    Mentor: Björn Esser (email: bjoern.esser at gmail.com, irc besser82)
    Difficulty: Difficult
    Requirements: C++(11) programming, support vector machines, optimization theory

    SVMlight is one of the first and the most advanced SVM solvers. While efficient, it is not open source. This idea is about implementing the algorithms described in Joachims 98 as free, MIT-style licensed software.

    Goals:
    Creating an easy to use / integrate modular SVM-framework, which is flexible and easy to extend / enhance by writing new plugins for different solvers, kernel, algorithms / implementations, IO-drivers for different data / models with nice and stable callback API and replace the existing inlined SVMlight in Shogun with that.

  • OpenCV Integration and Computer Vision Applications

    Mentor: Kevin Hughes (email: kevinhughes27 at gmail.com, irc: pickle27)
    Difficulty: Medium
    Requirements: understanding of computer vision and machine learning, good C++ and python skills

    Build OpenCV integration into Shogun using appropriate design patterns (likely a factory) and update any of the surrounding Shogun infrastructure as required. This Gist might be useful. After integrating with OpenCV create several involved applications using Shogun and OpenCV. Some possibilities include: Optical Character Recognition (OCR), License Plate Identification, Facial Recognition, Fingerprint Identification or suggest your own application that interests you!

  • Implement a framework for plugin-based architecture

    Mentors: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn), Viktor Gal (email: vigsterk at gmail.com, irc: wiking)
    Difficulty: Difficult
    Requirements: advanced C++ skills (rather deep understanding of shared libraries, linking, etc)

    Currently, Shogun is made of a monolithic structure of classes which seems to be a bit cumbersome to extend and maintain. We consider some kind of plugin architecture as a possible way to solve these problems. Such an architecture would support dynamic behaviour of plugins: a user could download a new classifier and run it instantly without any rebuilds. In this task the student have a chance to get deep understanding of important low-level details of dynamic libraries and ABIs.

  • Implement an easy to use model selection API

    Mentor: Sergey Lisitsyn (email: lisitsyn.s.o at gmail.com, irc: lisitsyn)
    Difficulty: Medium
    Requirements: medium C++ skills, any kind of a sense of syntax beauty

    The Shogun’s model selection framework is a powerful tool to select parameters. Although it lacks some user friendly syntax and is overly verbose. This idea considers endeavour of re-designing the model selection framework to be much more easy to use. The student assigned for this project would learn how to design cross-language APIs.

  • Independent jobs Framework

    Mentors: Viktor (email: vigsterkr at gmail.com, irc wiking)
    Difficulty: Difficult
    Requirements: distributed computing, openmp, c++, hadoop, spark

    Although Shogun has a huge selection of ML algorithms currently it lacks support for parallel processing, i.e. running these algorithm using multiple cores or multiple nodes in a cluster.

    Last year, as part of a GSoC project an Independent Computing Framework has been introduced to Shogun, but it has been applied in handful of algorithms. Hence one of the objective of this task would be to introduce the framework to other algorithms in Shogun that could benefit from using parallel processing.

    The main objective of this project would be to unify the way parallel processing is being done within the library as currently there are algorithms in Shogun that are using OpenMP and others that are using native POSIX pthreads API to do parallel processing. The way this should be done is to define a unified parallel processing API, which supports both multi-core and multi-node distributed computing. A good example for this is Spark where an environmental variable (MASTER) defines whether the tasks are being run on a cluster or different cores of a local machine. Apart from this the API should as well define basic operations like parallel loop.

    Ideally, the implemented API should be accessible from the modular interfaces as well.

    It is important that the framework should support to use it over well known distributed computing frameworks, like
  • Shogun cloud extensions

    Mentors: Viktor (email: vigsterkr at gmail.com, irc wiking)
    Difficulty: Medium
    Requirements: distributed computing, python, docker

    Cloud service of Shogun was introduced with release 3.0. It is basically a simple Flask application that serves docker containers with Shogun and python modular interface with IPython server in it for the users. The code is available here.

    There are several issue (extensions) one would need to solve during this task
    • auto-update docker containers of the users to the latest shogun docker image
    • tunnel using session keys the IPython notebook via the http server instead of revealing the service directly.
    • create a frontend-backend setup, where the frontend node is only responsible for authenticating and registering the users and to optimally allocate resources among the backend servers who are responsible to run the docker containers.
  • Native MS Windows port

    Mentors: Viktor (email: vigsterkr at gmail.com, irc wiking)
    Difficulty: Medium
    Requirements: visual c++

    Shogun is missing native MS Windows port. Because of this we are missing a big user base as there's still a lot of researchers and developers who use MS Windows primarily.

    Currently the only way to compile Shogun in a MS Windows environment is to use CygWin. Although with the help of CMake one can generate an MS Visual Studio solution file, the code in current state cannot be compiled with Visual C++. I have started to work on this task, but a lot of macros are still required to be able to compile Shogun with VC++.

    Although it does not sound like a major task, it is, because one has to take care of all the little things of MS VC++ that makes it a not-so standard C++ compiler.

  • A Meta-Language for Shogun examples

    Mentors: Soeren Sonnenburg (email: soeren.sonnenburg at shogun-toolbox.org, irc sonney2k) Heiko Strathmann (email: heiko.strathmann at gmail.com, irc HeikoS)
    Difficulty: Medium
    Requirements: basic knowledge of the programming languages that Shogun interfaces with (C++, Python, Octave, Java, etc), basic computer-science and formal languages (compilers)
    Useful: SWIG, Shogun's examples, basic ML

    Shogun uses a unique way of automagically generating access to Shogun’s C++ classes from a large number of target languages, based on SWIG. Code in those interfaces is very similar in its syntactic structure, see for example the different modular API-examples. We experienced that this is one of the most impressive features of Shogun which none of the other toolboxes out there have. While this is great in principle, creating examples also creates a lot of overhead since they have to written in every of the target languages explicitly. This is boring work and as a result, we only have proper examples in python.

    This project is about unifying the example generating process. We want to build a simple abstract language to write Shogun examples that can be automatically mapped to source code in the target languages. Here is an example of what we have in mind.

    Part 1: A Shogun example language
    The first step is to come up with an abstract language that is able to represent Shogun objects, and basic data types, and method calls on Shogun objects. A purely sequential structure is sufficient: programs will just be ordered lists of statements. This makes it relatively easy to map these commands to the modular target languages. There are many issues to deal with here: syntactic structure, representation of programs, operators, etc, so this is a theoretic part. A good start will be to just write programs in a language how we would like it to be and then see whether its possible to generalize over this.

    Part 2: A Shogun example language “compiler”
    Once the language is designed and a proof of concept is working, the second part of the project is to create a generator for all of Shogun’s target languages. It will take a program in the Shogun language from part 1, and create the corresponding code in C++/Python/Java. We aim for a single generator which just uses a different dictionary for each target language. There also needs to be a check-mechanism that tests whether a given program is syntactically correct in the abstract Shogun language, which is probably the best way to start part 2.

    Part 3: Translation of API-examples
    Once we are able to translate any Shogun language program to an actual program that interfaces Shogun from a modular target language (this will probably take some iterations of part 1 and 2), the goal is to go through all modular examples that Shogun offers and translate them to the created Shogun language. There are quite a few examples will not be possible to be expressed in the new Shogun language. However, we aim to only keep API-illustration examples and outsource everything that includes plots or more interesting use-cases to IPython notebooks. So there will be some time spent on translating cool python examples to IPython notebooks, which should be fun. C++ examples that cannot be expressed might be moved into unit-tests. When this is done, we can remove all existing examples that are explicitly coded in the modular interfaces, and have them automatically generated in a unified way from the abstract Shogun example language.

    Part 4: Integration testing in all modular interfaces
    The final (and very important) step of the project is to extend the current way of integration testing. Currently, every python_modular example returns some Shogun objects that are then serialised, and compared to a file. We use this to monitor whether there are any changes in the objects/results (induced by “fixing” bugs) that might slip through our fingers otherwise. Note that those reference files can be de-serialised from *any* modular target language. Consequently, we would like to do integration tests for *all* modular target languages. This part is to write a framework that automatically does exactly this, which is straightforward once part 1-3 are done.

    Why is this cool?
    This is an extremely useful project for us: we currently do not have proper examples for most target languages - even though this is one of Shogun’s coolest features. This project solves this issue in an extremely elegant way. Second, detecting differences in algorithm results among the different target languages is very tough. WIth part 4 implemented, this will be done automatically for any new example added.

    You will play a role in making Shogun more popular and more stable - everyone will love you! :) In addition, since you will be touching all of Shogun’s examples, you will get in contact with loads of different ML algorithms and get a very nice overview of what is possible with Shogun. In addition, you will learn about core Computer-Science from a theoretical (designing a language) and a practical (implementing parsers) type of view.

  • Lobbying Shogun in MLPACK’s automatic benchmarking system (joint project)

    Mentors: Ryan Curtin and Marcus Edel (MLPACK authors), Soeren Sonnenburg (email: soeren.sonnenburg at shogun-toolbox.org, irc: sonney2k), Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS)
    Difficulty: Medium
    Requirements: basics in Python, ML, Shogun, MLPACK, web design
    Useful: buildbot, advanced ML, Shogun internal, d3js

    Last year, at the GSoC 2013 mentor summit, we met a cool guy named Ryan, who made a great idea happen with his student Marcus: A cross-toolbox benchmarking system. This is a website where multiple different ML toolboxes are compared to each other on various fixed datasets. Currently, the focus is on runtime and memory footprint of algorithms. The resulting summaries are very informative already. The MLPACK guys will try to push this further this year, see their description. We would like to participate in development of this project to make it even more useful. Our (Shogun-biased) vision is to:

    • Make the framework run live. Results should be updated nightly in an automated way that ensures the project stays green (i.e. doesn’t break by api changes or similar). This is very important in our view as it makes the framework maintainable for the future. We could for example add a slave in Shogun’s buildbot farm.
    • Extend the framework to compare the quality of the results. The first step here would be to simply compare the outputs of the algorithms. The second step would be to do a toolbox independent, fair evaluation of these results (i.e. accuracy on a held out test set, or even a full cross-validation).
    • Have a more intuitive visualisation of the results. It would be cool to use d3js here and have some eye candy. Web-designer skills are useful here, as it would be an option to detach the current result website from the MLPACK main page.
    • We want you to know a bit about ML since you will also need to write descriptions of parameters of the algorithms. Links to the implementation references for all toolboxes should be provided.
    • Make sure that Shogun is represented well in the results. This means that we want to make sure that a) our algorithms are used properly (sadly, it is possible to do things wrong with Shogun), and b) find out bottlenecks in cases where Shogun cannot produce results or is very slow (given similar results), and document those for us to fix.
    • A part of the project would also be to fix potential bottlenecks and or crashes in Shogun to rank better (with the help of other GSoC students, if interested).
    • Add more different algorithms to the comparison (in particular those that are in Shogun).
    Why this is cool:
    Runtime is an important quality criterion, given that the results are comparable. We therefore think that it would be very useful to be able to compare results of various implementations the same algorithm. This increases the quality of the diverse ecosystem of ML toolboxes out there and is helps everyone: Users will get a useful tool to choose the from lots of packages. Developers will be pointed to potential problems in their code. In the optimal case, this gives raise to a friendly competition between different toolboxes on producing better results faster. Having daily results makes sure this neat project survives over the years and might attract other projects to participate. We could even think about integrating it to MLOSS and use MLOSS datasets.

    Collaboration with MLPACK:
    This project would be under the main lead of the MLPACK guys, so you would talk to them a lot, trying to identify what is needed for the above extensions. We have the idea that you will give them a helping hand in pushing the infrastructure of their project (buildbot!) about 30% of the time, spend another 30% on polishing the visualisation and presentation of the results, and spend the remaining 30% on Shogun specific things, such as polishing the used Shogun code, fix possible bugs and add new algorithms from our toolbox. Have a look at the existing code.

     

  • Shogun Binary Packages

    Mentors: Viktor Gal (email: vigsterkr at gmail.com, irc wiking), Soeren Sonnenburg (email: soeren.sonnenburg AT shogun-toolbox.org)
    Difficulty: Medium
    Requirements: knowledge of OSX, win, linux, experience with packaging software
    Useful: compiling software in windows/mac (as more complicated as in linux), proper hacking skills :)

    Sometimes, attention that Shogun gets evaporates due to the fact that people really struggle to install it. Compiling code, changing cmake settings, library paths, etc is not everyone’s thing. We therefore aim to offer an easy way to install Shogun on mainstream operating system via binary packaging. An easy 3-minute installation has a massive potential to increase the number of our users. In particular, we aim to automate the packaging process such that nightly binary packages are constructed by a buildslave. One big question here is how dependencies are handled: Are our binaries self contained (on which systems) or do they dynamically link against others? This will obviously depend on the target system.

    There are various ideas:

    • Have a self-contained windows installer, similar to other open-source libraries available for windows. This one should easily play with existing python/java/octave/etc distributions. The windows version first needs to compile properly.
    • Have a self-contained mac installer. Same as above.
    • Both of the above packages should be as easy to install as the debian/fedora packages.
    • Have a fully-self-contained linux installation that does not depend on the installed system libraries (at least not on all). This is meant to complement our existing binary packages that have certain dependencies and are meant for users that have no control over installing software on their machine but still want to use Shogun locally.
    • It would also be great if Shogun was part of python installation tools such as easy_install or pip.
    • Based on our docker image, we would like to have an iso-like Shogun installation that one can boot up for example on an amazon cluster so that Shogun can be used from there without further hassle.
    • All major modular languages should be supported effortlessly. Changing path variables etc should also be done automated.
    • Write proper documentation and examples how to compile and install Shogun including things like dependencies and provide commands how to install all of them.
    There is already some preliminary work done for Debian, Fedora, and Docker. We also have a nightly dmg build for mac. Your input on this project is very appreciated, let us know your ideas for spreading the Shogun word!

    Why this is cool:
    The impact of this project on Shogun’s world dominance is massive - even without any machine learning! The optimal student will be able to use his existing packaging skills and knowledge of multiple operating systems to push the number of Shogun users significantly. Apart from everyone loving you, this project also offers a wide variety of interesting technical challenges. Ask Viktor, he has lots of ideas on this.

     

  • Shogun Missionary & Shogun in Education

    Mentors: Heiko Strathmann (email: heiko.strathmann at gmail.com, irc: HeikoS), Soeren Sonnenburg (email: sonne@debian.org, irc: sonney2k)
    Difficulty: Medium
    Requirements: Clear written English, verbal creativity, ability to explain technical topics, basic ML
    Useful: a broad knowledge of ML algorithms, IPython (notebook) & LaTeX, Shogun’s examples & web-demo framework

    You will be surprised how many things Shogun can do, how many ML algorithms are implemented under the hood, and how fast they are. Problem is: nobody knows that they exist, or how to use them. This project is about telling the world how great Shogun is.

    We already made some progress in communicating Shogun’s abilities to the outside world. From last year’s GSoC, students are required to write detailed notebooks on their projects, and a student developed a quite cool framework for web-based demos. This year, in this rather unusual project, we are looking for an extremely motivated student with an affinity to writing about ML, and to be able to code up nice eye-candy visualisations. If you are very good in explaining ML to others, both intuition and technical aspects, and keen on demonstrating ML algorithms on cool datasets, then this project might be for you. There are two main deliverables, notebooks and web-demos:

    • A handful of IPython notebooks about the most fundamental/popular ML algorithms in Shogun. We are aiming for high quality here: a red-line including introduction, maths (LaTeX), visualisation, real-life data, see the Gaussian processes notebook to get an idea.
    • An extension of the existing web-demo examples. These should be more explanatory, visually appealing, and allow for more flexibility. For example, the GP-demo should allow for different likelihoods (regression+classification), plot the predictive distribution as a heatmap, allow to learn the optimal parameters via ML2, and allow to change the GP covariance/kernel.
    • Both of those parts should blow peoples mind! We are looking for someone with a talent to produce impressive things both in terms of language and visualisation.
    Why is this cool?
    This project will massively boost Shogun’s acceptance in the world. First of all, a large number of algorithms in Shogun (such as MKL and most SVMs) are not covered at all by a demo that goes beyond simple API demonstration. Making them visible will attract people. Second, these examples are extremely useful in promoting Shogun to people on conferences/workshops etc, for example to get funding from industr

  • What's New

    Feb. 17, 2014 -> SHOGUN 3.2.0
    Jan. 6, 2014 -> SHOGUN 3.1.1
    Jan. 5, 2014 -> SHOGUN 3.1.0
    Oct. 28, 2013 -> SHOGUN 3.0.0
    March 17, 2013 -> SHOGUN 2.1.0
    Sept. 1, 2012 -> SHOGUN 2.0.0
    Dec. 1, 2011 -> SHOGUN 1.1.0