The summer came finally to an end and (yes in Berlin we still had 20 C end of October), unfortunately, so did GSoC with it. This has been the second time for SHOGUN to be in GSoC. For those unfamiliar with SHOGUN - it is a very versatile machine learning toolbox that enables unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or explorative data analysis. I again played the role of an org admin and co-mentor this year and would like to take the opportunity to summarize enhancements to the toolbox and my GSoC experience: In contrast to last year, we required code-contributions in the application phase of GSoC already, i.e., a (small) patch was mandatory for your application to be considered. This reduced the number of applications we received: 48 proposals from 38 students instead of 70 proposals from about 60 students last year but also increased the overall quality of the applications.
In the end we were very happy to get 8 very talented students and have the opportunity of boosting the project thanks to their hard and awesome work. Thanks to google for sponsoring three more students compared to last GSoC. Still we gave one slot back to the pool for good to the octave project (They used it very wisely and octave will have a just-in-time compiler now, which will benefit us all!).
SHOGUN 2.0.0 is the new release of the toolbox including of course all the new features that the students have implemented in their projects. On the one hand, modules that were already in SHOGUN have been extended or improved. For example, Jacob Walker has implemented Gaussian Processes (GPs) improving the usability of SHOGUN for regression problems. A framework for multiclass learning by Chiyuan Zhang including state-of-the-art methods in this area such as Error-Correcting Output Coding (ECOC) and ShareBoost, among others. In addition, Evgeniy Andreev has made very important improvements w.r.t. the accessibility of SHOGUN. Thanks to his work with SWIG director classes, now it is possible to use python for prototyping and make use of that code with the same flexibility as if it had been written in the C++ core of the project. On the other hand, completely new frameworks and other functionalities have been added to the project as well. This is the case of multitask learning and domain adaptation algorithms written by Sergey Lisitsyn and the kernel two-sample or dependence test by Heiko Strathmann. Viktor Gal has introduced latent SVMs to SHOGUN and, finally, two students have worked in the new structured output learning framework. Fernando Iglesias made the design of this framework introducing the structured output machines into SHOGUN while Michal Uricar has implemented several bundle methods to solve the optimization problem of the structured output SVM.
It has been very fun and interesting how the work done in different projects has been put together very early, even during the GSoC period. Only to show an example of this dealing with the generic structured output framework and the improvements in the accessibility. It is possible to make use of the SWIG directors to implement the application specific mechanisms of a structured learning problem instance in python and then use the rest of the framework (written in C++) to solve this new problem.
Students! You all did a great job and I am more than amazed what you all have achieved. Thank you very much and I hope some of you will stick around.
Besides all these improvements it has been particularly challenging for me as org admin to scale the project. While I could still be deeply involved in each and every part of the project last GSoC, this was no longer possible this year. Learning to trust that your mentors are doing the job is something that didn't come easy to me. Having had about monthly all-hands meetings did help and so did monitoring the happiness of the students. I am glad that it all worked out nicely this year too. Again, I would like to mention that SHOGUN improved a lot code-base/code-quality wise. Students gave very constructive feedback about our (lack) of proper Vector/Matrix/String/Sparse Matrix types. We now have all these implemented doing automagic memory garbage collection behind scenes. We have started to transition to use Eigen3 as our matrix library of choice, which made quite a number of algorithms much easier to implement. We generalized the Label framework (CLabels) to be tractable for not just classification and regression but multitask and structured output learning.
Finally, we have had quite a number of infrastructure improvements. Thanks to GSoC money we have a dedicated server for running the buildbot/buildslaves and website. The ML Group at TU Berlin does sponsor virtual machines for building SHOGUN on Debian and Cygwin. Viktor Gal stepped up providing buildslaves for Ubuntu and FreeBSD. Gunnar Raetschs group is supporting redhat based build tests. We have Travis CI running testing pull requests for breakage even before merges. Code quality is now monitored utilizing LLVMs scan-build. Bernard Hernandez appeared and wrote a fancy new website for SHOGUN.
A more detailed description of the achievements of each of the students follows:
Heiko Strathmann, mentored by Arthur Gretton, worked on a framework for kernel-based statistical hypothesis testing. Statistical tests to determine whether two random variables are are equal/different or are statistically independent are an important tool in data-analysis. However, when data are high-dimensional or in non-numerical form (documents, graphs), classical methods fail. Heiko implemented recently developed kernel-based generalisations of classical tests which overcome these issues by representing distributions in high dimensional so-called reproducing kernel Hilbert spaces. By doing so, theoretically any pair samples can be distinguished.
Implemented methods include two-sample testing with the Maximum Mean Discrepancy (MMD) and independence testing using the Hilbert Schmidt Independence Criterion (HSIC). Both methods come in different flavours regarding computational costs and test constructions. For two-sample testing with the MMD, a linear time streaming version is included that can handle arbitrary amounts of data. All methods are integrated into a newly written flexible framework for statistical testing which will be extended in the future. A book-style tutorial with descriptions of algorithms and instructions how to use them is also included.
Like Heiko, Sergey Lisitsyn did participate in the GSoC programme for the second time. This year he focused on implementing multitask learning algorithms. Multitask learning is a modern approach to machine learning that learns a problem together with other related problems at the same time using a shared representation. This approach often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks. During the summer Sergey has ported a few algorithms from the SLEP and MALSAR libraries with further extensions and improvements. Namely, L12 group tree and L1q group multitask logistic regression and least squares regression, trace-norm multitask logistic regression, clustered multitask logistic regression, basic group and group tree lasso logistic regression. All the implemented algorithms use COFFIN framework for flexible and efficient learning and some of the algorithms were implemented efficiently utilizing the Eigen3 library.
A generic latent SVM and additionally a latent structured output SVM has been implemented. This machine learning algorithm is widely used in computer vision, namely in object detection. Other useful application fields are: motif finding in DNA sequences, noun phrase coreference, i.e. provide a clustering of the nouns such that each cluster refers to a single object.
It is based on defining a general latent feature Psi(x,y,h) depending on input variable x, output variable y and latent variable h. Deriving a class from the base class allows the user to implement additional structural knowledge for efficient maximization of the latent variable or alternative ways of computation or on-the-fly loading of latent features Psi as a function of the input, output and latent variables.
We have implemented two generic solvers for supervised learning of structured output (SO) classifiers. First, we implemented the current state-of-the-art Bundle Method for Regularized Risk Minimization (BMRM) [Teo et al. 2010]. Second, we implemented a novel variant of the classical Bundle Method (BM) [Lamarechal 1978] which achieves the same precise solution as the BMRM but in time up to an order of magnitude shorter [Uricar et al. 2012].
Among the main practical benefits of the implemented solvers belong their modularity and proven convergence guarantees. For training a particular SO classifier it suffices to provide the solver with a routine evaluating the application specific risk function. This feature is invaluable for designers who can concentrate on tuning the classification model instead of spending time on developing new optimization solvers. The convergence guarantees remove the uncertainty inherent in use of on-line approximate solvers.
The implemented solvers have been integrated to the structured output framework of the SHOGUN toolbox and they have been tested on real-life data.
During GSoC 2012 Fernando implemented a generic framework for structured output (SO) problems. Structured output learning deals with problems where the prediction made is represented by an object with complex structure, e.g. a graph, a tree or a sequence. SHOGUN's SO framework is flexible and easy to extend . Fernando implemented a naïve cutting plane algorithm for the SVM approach to SO . In addition, he coded a case of use of the framework for labelled sequence learning, the so-called Hidden Markov SVM. The HM-SVM can be applied to solve problems in different fields such as gene prediction in bioinformatics or speech to text in pattern recognition.
During the latest google summer of code Evgeniy has improved the Python modular interface. He has added new SWIG-based feature - director classes, enabling users to extend SHOGUN classes with python code and made SHOGUN python 3 ready. Evgeniy has also added python's protocols for most usable arrays (like vectors, matrices, features) which makes possible to work with Shogun data structures just like with numpy arrays with no copy at all. For example one can now modify SHOGUN's RealFeatures in place or use.
Jacob implemented Gaussian Process Regression in the Shogun Machine Learning Toolbox. He wrote a complete implementation of basic GPR as well as approximation methods such as the FITC (First Independent Training Conditional) method and the Laplacian approximation method. Users can utilize this feature to analyze large scale data in a variety of fields. Scientists can also build on this implementation to conduct research extending the capability and applicability of Gaussian Process Regression.
Chiyuan defined a new multiclass learning framework within SHOGUN. He re-organized previous multiclass learning components and refactored the CMachine hierarchy in SHOGUN. Then he generalized the existing one-vs-one and one-vs-rest multiclass learning scheme to general ECOC encoding and decoding strategies. Beside this, several specific multiclass learning algorithms are added, including ShareBoost with feature selection ability, Conditional Probability Tree with online learning ability, and Relaxed Tree.