# April 18, 2014 12:10 AM

#### Opaque Pointers Revisited

Opaque pointer (aka d-pointer or pimpl) is a great C++ design pattern useful for prolongated binary interface compatibility, properly hidden implementation and faster compilation. However, it has inherent performance drawback, which could get pretty critical if you care about efficiency. In this post I propose an approach that makes d-pointers less binary compatible but swipes away its inefficiency.

# April 08, 2014 06:53 PM

#### Actions With RAII

You may know RAII as a cool idiom that makes it pretty easy to handle resources finalization automatically. When used properly, it reduces LoC, helps to avoid bugs and gives more safety for free. This makes RAII an important part of modern C++.

# March 30, 2014 11:16 AM

#### Reproducibility is not simple

There has been a flurry of articles recently outlining 10 simple rules for X, where X has something to do with data science, computational research and reproducibility. Some examples are:

## Best practices

These articles provide a great resource to get started on the long road to doing "proper science". Some common suggestions which are relevant to practical machine learning include:

##### Use version control

Start now. No, not after your next paper, do it right away! Learn one of the modern distributed version control systems, git or mercurial currently being the most popular, and get an account on github or bitbucket to start sharing. Even if you don't share your code, it is a convenient offsite backup. Github is the most popular for open source projects, but bitbucket has the advantage of free private accounts. If you have an email address from an educational institution, you get the premium features for free too.

Distributed version control systems can be conceptually daunting, but it is well worth the trouble to understand the concepts instead of just robotically type in commands. There are numerous tutorials out there, and here are some which I personally found entertaining, git foundations and hginit. For those who don't like the command line, have a look at GUIs such as sourcetree, tortoisegit, tortoisehg, and gitk. If you work with other people, it is worth learning the fork and pull request model, and use the gitflow convention.

##### Open source your code and scripts

Publish everything. Even the two lines of Matlab that you used to plot your results. The readers of your NIPS and ICML papers are technical people, and it is often much simpler for them to look at your Matlab plot command than to parse the paragraph that describes the x and y axes, the meaning of the colours and line types, and the specifics of the displayed error bars. Tools such as ipython notebooks and knitr are examples of easy to implement literate programming frameworks that allow you to make your supplement a live document.

It is often useful to try to conceptually split your computational code into "programs" and "scripts". There is no hard and fast rule for where to draw the line, but one useful way to think about it is to contrast code that can be reused (something to be installed), and code that runs an experiment (something that describes your protocol). An example of the former is your fancy new low memory logistic regression training and testing code. An example of the latter is code to generate your plots. Make both types of code open, document and test them well.

##### Make your data a resource

Your result is also data. When open data is mentioned, most people immediately conjure images of the inputs to prediction machines. But intermediate stages of your workflow are often left out of making things available. For example, if in addition to providing the two lines of code for plotting, you also provided your multidimensional array containing your results, your paper now becomes a resource for future benchmarking efforts. If you made your precomputed kernel matrices available, other people can easily try out new kernel methods without having to go through the effort of computing the kernel.

Efforts such as mldata.org and mlcomp.org provide useful resources to host machine learning oriented datasets. If you do create a dataset, it is useful to get an identifier for it so that people can give you credit.

## Challenges to open science

While the articles call these rules "simple", they are by no means easy to implement. While easy to state, there are many practical hurdles to making every step of your research reproducible .

##### Social coding

Unlike publishing a paper, where you do all your work before publication, publishing a piece of software often means that you have to support it in future. It is remarkably difficult to keep software available in the long term, since most junior researchers move around a lot and often leave academia altogether. It is also challenging to find contributors that can help out in stressful periods, and to keep software up to date and useful. Open source software suffers from the tragedy of the commons, and it quickly becomes difficult to maintain.

While it is generally good for science that everything is open and mistakes are found and corrected, the current incentive structure in academia does not reward support for ongoing projects. Funding is focused on novel ideas, publications are used as metrics for promotion and tenure, and software gets left out.

##### The secret branch

When developing a new idea, it is often tempting to do so without making it open to public scrutiny. This is similar to the idea of a development branch, but you may wish to keep it secret until publication. The same argument applies for data and results, where there may be a moratorium. I am currently unaware of any tools that allow easy conversion between public and private branches. Github allows forks of repositories, which you may be able to make private.

Once a researcher gets fully involved in an application area, it is inevitable that he starts working on the latest data generated by his collaborators. This could be the real time stream from Twitter or the latest double blind drug study. Such datasets are often embargoed from being made publicly available due to concerns about privacy. In the area of biomedical research there are efforts to allow bona fide researchers access to data, such as dbGaP. It seamlessly provides a resource for public and private data. Instead of a hurdle, a convenient mechanism to facilitate the transition from private to open science would encourage many new participants.

What is the right access control model for open science?

##### Data is valuable

It is a natural human tendency to protect a scarce resource which gives them a competitive advantage. For researchers, these resources include source code and data. While it is understandable that authors of software or architects of datasets would like to be the first to benefit from their investment, it often happens that these resources are not made publicly available even after publication.

# March 28, 2014 06:59 AM

#### Any Struggles

In C++ you can’t just forget about types. Each variable has its own type that is not going to be changed by any means. What if you really need something heterogeneous? There is a known idiom called Any that enables you to erase the type and recall it later.

# March 25, 2014 11:33 PM

#### I Like Intractable Likelihoods

Last week, I went to the i-like workshop at Oxford university. Pretty cool! All of Britain's statisticians were there and I met many of them for the first time. Check out my two posters (Russian Roulette, Kernel Adaptive Metropolis Hastings). Talks were amazing - as in last NIPS, the main trend is on estimating likelihoods (well, that's the name of the program), either using some other random process such as importance sampling a latent model's marginal likelihood (aka Pseudo-Marginal MCMC), or directly sub-sampling likelihoods or gradients.

These things are important in Machine Learning too, and it is very nice to see the field growing together (even-though there was a talk by a Statistician spending lots of time on re-inventing belief propagation and Junction tree ideas - always such a pitty if this happens simply because communities do not talk to each other enough). Three talks that I really found interesting:

Remi Bardenet talked about sub-sampling approaches to speed up MCMC. This is quite related to the Austerity in MCMC land paper by Welling & Co, with the difference that his tests do not suffer from small number of points in the hypothesis test to decide accept/reject.

Chris Sherlock talked about optimal rates and scaling for Pseudo-Marginal MCMC. There finally are some nice heuristics how to scale PM estimates in a way that the number of iid samples per computation time is optimal. Interestingly, the acceptance rate and the variance of the likelihood estimate can be tweaked separately.

Jim Griffin gave a very interesting talk on adaptive MCMC on discrete, in particular binary, state-spaces - he used them for feature selection (in ML language). His algorithm automatically learns global mutations rates for each of the positions. However, it doesn't take any correlations between the features into account. This might be a very interesting application for our fancy Kameleon sampler (arxiv, code), thinking about this!

Finally, I presented two posters, the one on Playing Russian Roulette with Intractable Likelihoods that I already presented in Reykjavik, and (with Dino) a new poster (link) on the Kernel Adaptive Metropolis Hastings Kameleon that I mentioned above. The corresponding paper is hopefully published very soon. Talking to other scientists about my own work is just great!

# March 14, 2014 09:46 PM

#### GSoC’13 – Implement estimators of large-scale sparse Gaussian densities

I have submitted the proposal (it can be found here). I’m having a much clearer view of the tasks that has to be done. I’ll be working on two other entrance issues before the official coding day begins, namely the graph coloring issue and the elliptic curve functions issue.

In my understanding, the whole work can be summarized as the following -

1. use greedy graph coloring of the precision matrix (power of precision matrix?) for finding a set of probing vectors, $\{v_{j}\}$ (task for the project includes integrating an existing library. I checked out a few libraries and seems like Joe Culberson’s Graph Coloring code is a really good candidate. Its written in C and provides two greedy methods for vertex coloring. More tests for our specific need has to be done)
2. for each probing vector, we need to compute $v^{T}log(Q)v$, $Q$ being the precision matrix, $log(Q)$ is the matrix-logarithm
3. for computing $log(Q) v$ in the above expression, Cauchy’s integral formula (coming from complex analysis) for matrix function is used, which can be discretized, giving a rational approximation of $log(Q) v$, which involves solving $N$ different systems of linear equation. Solving these systems involves invoking a preconditioned CG solver (in the project, we’ll  be integrating this from an existing library)
4. the systems have complex coefficients (coming from conformal mapping needed for the quadrature rule of the above integration) which is given by an existing algorithm in Hale et. al. The precision of this approximation is defined by a theorem, which depends on the extremal eigenvalues. We can calculate the number $N$ for desired accuracy using this theorem. The algorithm (for finding complex integration weights and shifts) needs Jacobi elliptic curve functions (in Driscoll’s SC-toolbox ellipkkp and ellipjc) and extremal eigenvalues. (task for the project is to integrate Krylstat’s implementation of this)
5. combining everything for the expression of log-determinant involves writing a class which, which will combine all the subtasks. I had a discussion with Heiko on this. While initially we’ll use this to compute this on one computer, later plans include replacing this so that a OpenMPI program can run the subtasks on a cluster with low communication cost and speeded up execution. Will discuss about more details later on

The main paper describes several other techniques for handling special cases (like, when the matrix is ill-conditioned, etc). I’ll read up about these in more detail and update this page later.

# March 02, 2014 11:46 PM

#### Weekend Project – Install SteamOS

This weekend I championed my way through installing SteamOS (the Debian distro by Valve that will be the installed on the upcoming Steam boxes). I had to do some pretty crazy stuff to get it working including dropping out of the automated install to manually inject grub-pc and then compiling the drivers for my wireless card. All in all it was a triumph!

and then finally:

This was an early beta release but they made some weird choices – like handicapping the basic Debian installer by fully automating it and only supporting efi. I was actually a bit disappointed when I finally finished because the end result is not really different from simply installing Ubuntu and setting Steam big picture mode to auto start, I am not sure what exactly I was expecting though. SteamOS is much more for OEMs than the DIY crowd at the moment but I can see that Valve is super invested in Linux at this point with a ton of additions to their own repositories. Good things are going to come of this I can feel it!

* Edit *

Almost all the hacking I had to do has been wrapped in Ye Olde SteamOSe

* Edit 2 *

Wow Valve released an updated version of the beta addressing a lot of the problems Ye Olde SteamOSe addressed and they allegedly collaborated to get this done! This is why Valve is going to win the next generation – working with the community. Full story here

# March 02, 2014 11:19 PM

I haven’t had time for much machine learning on the side as my new job and learning more about web development has been keeping me pretty busy so I decided to help out the Shogun team by doing some clean-up on the website. The whole thing actually came about because I was trying to find the current status of the shogun build to send to someone on the mailing list and I couldn’t find it on the old site, needless to say it was tough to navigate. My main goal with the re-design was to “flatten” the navigation into a navbar rather then the current system of nested navigation links (as an aside the way this worked was pretty poor, we actually queried a large portion of the database on every page load just to make this nested navigation auto-generated). The project ended up being mostly front-end which isn’t really my thing but its cool cause I got to learn some new stuff, I also got to write some Django “migrations” and maintenance tasks so there was a bit of backend too. I think the re-design was a success and the site is definitely easier to navigate around (as a bonus now we don’t have to query the whole db on each page wooo! – solution put the navbar content in the db too!)

you can see the new site live where its always been: http://www.shogun-toolbox.org/

# March 01, 2014 08:29 PM

#### GSoC Interview with Sergey and me

Sergey and me gave an interview on Shogun and Google Summer of Code. Here it is:

The internet. More specifically #shogun on irc.freenode.net. Wasn’t IRC that thing that our big brothers used as a socialising substitute when they were teenagers back in the 90s? Anyways. We are talking to two of the hottest upcoming figures in machine learning open-source software, the Russian software entrepreneur Sergey Lisitsyn, and the big German machine Heiko Strathmann.

Hi guys, glad to meet you. Would you mind introducing yourself?

Sergey (S): Hey, I am Sergey. If you ask me what do I do apart from Shogun - I am currently working as a software engineer and finishing my Master’s studies at Samara State Aerospace University. I joined Shogun in 2011 as a student and now I am doing my best to help guys from the Shogun team to keep up with GSoC 2014.

Heiko (H): Hej, my name is Heiko. I do a Phd in Neuroscience & Machine Learning at the Gatsby Institute in London and joined Shogun three years ago during GSoC. I love open-source since my days in school.

Your project, Shogun, is about Machine Learning. That sounds scary and sexy, but what is it really?

H: My grandmother recently sent me an email asking about this ‘maschinelles Lernen’. I replied it is the art of finding structure in data in an automated way. She replied: Since when are you an artist? And what is this “data”? I showed her the movie PI by Darren Aronofsky where the main character at some point is able to predict stock prices after realising “the pattern”, and said that’s what we want to do with a computer. Since then, she is worried about me because the guy puts a drill into his head in the end….. Another cool application is for example to model brain patterns to allow people to learn how to use a prosthesis faster.

S: Or have you seen your iPhone detects faces? That’s just a Support Vector Machine (SVM). It employs kernels which are inner products of non-linear mappings of Haar features into a reproducing kernel Hilbert Space so that it minimizes ….

Yeah, okok... What is the history of Shogun in the GSoC?

S: The project got started by Sören in his student days around 15 years ago. It was a research only tool for a couple of years before being made public. Over the years, more and more people joined, but the biggest boost came from GSoC...

H: We just got accepted into our 4th year in that program. We had 5+8+8 students so far who all successfully did the program with us. Wow I guess that’s a few million dollars. (EDITOR: actually 105,000$.) GSoC students forced Shogun to grow up in many ways: github, a farm of buildbots, proper unit-testing, a cloud-service, web-demos, etc were all set up by students. Also, the diversity of algorithms from latest research increased a lot. From the GSoC money, we were able to fund our first Shogun workshop in Berlin last summer. How did you two got into Shogun and GSoC? Did the money play a role? H: I was doing my undergraduate project back in 2010, which actually involved kernel SVMs, and used Shogun. I thought it would be a nice idea of putting my ideas into it -- also I was lonely coding just on my own. 2010, they were rejected from GSoC, but I eventually implemented my ideas in 2011. The money to me was very useful as I was planning to move to London soon. Being totally broke in that city one year later, I actually paid my rent from my second participation’s stipend - which I got for implementing ideas from my Master’s project at uni. Since 2013, I mentor other students and help organising the project. I think I would have stayed around without the money, but it would have been a bit tougher. S: We were having a really hard winter in Russia. While I was walking my bear and clearing the roof of the snow, I realised I forgot to turn off my nuclear missile system….. H: Tales! S: Okay, so on another cold night I noticed a message on GSoC somewhere and then I just glanced over the list of accepted organizations and Shogun’s description was quite interesting so I joined a chat and started talking to people - the whole thing was breathtaking for me. As for the money - well, I was a student and was about to start my first part-time job as a developer - it was like a present for me but it didn’t play the main role! H: To make it short: Sergey suddenly appeared and rocked the house coding in lightspeed, drinking Vodka. But now you are not paid anymore, while still spending a lot of time on the project. What motivates you to do this? S: This just involves you and you feels like you participate in something useful. Such kind of appreciation is important! H: Mentoring students is very rewarding indeed! Some of those guys are insanely motivated and talented. It is very nice to interact with the community with people from all over the world sharing the same interest. Trying to be a scientist, GSoC is also very useful in producing tools that myself or my colleagues need, but that nobody has the time to build properly. You see, there are all sorts of synergic effects in GSoC and my day-job at university, such as meeting new people or getting a job since you know how to code in a team. How does this work? Did you ever publish papers based on GSoC work? S: Yeah, I actually published a paper based on my GSoC 2011 work. It is called ‘Tapkee: An Efficient Dimension Reduction Library’ and was recently published in the Journal of Machine Learning Research. We started writing it up with my mentor Christian (Widmer) and later Fernando (Iglesias) joined our efforts. It took enormous amount of time but we did it! Tapkee by the way is a Russian word for slippers. H: I worked on a project on statistical simulation of global ozone data last year. The code is mainly based on one of my last year’s student’s project - a very clever and productive guy from Mumbai who I would never have met without the program, see http://www.ucl.ac.uk/roulette/ozoneexample So you came all the way from being a student with GSoC up to being an organisation admin. How does the perspective change during this path? H: I first had too much time so I coded open-source, then too little money so I coded open-source, then too much work so I mentor people coding it open-source. At some point I realised I like this stuff so much that I would like to help organising Shogun and bring together the students and scientists involved. It is great to give back to the community which played a major role for me in my studies. It is also sometimes quite amusing to get those emails by students applying, being worried about the same unimportant things that I worried about back then. S: It seems to be quite natural actually. You could even miss the point when things change and you became a mentor. Once you are into the game things are going pretty fast. Especially if you have full-time job and studies! Are there any (forbidden) substances that you exploit to keep up with the workload? S: It would sound strange but I am not addicted to vodka. Although I bet Heiko is addicted to beer and sausages. H: Coffeecoffeecoffeee…… Well, to be honest GSoC definitely reduces your sleep no matter whether you are either student, mentor, or admin. By the way, our 3.0 release was labelled: Powered by Vodka, Mate, and beer. Do you crazy Nerds actually ever go away from your computers? H: No. S: Once we all met at our workshop in Berlin - but we weren’t really away from our computers. Why on earth to do that? Any tips for upcoming members of the open-source community? For students? Mentors? Admins? H: Students: Do GSoC! You will learn a lot. Mentors: Do GSoC! You will get a lot. Admins/Mentors: Don’t do GSoC, it ruins your health. Rather collect stamps! S: He is kidding. (whispers: “we need this … come on … just be nice to them”) H: Okay to be honest: just have fun of what you are doing! Due to the missing interest in the community, Sergey and Heiko interviewed themselves on their own. GSoC 2013 blog: http://herrstrathmann.de/shogun-blog/110-shogun-3-0.html GSoC 2014 ideas: http://www.shogun-toolbox.org/page/Events/gsoc2014_ideas Sergey: http://cv.lisitsyn.me/ # March 01, 2014 01:42 AM Yeah! Shogun this week got accepted to be an organisation participating in the 10th Google Summer of Code. This year, besides mentoring a few projects, I am one of the three project administrators. I am curious how this will be. One first thing to do was to write the application for Shogun - I'm glad it worked! I also will spend a little more time organising things. Apart from trying to find mentors (which requires a lot of talking people into it), I also want to make Shogun (and the students) having more from the program. Last year, I pushed the team to ask all students • to write a project report in the form of IPython notebooks (link). These are absolutely great for talking about the GSoC work, impressing people, and having a final piece of work to show for the students. • To fully unit-test every module of their algorithm/framework. This is absolutely essential in order to not loose the student's work a few years later when a re-factoring change breaks their code and nobody knows how to fix it. Those tests already saved lots of life since last year. • To peer-review each other in pairs of students. This improved documentation here and there and solved some bugs. I want to emphasise this more this year as I think it is a great way of enabling synergistic effects between students. In addition, we will again screen all the applicants via a set of entrance tasks on our github page (link). I just wrote a large number of such smaller or larger tasks that get students started on a particular project, fix bugs in Shogun, or prepare some larger change. In order to get the students started a bit more easily (contributing to Shogun these days is a non-trivial task), I wrote a little how-to (link) that is supposed to point out our expectations, and what are the first steps towards participating in GSoC. Finally, I wrote descriptions for quite a few possible projects, some of them with a number of interesting co-mentors. The full list is here (link). If you are a talented student interested in any of those topics, consider working with us during the summer. It's usually very fun! • Variational Learning for Recommendation with Big Data. With Emtiyaz Khan, who I met at last year's workshop for latent Gaussian models. Matrix factorisation and Gaussian Processes, ultra-cool project. • Generic Framework for Markov Chain Monte Carlo Algorithms and Stan Interface. With Theo Papamarkou, who I know from my time at UCL Statistics. It's about a modular representation of MCMC within Shogun and a possible interface to STAN for the actual sampling. This would be a major step of Shogun towards probabilistic models. • Testing and Measuring Variable Interactions With Kernels. With Dino, who is post-doc at Gatsby and co-author of our optimal kernel for MMD paper. This project is to implement all kernel based interaction measures in Shogun in a unified way. We'll probably use this for research later. • A Meta-Language for Shogun examples. With Sören. Write example once, press button to generate in any modular language binding. This would be so useful to have in Shogun! • Lobbying Shogun in MLPACK’s automatic benchmarking system. Joint project with Ryan from MLPACK. He already can compare speed of different toolboxes. Now let's compare results. • Shogun Missionary & Shogun in Education. With Sören. Write high quality notebooks and eye-candy examples. Very different project as this is about creative technical writing and illustrating methods on cool data rather than hacking new algorithms. I would be very excited if this happened! Some of the other projects involve cool buzzwords such as Deep Learning, Structured Output, Kernel, Dual solvers, Cluster backends, etc. Join us! :) # February 25, 2014 02:43 AM #### An Excellent Computer Vision Book: Mastering OpenCV with Practical Computer Vision Projects I’ve talked about this book before – one of the chapters inspired my vision system for last years FRC competition (you can read it here). I think it’s an excellent book and Packt just sent my an ebook version to do a proper review. Of all the book reviews I’ve done this is the most genuine because I actually found this book useful for work I was doing. They found the right level of being technically interesting, robust and substantial all the while without being too daunting. The source code that accompanies this book is great and I still check back to it when starting new projects (https://github.com/MasteringOpenCV/code). Its well written and won’t take long to read through – I think it is a worthwhile read for anyone doing computer vision work. Program design isn’t something a lot of computer vision researchers / developers think or talk about a lot (at least in my experience) so to see how others lay out the problem can really help with your own work. You can grab a copy here: http://bit.ly/1jJN5uL # February 17, 2014 10:30 PM #### Gmail_ToDo!! Do you use your inbox as a running ToDo list? Do you send yourself emails so you remember to do certain things? Do you spend most of your day in a terminal? If you answered yes to all 3 of those questions then you might be interested in this nifty little ruby gem I just released. Its called gmail_todo and its made for quickly emailing yourself a ToDo note from the command line, think of the precious seconds you’ll save! Now when you remember something you need to do and you are in a terminal rather than alt-tabbing (or god forbid reaching for the mouse) you can quickly type ‘todo “get milk”‘ and voila an email will appear in your inbox. As an added bonus my gem prepends “[ToDo]” to your subject for easy filtering! Check it out on Github and RubyGems https://github.com/pickle27/gmail_todo http://rubygems.org/gems/gmail_todo # January 28, 2014 10:51 PM #### Book Review: Android Application Programming with OpenCV I was asked to do another book review for Packt Publishing which I announced a little while ago. I’ve been pretty busy with my new job (which is awesome! I’ll probably make a post about it and all the stuff I’ve learned soon!) but I eventually finished giving the book a quick read through. It was a good book! I wish I’d had the time to follow along and make my own app but unfortunately I didn’t have time. The book has lots of detail and screenshots to guide you through making your own app which is helpful for working with more graphical type things like Eclipse. I think this book would save a developer looking to get started with Android and OpenCV a lot of time and remember people time is money (money you could spend on this book). Reading the book really made me want to start an Android OpenCV project, I think the mobility of the platform makes it so much more fun! To think all my computer vision projects have been tied to a desktop or a best a laptop until now is such a shame. I think I like this model of offering premium lessons and documentation at a price and as long as it stays clearly on the premium side and not “a pay to even get docs” situation I think it’s a good way to get some money back into Open Source and the people who make it happen. Packt does it right too paying some of the highest royalties of any publisher of IT books. tl;dr If you’re planning on doing some OpenCV Android development buy this book – it will undoubtedly pay for itself in time savings You can get the book here: bit.ly/1fjbs0t # January 28, 2014 10:49 PM This one is called “Android Application Programming with OpenCV” – It sounds interesting, I’ll be posting a full review once I’ve had time to read through it! In the mean time you can get the book here: bit.ly/1fjbs0t # January 28, 2014 10:49 PM This is the first time I’ve felt compelled to write a real video game review/critique, why is this? because I’ve been a fan of RE since I was a kid and its one of my favourite franchises. While I will say I liked RE6 the game left me very frustrated and I need to talk about it. RE is increasingly suffering from an identity crisis, is it a survival game? is it an action game? it’s getting more cinematic but at the same time someone is pushing to add more gamification. These goals are kind of at odds, you can’t just smash them all together and expect it to work. Action vs. Survival I think going for a bit more action than the early games is fine and in my opinion you had the right balance in RE4, since then things have gotten weird. Melee seems to have gotten way more important to the extent that I’m using it when it doesn’t even make sense (this is detracting from the cinematic feel too). The mutations have also got a bit out of hand and I feel like I’m being penalized for headshots – shooting zombies in the face is one of the reasons I play RE. Most importantly the ammo conservation system is now broken, in RE4 I felt in control of my ammo supply and I could conserve in easy situations, in RE6 I felt like I was always scrambling but not in a good way. Finally when you gave enemies real guns it kind of tossed the ammo conservation aspect out the window – some guy is shooting at me but I have no ammo so I’m gonna run and punch him in the face what? (not very cinematic either..) Cinematic vs. Gamification The gamification is at odds with the cinematic grandeur of the game – I should have my gun raised looking intently at the next door or checking my back but instead I’m looking down at the guy whose brain I just stomped hoping to god that he drops some ammo so I’m not totally screwed in the next room – and then guess what he drops experience, fucking experience. My memories of RE6 should include the epic boss battles or being startled by a surprise monster but instead I remember running around after killing everything picking up swag and kicking boxes in looking for crap. Tell whoever is pushing the gamification to shove it and make the game epic! tl;dr • Reduce/Remove the item drops, it makes the game less cinematic. Same goes for skill points just calculate it somehow • Melee is sick don’t get me wrong but it should be used when it makes sense (think cinematic) • The mutations and enemy variants made sense in RE4, no need to one up it • I like shooting zombies in the face! • If enemies have guns we need more ammo P.S. I just bought RE4 HD for PC and I am pumped! # January 28, 2014 10:49 PM I extracted some of the useful code and nifty examples from the background of my Thesis as a python library for your enjoyment. PCA or Principal Component Analysis is a pretty common data analysis technique, incremental PCA lets you perform the same type of analysis but uses the input data one sample at a time rather than all at once. The code fully conforms to the scikit-learn api and you should be able to easily use it anywhere you are currently using one of the sklearn.decomposition classes. In fact this library is sort of on the waiting list for sklearn: https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets IPCA on 2D point cloud shaped like an ellipse Check it out if you’re interested and holla at sklearn if you want this feature! https://github.com/pickle27/pyIPCA # January 17, 2014 02:17 PM #### Threading using C++11: Part 1 Recently I bumped into a great introductory video series on concurrent programming using C++11 by Bo Qian on youtube. In this I am just trying to note down important parts and playing with stuffs. Some of the examples may be directly used as it is in the video series. Basic threading environment in C++11 The following example introduces with the basic constructs. /** * filename: test1.cpp * a simple example of using threads in c++11 * to compile, use * g++ -lpthread -std=c++0x test1.cpp * on linux environment */ #include <iostream> #include <thread> using namespace std; // a stupid thread function void thread_function(int n) { for (int i=0; i<n; ++i) // each thread has a unique id cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t1 thread is " << &n << endl; } // a stupid thread function that takes a reference void thread_function_ref(int& n) { for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t2 (reference) thread is " << &n << endl; } // a stupid functor class functor { int state; public: functor():state(0){}; void operator()(int n) { for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello "<< i << endl; } }; // a stupid main function int main() { int n = 3; // thread creation - // ---------------- // thread using normal function - takes a callable object as the first // argument and a number of other arguments which are to be passed as // arguments to the callable object (function, lambda expressions, functor) // note that the arguments are always passed by value! a way to pass the // arguments by reference is shown in the next example thread t1(thread_function, n); // using std::ref we can pass the argument (n in this case) as a reference // just passing n won't do the trick! thread t2(thread_function, std::ref(n)); // to verify, we print the address of n in the main thread and in the child // threads of the thread functions. let's see! cout << "address of n in main thread is " << &n << endl; // this is cool! using this we can share memory between threads. we can // also completely hand over a memory to another thread using std::move // but more on that later // joining and stuffs - // -------------------- // we may decide to join the thread, in which the creator thread will wait // until the execution of the child thread is finished. here, however we // decide not to do that right away // //t1.join(); // more funky ways of thread creation - // ----------------------------------- // thread using lambda function - lambda functions are cool! check more // about these on cppreference.com thread t3([n]() { // inherits variable n from current scope and copies it to the thread // stack at different address for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t3 thread is " << &n << endl; }); thread t4([&n]() { // inherits variable n from current scope and uses the same address for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t4 thread is " << &n << endl; }); thread t5([](int n) { // doesn't inherit anything - we pass it by value - same as t1 for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t5 thread is " << &n << endl; }, n); // similarly we can pass it by reference as t2 - but its getting boring // thread using functor - // ---------------------- // creating an object first functor f1; thread t6(f1, n); // using anonymous object (note the extra paranthesis) thread t7((functor()), n); // some more boring stuffs - // ------------------------- // something to do for the main thread for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; // join/detatch all threads t1.detach(); if (t1.joinable()) t1.join(); t2.join(); t3.join(); t4.join(); t5.join(); t6.join(); t7.join(); return 0; }  This code works. When I compiled and ran on my system (Fedora-19-x86_64) it ran, except it printed out something that’s pretty tough to read! [lambday@lambday.iitb.ac.in thread_test]$ ./a.out
0
thread with id 0139845777069824 says hello
thread with id 139845768677120 says hello 1
thread with id 139845777069824 says hello 2
thread with id  says hello 0
thread with id 139845760284416 says hello 1
thread with id 139845760284416 says hello 2

thread with id 139845751891712 says hello 1
thread with id 139845751891712 says hello 2
0
thread with id 139845768677120 says hello 1
thread with id 139845768677120 says hello 2
139845785462528 says hello 1
139845743499008 says hello 0
thread with id 139845743499008 says hello 1
thread with id 139845743499008 says hello 2
thread with id 139845785470848 says hello 0
thread with id 139845785470848 says hello 1
thread with id 139845785470848 says hello 2
thread with id 139845735106304 says hello 0
thread with id 139845735106304 says hello 1
thread with id 139845735106304 says hello 2


This is because all the threads are running for the same resource stdout and causing data race condition. This must be handled using mutex which provides lock/unlock mechanisms for shared resources. The following example shows how.

Handling data race condition using mutex

/**
* filename: test2.cpp
* a simple example of using a shared resource, such as cout,
* among multiple threads securely using mutex
*/

#include <iostream>
#include <mutex>

using namespace std;

class printer
{
ostream& os;
mutex& mu;
public:
// making default constructor, copy constructor and
// assignment operator implicitly disabled for apparently
// no reason
printer()=delete;
printer(const printer&)=delete;
const printer& operator=(const printer&)=delete;

// the constructor that we will be using
printer(mutex& m):os(cout), mu(m){}

// the function that we will be using
// I tried using string& s as a parameter but didn't work!
void shared_print(string s, int i)
{
// before using the resource that is os, we first need to
// lock the mutex, which could have normally be done using
// mutex::lock and mutex::unlock methods. however, if in
// between lock and unlock some exception happens, then
// the mutex is never going to be unlocked again until the
// kept on waiting till death
//
// lock_guard provides a Resource Acquisition Is Initialization
// (RAII) sort of thing, which calls the unlock in its destructor
// therefore, whenever the control goes outside of this block and
// guard gets killed, the mutex will be unlocked!
// sweet!
lock_guard<mutex> guard(mu);
os << s << i << endl;
}
};

{
for (int i=0; i<n; ++i)
// tried using std::ref which gave weird errors
// that it has been deleted!! any clue??
pr.shared_print(string("t1():"),i);
}

int main()
{
// mutex is the main tool for employing mutual exclusion from
// shared resources
mutex mu;

printer pr(mu);
int n = 4;

// to print stuffs. in fact everyone wishes to print stuffs
// using cout should use this! otherwise cout will be exposed
// to vulnerabilities and somebody can do nasty stuffs.
//
// its really importantn that we never ever give the access
// of the ostream from printer to anybody!

// its a good practice to surround the following within
// a try-catch block, so that if the following part of the
// code throws an exception, the child thread still gets joined
//
try {
for (int i=0; i<n; ++i)
pr.shared_print(string("main():"),i);
} catch(...) {
t1.join();
throw;
}

t1.join();

return 0;
}


This gives a pretty output

[lambday@lambday.iitb.ac.in thread_test]$./a.out main():0 t1():0 t1():1 t1():2 t1():3 main():1 main():2 main():3  Number of threads to be created is often a good question. It makes sense to create as many threads as there are cpus available. Otherwise a lot of time would be wasted in context switching which is a bad thing for program performance. std::thread::hardware_concurrency() function provides a way to do that. In the next example, a different kind of lock guard is shown at action, which provides a deferred locking mechanism. /** * filename: test3.cpp * this shows usage of unique_lock */ #include <iostream> #include <thread> #include <mutex> using namespace std; // similar to what we have seen before, locks the data // and then writes to it void thread_function(int& data, mutex& mu, void (*f)(int&)) { // we shouldn't create a lock_guard here because there is // no point in locking the data between the iterations of // the loop - the other thread would then have to wait until // this function returns from one thread unique_lock<mutex> guard(mu, defer_lock); for (int i=0; i<10; ++i) { // here we can create the lock as lock_guard and before the // next iteration the mutex will be unlocked. but to avoid // the overhead of creating and destroying a lock_guard obj // each time, we can use unique lock's lock/unlock mechanism // under defered lock //lock_guard<mutex> guard(mu); // guard.lock(); f(data); cout << this_thread::get_id() << ": says data is " << data << endl; guard.unlock(); } } int main() { mutex mu; // data that the threads try to write into // must be synchronized to avoid data race condition // using mutex int data = 0; // creating a thread that tries to increase the data thread t1(thread_function, std::ref(data), std::ref(mu), [](int& n){n++;}); // another thread which tries to decrease the data thread t2(thread_function, std::ref(data), std::ref(mu), [](int& n){n--;}); t1.join(); t2.join(); return 0; }  Output is shown below. after the execution is over, the value of data is same as before, as one might expect. [lambday@lambday.iitb.ac.in thread_test]$ ./a.out
140657387300608: says data is 1
140657387300608: says data is 2
140657387300608: says data is 3
140657387300608: says data is 4
140657387300608: says data is 5
140657387300608: says data is 6
140657378907904: says data is 5
140657378907904: says data is 4
140657378907904: says data is 3
140657378907904: says data is 2
140657378907904: says data is 1
140657378907904: says data is 0
140657378907904: says data is -1
140657378907904: says data is -2
140657378907904: says data is -3
140657378907904: says data is -4
140657387300608: says data is -3
140657387300608: says data is -2
140657387300608: says data is -1
140657387300608: says data is 0


Some more pointers are coming up next! Wrapping up this one cause its getting too big.

# January 17, 2014 08:39 AM

Let me start this with a happy note. I am really glad for my proposal for “Estimators of large-scale sparse Gaussian” has been accepted in Google summer of code, 2013. I’m really looking forward to an awesome summer. I  promised myself to write more often regarding this, but I’ve been really busy with a task that was needed for this project. I hope I can include some of that experience in this post too.

Some theoretical background about the project -

The aim of this project is to estimate log-determinant (up to an arbitrary precision) of a very large sparse precision matrix (inverse of covariance matrix) that arises in the log-likelihood expression of a multivariate Gaussian distribution. A direct method (like the one that I already added there in Shogun, CStatistics::log_det()) relies on Cholesky factorization of the matrix, which, in practice, may not be that sparse and often cannot be even stored in the memory and hence are not feasible in most of the practical scenarios  . The idea of this project borrows a concept from complex analysis (Cauchy’s integral formula for matrix functions) which represents a matrix function using a contour integral in the complex plane. Rational approximation of this integral for matrix function (logarithm of a matrix in this case) times a vector leads to a shifted family of linear systems with complex shifts, weights and constant, which can be estimated up to an arbitrary precision.

The task for estimating log-determinant here then becomes to sample the trace of the log of matrix using a set of vectors (called probing vectors, generated using greedy graph coloring), and in that expression, fit the log-matrix times vector using the rational approximation formula,

There has been quite a few changes in the design and structure of the framework than I have initially had in mind. Heiko suggested some really cool ideas regarding the way it should work.

# December 13, 2013 04:45 PM

#### MLOSS workshop at NIPS 2013

Last week, I went to the Advances in Neural Information Processing Systems (NIPS) for the first time. That was a very nice experience due to the incredibly density of people whose names I know from research papers. In fact, it was too much to take so I had to pick things that sounded interesting - still loads.

The main three buzzwords of the conference for me were: Deep Learning (even Mark Zuckerberg is interested in that these days), Mini-batch, and stochastic gradient descent (aka on-line whatever).

One very interesting workshop I attended on Tuesday was on Machine Learning Open-Source Software (MLOSS), organised by Cheng Soon Ong (who could not be there unfortunately) and Antti Honkela. I presented a short spotlight for Shogun (slide) and had a one hour demo, showing off with our cool IPython notebooks (link) and the cloud Shogun server (link). I got some very encouraging feedback for this, including from Fernando Perez.
I also met a few nice fellow open-source ML coders from scikit-learn.

During the workshop, there was a quite lively discussion about licensing issues, in particular whether to choose GPL or BSD. The python universe for example seems to gain a lot from being BSD-style licensed.

Finally, NIPS is was held close to Lake Tahoe, which is surrounded by incredibly beautiful mountains to hike in. One evening, I met the guy who left those traces ... very exciting, slightly scary...

# November 14, 2013 04:18 PM

#### Keynotes at ACML 2013

We were very lucky this year to have an amazing set of keynote speakers at ACML 2013 who have made key contributions to getting machine learning into the real world. Here are some links to the open source software projects that they mentioned during their talks. The videos of the talks should be available at some point on the ACML website

We started off with Geoff Holmes, who spoke at MLOSS 06. He told us about how WEKA has been used in industry (satisfying Kiri Wagstaff's Challenge #2), and the new project for streaming data MOA. Later in the day, Chih-Jen Lin told us how important it was to understand both machine learning and optimisation, such that you can exploit the special structure for fast training of SVMs. This is how he obtained amazing speedups in LIBLINEAR. On the second day, Ralf Herbrich (who also gave a tutorial) gave us a behind the scenes tour of TrueSkill, the player matching algorithm used on XBox Live. Source code in F# is available here and the version generalised to track skill over time is available here.

Thanks to Geoff, Chih-Jen and Ralf for sharing their enthusiasm!

# October 29, 2013 11:38 PM

#### GSoC 2013 brings Shogun 3.0

Shogun’s third Google Summer of Code just ended with our participation in the mentor summit at Google’s headquarter in Mountain View and the release of Shogun 3.0 (link) What a great summer! But let’s start at the beginning…

Shogun is a toolbox that offers a unified framework for data-analysis, or in buzz words: machine learning, for a broad range of data types and analysis problems. Those not only include standard tools such as regression, classification, clustering, etc, but also cutting edge techniques from recent developments in research. One of Shogun’s most unique features is its interfaces to a wide range of mainstream computing languages.

In our third GSoC, we continued most of the directions taken in previous years such as asking students to contribute code in the application process for them to be considered. For that, we created a list of smaller introductory tasks for each of the GSoC projects that would become useful later in the project. While allowing students to get used to our development process, and increasing the quality of the applications, this also pushed the projects forward a bit before GSoC even started. The number of applications did not suffer through that (57 proposals from 52 students) but even increased compared to the previous year (48 proposals from 38 students) -- this seems to be a trend.

This summer, we also had former GSoC students mentoring for the first time: Sergey Lisitsyn and me (mentoring two projects). Both of us joined in 2011. In addition, the former student Fernando Iglesias participated again and former student Viktor Gal stayed around to work on Shogun during GSoC (and did some massive infrastructure improvements). These are very nice long term effects of continuous GSoC participation. Thanks to GSoC, Shogun is growing constantly both in terms of code and developers.

As in 2012, we eventually could give away 8 slots to some very talented students. All of them did an awesome job on some highly involved projects covering a large number of topics. Two projects were extensions of previous ones:

Roman Votjakov extended last year’s project on the popular Gaussian Processes for handling classification problems and Shell Hu implemented a collection of algorithms within last year’s structured output framework (for example for OCR)

Fernando Iglesias implemented a new algorithm called metric learning, which plays well together with existing methods in Shogun.

Another new algorithm came from Soumyajit De, who has implemented an estimation method for log-determinants of large sparse matrices (needed for example for large-scale Gaussian distributions), and implemented a framework for linear operators and solvers, and fundamentals of an upcoming framework for distributed computing (which is used by his algorithm) on the fly.

Evangelos Anagnostopoulos worked on feature hashing and random kitchen sinks, two very cool tricks to speed up linear and kernel-based learning methods in Shogun. Kevin Hughes implemented methods for independent component analysis, which can be used to separate mixtures of signals (for example audio, heart-beats, or images) and are well known in the community.

Last but not least, Liu Zhengyang created a pretty web-framework for running Shogun demos from the web browser and did add support for directly loading data from the mldata website. Evgeniy Andreev improved Shogun’s usability via integrating native support for various popular file formats such as CSV and protobuf.

You might have noticed the links in the above text (and images). Most of them are the final reports of the students in the form of IPython notebooks, an awesome new open-source tool that we started using for documentation. We are very proud of these.  See http://shogun-toolbox.org/page/documentation/notebook/ for a list of all notebooks. Also check out the web-demo framework at http://www.shogun-toolbox.org/page/documentation/demo/ if you haven't yet.

IPython also features Shogun in the cloud: Former student Viktor Gal did setup http://cloud.shogun-toolbox.org which is an IPython notebook server ran by us. It allows you to play with Shogun-python from any web-browser without having to install it. You can try the existing notebooks or write your own. Give it a shot and let us know what you think!

This year’s GSoC also was the most productive one for us ever. We got  more than 2000 commits changing almost 400000 lines in more than 7000 files since our last release before GSoC.

Students! You all did a great job and we are more than amazed what you all have achieved. Thank you very much and we hope some of you will stick around.

Besides all the above individual projects, we encouraged students to work together a bit more to enable synergistic effects. One way we tried to implement this was through a peer review where we paired students to check each others interface documentation and final notebooks. We held the usual meetings with both mentors and students every few weeks to monitor progress and happiness, as well as asking students to write weekly reports. Keeping our IRC channel active every day also helped a lot in keeping things going.

My personal experience with mentoring was very positive. It is very nice to give back to the community. I tried to give them the same useful guidance that I received back then, and probably learned as much as my students did on the way. Having participated in GSoC 2011 and 2012, the change of perspective as a mentor was interesting, in particular regarding the selection process. Time wise, I think Google’s official statement of 5 hours per student per week is underestimating things quite a bit (if you want to get things done), and of course there is no upper bound on time you can spend.

Our plan of pairing external mentors with internal developers worked smoothly. As most of our mentors are scientists who tend to be very busy, it is sometimes hard for them to review all code on their own. Combining  big-picture guidance with the in-depth framework knowledge of the paired core developers allowed for more flexibility when allocating mentors for projects. Keep in mind that Shogun is still being organised by only five people (4 former students) plus a hand full of occasional developers, which makes it challenging to supervise 8 projects.

Another change this year was that writing unit-tests were mandatory to get code merged, which made the number of unit tests grew from 50 to more than 600. In the past years, we had seen how difficult it is to write tests at the end of projects, or maintain untested code. Making students do this on-the-fly drastically increased the stability of their code. A challenging side-effect of this was that many bugs within Shogun were discovered (and eventually fixed) which kept students and developers busy.

As for Shogun itself, GSoC also boosts our community of users, which became so active this year that decided to organise a the first Shogun workshop in Berlin this summer. We had something over 30 participants from all over the world. The Shogun core team also met for the first time in real life, which was nice! We had a collection of talks, discussions, and hands-on sessions. Click here and here for videos and slides.

October brought the mentor summit, which I attended for the first time. This was such a cool event! There was a hotel with hot-tub, lots of goodies on the google campus as for example an on-site barista (!), a GSoC mentor with a robot-dog, and loads of loads of interesting people from interesting open-source projects. Some of these were new to me, some of them are projects that I have been checking out for more than 10 years now.I attended a few fruitful sessions, for example on open-source software for science. Sören hang out with the people he knew from previous years and the cool Debian guys (for which he is a developer too).

After the summit, the Shogun mentor team went hiking in the south Californian desert - I even climbed a rock.

What a great summer!

# October 29, 2013 09:15 PM

#### Shogun Toolbox Version 3.0 released!

Dear all,

we are proud to announce the 3.0 release of the Shogun Machine-Learning Toolbox. This release features the incredible projects of our 8 hard-working Google Summer of Code students. In addition, you get other cool new features as well as lots of internal improvements, bugfixes, and documentation improvements. To speak in numbers, we got more than 2000 commits changing almost 400000 lines in more than 7000 files and increased the number of unit tests from 50 to 600. This is the largest release that Shogun ever had! Please visit http://shogun-toolbox.org/ to obtain Shogun.

### News

Here is a brief description of what is new, starting with the GSoC projects, which deserve most fame:

• Gaussian Process classification by Roman Votjakov
• Structured Output Learning of graph models by Shell Hu
• Estimators for log-determinants of large sparse matrices by Soumyajit De
• Feature Hashing and random kitchen sinks by Evangelos Anagnostopoulos
• Independent Component Analysis by Kevin Hughes
• A web-based demo framework by Liu Zhengyang
• Metric learning with large margin nearest neighbours by Fernando Iglesias
• Native support for various popular file formats by Evgeniy Andreev

### Screenshots

Everyone likes screenshots. Well, we have got something better! All of the above projects (and more) are now documented in the form of IPython notebooks, combining machine learning fundamentals, code, and plots. Those are a great looking way that we chose to document our framework from now on. Have a look at them and feel free to submit your use case as a notebook!

The web-demo framework has been integrated into our website, go check them out.

### Other changes

We finally moved to the Shogun build process to CMake. Through GSoC, added a general clone and equals methods to all Shogun objects, and added automagic unit-testing for serialisation and clone/equals for all classes. Other new features include multiclass LDA, and probability outputs for multiclass SVMs. For the full list, see the NEWS.

### Workshop Videos and slides

In case you missed the first Shogun workshop that we organised in Berlin last July, all of the talks have been put online.

### Shogun in the Cloud

As setting up the right environment for shogun and installing it was always one of the biggest problems for the users (hence the switching to CMake), we have created a sandbox where you can try out shogun on your own without installing shogun on your system! Basically it's a web-service which give you access to your own ipython notebook server with all the shogun notebooks. Of course you are more than welcome to create and share your own notebooks using this service! *NOTE*: This is a courtesy service created by Shogun Toolbox developers, hence if you like it please consider some form of donation to the project so that we can keep up this service running for you. Try shogun in the cloud.

#### Thanks

The release has been made possible by the hard work of all of our GSoC students, see list above. Thanks also to Thoralf Klein and Björn Esser for the load of great contributions. Last but not least, thanks to all the people who use Shogun and provide feedback.

Sören Sonnenburg on behalf of the Shogun team (+ Viktor Gal, Sergey Lisitsyn, Heiko Strathmann and Fernando Iglesias)

# October 29, 2013 08:50 PM

Shogun goes cloud. Try out http://cloud.shogun-toolbox.org to interactively play with machine learning algorithms or try any of the interactive demos.

The SHOGUN Machine Learning Toolbox is a collection of algorithms designed for unified learning for a broad range of feature types and learning settings, like classification, regression, or explorative data analysis. For more information visit http://www.shogun-toolbox.org.

# September 14, 2013 12:28 AM

#### Latent Gaussian Models in Reykjavik

This weekend, I visited the beautiful city Reykjavik in Iceland for the first time. I participated in this year's workshop on Latent Gaussian models (link) (in fact mostly spatial statistics) and also presented a poster (link), which is about our recent work on Russian Roulette for intractable likelihoods (arXiv, blog). Met a lot of nice people doing interesting things there.

# September 03, 2013 06:38 PM

#### Google Summer of Code 2013

Google Summer of Code (GSoC) 2013 has been an absolute blast! The majority of my heavy coding is over so I wanted to post a bit about the experience. It has been fantastic I’ve learned so much and done so many different and unexpected things. And before I go any further I want to give a big thanks to all the Shogun devs who helped me out and made this program so great and also thanks to Google for running such a kick ass program.

My project was to code several Independent Component Analysis algorithms specifically those based on Approximate Joint Diagonalization of matrices. It was pretty cool and similar enough to my thesis work that I was able to jump right in fairly quickly. It was an interesting change of pace for me, as I like to put it: its called Google Summer of Code not Google Summer of Research and having the code be the number one priority was a welcome change of pace for me. As the focus was code I spent a lot of time translating research papers and author’s source code into production code. I like to think I am quite the wizz with porting between numerical libraries now (matlab -> python,  python -> c etc.) Also I’m now so familiar with NumPy, Octave and Eigen3 (and almost R) that I can pretty much work fluently in each and change between them quickly almost without noticing. Have a look at my recent post One Example Three Languages!

One of the other things I got into this summer was playing with the Shogun Modular interfaces which are created using SWIG. I once tried to play with SWIG for one of my own projects but unfortunately never got far. This summer though I updated a few of the typemaps to add support for NDArray which some of my classes needed. I also playing with updating the ruby modular interface to use the newer and more active NMatrix numerical library (not included in Shogun as of yet though). Anyways playing with typemaps was an interesting experience and I definitely learned more than a few things.

One of the other things I learned was the softer side of class and framework design. I realized that even though I’ve been doing OOP for years one thing I still need more experience with is laying out a new class from scratch. The first time I had to do this I actually had to stop for a second and think, I actually wrote a basic foo bar style class example to double check that what I wanted to do would work. In the end I am quite happy with the class structure I came up with and I look forward to being involved in this type of design more often in the future.

Thats all I can think of now! If you’re a student I highly recommend doing GSoC!

Also here is a link to my final project:

# September 01, 2013 10:33 AM

#### What does the “OSS” in MLOSS mean?

I was recently asked to become an Action Editor for the Machine Learning and Open Source Software (MLOSS) track of Journal of Machine Learning Research. Of course, I gladly accepted since the aim of the JMLR MLOSS track (as well as the broader MLOSS project) -- to encourage the creation and use of open source software within machine learning -- is well aligned with my own interests and attitude towards scientific software.

Shortly after I joined, one of the other editors raised a question about how we are to interpret an item in the review criteria that states that reviewers should consider the "freedom of the code (lack of dependence on proprietary software)" when assessing submissions. What followed was an engaging email discussion amongst the Action Editors about the how to clarify our position.

After some discussion (summarised below), we settled on the following guideline which tries to ensure MLOSS projects are as open as possible while recognising the fact that MATLAB, although "closed", is nonetheless widely used within the machine learning community and has an open "work-alike" in the form of GNU Octave:

Dependency on Closed Source Software

We strongly encourage submissions that do not depend on closed source and proprietary software. Exceptions can be made for software that is widely used in a relevant part of the machine learning community and accessible to most active researchers; this should be clearly justified in the submission.

The most common case here is the question whether we will accept software written for Matlab. Given its wide use in the community, there is no strict reject policy for MATLAB submissions, but we strongly encourage submissions to strive for compatibility with Octave unless absolutely impossible.

## The Discussion

There were a number of interesting arguments raised during the discussion, so I offered to write them up in this post for posterity and to solicit feedback from the machine learning community at large.

### Reviewing and decision making

A couple of arguments were put forward in favour of a strict "no proprietary dependencies" policy.

Firstly, allowing proprietary dependencies may limit our ability to find reviewers for submissions -- an already difficult job. Secondly, stricter policies have the benefit of being unambiguous, which would avoid future discussions about the acceptability of future submission.

### Promoting open ports

An argument made in favour of accepting projects with proprietary dependencies was that doing so may actually increase the chances of its code being forked to produce a version with no such dependencies.

Mikio Braun explored this idea further along with some broader concerns in a blog post about the role of curation and how it potentially limits collaboration.

### Where do we draw the line?

Some of us had concerns about what exactly constitutes a proprietary dependency and came up with a number of examples that possibly fall into a grey area.

For example, how do operating systems fit into the picture? What if the software in question only compiles on Windows or OS X? These are both widely used but proprietary. Should we ensure MLOSS projects also work on Linux?

Taking a step up the development chain, what if the code base is most easily built using proprietary development tools such as Visual Studio or XCode? What if libraries such as MATLAB's Statistics Toolbox or Intel's MKL library are needed for performance reasons?

Things get even more subtle when we note that certain data formats (e.g., for medical imaging) are proprietary. Should such software be excluded even though the algorithms might work on other data?

These sorts of considerations suggested that a very strict policy may be difficult to enforce in practice.

### What is our focus?

It is pretty clear what position Richard Stallman or other fierce free software advocates would take on the above questions: reject all of them! It is not clear that such an extreme position would necessarily suit the goals of the MLOSS track of JMLR.

Put another way, is the focus of MLOSS the "ML" or the "OSS"? The consensus seemed to be that we want to promote open source software to benefit machine learning, not the other way around.

## Looking At The Data

Towards the end of the discussion, I made the argument that if we cannot be coherent we should at least be consistent and presented some data on all the accepted MLOSS submissions. The list below shows the breakdown of languages used by the 50 projects that have been accepted to the JMLR track to date. I'll note that some projects use and/or target multiple languages and that, because I only spent half an hour surveying the projects, I may have inadvertently misrepresented some (if I've done so, let me know).

C++: 15; Java: 13; MATLAB:11; Octave: 10; Python:9; C: 5; R: 4.

From this we can see that MATLAB is fairly well-represented amongst the accepted MLOSS projects. I took a closer look and found that of the 11 projects that are written in (or provide bindings for) MATLAB, all but one of them provide support for GNU Octave compatibility as well.

## Closing Thoughts

I think the position we've adopted is realistic, consistent, and suitably aspirational. We want to encourage and promote projects that strive for openness and the positive effects it enables (e.g., reproducibility and reuse) but do not want to strictly rule out submissions that require a widely used, proprietary platform such as MATLAB.

Of course, a project like MLOSS is only as strong as the community it serves so we are keen to get feedback about this decision from people who use and create machine learning software so feel free to leave a comment or contact one of us by email.

Note: This is a cross-post from Mark's blog at Inductio ex Machina.

# August 17, 2013 06:47 PM

#### Becoming a Web Developer

Now that Grad School is over I’m moving on to the next exciting chapter of my career – I’m joining a great company called Shopify in the fall and I am going to be working as a Web Developer! I’m quite excited, in fact so excited I took a course from Udacity to get up to speed on some Web Dev basics.

CS 253 Web Development – Building a Blog with Steve Huffman (for those who don’t know this is the guy who started Reddit so he might know a thing or two about building websites) was a really great course and I would definitely recommend it to anyone who wants to learn a thing or two about the web. I also really liked Steve’s teaching style – while he did a great job explaining things simply he also wasn’t afraid to show how he really works i.e. in the terminal and using Linux/Unix commands etc. The course could have easily hidden all of this away but I think it was important to show – using windows and gui’s just isn’t how people work in this industry so why should the course be taught like so? Good job Steve for keeping it real!

Taking the class was really worthwhile as it helped tie together a bunch of knowledge I had accumulated randomly over the years and it helped make some sense of some of the Django hacking I did once upon a time(I say hacking because I got stuff to work but didn’t totally understand everything).

Here is my hard-earned certificate! I completed the course with High Distinction meaning I did all the homework, the final and the bonus question!

I also pushed all my code to my GitHub account:
https://github.com/pickle27/cs253_blog
https://github.com/pickle27/cs253_wiki

# August 14, 2013 12:00 AM

#### Code review for science

How good is the software associated with scientific papers? There seems to be a general impression that the quality of scientific software is not that great. How do we check for software quality? Well, by doing code review.

In an interesting experiment between the Mozilla Science Lab and PLoS Computational Biology, a selected number of papers with snippets of code from the latter will be reviewed by engineers from the former.

For more details see the blog post by Kaitlin Thaney.

# August 09, 2013 06:58 PM

#### Thesis is published!

I posted a link from my publications page:
http://kevinhughes.ca/publications/

# August 05, 2013 06:10 PM

#### One Example Three Languages

I wanted to post this example of my Google Summer of Code work because I think its neat. One of the cool things about Shogun is our great SWIG wrapper and our static interface which lets us use Shogun natively in a bunch of different languages. So here is an example program of doing Blind Source Separation using the Jade algorithm from Python, Octave and R:

"""
Blind Source Separation using the Jade Algorithm with Shogun
Based on the example from scikit-learn

http://scikit-learn.org/

Kevin Hughes 2013
"""

import numpy as np
import pylab as pl

from shogun.Features  import RealFeatures

# Generate sample data
np.random.seed(0)
n_samples = 2000
time = np.linspace(0, 10, n_samples)

# Source Signals
s1 = np.sin(2 * time)  # sin wave
s2 = np.sign(np.sin(3 * time))  # square wave
S = np.c_[s1, s2]
S += 0.2 * np.random.normal(size=S.shape)  # add noise

# Standardize data
S /= S.std(axis=0)
S = S.T

# Mixing Matrix
A = np.array([[1, 0.5], [0.5, 1]])

# Mix Signals
X = np.dot(A,S)
mixed_signals = RealFeatures(X)

# Separating
S_ = signals.get_feature_matrix()

# Plot results
pl.figure()
pl.subplot(3, 1, 1)
pl.plot(S.T)
pl.title('True Sources')
pl.subplot(3, 1, 2)
pl.plot(X.T)
pl.title('Mixed Sources')
pl.subplot(3, 1, 3)
pl.plot(S_.T)
pl.title('Estimated Sources')
pl.subplots_adjust(0.09, 0.04, 0.94, 0.94, 0.26, 0.36)
pl.show()

% Blind Source Separation using the Jade Algorithm with Shogun
%
% Based on the example from scikit-learn
% http://scikit-learn.org/
%
% Kevin Hughes 2013

% Generate sample data
n_samples = 2000;
time = linspace(0,10,n_samples);

% Source Signals
S = zeros(2, length(time));
S(1,:) = sin(2*time);
S(2,:) = sign(sin(3*time));
S += 0.2*rand(size(S));

% Standardize data
S = S ./ std(S,0,2);

% Mixing Matrix
A = [1 0.5; 0.5 1]

% Mix Signals
X = A*S;
mixed_signals = X;

% Separating
sg('set_features', 'TRAIN', mixed_signals);
S_ = sg('apply_converter');

% Plot
figure();
subplot(311);
plot(time, S(1,:), 'b');
hold on;
plot(time, S(2,:), 'g');
set(gca, 'xtick', [])
title("True Sources");

subplot(312);
plot(time, X(1,:), 'b');
hold on;
plot(time, X(2,:), 'g');
set(gca, 'xtick', [])
title("Mixed Sources");

subplot(313);
plot(time, S_(1,:), 'b');
hold on;
plot(time, S_(2,:), 'g');
title("Estimated Sources");

# Blind Source Separation using the Jade Algorithm with Shogun
#
# Based on the example from scikit-learn
# http://scikit-learn.org/
#
# Kevin Hughes 2013

library('sg')

# Generate sample data
n_samples <- 2000
time <- seq(0,10,length=n_samples)

# Source Signals
S <- matrix(0,2,n_samples)
S[1,] <- sin(2*time)
S[2,] <- sign(sin(3*time))
S <- S + 0.2*matrix(runif(2*n_samples),2,n_samples)

# Standardize data
S <- S * (1/apply(S,1,sd))

# Mixing Matrix
A <- rbind(c(1,0.5),c(0.5,1))

# Mix Signals
X <- A %*% S
mixed_signals <- matrix(X,2,n_samples)

# Separating
sg('set_features', 'TRAIN', mixed_signals)
S_ <- sg('apply_converter')

# Plot
par(mfcol=c(3,1));

plot(time, S[1,], type="l", col='blue', main="True Sources", ylab="", xlab="")
lines(time, S[2,], type="l", col='green')

plot(time, X[1,], type="l", col='blue', main="Mixed Sources", ylab="", xlab="")
lines(time, X[2,], type="l", col='green')

plot(time, S_[1,], type="l", col='blue', main="Estimated Sources", ylab="", xlab="")
lines(time, S_[2,], type="l", col='green')


# August 02, 2013 10:02 PM

#### Successfully Defended my Thesis!

Yesterday I defended my thesis titled “Subspace Bootstrapping and Learning for Background Subtraction”. Grad School has been a blast but I’m definitely looking forward to employed life!

I’ll post a link to my thesis under publications as soon as Queen’s uploads it to their system.

# July 17, 2013 11:05 AM

#### Shogun Workshop 2013

Last weekend, our Shogun workshop finally took place in Berlin. It was really cool to meet all those guys in person. We have been working together for quite some time now. The core-team an Shogun's supporters are absolutely awesome. It is great to be part of that.

We had a nice afternoon at c-base (who were so friendly to host us) with some talks by all of our developers, followed by two days of hands-on workshop at the TU-Berlin.

I gave a little talk on two random things you can do with kernels (that are completely unrelated): Gaussian Processes and the kernel MMD. Slides are (download). I also wrote some IPython notebooks for GP-regression (link), GP-probit-classification (link), and two-sample testing with the kernel MMD (link).
One of the results of our discussions was that we will start using those notebook for Shogun's documentation as they allow to combined code, plots, and maths in a web-based viewer.

Finally, here are some picture of us, (pretty nerdy)

# July 15, 2013 03:02 PM

#### 8 Reasons Why Better Nutrition Makes You a Better Developer

Software developers are not known for having the best nutrition. When it comes to development work, the stereotypical late night Red Bull-fueled coding binge is often not too far from the truth. It's hard to imagine a hackathon without a stack of pizza boxes and a mountain of empty soda bottles. In addition, no good tech firm lets their kitchen run out of chips or Vitamin Water.

As a fellow primal/paleo software developer my experiences have been similar!

# July 09, 2013 11:05 PM

#### Shogun Toolbox Days 2013 Program and Updates

Dear all,

we are excited that the first Shogun workshop, July 12-14 in Berlin, is getting closer. Thanks to all the people that signed up -- we are sure it will be a packed and inspiring weekend!

We have finalized the schedule for Friday, July 12, taking place at the C-Base (see description below [1]). After an intro where everyone gets to know each other and where we introduce ourselves, Shogun, and Machine Learning in general, there will be some tutorials by Shogun developers. In addition, we will have discussions about various topics and various coffee breaks and lunch. Finally, we will enjoy a summer's evening in Berlin.

On Saturday and Sunday, July 13-14 there will be hands on sessions at the Technical University Berlin [2], where developers are around for more close up discussions and practical guidance. Bring your Laptop if you want to try things.

See the final schedule for more details [1]. We plan to do video recordings of all lectures and will have a live stream [3].

See you there! The Shogun-Team

# July 07, 2013 04:00 PM

#### GSoC Weekly Report 3

This week I wrote a hdf5 data importer to all demos in shogun-demo. For the sake of that, I made the coordinate scalable, and changed all the demo’s input domain to fit the toy_data_set. Now it can only loads features and labels in australian.libsvm.h5 in shogun-data, I’ll make it available to accept more in next week.

Next week, I’ll finish the web-based ocr demo as in http://shogun-toolbox.org/static/media/ocr.swf, and a dimensional reduction demo, and add more function on toy_data generator/importer.

# June 30, 2013 04:00 PM

#### GSoC Weekly Report 2

This week, I did a lot of refractoring on shogun-demo. Now the demo framework can auto-generate code(js, css, html) for coordinate-system, json interact, mouse click input, argument handle , and heatmap drawing. demo creator only need to specify what argument demo needed, what shape will be plot, some details for the coordinate system, and a python backend to tell server how data will be processed, then a new demo created! I also rewrite the toy data generator, now it’s fully modular, when demo creator want to add a toy data generator, he can add it with only one line code. toy-data importer are half finished, it’ll will finish in tomorrow. I merged dvalcarce’s binary classification and binary perceptron demo into this repo. They’re now available: http://nn.7nn.de:8000/classification/binary/entrance, http://nn.7nn.de:8000/classification/perceptron/entrance .

next week, I’ll prepare some data set from mldata.org, for toy data importer use. and I’ll make demos as more as possible. and add exceptional input check on existing framework(defend ddos attack).

# June 28, 2013 05:19 PM

#### Book Review: Instant OpenCV Starter

I was recently contacted by Packt Publishing to do a quick book review of their new OpenCV book Instant OpenCV Starter. I was quite flattered that they contacted me, my internet presence must be improving, and they were going to provide me a free copy to review so I agreed!

I just finished reading the book and I do think it is quite good! Its on the shorter side but its also not that expensive and I would definitely recommend it to people who are just starting with OpenCV (and maybe who aren’t the most rockstar programmers) and also to professors who are going to be teaching a Computer Vision course.

Perhaps the most helpful chapter to newcomers to OpenCV is the introduction and the installation tutorial. The book does a very nice job at describing how to install OpenCV for both Linux and Windows, the latter being quite similar to my own tutorial. I also liked how they added some details about the various dependencies required for building on Linux.

The book wraps up with some fairly simple OpenCV examples and one really cool example of Image Steganography. And as always with Packt the source code is available from their site and is of good quality.

I’d definitely recommend this book to newcomers especially for the install tutorial which is an area that the greater OpenCV community should address better. Also if you are a more advanced OpenCV user or you finish this book and are looking for more I would recommend the OpenCV 2 Cookbook and definitely the Mastering OpenCV with Practical Computer Vision Projects.

# June 26, 2013 04:10 PM

#### New Project ArduinoDAQ!

Check it out -> ArduinoDAQ

# June 26, 2013 12:35 PM

#### Russian Roulette for intractable Likelihoods

$\def\mm#1{\boldsymbol{#1}} \DeclareMathOperator{\tr}{tr}$

While I was working at UCL's Statistics Department in winter, I got involved into a very exciting project in the group of Mark Girolami. It is based around the Pseudo-Marginal Metropolis-Hastings algorithm. In 2003, a Genetics paper [1] described an approach to sample a distribution using the standard Metropolis-Hastings algorithm when the density function is not available by simply replacing it with an unbiased estimate.

For a standard Bayesian inference problem with likelihood $\pi(y|\theta)$, prior $\pi(\theta)$, and a proposal $Q$, rather than using the standard M-H ratio $\frac{\pi(y|\theta^{\text{new}})}{\pi(y|\theta)}\times\frac{\pi(\theta^{\text{new}})}{\pi(\theta)}\times \frac{Q(\theta|\theta^{\text{new}})}{Q(\theta^{\text{new}}|\theta)},$ the likelihood is replaced by an unbiased estimator as

$\frac{\hat{\pi}(y|\theta^{\text{new}})}{\hat{\pi}(y|\theta)}\times\frac{\pi(\theta^{\text{new}})}{\pi(\theta)}\times \frac{Q(\theta|\theta^{\text{new}})}{Q(\theta^{\text{new}}|\theta)}.$ Remarkably  the resulting Markov chain converges to the same posterior distribution as the exact algorithm. The approach was later formalised and popularised in [2].

In our project, we exploited this idea to perform inference over models whose likelihood functions are intractable. Example of such intractable likelihoods are for example Ising models or, even simpler, very large Gaussian models. Both of those models' normalising constants are very hard to compute. We came up with a way of producing unbiased estimators for the likelihoods, which are based on writing likelihoods as an infinite sum, and then truncating it stochastically.

Producing unbiased estimators for the Pseudo-Marginal approach is a very challenging task. Estimates have to be strictly positive. This can be achieved via pulling out the sign of the estimates in the final Monte-Carlo integral estimate and add a correction term (which increases the variance of the estimator). This problem is studied under the term Sign problem. The next step is to write the likelihood function as an infinite sum. In our paper, we do this for a geometrically titled correction of a biased estimator obtained by an approximation such as importance sampling estates, upper bounds, or deterministic approximations, and for likelihoods based on the exponential function.

I in particular worked on the exponential function estimate. We took a very nice example from spatial statistics: a worldwide grid of ozone measurements from a satellite that consists of a about 173,405 measurements. We fitted a simple Gaussian model whose covariance matrices are massive (and sparse). In such models of the form $\log (\mathcal{N}_x(\mu,\Sigma))=-\log(\det(\Sigma)) - (\mu-x)^T \Sigma^{-1}(\mu-x) + C,$ the normalising constant involves a log-determinant of such a large matrix. This is impossible using classical methods such as Cholesky factorisation $\Sigma=LL^T \Rightarrow \log(\det(\Sigma))=2\sum_i\log(L_{ii}),$ due to memory shortcomings: It is not possible to store the Cholesky factor $L$ since it is not in general sparse. We therefore constructed an unbiased estimator using a very neat method based on graph colourings and Krylov methods from [3].

This unbiased estimator of the log-likelihood is then turned into a (positive) unbiased estimator of the likelihood itself via writing the exponential function as an infinite series $\exp(\log(\det(\Sigma)))=1+\sum_{i=1}^\infty \frac{\log(\det(\Sigma))^i}{i!}.$

We then construct an unbiased estimator of this series by playing Russian Roulette: We evaluate the terms in the series and plug in a different estimator for $\log(\det(\Sigma))$ for every $i$; once those values are small, we start flipping a coin every whether we continue the series or not. If we do continue, we add some weights that ensure unbiasedness. We also ensure that it is less likely to continue in every iteration so that the procedure eventually stops. This basic idea (borrowed from Physics papers from some 20 years ago) and some technical details and computational tricks then give an unbiased estimator of the likelihood of the log-determinant of our Gaussian model and can therefore be plugged into Pseudo-Marginal M-H. This allows to perform Bayesian inference over models of sizes where it has been impossible before.

More details can be found on our project page (link, see ozone link), and in our paper draft on arXiv (link). One of my this year's Google summer of Code projects for the Shogun Machine-Learning toolbox is about producing a sophisticated implementation of log-determinant estimators (link). Pretty exciting!

[1]: Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics 164 1139–1160.
[2]: Andrieu, C., & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2), 697–725.
[3]: Aune, E., Simpson, D., & Eidsvik, J. (2012). Parameter Estimation in High Dimensional Gaussian Distributions.

# June 24, 2013 02:30 PM

#### GSoC weekly report – week 1

As we planned, to keep the focus more on a working framework initially, I started with a direct logarithm of dense matrices that Eigen3 unsupported module provides (available for Eigen3.1.0 and later) instead of the rational approximation. This week, I have implemented the framework for dense matrix linear operators using Gaussian trace samples ($\mathcal{N}(0,I)$) and that shows that for a small $2\times 2$ matrix, say,

$C=\begin{bmatrix} 2&1 \\ 1&3 \end{bmatrix}$

we can approximate the $log(\left|C\right|)$  by estimating $E(s^{T}log(C)s)$ up to a certain precision with a large number of estimates. For example, $log(\left|C\right|)=1.609438$, and for $1,000,000$ log-det estimates with Gaussian samples, we get an expected estimate $E(s^{T}log(C)s)=1.609546$, $s \sim\mathcal{N}(0,I)$.

The framework has been developed in a modular way, with all independent base classes and implementations first and then the dependent ones. I added a unit-test for (almost) every component that has been added to ensure that they behave as we might expect them to do. Currently the framework has -

• CLinearOperator<T> base class, CDenseMatrixOperator<T> implementation of this, a unit-test.
• CJobResult base class, CScalarResult<T> and CVectorResult<T> implementation of this, a unit-test for scalar.
• CJobResultAggregator base class, CStoreScalarAggregator<T> implementation of this and a unit-test.
• CIndependentJob base class, CDenseExactLogJob implementation of this and a unit-test.
• CIndependentComputationEngine base class, CSerialComputationEngine implementation, and a unit-test
• COperatorFunction<T> base class, CDenseMatrixExactLog implementation of this, a unit-test
• CTraceSampler base, CNormalSampler implementation, a unit-test
• CLogDetEstimator class and a unit-test

I have a small program ready (very much similar to the CLogDetEstimator unit-test) with this which I’ll add to the libshogun examples. This shows how the framework works -

• The sample method of CLogDetEstimator computes a number of log-det estimates of a linear operator function
• In sample, it first computes the stuff that are required for the rest of the computation job (call init on COperatorFunction and CTraceSampler). For example, CDenseMatrixExactLog implementation computes the log of the dense matrix and sets that as the linear operator in the operator function in its init. CLogRationalApproximation will compute complex shifts, weights etc using Jacobi elliptic functions in init, etc. CNormalSampler initializes the number of samples it should generate per log-det estimate (which is 1 in this case). CProbingSampler implementation will use graph coloring of the sparse linear operator and set the number of colors as the number of estimates, all in its init.
• Then for the required number of log-det estimates, sample uses the submit_job method of COperatorFunction to create a number of jobs per sample, and keeps the JobResultAggregators with itself. submit_job internally creates a number of jobs (based on the implementation) with one aggregator, submits the jobs to the computation engine, which may or may not start computing those job immediately (based on implementation) and passes the aggregator to sample.
• In serial implementation of the computation engine, it starts computing the jobs immediately as soon as they are submitted (call compute method on the job) and blocks until the computation is done. For CDenseExactLogJob, the computation task is to compute $log(C)s$ first (apply linear operator on vector, result is a vector), and then $s^{T}log(C)s$ (vector-vector dot product). The vector is then safely discarded.
• sample then waits for all computation jobs to be completed with the engine’s wait_for_all method. Serial implementation returns immediately in this case since all jobs are already computed.
• sample calls finalize on all the job result aggregators which shapes the final aggregation of the job results into CScalarResults. It then computes an average of the estimates and returns.

Plan for next week

To implement the CLogRationalApproximation class for dense matrix that computes weights, shifts in init (requires the eigenvalues, currently thought of using Eigen3 for that), and then CLogRationalApproximationIndividual, which creates num_shifts jobs for each trace sample and moves the shifts into the operator. A CLogRationalApproximationIndividualJob will use Eigen3′s complex solver for direct solving. I planned to keep CLinearSolver as a base class for all the solvers, which I’ll try to implement next week, along with its DirectLinearSolver implementation. CStoreVectorAggregator has to be implemented as well.

Ah, its been too long for a report! I just realized it!

That’s all folks! See you next week!

# June 24, 2013 05:10 AM

This is my first post after being selected for GSoC this year, been really really occupied with stuffs so far. So I’ll just quickly list down the things that I’ve done so far and things I am planning to do next.

We have designed the framework for implementing the log-determinant project. The background of this project can be found here (I’m really happy to see my name listed there thanks to Heiko! ).  For this project, we need to work with complex numbers, which was not supported by Shogun yet. The first task was to incorporate std::complex<T> with shogun and that required

• adding the datatype as a primitive type
• checking switch over all ptypes
• adding support for the template classes that will use this
• adding support for the parameter framework
• and finally check serialization with this.

I added a new type for std::complex<double> as complex64_t and all required support.

We also needed Jacobi elliptic functions that we will be needing for the rational appromixation of the matrix logarithm. This part is also completed.

The next step was to come up with the basic structure of the framework. So far, the class diagram looks like the following -

Not all the classes are shown, and a few mistakes are there (please pardon me). I’m going to implement the base classes one by one, and for the sake of simplicity, we’ll start with a direct computation of matrix log (thanks to Eigen3) instead of approximation. So my main focus for this week would be to code up basic stuffs, focus on the framework development rather than numerical issues, fixing errors+unit-test+documentation+leak-check all the newly added classes thoroughly. Once the basic framework seems promising, the rest of the stuffs can be added iteratively one by one.

# June 23, 2013 04:00 PM

#### GSoC Weekly Report 1

Sorry that I’ve been busying for my finals last week, I’ll catch up the progress this week. Last week I’ve done some code refractoring on the existing demos, and made the django site looks like a little framework. I planed to do the following things this week:

1. integrate the data to the exisiting demos. (regression, clustering)
2. merge the exisiting binary/multiclass classification demo into the framework.
3. refractor the javascripts.
4. find a better way to draw heatmap with contour.

# June 06, 2013 08:44 AM

#### 10th Anniversary of DIMVA!

I am happy to announce the 10th anniversary of the security conference DIMVA! The conference takes place 17-19 July 2013 in Berlin, Germany. For now 10 years, DIMVA has brought together experts from academia, industry, and government to discuss research on detection of intrusions and malware. In celebration of the 10th anniversary the conference program is packed with highlights:
The conference venue is the Mövenpick Hotel Berlin that is located in the centre of Berlin. The registration for the conference is now open. Do not miss the early bird deadline: June 12! You can find more details at the DIMVA 2013 website.

See you in Berlin!

# May 31, 2013 09:33 PM

#### Google Summer of Code 2013: Acceptance

Yahoo! I have been accepted to Google Summer of Code 2013! This summer I’ll be working on Gaussian Processes for Classification project for the SHOGUN Machine Learning Toolbox. More information about ideas of the project can be found at GSoC 2013 ideas page at the website of the SHOGUN Machine Learning Toolbox. My proposal for the project can be found at here.

I’m very excited to be a part of the team of SHOGUN developers and to have two great mentors: Heiko Strathmann and Oliver Stegle, who will be mentoring me during GSoC 2013. I think this summer should be quite fun :)

Weekly reports and other important and interesting information about my progress on the project will be posted here.

# May 29, 2013 04:00 PM

#### Google Summer of Code 2013 Accepted

After about a month’s preparation, I got accepted into GSoC 2013 by shogun-toolbox. Thanks to all the members of shogun, especially Soeren Sonnenburg, thanks to Google, thanks to all the people who supported me.

In this summer, I’ll devote myself to implementing graphical interactive demos for the ML algorithms provided by shogun-toolbox. Progress will be reported weekly in this blog.

Next I’m going to do some code refractoring on the previous demos, for the maximization of code-reuseing. During the time before the official GSoC startup date(June, 17), I’ll spend about (3-4 hours)/day on the project.

It’s my first time participate in developing open source software, so amazing.

# May 27, 2013 08:42 PM

#### Google Summer of Code 2013

I’ve been accepted to Google Summer of Code 2013! I’ll be working with the great group behind the Shogun Machine Learning Toolbox http://www.shogun-toolbox.org/. I am going to be working on Approximate Joint Diagonalization (ADJ) of matrices for Independent  Component Analysis (ICA) and Blind Source Separation (BSS) – think the cocktail party problem.

I’ll be posting about my progress here regularly so stay tuned!

# May 16, 2013 07:28 PM

#### Open Sourced a bunch of Robot Code today!

I cleaned up and released all our code for the 2013 FRC season. It’s not crazy documented or anything but this was the code that was on our robot so maybe it will help somebody.

Also check out our scouting database using python, tkinter, sqlite3, numpy and matplotlib. It worked great for our scouting needs this year and more importantly the students who worked on it became pretty proficient with some important tools!

https://github.com/KBotics

# May 13, 2013 06:27 PM

#### libBGS released!

Today I release some of the code I’ve been using for my masters as libBGS. I created libBGS from modifying the original libBGS by Donovan Parks. The major changes include updating to use OpenCV 2.xx and making better use of the stl when appropriate. I also added several algorithms including a Gaussian Mixture Model (GMM) variant, simple frame differencing and some upgrades to the Eigenbackground implementation.

The code is use-able but still a WIP, I’d like to try and add serialization so that the background models and be saved and loaded nicely.

check out the project page:
libBGS

# May 08, 2013 09:04 PM

#### ARPool::getInstance()

Finally got a chance to blog about the most recent architectural improvements to ARPool – Singletons! Singletons make so much sense for ARPool its insane, I can’t believe we didn’t have it like this before.

Backing up for those who aren’t familiar with what I’m talking about, a singleton is an object for which we enforce the constraint that there will only ever be a single instance of that object. How do we do this? easy make the constructor private, err okay but then how do we instantiate the singleton in the first place? This one isn’t hard but I wouldn’t say easy, we make a public method called getInstance which returns a pointer to the single instance of the class if it has already been initialized or creates it if it has not. The class also has 2 static variables to hold the single instance and a flag indicating whether it has been instantiated yet. The code looks like this:

// header
class Singleton

private:
static bool instanceFlag;
static Singleton* single;

public:
Singleton* getInstance();

// the rest of the methods...

};

// cpp file
bool Singleton::instanceFlag = false;
Singleton* Singleton::single = NULL;

Singleton* Singleton::getInstance()
{
if(!instanceFlag)
{
single = new Singleton();
instanceFlag = true;
return single;
}
else
{
return single;
}
}


So why would we want to do this? when is this useful? The best example is whenever you are dealing with hardware or an actual physical element. For example in ARPool there is only one camera so it makes sense to make the camera class a singleton.

The major advantage of using singletons is that you can include the class where ever you like and simply ask for the instance – this can greatly simplify your code because you no longer have to pass all of these objects to each other to access them. A really basic way to think about Singletons is to treat them as safe global objects.

and thats really all there is to it! I’m just going to sit here and smile thinking about how much cleaner ARPool’s code is now mmmmmmmmmmmmm!

# May 06, 2013 12:05 AM

#### New ARPool Website is Live!

In prep for our big show of ARPool at the Augmented World Expo 2013 (AWE) in Santa Clara California I got the okay to re-do the website a bit. Now I’m a far-cry from a front-end engineer (just have never had enough time to learn!) but I am pretty stoked with what I put together in a couple of hours.

Thanks to these great weekly events called Queen’s Hacks where a bunch of us get together and work on cool side projects I’d been exposed to Twitter Bootstrap, which is essentially the quintessential css and js library for building a modern website / web app. Big shout out to Twitter for this awesome tool and an even bigger shout out for making it open source! Go Twitter!

Anyways I am pretty hack and slash when it comes to css and js so I don’t have anything neat to report other than to say hey check out this flashy new site I made!

www.arpool.ca

# May 03, 2013 10:57 PM

#### Shogun Student Applications Statistics for Google Summer of Code 2013

Almost a month has passed since SHOGUN has been accepted for Google Summer of Code 2013. Student application deadline was today (May 6) and shogun received 57 proposals from 52 students. This is quite some increase compared to 2012 (48 applications from 38 students). What is interesting though is that it didn't look that good in the very beginning (see the figure below):

Comparing this to 2012, this curve is much more flat in the beginning but exponentially increasing towards the end. Why is that? We didn't change the way we engaged with students (even though we tried to improve the instructions and added lots of entrance tagged tasks to github issues). We still require patches to be submitted to even be considered. So it is similarly tough to get into gsoc 2013 with us as it was in the previous year.

What is interesting though is that various organizations complained about a slow uptake in the beginning. And it turns out that google did limit the number of student applications from 20 (last year) to 5 (in 2013). This might explain the shape of the curve: Students are more cautious to apply but once the deadline is near the apply to the maximum of 5 to improve their chances. This goes hand-in-hand with the observation that the quality of newly submitted student applications tends to decrease towards the deadline.

So did this new limit hurt? To the contrary! In the end the quality of proposals increased a lot and we were able to even way before the student application deadline start to score/rank students. We are happy to have many very strong candidates this year again. Lets hope we get enough slots to accommodate all of the excellent students and then lets start the fun :)

# May 03, 2013 03:18 PM

#### Getting a minimal shogun java_modular interface program running

I used this code just to see whether shogun works or not. I configured shogun with --interfaces=java_modular option and did a make install in /usr/local. It has to have jblas installed in /usr/share/java/jblas.jar

1. First I checked if jblas works fine. I tried this example -

import org.jblas.*;

public class jblas_test {
public static void main(String[] args) {
double[][] data = new double[][]
{{ 1, 2, 3, 4, 5},
{ 6, 7, 8, 9, 10},
{11, 12, 13, 14, 15}};
DoubleMatrix matrix = new DoubleMatrix(data);

DoubleMatrix vector = new DoubleMatrix(new double[]{3, 3, 3, 3,3});
DoubleMatrix result = matrix.mmul(vector);
System.out.println(result.rows+"x"+result.columns+": "+result);
System.out.println("Jblas working fine");
}
}


2. Then I compiled and ran it with -


[rahul@cfdvs4-2 jblas]$javac -cp ".:/usr/share/java/jblas.jar" jblas_test.java [rahul@cfdvs4-2 jblas]$ java -cp ".:/usr/share/java/jblas.jar" jblas_test

3x1: [45.000000; 120.000000; 195.000000]
Jblas working fine



3. Next step was to get a minimal shogun example run. I wrote a simple code -

import org.shogun.*;

public class helloworld {
static {
try{
System.out.println(e.getMessage());
}
}

public static void main(String[] args) {
modshogun.init_shogun_with_defaults();
System.out.println("shogun works");
modshogun.exit_shogun();
}
}


4. Then I compiled and ran it with -

[rahul@cfdvs4-2 test]$javac -cp ".:/usr/local/share/java/shogun.jar" helloworld.java [rahul@cfdvs4-2 test]$ java -cp ".:/usr/local/share/java/shogun.jar" -Djava.library.path=".:/usr/local/lib/jni/libmodshogun.so" helloworld
shogun works



5. Now was the time to run some actual shogun code. I need shogun for string kernel classification, so I tried out classifier_domainadaptationsvm_modular.java . Just changed System.loadLibrary("modshogun"); to System.load("/usr/local/lib/jni/libmodshogun.so"); at line #9. And also added this line after line 51.

System.out.println(out.rows+"x"+out.columns+": "+out);


Then compiled and ran -

[rahul@cfdvs4-2 test]$javac -cp ".:/usr/share/java/jblas.jar:/usr/local/share/java/shogun.jar" classifier_domainadaptationsvm_modular.java [rahul@cfdvs4-2 test]$ java -cp ".:/usr/share/java/jblas.jar:/usr/local/share/java/shogun.jar" -Djava.library.path=".:/usr/local/lib/jni/libmodshogun.so" classifier_domainadaptationsvm_modular
1x10: [-1.000000, 1.000000, -1.000000, -1.000000, -1.000000, -1.000000, 1.000000, 1.000000, 1.000000, -1.000000]


Feels terrific

# April 17, 2013 08:27 PM

### CALL FOR PARTICIPATION: Shogun Machine Learning Workshop, Berlin, Germany, July 12-14, 2013

Data Science, Big-Data are omnipresent terms documenting the need for automated tools to analyze the ever growing wealth of data. To this end we invite practitioners, researchers and students to participate in the first Shogun machine learning workshop. While the workshop is centered around the development and use of the shogun machine learning toolbox, it will also feature general machine learning subjects.

### General Information

The workshop will include:
• A general introduction to machine learning held by Gunnar Raetsch.
• Introductory talks about e.g. Dimension reduction techniques, Kernel-statistical testing, Gaussian Processes, Structured Output learning.
• Contributed talks and a poster session, and a poster-spotlight.
• A discussion panel
• A hands on session on July 13-14

Do not miss the chance to familiarize yourself with the shogun machine learning toolbox for solving various data analysis tasks and to talk to their authors and contributors. The program of the workshop will cover from basic to advanced topics in machine learning and how to approach them using Shogun, which makes it suitable for anyone, no matter if you are a senior researcher or practitioner with many year's of experience, or a junior student willing to discover much more. Interested?

A tentative schedule is available at http://shogun-toolbox.org/page/Events/workshop2013_program.

### Call for contributions

The organizing committee is seeking workshop contributions. The committee will select several submitted contributions for 15-minute talks and poster presentations. The accepted contributions will also be published on the workshop web site.

Amongst other topics, we encourage submission that

• are applications / publications utilizing Shogun
• are highly relevant to practitioners in the field
• are of broad general interest
• are extensions to Shogun

### Submission Guidelines

Send an abstract of your talk/contribution to shogun-workshop2013@shogun-toolbox.org before June 1. Notifications will be given on June 7.

### Registration

Workshop registration is free of charge. However, only a limited number of seats is available. First-come, first-served! Register by filling out the registration form.

### Location and Timeline

The main workshop will take place at c-base Berlin (http://c-base.org/, https://en.wikipedia.org/wiki/C-base) on July 12. It is followed by additional 2-day hands-on sessions held at TU Berlin on July 13-14.

### About the Shogun machine learning toolbox

Shogun is designed for unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or explorative data analysis. Further information is available at http://www.shogun-toolbox.org.

# April 09, 2013 10:28 AM

#### Talk at the EBI in Cambridge

I gave a talk at the EMBL-European Bioninformatic institute in Cambridge, where I visited the group of Oliver Stegle.

The topic was "Adaptive Large-Scale Kernel Two-Sample Testing". Slides can be found here behind this link.

# April 09, 2013 10:14 AM

Shogun got accepted in the Google Summer of Code 2013!

Check out our ideas pageThis year, I will be a mentor rather than a student  and I am very excited about this.

I'll be offering two projects:

• Implement Gaussian process classification (joint with Oliver Stegle). This is an extension of the GSoC project last year and should be quite interested while not being too complicated (link)
• Implement unbiased estimators of likelihoods of very large, sparse Gaussian distributions (joint with Erlend Aune and Daniel Simpson). This one is quite challenging since it involved many different topics. However, it should also be very interesting (link)

# April 09, 2013 12:01 AM

#### GSoC 2013

GSoC has just announced the list of participating organisations. This is a great opportunity for students to get involved in projects that matter, and to learn about code development which is bigger than the standard "one semester" programming project that they are usually exposed to at university.

Some statistics:

• 177 of 417 projects were accepted, which is a success rate of 42%.
• 40 of the 177 project are accepted for the first time, which is a 23% proportion of new blood.

These seem to be in the same ballpark as most other competitive schemes for obtaining funding. Perhaps there is some type of psychological "mean" which reviewers gravitate to when they are evaluating submissions. For example, consider that out of the 4258 students that applied for projects in 2012, 1212 students got accepted, a rate of 28%.

To the students out there, please get in touch with potential mentors before putting in your applications. You'd be surprised at how much it could improve your application!

# April 08, 2013 10:50 PM

#### Shogun is participating in GSoC 2013

It is difficult to come up with a better way to start off the week than with news as good as this. As usual, we have two main types of project ideas in Shogun:

• Accessibility improvements.
• Core machine learning tasks or framework improvements.

Check out the full ideas list for more detailed descriptions.

Providing that Shogun has plenty of useful and interesting machine learning methods but, unfortunately, some of them are not so accessible for users that are not familiar with Shogun’s code base, this year the accessibility improvements projects seem to be particularly important. We expect to have after this summer more interactive demos showing off Shogun’s capabilities.

Nonetheless, there are also some interesting ideas concerning the implementation of new machine learning algorithms. For example, extensions in the Gaussian Processes to support classification, more dimension reduction techniques (are you an expert in ball trees? then we want you!) and some really challenging projects such as large-scale estimation of sparse Gaussian densities. Of course, last but not least, there is a very nice idea about my favourite topic: structured learning! This idea aims at providing some tools to target general structured output models with SO-SVMs, large-scale solvers and kernels.

# April 08, 2013 10:08 PM

#### Shogun got accepted at Google Summer of Code 2013

We are happy to announce that the shogun machine learning toolbox will participate in this years google summer of code :-)

SHOGUN is designed for unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or exploratory data analysis.

In case you are a talented student interested in implementing and learning about machine learning algorithms - we are looking for you!

We have collected a number of ideas [1] that we consider worthwhile implementing. And don't forget to apply [2]!

# April 04, 2013 07:22 PM

#### OpenCV 2.4.4 on Crunchbang 11

I recently decided that crunchbang is the distro I want to use on my laptop (yeah I know I switch distros a lot) its pretty sweet. Anyways I finally got around to setting OpenCV on my laptop and I thought I would write a little bit about the process here.

It was actually pretty straight forward and I’m pretty sure that anyone who decides to use a slightly less mainstream distro like crunchbang would easily figure this out but hey everyone tries a hail mary Google of their exact task at hand now and then so lazy Googler this ones for you.

I followed the steps from OzBotz’s great guide: http://www.ozbotz.org/opencv-installation/ with a few simple modifications.

# add the following line to /etc/apt/sources.list
deb ftp://ftp.deb-multimedia.org wheezy main non-free


then run

sudo apt-get update


At this point I got an error about unauthenticated packages. The fix for this is to install the repo key ring:

sudo apt-get install deb-multimedia-keyring


Say yes to unauthenticated packages. Now this repository is authenticated and won’t give you any more lip.

From here the install was pretty much smooth sailing. My install is 64-bit so I made sure to use the extra configure command for 64-bit systems. Actually come to think of it I think I had to use theses extra commands the last time I built OpenCV on a 32-bit Ubuntu machine so you might want to just use those regardless. Also be careful with the using a too recent x264 stable because it might require yasm 1.2 which isn’t in the debian repositories yet. So unless you want to make more work for yourself needlessly (an older stable will work just fine) then stick to an older stable. I did use the latest ffmpeg though (1.2 magic) with no issues.

And that’s it! Sure felt nice to be writing a how to of sorts for Linux again!

# April 03, 2013 08:24 PM

#### ECE Banquet Awards

Unfortunately I could not attend but apparently I was an honourable mention for favourite TA in a 4th year course!  Big shout out to my buddies Cenk and Jeet who took home the honours!!!

# April 03, 2013 04:21 PM

Today I got the opportunity to give a guest lecture in ELEC 278 Data Structures and Algorithms. I put a data structures spin on my talk and went into some detail about the nearest neighbour problem and K-D-Trees.

I think it went well! – here are the slides
278GuestLecture

# March 19, 2013 05:20 PM

#### Shogun 2.1 is out!

We released SHOGUN 2.1. See the announcement (link).

The release features my recent work on kernel selection for MMD-based kernel two-sample testing and a streaming based implementation for this. See blog-entry. We also added a new unit-testing framework, of which I am very excited since we finally get a mechanism to detect code errors. We also got yet another interface language (perl). Very cool stuff and lots of work/blood/sweat/fun with the other guys. Check it out!

Next thing to come here is a workshop on machine learning with SHOGUN on July 12 in the C-Base in Berlin. Stay tuned!

# March 18, 2013 12:14 AM

#### Scientist vs Inventor

Mikio and I are writing a book chapter about "Open Science in Machine Learning", which will appear in a collection titled "Implementing Computational Reproducible Research". Among many things, we mentioned that machine learning is about inventing new methods for solving problems. Luis Ibanez from Kitware pounced on this statement, and proceeded to give a wonderful argument that we are confusing our roles as scientists with the pressure of being an inventor. The rest of this post is an exact reproduction of Luis' response to our statement.

“... machine learning is concerned with creating new learning methods to perform well on certain application problems.”.

The authors discuss the purpose of machine learning, but under the untold context of “research on machine learning”, and the current landscape of funding research. To clarify, the authors imply that novelty is the purpose of machine learning research. More explicitly, that “developing new methods” is the goal of research.

This is a view (not limited to machine learning) that is commonly widespread, and that in practice is confirmed by the requirements of publishing and pursuit of grant funding. I beg to differ with this view, in the sense that “novelty” is not part of the scientific process at all. Novelty is an artificial condition that has been imposed on scientific workers over the years, due to the need to evaluate performance for the purpose of managing scarce funding resources. The goal of scientific research is to attempt to understand the world by direct observation, crafting of hypothesis and evaluation of hypothesis via reproducible experiments.

The pursuit of novelty (real or apparent) is actually a distraction, and it is one of the major obstacles to the practice of reproducible research. By definition, repeating an experiment, implies, requires and demands to do something that is not new. This distracted overrating of novelty is one of the reasons why scientific workers, and their institutions have come to consider repeatability of experiments as a “waste of time”, since it takes resources away from doing “new things” that could be published or could lead to new streams of funding. This confusion with “novelty” is also behind the lack of interest in reproducing experiments that have been performed by third parties. Since, such actions are “just repeating” what someone else did, and are not adding anything “new”. All, statements that are detrimental to the true practice of the scientific method.

The confusion is evident when one look at calls for proposals for papers in journal, conferences, or for funding programs. All of them call for “novelty”, none of them (with a handful of exceptions) call for reproducibility. The net effect is that we have confused two very different professions: (a) scientific researcher, with (b) inventor. Scientific researchers should be committed to the application of the scientific method, and in it, there is no requirement for novelty. The main commitment is to craft reproducible experiments, since we are after the truth, not after the new. Inventors on the other hand are in the business of coming up with new devices, and are not committed to understanding the world around us.

Most conference, journals, and even funding agencies have confused their role of supporting the understanding the world around us, and have become surrogates for the Patent Office.

In order to make progress in the pursuit of reproducible research, we need to put “novelty” back in its rightful place of being a nice extra secondary or tertiary feature of scientific research, but not a requirement, nor a driving force at all.

# March 17, 2013 02:21 PM

#### Shogun Toolbox Version 2.1.0 Released!

We have just released shogun 2.1.0. This release contains over 800 commits since 2.0.0 with a load of bugfixes, new features and improvements (see the changelog for details) that make Shogun more efficient, robust and versatile. In particular, Christian Montanari developed a first alpha version of a perl modular interface, Heiko Strathmann did add Linear Time MMD on Streaming Data, Viktor Gal wrote a new structured output solver and Sergey Lisitsyn added support for tapkee - a dimension reduction framework. Read more at http://www.shogun-toolbox.org

# February 20, 2013 06:26 AM

#### Getting started with SWIG with Java

I bumped into an error while trying out one simple example involving using the java_modular interface of shogun. It seems that the swig file is not updated. Anyways, more about that later but I started to get to know swig a bit. Here is how I got it worked under GNU/Linux 3.6.10-2.fc16.x86_64 with C. This tutorial might be helpful.

Step 1 – Creating the source file

I created an example.c file as shown -

/* File : example.c */

#include
double My_variable = 3.0;

int fact(int n) {
if (n <= 1) return 1;
else return n*fact(n-1);
}

int my_mod(int x, int y) {
return (x%y);
}

char *get_time()
{
time_t ltime;
time(&ltime);
return ctime(&ltime);
}


Step 2 – Creating the interface file

I created an interface file which is needed by swig (actually copied and pasted from the tutorial, will have to find out how it works) as below -

/* example.i */
%module example
%{
/* Put header files here or function declarations like below */
extern double My_variable;
extern int fact(int n);
extern int my_mod(int x, int y);
extern char *get_time();
%}

extern double My_variable;
extern int fact(int n);
extern int my_mod(int x, int y);
extern char *get_time();


Step 3 – Run swig on the interface file.

swig -java example.i

This creates three files exampleJNI.java, example_wrap.c, example_wrap.c. More details about these files later.

Step 4 – Compile the source file with the wrapper, jni.h and jni_md.h

We need to locate jni.h and jni_md.h first. In my system the path was

/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/include/jni.h
/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/include/linux/jni_md.h


Then I compiled with the -fpic option in order to create a shared library. (See here)

gcc -fpic -c example.c example_wrap.c -I/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/include/ -I/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/include/linux/

This creates the object files example.o, example_wrap.o.

Step 5 – Creating the shared link library

I created an example.so with -

ld -shared example.o example_wrap.o -o example.so

[Optional Step - Create a jar with the classes]

To be written

Step 6 – Create the java file

I created the java file with main class containing a System.load call

// filename main.java
public class main {
public static void main(String argv[]) {
System.out.println(example.getMy_variable());
System.out.println(example.fact(5));
System.out.println(example.get_time());
}
}


Step 7 – Compile the java file with classpath

The main.java file should be compiled with classpath option

javac -cp '.:/path/to/swig/generated/class/files/' main.java

Step 8 – Run the example with added library path

While running, the path of example.so should be provided with the Djava.library.path=”..”

java -cp '.:/path/to/swig/generated/class/files/' -Djava.library.path=".:/path/to/example.so"

Feels so good when you finally see the output

</pre>
main
3.0
120
Wed Feb 20 11:52:57 2013



# February 06, 2013 12:00 AM

#### Software Licensing

One of the tricky decisions software authors have to make is "What license should I use for my software?" A recent article in PLoS Computational Biology discusses the different possible avenues open to authors. It gives a balanced view of software licensing, carefully describing the various dimensions authors of software should consider before coming to a decision.

It recommends the following guidelines:

• For the widest possible distribution consider a permissive FOSS license such as the BSD/MIT, Apache, or ECL.
• For assuring that derivatives also benefit FOSS, choose a copyleft FOSS license like the GPL, LGPL, or MPL.
• To those on the fence, there are hybrid or multi-licensing which can achieve the benefits of both open source and proprietary software licenses.
• For protecting the confidentiality of your code, there is the proprietary license.

Naturally being an open source venue, I strongly encourage people to consider the first two options. We also discuss the distinction between FOSS licences in our position paper from 2007.

# January 23, 2013 09:34 AM

### Dear all, save the date - July 12, 2013!

we just received confirmation from the c-base crew [1], [2] - the first shogun machine learning toolbox workshop will take place at c-base this Summer July 12 in Berlin! Anyone interested in participating / this event / or interested in helping to organize it - please talk to us on IRC or post on the mailinglist...

# January 02, 2013 01:25 PM

#### Chemical compound and drug name recognition task.

CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task.

TASK GOAL AND MOTIVATION Machine learning methods have been especially useful for the automatic recognition of entity mentions in text, a crucial step for further natural language processing tasks. To promote the development of open source software for indexing documents with compounds and recognizing compound mentions in text.

The goal of this task is to promote the implementation of systems that are able to detect mentions in text of chemical compounds and drugs. The recognition of chemical entities is also crucial for other subsequent text processing strategies, such as detection of drug-protein interactions, adverse effects of chemical compounds or the extraction of pathway and metabolic reaction relations. A range of different methods have been explored for the recognition of chemical compound mentions including machine learning based approaches, rule-based systems and different types of dictionary-lookup strategies.

As has been the case in previous BioCreative efforts (resulting in high impact papers in the field), we expect that successful participants will have the opportunity to publish their system descriptions in a journal article.

CHEMDNER DESCRIPTION The CHEMDNER is one of the tracks posed at the BioCreative IV community challenge (http://www.biocreative.org).

We invite participants to submit results for the CHEMDNER task providing predictions for one or both of the following subtasks:

a) Given a set of documents, return for each of them a ranked list of chemical entities described within each of these documents [Chemical document indexing sub-task]

b) Provide for a given document the start and end indices corresponding to all the chemical entities mentioned in this document [Chemical entity mention recognition sub-task].

For these two tasks the organizers will release training and test data collections. The task organizers will provide details on the used annotation guidelines; define a list of criteria for relevant chemical compound entity types as well as selection of documents for annotation.

REGISTRATION Teams can participate in the CHEMDNER task by registering for track 2 of BioCreative IV. You can register additionally for other tracks too. To register your team go to the following page that provides more detailed instructions: http://www.biocreative.org/news/biocreative-iv/team/

Mailing list and contact information You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html You can also directly send questions to the organizers through e-mail: mkrallinger@cnioes

WORKSHOP CHEMDNER is part of the BioCreative evaluation effort. The BioCreative Organizing Committee will host the BioCreative IV Challenge evaluation workshop (http://www.biocreative.org/events/biocreative-iv/CFP/) at NCBI, National Institutes of Health, Bethesda, Maryland, on October 7-9, 2013

CHEMDNER TASK ORGANIZERS Martin Krallinger, Spanish National Cancer Research Center (CNIO) Obdulia Rabal, University of Navarra, Spain Julen Oyarzabal, University of Navarra, Spain Alfonso Valencia, Spanish National Cancer Research Center (CNIO)

REFERENCES - Vazquez, M., Krallinger, M., Leitner, F., & Valencia, A. (2011). Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Molecular Informatics, 30(6-7), 506-519. - Corbett, P., Batchelor, C., & Teufel, S. (2007). Annotation of chemical named entities. BioNLP 2007: Biological, translational, and clinical language processing, 57-64. - Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics, 24(13), i268-i276. - Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M. V., ... & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics, 25(22), 2983-2991. - Yeh, A., Morgan, A., Colosimo, M., & Hirschman, L. (2005). BioCreAtIvE task 1A: gene mention finding evaluation. BMC bioinformatics, 6(Suppl 1), S2. - Smith, L., Tanabe, L. K., Ando, R. J., Kuo, C. J., Chung, I. F., Hsu, C. N., ... & Wilbur, W. J. (2008). Overview of BioCreative II gene mention recognition. Genome Biology, 9(Suppl 2), S2.

# December 26, 2012 12:06 AM

#### NIPS paper: Optimal kernel choice for large-scale two-sample tests

NIPS 2012 is already over. Unfortunately, I could not go due to the lack of travel funding. However, as mentioned a few weeks ago, I participated in one paper which is closely related to my Master's project with Arthur Gretton and Massi Pontil. Optimal kernel choice for large-scale two-sample tests. We recently set up a page for the paper where you can download my Matlab implementation of the paper's methods. Feel free to play around with that. I am currently finishing implementing most methods into the SHOGUN toolbox. We also have a poster which was presented at NIPS. See below for all links.
Update: I have completed the kernel selection framework for SHOGUN, it will be included in the next release. See the base class interface here. See an example to use it: single kernel (link) and combined kernels (link). All methods that are mentioned in the paper are included. I also updated the shogun tutorial (link).

At its core, the paper describes a method for selecting the best kernel for two-sample testing with the linear time MMD. Given a kernel $k$ and terms
$h_k((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i}),$
the linear time MMD is their empirical mean,

$\hat\eta_k=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i})),$
which is a linear time estimate for the squared distance of the mean embeddings of the distributions where the $x_i, y_i$ come from. The quantity allows to perform a two-sample test, i.e. to tell whether the underlying distributions are different.

Given a finite family of kernels $\mathcal{K}$, we select the optimal kernel via maximising the ratio of the MMD statistic by a linear time estimate of the standard deviation of the terms
$k_*=\arg \sup_{k\in\mathcal{K}}\frac{ \hat\eta_k}{\hat \sigma_k},$
where $\hat\sigma_k^2$ is a linear time estimate of the variance $\sigma_k^2=\mathbb{E}[h_k^2] - (\mathbb{E}[h_k])^2$ which can also be computed in linear time and constant space. We give a linear time and constant space empirical estimate of this ratio. We establish consistency of this empirical estimate as
$\left| \sup_{k\in\mathcal{K}}\hat\eta_k \hat\sigma_k^{-1} -\sup_{k\in\mathcal{K}}\eta_k\sigma_k^{-1}\right|=O_P(m^{-\frac{1}{3}}).$

In addition, we describe a MKL style generalisation to selecting weights of convex combinations of a finite number of baseline kernels,
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\},$
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\},$
where $\hat{Q}$ is the positive definite empirical covariance matrix of the $h$ terms of all pairs of kernels.

We then describe three experiments to illustrate

• That our criterion outperforms existing methods on synthetic and real-life datasets which correspond to hard two-sample problems
• Why multiple kernels can be an advantage in two-sample testing

Supplementary page

# December 11, 2012 12:00 PM

#### Paper "Ten Simple Rules for the Open Development of Scientific Software" by Prlic and Procter

PLOS Computational Biology has an interesting Editorial on 10 rules for open development of scientific software. The ten rules are:

1. Don't Reinvent the Wheel
2. Code Well
4. Be Transparent
5. Be Simple
6. Don't Be a Perfectionist
7. Nurture and Grow Your Community
10. Science Counts.

# November 28, 2012 10:52 AM

#### Best Practices for Scientific Computing

I've been following the progress of Software Carpentry for some years now, and have been very impressed by their message that software is the new telescope, and we should invest time and effort to build up skills to ensure that our software is the best quality possible. Otherwise, how can we be sure that our new discoveries are not due to some instrument error?

They wrote a nice short paper titled "Best Practices for Scientific Computing" that highlights practices that would improve the quality of the software, and hence improve research productivity. Here are the 10 recommendations (along with the sub-recommendations).

### 1. Write programs for people, not computers.

1.1 a program should not require its readers to hold more than a handful of facts in memory at once

1.2 names should be consistent, distinctive, and meaningful

1.3 code style and formatting should be consistent

1.4 all aspects of software development should be broken down into tasks roughly an hour long

2.1 rely on the computer to repeat tasks

2.2 save recent commands in a file for re-use

2.3 use a build tool to automate their scientific workflows

### 3. Use the computer to record history.

3.1 software tools should be used to track computational work automatically

### 4. Make incremental changes.

4.1 work in small steps with frequent feedback and course correction

### 5. Use version control.

5.1 use a version control system

5.2 everything that has been created manually should be put in version control

### 6. Don’t repeat yourself (or others).

6.1 every piece of data must have a single authoritative representation in the system

6.2 code should be modularized rather than copied and pasted

6.3 re-use code instead of rewriting it

### 7. Plan for mistakes.

7.1 add assertions to programs to check their operation

7.2 use an off-the-shelf unit testing library

7.3 turn bugs into test cases

7.4 use a symbolic debugger

### 8. Optimize software only after it works correctly.

8.1 use a profiler to identify bottlenecks

8.2 write code in the highest-level language possible

### 9. Document the design and purpose of code rather than its mechanics.

9.1 document interfaces and reasons, not implementations

9.2 refactor code instead of explaining how it works

9.3 embed the documentation for a piece of software in that software

### 10. Conduct code reviews.

10.1 use code review and pair programming when bringing someone new up to speed and when tackling particularly tricky design, coding, and debugging problems

10.2 use an issue tracking tool

# November 13, 2012 11:30 PM

#### Shogun at PyData NYC 2012

Christian Widmer gave a talk about the shogun machine learning toolbox version 2.0 at PyData NYC 2012. His talk is finally available as a video online.

# November 01, 2012 11:58 PM

#### 0x5ECB1A5ED: Reversing Malware Protocols With Machine Learning

Post by Hugo Gascon: Last week took place the 5th ACM Workshop on Artificial Intelligence and Security,  co-located with the 19th Conference on Computer and Communications Security in Raleigh, NC. This workshop is one of the few focused exclusively on the application of machine learning and AI based methods to privacy and computer security problems, so it was a great event to introduce our latest research, which I have done together with Tammo Krueger, Nicole Krämer and Konrad Rieck. We had the opportunity to present our method to automatically build stateful models of network protocols using machine learning.

Reverse engineering network protocols has been a popular strategy for people wanting to develop open implementations of proprietary protocols. For security researchers it is an effective way to understand how malware that is used to control infected machines in a botnet communicates with its command and control server. Sometimes, these messages are exchanged using communication channels built on top of known protocols like IRC, HTTP or P2P (some of them very quirky, as encrypted blog posts or steganographic image uploads). If the infected machine is a mobile phone is not uncommon for malware to receive and sent instructions over SMS messages.

As the complex time-consuming task it is, there have been many efforts to automate this reversing process. Some of them have focused on how to extract the complete protocol specification, which is very effective if done through taint analysis but not possible relaying only on network traces. Others have focused on understanding the protocols enough to simulate vulnerable services in honeypots (i.e. ScriptGen). Such honeypots can automatically infer parts of a protocol from network traffic but they have not been designed to track more evolved attacks that require a longer sequence of stateful communication. Close to this line of research, we have designed a probabilistic approach to model both the message content and the state machine of the protocol relying only on network traffic. Our method, called PRISMA: Protocol Inference and State Machine Analysis, is able to learn a stateful model that can be used to simulate valid communication of an unknown protocol. To construct this model, our method infers the message format, the state machine as well as rules for propagating information between states using machine learning techniques.

PRISMA builds on special tailored embedding and clustering techniques that allow for large-scale applications with millions of network messages. After preprocessing the network traces and embedding the messages, we define a similarity measure in order to find common structures in the data. We have explored part-based and position-based clustering through non-negative matrix factorization (NMF) to group individual messages into events, but other options for clustering algorithms can be chosen, as long as the procedure assigns a cluster label to each message.

Messages which occur at specific events in the flow of communication often exhibit similar structural features. Thus, to extract event information we exploit this structural dependency. Each of the extracted sequences of events can be seen as a path through the protocol’s state machine. To infer an approximation of this state machine, we use Markov models, where transitions are linked to probabilities. Every state in the model is linked to a set of automatically generated templates of the messages associated with each one of the states. The information flow between the different states during a communication is also automatically inferred and characterized as a set rules. Finally, we have developed a network level module called LENS, which is able to load the inferred model and simulate both sides of a real communication session.

After evaluating the system with real protocols, we have used our method on network traffic collected from malicious software. The following figure shows an example of the state machine inferred from traffic of the Koobface worm:

This method is specially interesting for honeypots, as simulating malware communication can help to deceive an attacker and obtain more information about its behavior. By inspecting the extracted state-machine and the associated templates and rules a malware analyst can also gain insights into the inner workings of a sample from the collected network traces alone. This also makes PRISMA a valuable method for dynamic analysis of malware beyond honeypot applications.
Attacks like call fraud and identity theft often involve sophisticated, stateful attack patterns which on top of normal communication try to harm systems on a higher semantic level than usual attack scenarios. To detect these kind of threats via specially deployed honeypots, at least a minimal understanding of the inherent state machine of a specific service is needed to lure potential attackers and to keep a communication for a sufficiently large number of steps. To this end we propose PRISMA, a method for protocol inspection and state machine analysis, which infers a functional state machine and message format of a protocol from network traffic alone. We apply our method to three real-life network traces ranging from 10.000 up to 2 million messages of both binary and textual protocols. We show that PRISMA is capable of simulating complete and correct sessions based on the learned models. A use case on malware traffic reveals the different states of the execution, rendering PRISMA a valuable tool for malware analysis.

Tammo Krueger, Hugo Gascon, Nicole Krämer and Konrad Rieck
ACM Workshop on Security and Artificial Intelligence (AISEC) October 2012

# November 01, 2012 12:51 PM

Here comes some good news: DIMVA – the Conference on Detection of Intrusions and Malware & Vulnerability Assessment – is celebrating its 10th anniversary in Berlin, Germany! For the last years DIMVA has served as a premier forum for advancing the state of the art in intrusion detection and malware analysis. If you are working in this area of computer security, consider submitting a paper:
• Paper Submission Due: February 10, 2013
• Acceptance Notification: March 27, 2013
• Conference: July 18-19, 2013
DIMVA solicits submission of high-quality, original scientific papers presenting novel research on malware analysis, intrusion detection, and related systems security topics. DIMVA encourages submissions from the following broad areas:
• Intrusion Detection
(Novel approaches and domains; Prevention and response; Data leakage and exfiltration; Evasion and other attacks; Result correlation; Potentials and limitations)

• Malware Detection
(Automated analyses; Behavioral models; Prevention and containment; Infiltration; Acquisition and monitoring; Forensics and recovery; Underground economy)

• Vulnerability Assessment
(Vulnerability detection; Vulnerability prevention; Fuzzing techniques; Classification and evaluation; Situational awareness)

# October 28, 2012 09:49 PM

#### Nice blog entry about SHOGUN's GSoC 2012

Sören wrote a nice summarising blog post on the GSoC 2012. See here.

# October 28, 2012 01:55 AM

#### Shogun at Google Summer of Code 2012

The summer came finally to an end and (yes in Berlin we still had 20 C end of October), unfortunately, so did GSoC with it. This has been the second time for SHOGUN to be in GSoC. For those unfamiliar with SHOGUN - it is a very versatile machine learning toolbox that enables unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or explorative data analysis. I again played the role of an org admin and co-mentor this year and would like to take the opportunity to summarize enhancements to the toolbox and my GSoC experience: In contrast to last year, we required code-contributions in the application phase of GSoC already, i.e., a (small) patch was mandatory for your application to be considered. This reduced the number of applications we received: 48 proposals from 38 students instead of 70 proposals from about 60 students last year but also increased the overall quality of the applications.

In the end we were very happy to get 8 very talented students and have the opportunity of boosting the project thanks to their hard and awesome work. Thanks to google for sponsoring three more students compared to last GSoC. Still we gave one slot back to the pool for good to the octave project (They used it very wisely and octave will have a just-in-time compiler now, which will benefit us all!).

SHOGUN 2.0.0 is the new release of the toolbox including of course all the new features that the students have implemented in their projects. On the one hand, modules that were already in SHOGUN have been extended or improved. For example, Jacob Walker has implemented Gaussian Processes (GPs) improving the usability of SHOGUN for regression problems. A framework for multiclass learning by Chiyuan Zhang including state-of-the-art methods in this area such as Error-Correcting Output Coding (ECOC) and ShareBoost, among others. In addition, Evgeniy Andreev has made very important improvements w.r.t. the accessibility of SHOGUN. Thanks to his work with SWIG director classes, now it is possible to use python for prototyping and make use of that code with the same flexibility as if it had been written in the C++ core of the project. On the other hand, completely new frameworks and other functionalities have been added to the project as well. This is the case of multitask learning and domain adaptation algorithms written by Sergey Lisitsyn and the kernel two-sample or dependence test by Heiko Strathmann. Viktor Gal has introduced latent SVMs to SHOGUN and, finally, two students have worked in the new structured output learning framework. Fernando Iglesias made the design of this framework introducing the structured output machines into SHOGUN while Michal Uricar has implemented several bundle methods to solve the optimization problem of the structured output SVM.

It has been very fun and interesting how the work done in different projects has been put together very early, even during the GSoC period. Only to show an example of this dealing with the generic structured output framework and the improvements in the accessibility. It is possible to make use of the SWIG directors to implement the application specific mechanisms of a structured learning problem instance in python and then use the rest of the framework (written in C++) to solve this new problem.

Students! You all did a great job and I am more than amazed what you all have achieved. Thank you very much and I hope some of you will stick around.

Besides all these improvements it has been particularly challenging for me as org admin to scale the project. While I could still be deeply involved in each and every part of the project last GSoC, this was no longer possible this year. Learning to trust that your mentors are doing the job is something that didn't come easy to me. Having had about monthly all-hands meetings did help and so did monitoring the happiness of the students. I am glad that it all worked out nicely this year too. Again, I would like to mention that SHOGUN improved a lot code-base/code-quality wise. Students gave very constructive feedback about our (lack) of proper Vector/Matrix/String/Sparse Matrix types. We now have all these implemented doing automagic memory garbage collection behind scenes. We have started to transition to use Eigen3 as our matrix library of choice, which made quite a number of algorithms much easier to implement. We generalized the Label framework (CLabels) to be tractable for not just classification and regression but multitask and structured output learning.

Finally, we have had quite a number of infrastructure improvements. Thanks to GSoC money we have a dedicated server for running the buildbot/buildslaves and website. The ML Group at TU Berlin does sponsor virtual machines for building SHOGUN on Debian and Cygwin. Viktor Gal stepped up providing buildslaves for Ubuntu and FreeBSD. Gunnar Raetschs group is supporting redhat based build tests. We have Travis CI running testing pull requests for breakage even before merges. Code quality is now monitored utilizing LLVMs scan-build. Bernard Hernandez appeared and wrote a fancy new website for SHOGUN.

A more detailed description of the achievements of each of the students follows:
• Kernel Two-sample/Dependence test

Heiko Strathmann, mentored by Arthur Gretton, worked on a framework for kernel-based statistical hypothesis testing. Statistical tests to determine whether two random variables are are equal/different or are statistically independent are an important tool in data-analysis. However, when data are high-dimensional or in non-numerical form (documents, graphs), classical methods fail. Heiko implemented recently developed kernel-based generalisations of classical tests which overcome these issues by representing distributions in high dimensional so-called reproducing kernel Hilbert spaces. By doing so, theoretically any pair samples can be distinguished.

Implemented methods include two-sample testing with the Maximum Mean Discrepancy (MMD) and independence testing using the Hilbert Schmidt Independence Criterion (HSIC). Both methods come in different flavours regarding computational costs and test constructions. For two-sample testing with the MMD, a linear time streaming version is included that can handle arbitrary amounts of data. All methods are integrated into a newly written flexible framework for statistical testing which will be extended in the future. A book-style tutorial with descriptions of algorithms and instructions how to use them is also included.

Like Heiko, Sergey Lisitsyn did participate in the GSoC programme for the second time. This year he focused on implementing multitask learning algorithms. Multitask learning is a modern approach to machine learning that learns a problem together with other related problems at the same time using a shared representation. This approach often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks. During the summer Sergey has ported a few algorithms from the SLEP and MALSAR libraries with further extensions and improvements. Namely, L12 group tree and L1q group multitask logistic regression and least squares regression, trace-norm multitask logistic regression, clustered multitask logistic regression, basic group and group tree lasso logistic regression. All the implemented algorithms use COFFIN framework for flexible and efficient learning and some of the algorithms were implemented efficiently utilizing the Eigen3 library.

• Implementation of / Integration via existing GPL code of latent SVMs.

A generic latent SVM and additionally a latent structured output SVM has been implemented. This machine learning algorithm is widely used in computer vision, namely in object detection. Other useful application fields are: motif finding in DNA sequences, noun phrase coreference, i.e. provide a clustering of the nouns such that each cluster refers to a single object.

It is based on defining a general latent feature Psi(x,y,h) depending on input variable x, output variable y and latent variable h. Deriving a class from the base class allows the user to implement additional structural knowledge for efficient maximization of the latent variable or alternative ways of computation or on-the-fly loading of latent features Psi as a function of the input, output and latent variables.

• Bundle method solver for structured output learning

We have implemented two generic solvers for supervised learning of structured output (SO) classifiers. First, we implemented the current state-of-the-art Bundle Method for Regularized Risk Minimization (BMRM) [Teo et al. 2010]. Second, we implemented a novel variant of the classical Bundle Method (BM) [Lamarechal 1978] which achieves the same precise solution as the BMRM but in time up to an order of magnitude shorter [Uricar et al. 2012].

Among the main practical benefits of the implemented solvers belong their modularity and proven convergence guarantees. For training a particular SO classifier it suffices to provide the solver with a routine evaluating the application specific risk function. This feature is invaluable for designers who can concentrate on tuning the classification model instead of spending time on developing new optimization solvers. The convergence guarantees remove the uncertainty inherent in use of on-line approximate solvers.

The implemented solvers have been integrated to the structured output framework of the SHOGUN toolbox and they have been tested on real-life data.

• Built generic structured output learning framework

During GSoC 2012 Fernando implemented a generic framework for structured output (SO) problems. Structured output learning deals with problems where the prediction made is represented by an object with complex structure, e.g. a graph, a tree or a sequence. SHOGUN's SO framework is flexible and easy to extend [1]. Fernando implemented a naïve cutting plane algorithm for the SVM approach to SO [2]. In addition, he coded a case of use of the framework for labelled sequence learning, the so-called Hidden Markov SVM. The HM-SVM can be applied to solve problems in different fields such as gene prediction in bioinformatics or speech to text in pattern recognition.

• Improving accessibility to shogun

During the latest google summer of code Evgeniy has improved the Python modular interface. He has added new SWIG-based feature - director classes, enabling users to extend SHOGUN classes with python code and made SHOGUN python 3 ready. Evgeniy has also added python's protocols for most usable arrays (like vectors, matrices, features) which makes possible to work with Shogun data structures just like with numpy arrays with no copy at all. For example one can now modify SHOGUN's RealFeatures in place or use.

• Implement Gaussian Processes and regression techniques

Jacob implemented Gaussian Process Regression in the Shogun Machine Learning Toolbox. He wrote a complete implementation of basic GPR as well as approximation methods such as the FITC (First Independent Training Conditional) method and the Laplacian approximation method. Users can utilize this feature to analyze large scale data in a variety of fields. Scientists can also build on this implementation to conduct research extending the capability and applicability of Gaussian Process Regression.

• Build generic multiclass learning framework

Chiyuan defined a new multiclass learning framework within SHOGUN. He re-organized previous multiclass learning components and refactored the CMachine hierarchy in SHOGUN. Then he generalized the existing one-vs-one and one-vs-rest multiclass learning scheme to general ECOC encoding and decoding strategies. Beside this, several specific multiclass learning algorithms are added, including ShareBoost with feature selection ability, Conditional Probability Tree with online learning ability, and Relaxed Tree.

# October 16, 2012 01:08 PM

#### Streaming Features for Linear Time MMD

I finally finished an important and very cool extension to my GSoC 2012 project - making the linear time MMD statistic work with streaming based data. In particular, SHOGUN's streaming framework is now used.

By design, the linear time MMD statistic, given as
$\text{MMD}_l^2=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))$
where
$h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})$
is very well suited for streaming based data since only four examples have to be hold in memory at once. Once, the sum in the h-statistic is computed, used data can be "forgotten". As I described in my M.Sc. thesis (link), this allows to process infinite amounts of data and therefore results in possibly more accurate two-sample tests. This holds in particular in cases where the amount of data needed to solve problems is larger than computer memory.

During the GSoC, I implemented the linear time MMD on the base of SHOGUN's standard features interface, which made it necessary to hold data in memory. With the latest modifications (link to patch), the class for the linear time MMD (class reference), now accepts streaming features (class reference) only. This allows to process arbitrarily large amounts of data in a very comfortable way. In order to not suffer from overhead while streaming examples one by one, a block size may be specified: this number of examples is processed at once and should be chosen as large as fits into memory.

Recall the linear time MMD's distribution is normal and its variance can easily estimated by using the empirical variance of the individual h-statistics (while the MMD is their mean) when the number of samples is large enough. The new implementation in SHOGUN does this on the fly using D. Knuth's online variance algorithm [1] (implementation link). Therefore, a complete two-sample test is now possible in linear time and constant space.

A nice illustration of the advantages of this approach can be found in the examples for the linear time MMD (link). A data generator for artificial data which implements SHOGUN's streaming interface is passed to the MMD class. It produces data from the underlying distribution on the fly.

[1] Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley.

# October 12, 2012 04:28 PM

In a rather self deprecating title "I wanted to Predict Elections with Twitter and all I got was this Lousy Paper" Daniel Gayo-Avello takes us on a tour of how hard it is to do reproducible research, and how often authors take short cuts. From the abstract:

"Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely diﬃcult. However, most of the authors seem to be more interested in claiming positive results than in providing sound and reproducible methods."

It is an interesting survey of papers that use Twitter data.

He lists some flaws in current research on electoral predictions, but they are generally applicable to any machine learning paper (my comments in brackets):

1. It's not prediction at all! I have not found a single paper predicting a future result. (Neither is bootstrap nor cross validation prediction)
2. Chance is not a valid baseline...
3. There is not a commonly accepted way of "counting votes" in Twitter
4. There is not a commonly accepted way of interpreting reality! (In supervised learning, we tend to ignore the fact that there is no ground truth in reality.)
5. Sentiment analysis is applied as a black-box... (As machine learning algorithm get more complex, more people will tend to use machine learning software as a black box)
6. All the tweets are assumed to be trustworthy. (I don't know if anybody is doing adversarial election prediction)
7. Demographics are neglected. (The biased sample problem)
8. Self-selection bias.

The window is closing on those who want to predict the upcoming US elections from X.

# September 11, 2012 01:31 PM

#### GSoC 2012 is over

Since a few weeks, GSoC 2012 is over. It has been a pretty cool summer for me. As last year, I learned lots of things. This year though, my project a bit more research oriented -- which is nice since it allowed me to connect my work for SHOGUN with the stuff I do in Uni. I even mentioned it in my Master's dissertation (link) which also was about statistical hypothesis testing with the MMD. Working on the dissertation at the same time as on the GSoC was sometimes exhausting. It eventually worked out fine since both things were closely related. I would only suggest to do other important things if they are connected to the GSoC project. However, if this condition is met, things multiply in terms of the reward you get due to synergistic effects.

The other students working for SHOGUN also did very cool projects. All these are included in the SHOGUN 2.0 release (link). The project now also has a new website so its worth taking a closer look. Some of the other (really talented) guys might stay with SHOGUN as I did last year. This once more gives a major boost to development. Thanks to all those guys. I also owe thanks to Sören and Sergey who organised most things and made this summer so rewarding.

In the near future I will try to put in some extensions to the statistical testing framework that I though of during the summer but did not have time to implement: On-line features for the linear time MMD, a framework for kernel selection which includes all investigated methods from my Master's dissertation, and finally write unit-tests using SHOGUN's new framework for that. I will update the SHOGUN project page of my website (link). I might as well send some tweets to SHOGUN's new twitter account (link).

# August 30, 2012 01:42 PM

#### John Hunter - the author of matplotlib - has died.

John Hunter the main author of matplotlib has died of cancer. For those interested, his close friend Fernando gives a bit more details here. John was an long term developer of matplotlib (even continuing while he was working in industry) and a father of three kids. You might consider donating to the John Hunter Memorial Fund.

We had John as invited speaker at one of our NIPS machine learning open source software workshops. He gave quite some entertaining talk featuring some live demo. I recall that he started with a command prompt typing everything (including fetching some live stock-exchange data) in python at some insane speed. Videolectures recorded his lecture. I don't know about others but I basically plotted all the scientific results using matplotlib and python for the last several years.

Rest in peace John - your contributions will be remembered.

# August 15, 2012 04:00 PM

#### 在blogger/blogspot上新挖了个坑 顺便和tumblr做下比较

blogger 新坑 : foulwall's diary

• blogger和tumblr两者是web2.0时代的产物, 虽然都提倡共享, 但tumblr要比blogger开放得多,更加靠近web2.0
• blogger在提供的博客样式上显得略微内敛, 而tumblr提供的绝大部分样式都个性得很
• blogger在编辑器上使用了google docs的界面, 写博客像是写word, tumblr的编辑器比起blogger的要简单许多, 但是tumblr有markdown的支持, 功能上毫不逊色, 甚至更加程序员友好
• 据草根统计, blogger托管着全球1900万以上个博客, 用户不少, 但是老龄化厉害(livejournal更甚). 而tumblr用户大多比较年轻, 博客内容也更娱乐, 两家的差异在哪里, 上blogger和tumblr随便fo几个博客就知道了
• 在绑定域名上, google做的不错, 绑二级域名在DNS Manager里加一CNAME记录就可以了,即刻生效. 而tumblr那边二级域名好像不太好绑, 绑完域名还要等十多分钟才会生效
• 在后台, blogger做的稍显复杂, tumblr简洁得多
• tumblr可以选择7种博文类型(text, quote, chat ,photo, link, video, audio), 每一种有独自的css样式, 每种类型和周边的服务连接也很紧密, 像audio就和soundcloud衔接得很紧密. blogger这点是软肋.
• 两者都允许用户自行编辑网页html, tumblr带代码高亮, blogger光秃秃
• 吐槽一下tumblr, 用了太多Ajax, 网速慢的时候基本就别想用了
• blogger的dashboard入口是网站流量信息什么的, 而tumblr则是social feed

P.S. 上面那个链接不是给墙内的人提供的.

# August 14, 2012 02:27 PM

#### 11th GSoC weekly report: Done!

This will be my last weekly report for this years summer of code! Last week, I did not write a report since I have been very busy with experiments for a rebuttal for the NIPS submission (see 2nd GSoC weekly report). This week was more productive: I continued polishing the new framework for statistical tests, squeezed out some final bugs and made made a few things more effective.

I also created graphical examples for linear and quadratic time MMD and HSIC based tests. These serve the purpose of illustrating how the methods work on simple datasets. They sample the underlying statistic's null and alternative distributions using all different methods I implemented and plot distributions with test thresholds (as well as data). For the MMD tests, the dataset contains samples from two multivariate Gaussian distributions with unit variance in every component and equal means in all but one component. The HSIC tests uses data where dependence is induced via rotation (see last report). Below are screenshots of the output of the examples.

These images were also added to the shogun-tutorial. I added a part about independence testing and corrected some mistakes in there. All methods I implemented are now contained within the tutorial. Another documentation related thing I did was to update doxygen based sourcecode documentation. In particular, I cleaned up the horrible mess in the CStatistics class -- and replaced all ascii-art by LaTeX. Although there are still things to do, my project is now in the status "done" in terms of GSoC :) It was a nice summer! I guess I will be extending it with some ideas that came up while working on with kernel two sample tests recently.

For the last week, I intend to get some unit-testing done and start to focus on things that are needed for our upcoming 2.0 release (Bug hunting, fix warnings, implement things that people request). I will also write an overall summary on the GSoC next month or so. Next month will be busy since I also have to finish my Master's project.

# August 11, 2012 04:00 PM

#### WLAN/WIFI信道不兼容导致Macbook Air找不到无线SSID

Graphical representation of Wireless LAN channels in 2.4 GHz band
P.S. 信道1，6，11，14是独立信道，与周围WLAN信号产生混叠的概率最小, 性能最优(14信道大多数国家不许可).下面这是篇信道强度实验的文章.

http://openplatform.com.hk/Knowledge/channel_b.htm

# July 30, 2012 04:50 PM

#### 10th GSoC weekly report: Slowly getting ready

Step by step, my project enters a final state :)
Last week, I added new data generation methods, which are used from a new example for independence tests with HSIC. It demonstrates that the HSIC based test is able to capture dependence which is induced by rotating data that has zero correlation -- one of the problems from the paper [1]. Here is a picture; the question is: are the two dimensions dependent? Or moreover, is a test able to capture that? (correlation is almost zero, dependence is induced via rotation)

I also realised that my current class structure had problems doing bootstrapping for HSIC, so I re-factored a bit. Bootstrapping is now also available for HISC using the same code that does it for two-sample-tests. I also removed some redundancy -- both independence and two-sample tests are very similar problems and implementations should share code where possible.

Another thing that was missing so far is to compute test thresholds; so far, only p-values could be computed. Since different people have different tastes about this, I added both methods. Checking a test statistic against a threshold is straight-forward and gives a binary answer; computing a p-value gives the position of the test statistic in the null-distribution -- this contains more information. To compute thresholds, one needs the inverse CDF function for the null-distribution. In the bootstrapping case, it is easy since simply the sample that corresponds to a certain quantile has to be reported. For cases where a normal- or gamma-distribution was fitted, I imported some more routines from the nice ALGLIB toolbox.

For this week, I plan to continue with finishing touches, documentation, examples/tests, etc. Another idea I had is to make the linear time MMD test work with SHOGUN's streaming features, since the infinite or streaming data case is the main area for its usage.

[1]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel statistical test of independence. Advances in Neural Information Processing Systems

# July 23, 2012 03:17 PM

#### 9th GSoC weekly report: Bugs again! and documentation

I spend quite some fraction of last week on something which is not really related my project: trying to make cross-validation possible for multi-class MKL (multiple kernel learning) machines using my framework from last year's GSoC. To this end, I added subset support to SHOGUN's combined features class; and then went for a bunch of bugs that prevented it from working. But it now does! So cross-validation should now be possible within a lot more situations. Thanks to Eric who reported all the problems.

Apart from that, I worked on documentation for the new statistical testing framework. I added doxygen class descriptions, see for example CQuadraticTimeMMD. More important, I started writing a section for the SHOGUN tutorial, a book-like description of all algorithms. We hope that it will grow in the future. You can find the $\LaTeX$ sources at github. We should/will add a live pdf download soon.

Another minor thing I implemented is a data generator class. I think it is nice to illustrate new algorithms with data that is not fixed (aka load a file). The nice thing about this is that it is available for examples from all interfaces -- so far I implemented this separately for c++ and python; this is more elegant now. I bet some of the others projects will need similar methods for their demos too; so please extend the class!

This week, I will add more data generation methods to the generator, in particular data that can be used to illustrate the recently implemented HSIC test for independence. Reference datasets are quite complicated, so this might take a while. Another thing we recently changed is a new framework for unit-tests, so I will write these for all new methods I created recently.

# July 16, 2012 07:58 PM

#### 8th GSoC weekly report: Examples, Bugs, and Kernel Choice

Last week was a mixed one. Next to new examples, tests, bugfixes, and helper methods, the biggest implementation is an automatic kernel selection algorithm for the linear time MMD. This is one thing that I worked on during my Master project at UCL.
It selects optimal optimal kernel weights for kernel of the family
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\}$
by solving the convex program
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}$
where $\hat{Q}$ is a linear time estimate of the covariance of the MMD estimates and $\hat{\eta}$ is a linear time estimate of the MMD.

I already described this a few weeks ago, when the method was developed. It is now integrated into SHOGUN. Efficient kernel selection, yeah :) It uses a convex solver called libqp, which is by Vojtech Franc, one of the mentors of this year's GSoC. Still, I need to think of a nice way of embedding it into SHOGUN's model selection framework, which isn't as straight-forward as it first seems.

This week, bug-hunting continues with a bug that gives wrong results during cross-validation on multi-class machines. Afterwards, I will try to polish my code so far a bit, especially documentation (and tutorial); and continue on more examples/demo for the new framework for statistical testing.

# July 14, 2012 12:10 AM

#### A memory-leak-killer combination

Once you have finished coding a program, tested it, debugged it and checked that its behaviour is the expected one, it is quite normal to be invaded by a superhero-like feeling. You just feel smart. The simple thought that your program may not be actually correct hurts your pride, a lot. Well, C/C++ programmers (along with developers of other languages that do not use garbage collection) know that this can be the case. Your program may be leaking memory; i.e. it has reserved and made use of some memory but not freed it properly.

During these last two years I have been using Valgrind [1] to help me find the source of memory leaks. Valgrind has a tool suited for this purpose and it is called memcheck [2]. Even though I have successfully used this tool several times, I must admit that there were some details in the output that I did not quite understand properly. For example, what was the difference between a memory leak caused by a memory block directly or indirectly lost, if the blocks of memory that were still reachable were caused by errors in my code, etc. I am sure that the answers to these questions are in the manual. But, to be honest, I have never read a manual from the first to the last page before starting to use it. In my honest opinion, it is just too time consuming and full of unintelligible details for a non experienced user. Today, I have found this document [3] extremely useful for this task. The concepts are presented in a simple and concise way, together with pictures and some examples to make the topic clear. A MUST read for anyone getting in touch with Valgrind memcheck.

Apart from this, in SHOGUN we use our own system of reference counting to track the number of objects that hold a pointer to another object. In this way, when one object is deleted, the reference counts to all the other objects this object in question was holding are decremented by one, effectively deleting the other objects if their counts go to zero. This system turns out to be very comfortable to use. And now, why am I talking about this reference counting here? Well, because it is indispensable that you make the correct use of it if you want to have your SHOGUN program free of memory leaks. There is a very nice option which enables traces to let you know when a new SHOGUN object is created, when the count to a reference is increased, decreased and to know when an object is deleted. This option, together with the use of Valgrind, has enabled me to detect a bunch of memory leaks in my code in a matter of minutes FTW! I am completely sure that this task would have taken me at least a couple of hours some time ago.

This definitely made my day and I must thank Sergey Lisitsyn for the suggestion of enabling this debug output and Aleksander for his wonderful document about Valgrind memory leaks reports.

PS. Take a look to the code profiling tool provided by Valgrind, cachegrind [4] and its KDE graphical interface kcachegrind! They are simply awesome.

References

[1] Valgrind. http://valgrind.org/
[2] Valgrind manual page about the tool memcheck.
http://valgrind.org/docs/manual/mc-manual.html
[3] Understanding Valgrind memory leaks reports.
https://sigquit.wordpress.com/2010/02/04/understanding-valgrind-memory-leak-reports/
[4] Valgrind manual page about the tool cachegrind.
http://valgrind.org/docs/manual/cg-manual.html
[5] An interesting thread is stack overflow about different types of memory leaks detected by valgrind memcheck.
http://stackoverflow.com/questions/3840582/still-reachable-leak-detected-by-valgrind

# July 14, 2012 12:10 AM

I tried to write a tutorial focusing on the type of memory leaks as detected internally by Valgrind and the generated output leak reports.

The PDF is available in the following URL:
http://es.gnu.org/~aleksander/valgrind/valgrind-memcheck.pdf

And the simple C tester to generate each type of memory leak (cases 1 to 9 in the Valgrind Manual) is available here:
http://es.gnu.org/~aleksander/valgrind/valgrind-memcheck.c

# July 09, 2012 03:25 PM

#### 7th GSoC weekly report: Hilbert Schmidt Independence Criterion

Finally, I started on kernel based (in)dependence tests last week. These are tests that try to find out whether for two random variables $\textbf{x},\textbf{y}$ are independent, i.e. whether their joint distribution factorises into the individual ones. The null hypothesis (that may be rejected) is $H_0:P_\textbf{x}P_\textbf{y}=P_{\textbf{x}\textbf{y}}$

These kind of tests basically work like two-sample tests: Given one set of samples from each random variable
$Z=(X,Y)=\{(x_1,y_1,...,(x_m,y_m)\}$
a test statistic is computed and then compared against the distribution of the statistic under the null-hypothesis. If the position is in an upper part of it, the null-hypothesis is rejected since it is unlikely that the current value was generated by it.

The class of independence tests I will implement for my project are all based on the Hilbert Schmidt independence criterion (HSIC), which takes out the above procedure to an reproducing kernel Hilbert space (RKHS). The (biased version of the) HSIC statistic itself is simply given by
$\text{HSIC}_b(Z)=\frac{1}{m^2}\text{trace}(KHLH)$
where $K,L$ are kernel matrices of the input samples $X,Y$ in some RKHS and $H=I-\frac{1}{m}\textbf{1}\textbf{1}^T$ is a centring matrix.

I integrated a general modular framework for independence tests into SHOGUN. The HSIC class is the first kernel-independence test that works. Interfaces are very similar to the two-sample test, however, they are not quite the same for various reasons. That's why there is another class for independence testing next to the one for two-sample testing.

As for the two-sample tests, the null-distribution may simply be approximated by bootstrapping, i.e. merging the samples and computing the statistic for many times. This is now possible for any independence test. Another method to approximate the null-distribution for HSIC is fitting a Gamma distribution [1] as

$m\text{HSIC}=\frac{x^{\alpha-1}\exp(-\frac{x}{\beta})}{\beta^\alpha \Gamma(\alpha)}$ where
$\alpha=\frac{(\textbf{E}(\text{HSIC}_b(Z)))^2}{\text{var}(\text{HSIC}_b(Z))} \quad \text{and}\quad\beta=\frac{m\text{var}(\text{HSIC}_b(Z))}{\textbf{E}(\text{HSIC}_b(Z))}$

It's also already implemented! There are already modular interfaces for the new classes and some simple tests. I will extend these during this weak. Time passes fast: The mid-term-evaluation is this week already. I pretty much enjoyed the first half :)

[1]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel statistical test of independence.

# July 06, 2012 12:15 AM

#### Hello world!

The title seems kinda apt so I am gonna use it shamelessly. Anyway, functional programming is one of the paradigms that most of the programmers don’t get much chances to have an exposure of . Most of us start learning programming in C or C++ or Java, which is pretty different from what functional programmers try to achieve. I happened to do a coursework on functional programming during my masters in IIT Bombay, CS613, instructed by Prof. Amitabha Sanyal, and happened to  catch a glimpse of it. He used Haskell for the purpose and I think its an excellent choice to teach this course.

It’s been a year I am out of touch and I am full of guilt and regret for not giving more time for it. Coding in Haskell can be so fun and at the same time horrifying, I even can’t begin to explain. I guess there were some parts I didn’t get quite well during the course, specially the Monads and a few portions of Lambda calculus.  So, before I jump into that complicated stuffs, I’d be brushing up a bit of the basics. Here is what I am gonna do. I’d be studying Learn You A Haskell (LYAH – available online – best place to start with Haskell) and keep posting interesting problems that we were given in assignments and exams. I’ll also be posting interesting blogs/problem sites and discuss the approach. One is Learn Me A Haskell.

Then I am planning to get started with Real World Haskell, another great book (also available online). But that seems pretty far ahead. The ultimate cracking would be to get involved with the Haskell open source project, and actually write some real code that matters.

I’ll keep you updated. Happy Learning.

# July 04, 2012 10:01 AM

#### Finally, a new project update

It is about time to say something again about how the project is going. First of all, as a matter of excuse (probably a bad one though) I want to say why the blog has not been updated lately. There have been a couple of weeks in which I have been debugging the code for the primal formulation of the SO-SVM based on MOSEK using a simple multiclass classification example. Those two weeks were basically that, debugging testing, debugging again, testing, … it is not interesting stuff to tell the truth but very necessary in any case! However, last week I started to work in the – probably – most interesting part of the project, the Hidden Markov - Support Vector Machine [1].

I like to think of HM-SVMs as a particular instance of the general Structured Output (SO) problem I have been talking about in the previous posts. In SO learning we are trying to find a function $f$ that takes an object from the space $\mathcal{X}$  and outputs and object from the space $\mathcal{Y}$. In HM-SVMs the objects of $\mathcal{Y}$ are sequences of discrete data (e.g. integers). In order to deal with the structure of the sequences properly, we extract features of the data in the space $\mathcal{Y}$ together with the data in the space $\mathcal{X}$. This is done via a function we denote as $\Psi(x, y)$ whose definition appears in the tutorial about HM-SVM that Nico, my GSoC mentor, wrote [2, pages 4 and 5].

Another important piece of the SO-SVM is the argmax function. In HM-SVMs this function is computed using the well-known Viterbi algorithm. I had never studied before this algorithm but I was happy to discover its similar structure to the forward-backward algorithm used to train an HMM from data that I have already studied and implemented [3]. I found this tutorial [4] on the Viterbi algorithm very useful and easy to understand. The implementation I have done in SHOGUN of this algorithm is based on the one in the HMSVM toolbox [5] (in particular, it is in this file [6]).

One more piece of code I have implemented in the toolbox for the HM-SVM is the loss function, commonly denoted as $\Delta$. In this case, we use the Hamming distance defined for strings. In our case, the inputs of the loss function are sequences of the same length so the Hamming loss is simple the count of the number of elements that are different in the sequences (think of it as the minimum number of modifications one may do to one of the sequences to make it equal to the other. Providing that the sequences are of the same length, the only modification allowed is to change the value of an element in one of the sequences).

These three functions, together with some data structures to hold features and labels used by the HM-SVM, constitute what I have implemented so far into SHOGUN about this interesting discriminative model. The plan now is to correct with Nico the possible mistakes I have introduced into the code and to generate some training and test data so I can start debugging phase! My idea is to port this function [7] from HM-SVM toolbox that generates random two-state data.

Comments and suggestions are very welcome!

You can find the code I have been talking about in github:
https://github.com/iglesias/shogun/tree/so/src/shogun/
(mainly in the files structure/HMSVMModel.[cpp|h], features/MatrixFeatures.[cpp|h], structure/HMSVMLabels.[cpp|h]).

References

[1] Altun, Y., Tsochantaridis, I., Hofmann, T. Hidden Markov Support Vector Machines.
http://cs.brown.edu/~th/papers/AltTsoHof-ICML2003.pdf
[2] SHOGUN Hidden Markov SO-SVM written by Nico Görnitz.
https://dl.dropbox.com/u/11020840/shogun/hmsvm.pdf
[3] The code of my HMM implementation for one of the homeworks of the Artificial Intelligence course that I took at KTH. https://github.com/iglesias/hmm-ai-ducks
[4] Viterbi algorithm tutorial.
http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/
html_dev/viterbi_algorithm/s1_pg2.html
[5] Entry in mloss.org for the HM-SVM toolbox written by Gunnar Raetsch and Georg Zeller.
http://mloss.org/software/tags/hmsvm/
[6] Viterbi algorithm implementation in the HM-SVM toolbox.
https://dl.dropbox.com/u/11020840/shogun/best_path.cpp
[7] Data generation for a two-state model in the HM-SVM toolbox.
https://dl.dropbox.com/u/11020840/shogun/simulate_data.m

# July 02, 2012 03:36 PM

#### 6th GSoC weekly report: First modular examples and other stuff

Last week's changes were all rather subtle:

• I created some first modular examples in python,
• fixed this big bug in the model selection trees I talked about last week (nasty!),
• added some convenience methods for the two-sample-test constructors (there is now a new method in CFeatures to append feature objects)
• and corrected a bunch of bugs on the fly.

This week, I will do some more work on the examples and then start working on independence testing.

# June 25, 2012 05:26 PM

#### 5th GSoC weekly report: Almost finished MMD tests

Last week, I have been working silently (no Internet connection) on finishing all MMD related tests. What is new is a distinction between biased/unbiased test statistics for the quadratic MMD, and the Gaussian approximation of the null-distribution for the linear time MMD - and more tests.

Since the linear time MMD, defined as
$\text{MMD}_l^2=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))$
where
$h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})$
is normally distributed both in null- and alternative distribution, one can easily compute test-threshold and p-values using the variance of the h-values of the above term as a proxy for the real variance (The null-distribution has zero mean). For large sample sizes, this is an accurate and very cheap way to perform the test: The statistic has to be computed only once whereas bootstrapping would need a few hundred iterations. Another reason why the linear time MMD is well suited for large scale problems.

I also started on integrating my code into the modular interfaces of SHOGUN and will produce some first python examples next week.

Jacob (Gaussian Process Regression project), who uses and extends parts of my model selection code from last years GSoC has found a serious problem in SHOGUN's parameter trees for model selection. I hope to fix it this week - complicated.

When all mentioned things are done, I might start with dependence testing next week.

# June 20, 2012 02:28 PM

"Much of machine learning (ML) research has lost its connection to problems of import to the larger world of science and society." So begins Kiri Wagstaff's position paper that will have a special plenary session on June 29 at ICML 2012. The paper then goes on to lament about the poor state of affairs in machine learning research. The paper is an interesting read, and it addresses an important question that any adolescent field faces: "How do I justify my existence?"

I'd like to take the half full glass view. Machine Learning already matters!.

Kiri herself uses examples that show that machine learning already has impact. In her introduction, she mentions the CALO project, which forms the basis of Siri on the iPhone 4S, which has revolutionised the way the general public perceives human computer interactions. She also mentions spam detection, which Gmail has generalized to sorting all email with Priority Inbox.

A quick look around the web reveals other success stories:

• The recent technology quarterly section of the Economist 2 June 2012 edition discusses the use of robots and how we would need to start legislating them. Ironically, in our human desire to appropriate blame in case of failure, we may have to block learning. Quoting the article: "This has implications for system design: it may, for instance, rule out the use of artificial neural networks, decision-making systems that learn from example rather than obeying predefined rules."

• Searching for the phrase "machine learning" in PLoS Computational Biology returns 250 hits, showing how machine learning has revolutionised biological research in the high throughput age.

• In high energy physics, particle accelerators use anomaly detection algorithms to only save data which may be interesting. The ultimate learning with data streams application.

At NIPS 2008 at the last talk of the Machine Learning in Computational Biology mini-symposium, I had the pleasure to be inspired by Thomas Lengauer's activities proposing anti-HIV therapy. I'd say that this "solves" challenge number 5 in Kiri's list. Remarkably (unfortunately?), their recommendation site, remains just that, a recommendation site, and has yet to navigate the legislative nightmare of getting a website to prescribe drugs. In an answer to a question, he said that Germany was one of the few places in the world where the legislation even allows for doctors to use such drug recommendation sites. A scan of the titles cited by the review article reveals keywords which would fit comfortably in a machine learning venue:

- multiple linear regression
- simple linear model
- prediction-based classification
- artificial neural networks
- self organising feature maps
- non-parametric methods
- sparse models
- convex optimization


But doom and gloom persists. Why? My personal opinion is that like most successful technologies, machine learning fades into the background once it has impact. In that vein of thought, we can measure the impact of machine learning by the decline of ICML, JMLR and friends. Meanwhile, I'm going to go back to making machine learning disappear...

Please join in the discussion at http://mlimpact.com/.

# June 19, 2012 01:26 PM

#### Second Weekly Report GSoC 2012

Let’s have some fun, fun with optimization!

Basically during the last days my work in the project has been strictly focused on the implementation of the optimization algorithm for SO-SVM (Structured Output – Support Vector Machine). This algorithm is presented and fully analysed in [1]. In the paper the dual formulation of the optimization problem is explored whereas we are concerned with the primal version of it at this point in the project. The reason why the primal has been selected stems from its properties regarding complexity  and efficiency; the dual requires an explicit expansion of the weight vector and the memory requirements are quadratic rather than linear as with the primal formulation. In [2] appears the primal formulation of the algorithm even though the main topic of this article is beyond our current approach for SO learning.

The algorithm belongs to a class called Cutting Plane Algorithms. These optimization methods are based on the idea of refining iteration by iteration the feasible set of solutions using the most violated constraints. In [3] there is a video lecture and around the instant 6:20 it is nicely introduced the intuition of how cutting plane methods work, which I really like.

As it was commented in the previous post, a quadratic program (QP) appears in the algorithm. In particular, every iteration of the algorithm we need to solve a QP that looks like

$\text{minimize}_{\vec{w}, \xi_i} \frac{1}{2} \vec{w}^TC\vec{w} + \sum_{i=1}^N \xi_i$
$s.t.~\forall i~\forall y \in \mathcal{Y} \setminus y_i : \vec{w} \cdot\delta\Psi_i(y) \ge \Delta(y_i, y) - \xi_i$

We are using MOSEK to solve this QP [4]. The QP shown above needs to be translated to a more general formulation as the one provided by MOSEK though. In order to do this, I have created an interface to MOSEK from Shogun inspired by the CPLEX one that is in the toolbox.

Right now the algorithm is close to be finished. However, nothing has been properly tested this far and this makes me feel uncomfortable. Once the algorithm is finally ready, I will start to prepare the first and simplest case of use of the framework, multiclass classification. Debugging and testing time will come then!

References
[1] Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y. Support vector machine learning for interdependent and structured output spaces.
[2] Finley, T., Joachims, T. Training Structural SVMs when Exact Inference is Intractable.
[3] Video lecture by Thomas Finley on Training Structural SVMs when Exact Inference is Intractable. http://videolectures.net/icml08_finley_tssvm/
[4] MOSEK C API Quadratic Optimization. http://docs.mosek.com/6.0/capi/node007.html#329747704
[5] Nowozin, S., Lampert, C. H. Structural Learning and Prediction in Computer Vision. The section 6 is mainly related with this part of the project.

# June 18, 2012 04:42 PM

#### 4th GSoC weekly report: Tests, tests, tests...

Since all threshold methods that I implemented so far are quite unintuitive and hard to verify, I spent last week writing tests for most of their parts. I created test-cases based on fixed and random datasets (mainly taken from the papers where the methods were introduced), in which results are compared against MATLAB implementations of the same problem. For most cases, the difference is asserted to be smaller than 10E-15.

Although it takes a lot of time and effort, I believe these automatic tests should come with every more complicated method in SHOGUN. I remember quite some cases where I spent an hour on finding an error that would have been easily caught by a test.

I hope to finish all all statistical tests based on the quadratic MMD this week. I still need to implement the biased version (which can be used along with the threshold methods) and a possibility for choosing the type of statistic to use. I also plan to implement the linear time MMD and corresponding tests (Gaussian approximation of null-distribution). I won't have Internet access this week since I am in nowhere in Poland -- so expect a larger patch at the end of the week.

# June 18, 2012 07:36 AM

#### Third Weekly Report GSoC 2012 – Clash of Assumptions

After a weekend of debugging I can proudly say that I am able to run toy examples using my implementation of SO-SVM for the multiclass classification case of use \o/

Just for the record, and for the fun of its discovery, I am going to write about the last bug I fixed in my code. As it is commented in the previous posts, part of the project deals with solving an optimization problem, a QP in particular. The fact is that I could run the code just fine, no more failing ASSERTS popping up nor segmentation faults. However, something weird was going on with the solution. The weight vector $\vec{w}$ was zero for every problem instance. After double checking with MATLAB’s function quadprog that the correct solution for the problem was effectively non-trivial, I tried adding some lines here and there to ensure that the parameters given to MOSEK’s were the correct ones. All the matrices and vectors were ok. After some more lines to print out the bounds of the problem I found it. The variables of the optimization problem were all fixed to zero!

At the beginning I just assumed that if no bound is given for a variable, then MOSEK would consider that this one is free; i.e. it may take values within $(-\infty, \infty)$. However, this is false. If no bound is given, then the reality is that MOSEK fixes the value of the variable to zero. I still don’t get the point of including variables to the optimization vector whose values are fixed… but that’s another story.

For the next time I will try to remember to ensure that the assumptions I make are correct or, even better, to RTFM

My next objective is to try out the application with bigger examples. Unfortunately my MOSEK license file just allows me to solve problems with up to 300 variables or constraints, which currently prevents me from doing tests with training data bigger than 17 bidimensional training examples!

# June 11, 2012 02:58 PM

#### 3rd GSoC weekly report: New threshold methods for quadratic MMD

I finally finished my exams last week and can now work full-time for GSoC and my Master project. Puh! :)

Last week, I implemented two methods to compute a threshold for the quadratic time MMD test:

1. A test based on the Eigenspectrum of the kernel matrix of the joint samples. This is a nice and computationally efficient idea from [1]. The basic idea is that for the case $P=Q$, the biased estimate of the null-distribution converges in distribution:
$m\text{MMD}^2_b \rightarrow \sum_{l=1}^\infty \lambda_l z_l^2$
where $z_l \sim \mathcal{N}(0,2)$ i.i.d. and  $\lambda_l$ are Eigenvalues of which the empirical estimates $\hat{\lambda}_l$ are given by the Eigenvalues of the centred kernel matrix $\tilde{K}=HKH$ where $K_{ij}=k(x_i,x_j)$. Its possible to sample the null-distribution using these estimates and to compute a p-value or threshold using the resulting samples.

2. A heuristic method, also from [1],  that approximates the null-distribution with a gamma-distribution where the first two moments are matched. I.e.
$m\text{MMD}_b(Z) \sim \frac{x^{\alpha-1}\exp(-\frac{x}{\beta})}{\beta^\alpha \Gamma(\alpha)}$ where $\alpha=\frac{(\textbf{E}(\text{MMD}_b(Z)))^2}{\text{var}(\text{MMD}_b(Z))}\quad \text{and}\quad \beta=\frac{m\text{var}(\text{MMD}_b(Z))}{(\textbf{E}(\text{MMD}_b(Z)))^2}$

Both methods need some some distribution function ($\Gamma$, etc), which I integrated from ALGLIB. I added some tests to ensure that results equal to these obtained with MATLAB. Along with that come some SGVector based wrappers for SHOGUN functions (sorting, eigenvalues, etc).

Next week, I will do some fine-tuning on the implemented methods and then create tests which illustrate all methods.

[1]: Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). A fast, consistent kernel two-sample test.

# June 04, 2012 01:29 PM

#### 2nd GSoC weekly report

Last week, I was again very busy with exams and doing experiments for a NIPS submission.

The latter is somehow related to my GSoC project, and I will implement it once the other stuff is done:
We developed a method for selecting optimal coefficients of a non-negative combination of kernels for the linear time (=large scale) MMD-two-sample test. The criterion that is optimised for is the ratio of the linear MMD $\eta_k$ by its standard deviation $\sigma_k$, i.e.
$k_*=\arg \sup_{k\in\mathcal{K}} \eta_k \sigma_k^{-1}$. That is equivalent to solving the quadratic program
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}$
where the combination of kernels is given by
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\}$
$\hat{Q}$ is a linear time estimate of the covariance of the MMD estimates and $\hat{\eta}$ is a linear time estimate of the MMD using the above kernel combinations.

Apart from that, I implemented a method to approximate the null-distribution of the quadratic time MMD, which is based on the Eigenspectrum of the kernel matrix of the merged samples from the two distributions, based on [1]. It still needs to be compared against the MATLAB implementation. It comes with some minor helper functions around matrix algebra.

This week, I will finally have my last exam and then continue on the advanced methods for computing test thresholds.

[1]: Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). A fast, consistent kernel two-sample test.

# June 04, 2012 11:10 AM

#### First Weekly Report GSoC 2012

Let’s do the weekly reports kick-off of this summer!

Although GSoC started officially a couple of days ago, on May 21st, I have been working on the project for about two weeks. Next, I am going to summarize what progress has been made during this time.

First of all, based on the code skeletons [1] that Nico wrote, I started with the design of the SO framework in Shogun. The design decisions taken during this phase are summarized in the attached class diagram [2]. In the diagram, the classes in light green existed in Shogun previous to this project whereas the classes filled with light red are brand new. Among the new classes, the class CStructuredModel seeks to offer functionality to put together all the application-dependent parts of a SO problem instance.  The CLossFunction class became very handy since I just needed to extend it with a few methods in order to support the functionality required by SO. The idea of this class is to provide a generic interface for well-defined loss functions (e.g. Hinge loss). Needless to say, the design shown in the diagram is very likely to evolve. For example, CStructuredModel is currently implemented to be used with function pointers for some of its members and this will change to use a more understandable interface with classes.

Initial SO class diagram.

In addition, classes (labels/CStructuredLabels and lib/CStructuredData) to provide labels with structure (e.g. sequences, graphs) have been added. This is probably the feature that distinguishes the most SO learning from the other strategies already present in Shogun.

Finally, the optimization algorithm presented in [3]. This is still work in progress and the code is in CPrimalMosekSOSVM. The main difficulty I have found here is that, in order to solve the quadratic program (QP) that arises, we need to use a non Open Source tool since libqp does not support all the required constraints (in particular inequality constraints of the type $A \cdot x \leq b$ for the QP with box constraints). I have started to write some code in CPrimalMosekSOSVM that makes use of MOSEK to solve the QP. This piece of code is still rather poor and it is just in my local repository.

The current working plan is, in this order: finish the code in CPrimalMosekSOSVM mentioned above (I have set a deadline for this on Friday, June 1st), prepare the first case of use with multiclass SVMs, extend the design creating a class for the $\arg \max$ computation and another one for the structured loss function $\Delta(y_{pred}, y_{true})$.

References
[1] Gist with main concepts of the framework written by Nico Görnitz. https://gist.github.com/2634487.
[2] Structured Output framework – Class Diagram.
[3] Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y. Support vector machine learning for interdependent and structured output spaces.
[4] SO learning branch in git:  https://github.com/iglesias/shogun/tree/so.

# May 28, 2012 08:54 AM

#### First GSoC weekly report

I am currently quite busy with my exams, however, the last three will be done soon and I still managed to do initial sketches for the statistical testing framework along with helping on solving problems that occurred because of the massive changes that are currently happening to SHOGUN's label and multi-class system.

Here you can find an UML diagram of the class structure so far. I implemented first simple kernel-two-sample-tests -- the ones based on the linear and the quadratic time MMD metric. For computing a p-value, these two may approximate their null-distribution using a (brute-force) bootstrapping approach based on shuffling data of the two underlying distributions and then computing the statistic multiple times. The bootstrapping code will work for any two-sample based test.

Next steps are: Advanced methods for estimating Null-distributions for the MMD tests.

I also worked with Arthur (mentor) on a version of the MMD that is related to my Master project: A convex combination of (arbritary) kernels for the linear time MMD where the optimal weights are learned by solving a quadratic program. I might implement that into SHOGUN as well. (Who can help me how to interface the QP-solver of SHOGUN?)

# May 22, 2012 09:57 PM

#### GSoC 2012 is already here!

Last month I became selected to develop a project for Shogun in the frame of GSoC. The project’s name is Build a Generic Structured Output Framework Learning and it is mentorized by Nico Görnitz. Here follows a short description of the project:

The aim is to implement tools for structured output (SO) problems. The data in these problems have complex structure (e.g. graphs, sequences) and the traditional learning algorithms fail to find solutions efficiently. Structured output support vector machines and conditional random fields are methods for SO learning. They will be implemented to form Shogun’s first module for SO learning. Finally, these methods will be applied to hidden Markov models-type of problems such as gene prediction.

Feel free to visit my project proposal where, among some personal information, you will be able to find a thorough description of the project together with a tentative schedule and useful references on the topic.

This is going to be a fun summer of coding!

# May 22, 2012 02:25 PM

Welcome to iglesiashogun.wordpress.com! In this blog you will be able to find information about my wonderful experience as software developer for the Open Source Machine Learning toolbox Shogun.

Happy coding!

# May 15, 2012 05:31 PM

#### AISEC 2012 - Call for Papers.

I am happy to spread the word about the 5th ACM Workshop on Artificial Intelligence and Security (AISEC). The workshop invites original research papers describing the use of machine learning in security and privacy problems, as well as position papers discussing the role of learning in security. The workshop will be held in conjunction with the renowned ACM Conference on Computer and Communications Security (CCS) in Chicago and accepted papers are published by ACM press.
• Paper Submission Due: July 16, 2012
• Acceptance Notification: August 13, 2012
• Workshop: October 19, 2012
AISEC is one of the few events that specifically targets the challenging niche of machine learning and computer security. As a member of the PC, I am really looking forward to interesting contributions and innovative applications. More information are available at the workshop's website.

#### What's New

 Feb. 17, 2014 SHOGUN 3.2.0 Jan. 6, 2014 SHOGUN 3.1.1 Jan. 5, 2014 SHOGUN 3.1.0 Oct. 28, 2013 SHOGUN 3.0.0 March 17, 2013 SHOGUN 2.1.0 Sept. 1, 2012 SHOGUN 2.0.0 Dec. 1, 2011 SHOGUN 1.1.0