IRC logs of #shogun for Sunday, 2011-07-31

--- Log opened Sun Jul 31 00:00:59 2011
blackburnsonney2k: no leaks I'd say00:01
@sonney2kok then good...00:02
blackburnsonney2k: tried with init and loaded features - no leak, jigsaw memory usage :D00:06
blackburn1.3 1.5 1.9 2.3 1.6 1.9 2.2 1.0 ...00:07
blackburn2M kernels00:07
blackburnsonney2k: so the only thing to get to work - examples..00:08
@sonney2kyeah looks like00:09
@sonney2kblackburn, I have a new suggestion for SGVector btw:00:10
@sonney2k        static virtual void SGVector& get_vector(SGVector &src, bool own=true)00:10
@sonney2k        {00:10
@sonney2k            if (!own)00:10
@sonney2k                return src00:10
@sonney2k            orig.do_free=false;00:10
@sonney2k            return SGVector(src.vector, src.vlen);00:10
@sonney2k        }00:10
blackburnbetter00:10
@sonney2kI think I will go with this one00:11
blackburnsonney2k: week before I made some tester for java00:12
blackburnbut can't decide if I should continue..00:12
@sonney2kblackburn, don't for now00:12
@sonney2kthis needs discussion00:12
@sonney2kthe question really is - how do we want to compare if things are the same in our test suite00:13
blackburnsonney2k: I think we should split examples to some 'unit-tests' and complex examples00:14
@sonney2knot a good idea00:14
blackburnsonney2k: I think we shouldn't compare if things are the same00:14
@sonney2kno one will maintain tests00:15
blackburnsonney2k: unit tests should be autogenerated00:15
@sonney2k?00:15
blackburnwell all the kernels are the same..00:16
@sonney2kblackburn, so?00:16
@sonney2kwe had that before00:16
@sonney2kthe test suite now is 100% useless because no one updates it00:16
@sonney2kit is enough work to 'just' update examples00:16
blackburnI mean we can only write templates00:16
@sonney2kdoesn't help00:17
blackburnwhy?00:17
@sonney2ktoo much work00:17
@sonney2kyou need to do that for everything00:17
@sonney2kand there is not that much that you can generatlize00:17
@sonney2kthere are always exceptions00:17
blackburnbut agree, we can't maintain tests for java,python,...00:18
@sonney2kit really is much easier to write examples for everything (which we have to have anyways)00:19
@sonney2kand then return some reasonable number or so00:19
-!- in3xes [~in3xes@180.149.49.227] has quit [Quit: Leaving]00:19
@sonney2kthat we use to compare results00:19
blackburnI don't like the way of comparing results or so00:22
@sonney2kblackburn, because?00:23
blackburnsonney2k: well it looks strange to me..00:23
@sonney2kblackburn, yeah but why?00:23
blackburnI think we should only test if no errors00:23
@sonney2kwhat does no errors mean?00:24
blackburnno compile-time errors, no runtime errors like segfaults..00:24
@sonney2kblackburn, but then that might mean we return just crap00:25
blackburnreturn to?00:25
@sonney2kI could replace train() with random()00:25
@sonney2kand no one would recognize00:25
blackburncan we recognize it now?00:25
@sonney2kyes00:25
blackburnhow?00:26
@sonney2kblackburn, because we know that at time point T1 everything was correct00:26
@sonney2know we develop sth else00:26
@sonney2kand just compare whether result at T2 is the same as T100:26
blackburnwe don't use it at all..00:27
@sonney2kblackburn, yeah because no one runs it00:28
@sonney2kand we have no build bot that does it automagically00:28
@sonney2kbut we also have a problem00:28
@sonney2kbecause results are GaussianKernel et00:29
@sonney2kc00:29
@sonney2kand we pickle.dump00:29
@sonney2kand internally we changed formats so serialization results are different and we can no longer load results00:29
blackburnbad bad00:30
@sonney2kblackburn, yes.00:30
@sonney2kthe issue is here how can we keep the format constant or at least compatible00:31
CIA-87shogun: Soeren Sonnenburg master * r42595fa / src/interfaces/java_modular/swig_typemaps.i : use simple swig enums - https://github.com/shogun-toolbox/shogun/commit/42595fa7a75037216c096aeb9879f265c37fdbfe00:47
CIA-87shogun: Soeren Sonnenburg master * r95f11a0 / (5 files in 3 dirs): Hopefully fix compiler errors in GMM/Gaussian. Utilize destroy_*. - https://github.com/shogun-toolbox/shogun/commit/95f11a02929a4db4404c4bfdd0670d43cebf061000:47
blackburnsonney2k: what do you think, is it better to use Gaussian things in GaussianNaiveBayes?00:52
@sonney2kblackburn, not our biggest problem now - rather think about what we do to keep the serialization format compatible00:52
blackburnit simply fits a gaussian for each class00:53
@sonney2kmaybe we need some kind of  variable x is now y00:53
blackburnI'm just talking about it because there is a little bug in GNB :D00:53
@sonney2kor things like this00:53
blackburnsonney2k: how it looks now?00:53
@sonney2kblackburn, talk to alesis-novik about this :)00:53
@sonney2kblackburn, well we basically register all member varibles00:54
@sonney2kthe problem is that we introduced new ones now00:54
@sonney2kand we renamed old ones or even changed types...00:54
blackburnany automagic way ?00:55
blackburnbtw in java we would have way hehe00:56
@sonney2khow do you do that in java?00:56
@sonney2kif you change an object and rename variables?00:56
blackburnwell it is possible to get variable types and names00:57
blackburnin java*00:57
@sonney2kblackburn, how does this help our problem?00:57
blackburn'nohow', we aren't using java :)00:57
@sonney2kif you rename a variable - the serialized file would be different and so you couldn' load the serialized object00:58
@sonney2kthe old one I mean00:58
blackburnah i see00:58
@sonney2kthat is our problem atm00:58
@sonney2kwe need to extend the serialized format to store some transitioning information00:58
@sonney2klike version when this varialbe appeared00:59
@sonney2kand when this thing vanished or whatever00:59
@sonney2kand renamed00:59
blackburnsonney2k: didn't you say that it doesn't really matter now because we changed much things already?00:59
@sonney2kand transition form verions x to version y functions00:59
@sonney2kblackburn, no01:00
@sonney2kit is the only way to ensure testing...01:00
blackburndo you want to make it still compatible?01:00
@sonney2kthe only other alternative is to check each method individually01:00
@sonney2kby hand that is01:00
blackburnokay time for some sleep01:15
blackburnsonney2k: see you01:15
@sonney2kcu01:15
-!- blackburn [~blackburn@188.122.224.26] has quit [Quit: Leaving.]01:15
-!- f-x [~user@117.192.192.42] has joined #shogun07:29
-!- f-x` [~user@117.192.222.125] has joined #shogun08:15
-!- f-x [~user@117.192.192.42] has quit [Ping timeout: 260 seconds]08:17
-!- f-x [~user@117.192.222.125] has joined #shogun08:49
f-xsonney2k: you here?10:00
-!- f-x` [~user@117.192.222.125] has left #shogun ["ERC Version 5.3 (IRC client for Emacs)"]10:00
-!- f-x [~user@117.192.222.125] has quit [Quit: ERC Version 5.3 (IRC client for Emacs)]10:00
-!- f-x [~user@117.192.222.125] has joined #shogun10:01
-!- f-x is now known as Guest3302910:01
-!- Guest33029 is now known as f-x`10:01
f-x`sonney2k: should i use enums to identify the loss function for SVMSGD?10:02
@sonney2kf-x`, no just use a loss member11:29
@sonney2kthen you can add some flag or so for this < comparison11:29
f-x`sonney2k: i don't know.. what kind of flag do you suggest?11:35
-!- f-x` is now known as f-x11:37
-!- heiko [~heiko@134.91.52.15] has joined #shogun11:41
@sonney2kf-x, for SGD/QN you don't really need that flag if you treat the learning algorithms differently12:21
@sonney2kbut if you don't want to do that then you could introduce some flag needs_extra_update or so that is set true for some...12:22
@sonney2kheiko, hi...12:22
heikosonney2k, hi!12:22
@sonney2kheiko, I was trying to stabilize things and came across one issue...12:23
heikowhich one?12:23
@sonney2kthat is our test suite no longer works due to all the variable addtions (like subset) and renames and SGVector stuff12:23
@sonney2kso heiko I was wondering if you would have time to work on this a bit ...12:24
heikoyes, I can do this12:24
heikoThink I will finish the KMeans stuff today12:25
heikobut what exactly is the problem?12:25
@sonney2kheiko, you did subset for string features right, but sparse is still missing?12:25
@sonney2kand kernel / distance machine will work too - at least soon12:25
heikono sparse features already have subset12:26
heikoan example is there too12:26
heikoin c++12:27
heikoyes, kernel machines work12:27
heikobut only with simple/string features12:27
heikono modelselection for sparse features currently12:27
@sonney2kheiko, ok so then todo was only some nicer python syntax typemaps and different sampling techniques12:27
heikoyes12:27
@sonney2kheiko, why not - how is it different?12:27
@sonney2k^sparse & ms12:27
heikocross validation needs a method of CFeatures that is not implemented for sparse12:28
heikocopy_subset12:28
@sonney2kheiko, btw when you use SGVector in a class, could you please use vector.destroy_vector() in destructor?12:28
@sonney2kheiko, I just don't understand the difference to any other feature object ...12:29
heikosonney2k, yes, you changed this, i forgot12:29
@sonney2kahh ok12:29
@sonney2k(I was just reading your pull request/patch)12:29
heikosonney2k, have a look at void CKernelMachine::store_model_features12:30
heikothere, this method is called12:30
@sonney2kheiko, yeah I understand12:30
heikosonney2k, pull request updated12:31
@sonney2kheiko, but that copy* fucntion should be trivial for sparse12:31
@sonney2kit is the same like for strings...12:31
heikoyes, should be simple12:31
@sonney2k(mor or less :)12:31
@sonney2kmore12:31
heikobut has to be implemented :)12:31
@sonney2kyes yes12:31
@sonney2kok then I would say finish distance, the sparse copy and then it would be very very good if you could help getting serialization more cross-version compatilbe12:32
@sonney2kso here is the problem we have:12:32
@sonney2kall these m_parameters->add() stuff registers the variables to be serialized12:33
@sonney2know that is all good and works12:33
@sonney2khowever in shogun version+1 we might add a new variable, like in this case subset :D12:33
heikoyes12:34
@sonney2ksuddenly older objects can not really be loaded (well they could but issue a warning)12:34
@sonney2kso the plan would be to store a version in addition12:34
@sonney2kso add a int version to the m_parameter->add() call12:35
heikoand then to check version upon deserialization12:36
@sonney2kand each object then has a separate version nr defined that is then just passed12:36
@sonney2kheiko, yes, so it means we load only things of that specific version even we are in a newer version object12:36
@sonney2kso that would solve the problem of *additions*12:36
@sonney2know we have the problem of type changes and renames too12:37
heikoyes12:37
@sonney2kfor example we have lots of changes that do double* vector, int len -> SGVector vec12:37
heikoyes12:38
heikoalready an idea for that?12:38
@sonney2kI am not 100% sure yet how to properly fix this but I think there is no other way than providing some transition table12:38
heikook12:38
heikotricky12:38
@sonney2ki.e. in that table there would be the old variable names registered and a transition function that returns the new one12:38
heikoand this table has to be updated when someone changes a variable12:39
heikoor should this go automatically?12:39
@sonney2kso e.g. old_names = {vector, len} new_name= vector , transition function = transform_double_len_to_sgvector()12:39
@sonney2kheiko, this cannot go automatically12:39
heikook12:40
@sonney2kwe would have to update that for all classes such until the whole test suite runs trough again12:40
heikobut these functions are only for serialization12:40
heikoor deserialization of old data12:40
@sonney2k(test suite currently is the tester.py in testsuite)12:40
@sonney2kyes12:40
heikoi will have alook12:40
@sonney2kdeserialization only12:40
@sonney2kin serialization we write things out only in the newest format I would say... not sure if this is good - very much M$ Word style...12:41
heikopuh, that is a lot of stuff12:41
@sonney2kheiko, lets start with the low hanging fruits, that is additions12:41
heikook12:42
heikoand the version id12:42
@sonney2kyeah I think this can be solved via the version id12:42
heikook12:43
heikoI will probably start on this tommorrow and then bother you with my problems :)12:43
@sonney2kheiko, btw in line 146 you can use SGVector<float64_t>(k)12:44
@sonney2kthis will alloc a vector of len k12:44
@sonney2kheiko, yeah12:44
heikosonney2, another question:12:45
heikothere are many feature classes that do not support subset or model selection12:45
heikothis is because the inheritance sturcture12:45
heikobasically there are only three classes, and the things work for these12:45
heikobut for all these specializations, the methods are not implemented12:45
heikofor example all dot features12:46
heikobecause in the class DotFeatures itself, it is not possible to implement the missing methods12:46
f-xsonney2k: sorry for persisting, but the 'if (z < 1)' between the #if-#endif should function something like 'if (z < 1 || loss->is_log_loss())' right? i'm not understanding where this needs_extra_update flag would go12:47
@sonney2kheiko, it should be do-able in dotfeatures too - problem is that this needs another change in the features beneath, like dot() would call compute_dot() with the right subset12:49
@sonney2kheiko, so lets postpone that for now.12:50
heikook12:50
heikothis will automatically detected if people encouter the "class XYZ" is not ready for model-selection yet :)12:50
@bettybooheiko, haha12:50
heiko(SG_ERRORS)12:51
@sonney2kf-x, it is not just for LOGLOSS but also for LOGLOSSMARGIN12:51
f-xsonney2k: right.. so these two loss functions should have some common property we should be able to check for12:52
f-xor we could have enum types for all loss functions and check for those enums12:52
@sonney2kf-x, that is why I was suggesting a needs_extra_update or so flag12:52
f-xwhere would this be? in the SGD class?12:53
f-xi didn't understand properly12:53
@sonney2kf-x, in the losses12:53
@sonney2kf-x, in the end you either create all losses in one file or multiple like you do (up to you)12:53
f-xsonney2k: and they will be used only in SGD/SGD-QN?12:53
@sonney2kif in one file they would be in mathematics/*12:53
f-x(the flag)12:53
@sonney2k(BTW there is already one loss thingy in there which should be modified/removed)12:54
@sonney2kf-x, yes12:54
@sonney2kif you do it in one file then you will have to use enums for selecting the loss12:55
@sonney2kotherwise classes - which is what you do now.12:55
f-xsonney2k: but it wouldn't be good to modify loss functions for the sake of learning algorithms right?12:56
f-xor will this flag be of use generally as well?12:56
@sonney2kf-x, how else would you solve that problem?12:57
@sonney2kthe only other chance I see is to change the learning algorithm completely depending on loss12:57
f-xdefine a global list of enums for all loss functions in some header file12:57
@sonney2kand then?12:57
f-xand each loss function returns that enum12:57
f-xcheck for that enum from SGD12:57
f-xwhether the enum is LOGLOSS or LOGLOSSMARGIN or whatever12:57
@sonney2kyes sure that is also fine12:57
@sonney2kthis is what features/ preprocessors do12:58
@sonney2kkernels / distances too btw12:58
f-xsonney2k: so where should i add the loss function enums?12:58
@sonney2kthey all have an enum12:58
@sonney2kin Loss.h12:58
f-xhmm right12:58
f-xsonney2k: ok. sounds good, i'll do that.12:59
f-xsonney2k: btw VW also adds a couple of methods to the loss functions it uses12:59
f-xlike get_update() and get_square_grad()12:59
f-x(which are basically used mainly for VW)12:59
f-xso i shouldn't put these into the loss function classes right? (coz they'll probably only be used by VW)13:00
@sonney2kf-x, put them in the losses13:00
f-xand also I don't know how they'd look for any general loss function - I know it only for those loss functions used in VW13:00
@sonney2kthey belong there because they do some extra stuff13:00
@sonney2kthen return SG_NOTIMPLEMENTED for the other losses there13:01
f-xsonney2k: okay.. that's nice.. i'll add them too.13:01
@sonney2kit is totally fine if not all losses support such functions13:01
@sonney2k(or not implemented as in this case)13:01
f-xgreat.. we could implement them later.. by solving a recurrence relation in john's paper.. but i don't think I'll do it now.13:02
f-xSG_NOTIMPLEMENTED is the way to go13:02
f-xsonney2k: is the compilation fixed? or was it compiling for you already?13:02
-!- in3xes [~in3xes@180.149.49.227] has joined #shogun13:02
@sonney2kf-x, I have an older gcc version so it always compiled here13:03
@sonney2kbut I hope I fixed it yes13:03
f-xok.thanks! mine's 4.6.1.. and i'll report if it doesn't work here13:03
CIA-87shogun: Soeren Sonnenburg master * rfd09670 / (2 files):13:06
CIA-87shogun: Merge pull request #253 from karlnapf/master13:06
CIA-87shogun: made KMeans serializable and SGVector replacement (+8 more commits...) - https://github.com/shogun-toolbox/shogun/commit/fd0967097e615f9f234f1a18c6269e89d57a2ab413:06
@sonney2kalesis-novik, so can you avoid the memcpy stuff?13:07
-!- blackburn [~blackburn@188.122.224.26] has joined #shogun13:21
heikosonney2k, I think, it makes sense to implement apply for any DistanceMachine, i.e. move the implementation from KMeans to DIstanceMachine13:28
heikobut then, one has to ensure that every distance machine stores its cluster centers in the lhs of the underlying distance variable13:29
heikowhat do you think about this?13:29
heikothen any distance machine would implement the apply method13:29
heikowell, every distance machine that builds cluster centers in training13:30
heikoKNN then would override apply by its own method13:30
blackburnsonney2k: openmp? ;)13:54
@sonney2kheiko, ok14:14
@sonney2kblackburn, slow14:15
blackburnsonney2k: why slow?14:15
blackburnmany things could be easily adapted for openmp because of #pragma notation..14:16
blackburnis it really slow?14:16
@sonney2kblackburn, for simple things it is fast yes14:18
CIA-87shogun: Heiko Strathmann master * r3ac5c53 / (2 files): another SGVector replacement and usage of CMath::sqrt instead of std::sqrt - https://github.com/shogun-toolbox/shogun/commit/3ac5c53a62eec98dd1d0b68a2dd80453755f9a1d14:20
CIA-87shogun: Soeren Sonnenburg master * ra6586d5 / (2 files):14:20
CIA-87shogun: Merge pull request #254 from karlnapf/master14:20
CIA-87shogun: SGVector replacement - https://github.com/shogun-toolbox/shogun/commit/a6586d545c32c38ee414efd277d49f41bc8352a014:20
blackburnsonney2k: and when it is slow?14:25
-!- heiko [~heiko@134.91.52.15] has quit [Ping timeout: 258 seconds]14:25
@sonney2kblackburn, in my attempts when I called functions in the parallelized pragma stuff14:34
@sonney2kblackburn, so plain for loops without functions should become faster...14:34
blackburnso what we might use for multithreading?14:35
@sonney2kpthreads14:39
-!- mrsrikanth [~mrsrikant@59.92.22.26] has joined #shogun14:50
-!- in3xes_ [~in3xes@210.212.58.111] has joined #shogun14:54
-!- in3xes [~in3xes@180.149.49.227] has quit [Ping timeout: 240 seconds]14:57
-!- mrsrikanth [~mrsrikant@59.92.22.26] has quit [Read error: Connection reset by peer]15:07
-!- f-x [~user@117.192.222.125] has quit [Ping timeout: 260 seconds]15:31
-!- in3xes_ is now known as in3xes15:34
-!- srikanth [~mrsrikant@59.92.22.26] has joined #shogun16:21
-!- srikanth [~mrsrikant@59.92.22.26] has quit [Quit: Leaving]17:04
alesis-noviksonney2k, around?18:03
alesis-novikWell, I did what you asked, and valgrind doesn't seem to complain, so that's good.18:46
-!- f-x [~user@117.192.207.49] has joined #shogun18:58
-!- in3xes_ [~in3xes@180.149.49.227] has joined #shogun19:04
f-xsonney2k: hey! what kind of objects as a rule do you think should inherit from CSGObject?19:04
-!- in3xes [~in3xes@210.212.58.111] has quit [Ping timeout: 240 seconds]19:07
@sonney2kf-x, all except those for which this would be too much overhead19:58
CIA-87shogun: Alesis Novik master * rf8fc62c / src/shogun/clustering/GMM.cpp : Removed copying - https://github.com/shogun-toolbox/shogun/commit/f8fc62c7b365df87859be00ef74bde3c6d2b7cdd19:59
CIA-87shogun: Soeren Sonnenburg master * r380af5a / (3 files in 2 dirs):19:59
CIA-87shogun: Merge pull request #252 from alesis/gmm19:59
CIA-87shogun: Memory problem fixes. - https://github.com/shogun-toolbox/shogun/commit/380af5acf4bba4d7cb226fd9dc90ab625b2ac14919:59
@sonney2kalesis-novik, well you are the master of your algorithm... as long as you don't destroy the vector under your feat you should be fine.20:01
alesis-noviksonney2k, well, I think in this case we might still have a few variables floating around because the object isn't deleted. Nothing major though.20:05
-!- in3xes_ is now known as in3xes20:05
@sonney2kalesis-novik, ok...20:26
alesis-noviksonney2k, found another potential memory problem, committing.20:30
blackburnalesis-novik: do you know what is gaussian naive bayes is?20:35
blackburnI've been thinking about GNB+Gaussian integration20:35
@sonney2kalesis-novik, I suggest you compile with --trace-memory-allocs and check for leaks too :)20:37
alesis-noviksonney2k, will do20:37
alesis-novikwhat did you have in mind blackburn20:37
blackburnalesis-novik: well now it uses gaussian pdf20:38
blackburnmay be it is even possible to fit gaussians for every class not only with diag cov20:38
blackburnone issue with GNB now is that sometimes it leads to underflow or so20:39
blackburnI mean for every class the probability becomes so small that decision is not correct20:39
alesis-novikbut how do you want to integrate it with Gaussian?20:48
blackburnI'm not sure if CGaussian is exactly what I mean :)20:48
-!- in3xes [~in3xes@180.149.49.227] has quit [Quit: Leaving]21:05
CIA-87shogun: Alesis Novik master * r2d2fbf8 / src/shogun/clustering/GMM.cpp : added SG_UNREF where needed - https://github.com/shogun-toolbox/shogun/commit/2d2fbf8433c9bfb0621a046dc7408aa49f15c2d821:09
CIA-87shogun: Soeren Sonnenburg master * ra514bca / src/shogun/clustering/GMM.cpp :21:09
CIA-87shogun: Merge pull request #255 from alesis/gmm21:09
CIA-87shogun: added SG_UNREF where needed - https://github.com/shogun-toolbox/shogun/commit/a514bca4f7200429d1696953ca9d3cadacb80a5f21:09
@sonney2kalesis-novik, btw would it be possible to set the matrix of means etc for the gaussians in one go?21:09
@sonney2know it seems one has to set multiple vectors21:10
@sonney2kalesis-novik, I mean you could just add as set_* function which takes an SGMatrix as argument and then call the respective SGVector functions multiple times...21:10
alesis-noviksonney2k, that's because it's just calling the underlying CGaussian::set_mean(...)21:10
@sonney2kalesis-novik, yes but you can emulate that right?21:11
@sonney2kI mean split up the mean matrix etc21:11
alesis-novikbut what about the covariance one then?21:11
@sonney2kalesis-novik, covariance is for every gaussian right?21:12
@sonney2kI mean you have 1 per gaussian?21:12
alesis-novikyes21:12
@sonney2kso in you GMM you would have multiple cov matrices ?21:13
@sonney2kthen it doesn't make sense indeed21:13
alesis-novikWell, every Gaussian in the mixture model has a mean and cov. While making a bulk set_means makes sense, I don't really think that a bulk set_covs using SG* would make sense21:14
@sonney2kalesis-novik, yes I agree so we keep it like it is then21:28
-!- f-x [~user@117.192.207.49] has quit [Remote host closed the connection]21:54
-!- serialhex [~quassel@99-101-148-183.lightspeed.wepbfl.sbcglobal.net] has quit [Ping timeout: 250 seconds]22:04
-!- serialhex [~quassel@99.101.148.183] has joined #shogun22:04
--- Log closed Mon Aug 01 00:00:11 2011

Generated by irclog2html.py 2.10.0 by Marius Gedminas - find it at mg.pov.lt!