IRC logs of #shogun for Friday, 2017-08-04

--- Log opened Fri Aug 04 00:00:24 2017
@wiking	Trixis, the current version of libsvm does not	03:19
@wiking	but	03:19
@wiking	our kernels are multithreaded	03:20
@wiking	mikeling, sorry yesterday i've crashed	03:20
@wiking	i'll have time today	03:20
@wiking	to check onto things	03:20
@wiking	in 1-1.5 i'll have some fixes for you	03:20
-!- http_GK1wmSU [~deep-book@129.232.221.173] has joined #shogun		05:49
-!- http_GK1wmSU [~deep-book@129.232.221.173] has left #shogun []		05:52
-shogun-buildbot:#shogun- Build nightly_all #23 is complete: Success [build successful] - http://buildbot.shogun-toolbox.org:8080/#builders/22/builds/23		07:00
mikeling	wiking: thanks a lot!	07:27
@wiking	HeikoS, ping?	08:34
@wiking	HeikoS, i wonder why some base interfaces got into gpl	08:34
@wiking	like	08:34
@wiking	src/gpl/shogun/multiclass/MulticlassLogisticRegression.h	08:35
@wiking	i know that it uses SLEP	08:35
@wiking	but basically what we should do is to make MulticlassLogisticRegression.h an interface/abstract class	08:35
@wiking	and have an implementation like SlepMulticlassLogisticRegression.h	08:35
@wiking	that is using slep to do it	08:35
@wiking	HeikoS, ping?	09:39
@wiking	mikeling, i've just rebased again and fixed some of the errors	09:50
@wiking	can you check now?	09:50
@wiking	or i mean when you have time	09:50
-!- travis-ci [~travis-ci@ec2-54-204-223-168.compute-1.amazonaws.com] has joined #shogun		10:00
travis-ci	it's Viktor Gal's turn to pay the next round of drinks for the massacre he caused in shogun-toolbox/shogun: https://travis-ci.org/shogun-toolbox/shogun/builds/260912556	10:00
-!- travis-ci [~travis-ci@ec2-54-204-223-168.compute-1.amazonaws.com] has left #shogun []		10:00
-!- HeikoS [~heiko@host-92-0-169-11.as43234.net] has quit [Quit: Leaving.]		10:10
Trixis	wiking: i was mostly wondering at what point should my own multithreading kick in	10:15
@wiking	:)	10:32
Trixis	(cant have it kick in from the start, otherwise i end up with libgomp errors all over the place)	10:33
mikeling	wiking: hi, sorry for the late reply. I think those errors has been fixed now :)	12:24
mikeling	thank you	12:24
-!- HeikoS [~heiko@host-92-0-169-11.as43234.net] has joined #shogun		12:58
-!- mode/#shogun [+o HeikoS] by ChanServ		12:58
-!- HeikoS [~heiko@host-92-0-169-11.as43234.net] has quit [Ping timeout: 240 seconds]		13:33
-!- travis-ci [~travis-ci@ec2-54-83-103-165.compute-1.amazonaws.com] has joined #shogun		14:18
travis-ci	it's Olivier's turn to pay the next round of drinks for the massacre he caused in shogun-toolbox/shogun: https://travis-ci.org/shogun-toolbox/shogun/builds/260971672	14:18
-!- travis-ci [~travis-ci@ec2-54-83-103-165.compute-1.amazonaws.com] has left #shogun []		14:18
-shogun-buildbot:#shogun- Build deb3 - interfaces #78 is complete: Success [build successful] - http://buildbot.shogun-toolbox.org:8080/#builders/37/builds/78		14:24
Trixis	wiking: btw i was wondering, whats the rationale behind features/labels not supporting multiple independent subsets at once?	15:01
@wiking	hi	15:01
@wiking	just a sec	15:01
Trixis	hi	15:04
@wiking	Trixis, well it's not rationale	16:11
@wiking	it's a bottleneck atm	16:12
@wiking	:(	16:12
Trixis	right, makes sense	16:12
@wiking	currently micmn is working on fixing that in Features	16:12
Trixis	gotcha	16:26
Trixis	wiking: yeah i found i had to create a wrapper to make that operation stateless (via copying the whole object ofc) so that i could use it in map reduce	16:27
@wiking	:)	16:33
Trixis	wiking: also the shogun_num_threads limit is global for the entire program, right?	16:38
Trixis	wiking: i.e. suppose i set it to 8, then dispatch 32 threads, each training a single classifier, then only at most 8 classifiers will be trained in parallel?	16:39
-!- StarmanDeluxe [~StarmanDe@unaffiliated/starmandeluxe] has joined #shogun		16:40
-!- StarmanDeluxe [~StarmanDe@unaffiliated/starmandeluxe] has left #shogun ["WeeChat 1.9-dev"]		16:41
@wiking	yes	16:49
@wiking	shogun_num_threads is global	16:49
@wiking	we dont have yet the concept of openmp teams	16:50
Trixis	kk	17:05
Trixis	well, fuck cant set customkernel matrix because i cant cast the kernel returned by combinedkernel to customkernel :\| apparently this is correct behavior as per swig http://www.swig.org/Doc2.0/Java.html	17:06
Trixis	i guess a dirty hack to get around it atm is to insert a new custorm kernel at the position and delete the old one?	17:08
-!- HeikoS [~heiko@eduroam-int-pat-8-18.ucl.ac.uk] has joined #shogun		17:18
-!- mode/#shogun [+o HeikoS] by ChanServ		17:18
-!- HeikoS [~heiko@eduroam-int-pat-8-18.ucl.ac.uk] has quit [Ping timeout: 240 seconds]		17:22
-!- HeikoS [~heiko@untrust-out.swc.ucl.ac.uk] has joined #shogun		17:37
-!- mode/#shogun [+o HeikoS] by ChanServ		17:37
olinguyen	HeikoS: sorry for the spam commits, I had a little trouble with the style checker	17:39
@HeikoS	olinguyen: no worries	17:39
@HeikoS	github recently added the "squash option"	17:39
@HeikoS	so I can just turn them all into a single one	17:40
@HeikoS	and rewrite the message ;)	17:40
@HeikoS	btw dont use git commit -a	17:40
@HeikoS	as this always adds data	17:40
@HeikoS	olinguyen:	17:45
@HeikoS	can you explain me this test: https://github.com/shogun-toolbox/shogun/pull/3954/files#diff-cd4f542522819a81dd14ed78e4dbefd7R287	17:45
@HeikoS	olinguyen: just sent a review for the PR as well	17:46
olinguyen	sure & thanks	17:46
@HeikoS	I am here for another hour or so	17:46
@HeikoS	so let's discuss	17:46
olinguyen	i generated toy data where the output is 1 when features 1 and 2 are < 5	17:46
olinguyen	so and tested on data where the probability was 1 or 0	17:46
@HeikoS	I get the toy data generation	17:47
@HeikoS	I dont get the second	17:47
@HeikoS	"tested on data where the prob was 1 or 0"	17:47
@HeikoS	you mean the RF assigns 100% confidence in its prediction	17:47
@HeikoS	?	17:47
@HeikoS	I.e. all trees give the same result	17:47
olinguyen	correct and that is when features 1 and 2 are < 5	17:47
@HeikoS	I see	17:48
@HeikoS	ok so then	17:48
@HeikoS	no seed necessary, and no single thread	17:48
@HeikoS	since that doesnt change the fact that all trees will agree	17:48
olinguyen	yea, you're right	17:48
@HeikoS	test name should also reflect that	17:49
@HeikoS	score_consistent_with_binary_trivial_data	17:49
@HeikoS	or something nicer	17:49
olinguyen	kk	17:49
@HeikoS	but something that kind of explains what happens	17:49
@HeikoS	and what the rationale is	17:49
@HeikoS	then next thing	17:49
@HeikoS	can we have a test where the trees don't all agree?	17:49
@HeikoS	like a dataset where the class labels are random	17:50
olinguyen	sure, i'd use like EXPECT_NEQ?	17:50
@HeikoS	I am more thinking	17:50
@HeikoS	say you have a dataset where you just assign random class labels	17:50
@HeikoS	say all features are gaussian	17:50
@HeikoS	and then you just randomly give them +1, -1 labela	17:50
@HeikoS	s	17:50
@HeikoS	or rather 0,1,2	17:51
@HeikoS	then on prediction	17:51
@HeikoS	the confidences should be spread more or less evenly	17:51
@HeikoS	see what I mean?	17:51
@HeikoS	so you can add a rough check for that in the test, calibrate it so that it passes almost most of the time	17:51
olinguyen	yea, i'm just unsure what value assertion i'd make in that case?	17:51
olinguyen	like the probability outputs are likely fluctuating	17:52
@HeikoS	EXPECT_NEAR(score, 0.3, 0.1)	17:52
@HeikoS	something like this	17:52
olinguyen	ok got it	17:52
@HeikoS	run it a few times, observe	17:52
@HeikoS	and then give it some headroom	17:52
@HeikoS	so that it doesnt fail	17:52
olinguyen	kk	17:52
@HeikoS	but it catches some problems if somebody would screw up the scores	17:52
@HeikoS	and then name this like "scores_random_labels"	17:53
@HeikoS	or so	17:53
@HeikoS	(method name should be in there somehow)	17:53
@HeikoS	I think we can merge once that is done	17:53
olinguyen	In that case	17:54
olinguyen	do you think it's a good idea to follow sklearn's test here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/tests/test_voting_classifier.py#L143	17:54
olinguyen	i believe the probabilities are quite similar so with headroom the test will be similar	17:54
@HeikoS	what exactly does it do?	17:54
@HeikoS	compare against logistic regression scores?	17:54
olinguyen	no, the randomforestclassifier	17:54
olinguyen	So with the same input X, i'd compare with np.array([[0.8, 0.2], [0.8, 0.2], [0.2, 0.8], [0.3, 0.7]])	17:55
olinguyen	on shogun's RF	17:55
@HeikoS	ah	17:55
@HeikoS	I see	17:55
@HeikoS	so you are saying	17:55
@HeikoS	that if we have random number comparison with headroom	17:56
@HeikoS	why not compare against sklearn test directly?	17:56
olinguyen	exactly	17:56
@HeikoS	sure	17:56
@HeikoS	if the algorithm implementation is the same, then that is a very sensible thing to do	17:56
@HeikoS	maybe add two more tests then	17:56
@HeikoS	the one with the random labels	17:56
@HeikoS	and the one for sklearn	17:56
olinguyen	ok, sure	17:56
@HeikoS	the more different ones the better	17:57
olinguyen	can we chat a little bit about the data project?	17:57
olinguyen	unless you had a few things to add	17:57
olinguyen	on the RF stuff	17:57
@HeikoS	no thats all	18:01
@HeikoS	I am curious how the RF behaves	18:01
@HeikoS	especially when we add the lag features	18:01
olinguyen	So I'm a little uncertain about the incorporation of the lagged features	18:01
olinguyen	What i'm trying to do right now is to extract time series data (first 24 hours) for each feature of dying patients and normal patients (that's a lot of data!)	18:02
olinguyen	From my understanding, using the time-lagged features would help me predict how a time series would look like, given past data. But that doesn't help predict the mortality outcome. How did you see the use of time-lagged features in this case?	18:02
olinguyen	e.g. i have points (t-2, t-1) and i'm trying to predict (t+1). that's adding lagged features are, or am i seeing it wrong?	18:03
@HeikoS	yes	18:03
@HeikoS	thats exactly it	18:03
@HeikoS	but it depends all on the target, i.e. what you are trying to predict	18:03
@HeikoS	currently, you are predicting mortality at a fixed time point t right?	18:04
@HeikoS	and you use a snapshot of data at an earlier time point for that	18:04
olinguyen	right now: using the first 24 hours aggregated, i'm seeing if the patient will die in the hospital	18:04
@HeikoS	yes exactly	18:04
@HeikoS	that is "ever"	18:04
@HeikoS	but that is not the most interesting question (and also hard)	18:05
@HeikoS	but what about the question "will the patient die tomorrow/next week / next month	18:05
@HeikoS	"	18:05
@HeikoS	that is a bit more interesting for the hospital	18:06
@HeikoS	as they can react	18:06
@HeikoS	if you just tell them a patient is going to die while here, that is not going to help immensely	18:06
@HeikoS	but if you tell them; I am sure he won't die next week .... and then suddenly : I am certain she dies next week	18:06
@HeikoS	that is more useful	18:07
olinguyen	yea i see your point	18:07
olinguyen	i'm having trouble visualizing it in a time series binary classification setting	18:07
olinguyen	e.g. i have a N time-series of patients heartrate	18:07
olinguyen	and i have binary outcomes at different points in time (next day, next week, next month, next year)	18:07
@HeikoS	you have to think that you will have a pair of (X,y) for every point in time that you predict	18:08
@HeikoS	where X is patient data	18:08
@HeikoS	and y is "dies next ____"	18:08
@HeikoS	and then you have to generate those (x,y) pairs for all time points in the time series	18:08
@HeikoS	for example for every week	18:08
@HeikoS	or every day	18:08
@HeikoS	olinguyen: if it helps, I can write you a little example notebook	18:10
@HeikoS	using toy data in 1d	18:10
olinguyen	yea i think that would be helpful	18:10
olinguyen	would the X be a time series up till outcome y?	18:10
olinguyen	Like if i have 20 measurements of heartrate in the first 24 hours for a patient (at different time intervals), and i have the death outcome at 1 day, 1 week, 1 month	18:13
@HeikoS	olinguyen: well the X is the features you use from the time series up till outcome y	18:14
@HeikoS	so you can decide what to use there	18:14
@HeikoS	-raw value	18:14
@HeikoS	-lagged average	18:14
@HeikoS	-linear fit slope	18:14
@HeikoS	etc	18:14
@HeikoS	all this adds auto-regressive structure	18:15
@HeikoS	your RNN will be just another set of features	18:15
@HeikoS	olinguyen: a good way to think about this is if you were to use this as a system in real life	18:15
@HeikoS	olinguyen: and imagine you are doing the decision yourself	18:16
@HeikoS	so you are confronted with the patient record up to time t	18:16
@HeikoS	and then you are asked the question whether the patient will die next week	18:16
@HeikoS	so you can use all information you have of the patient up to time t	18:16
@HeikoS	and you can extract certain summaries from that	18:16
@HeikoS	then you predict a chance of dying	18:16
@HeikoS	then, the next day (t+1), you are asked again whether the patient will die next week	18:17
@HeikoS	so you give another answer	18:17
@HeikoS	and so on	18:17
@HeikoS	now replacing your manual answer wiht what actually happened will be your training data	18:17
olinguyen	yea thats a nice way to view it	18:17
olinguyen	i see it better	18:18
@HeikoS	I suggest a daily time resoluition for the training data	18:19
@HeikoS	and I suggest 1day ahead, 1week ahead, 1month ahead	18:19
@HeikoS	in terms of predicting mortality	18:19
@wiking	mikeling, does the unit tests pass in that PR?	18:20
@wiking	i.e. in https://github.com/shogun-toolbox/shogun/pull/3960	18:20
olinguyen	ok, i'll give that a shot	18:20
@wiking	or it's a WIP	18:20
olinguyen	HeikoS: i'll finish the RF PR but i'll tackle that next	18:20
olinguyen	thanks!	18:20
@wiking	https://explosion.ai/blog/prodigy-annotation-tool-active-learning	18:20
@HeikoS	olinguyen: yeah one thing at a time	18:20
@wiking	:)	18:20
@HeikoS	olinguyen: for the time series stuff	18:21
@HeikoS	can you prototype a very quick and dirty example of what we just discussed	18:21
@HeikoS	using only heart rate or so	18:21
@HeikoS	so that we make sure we are on the same page	18:21
olinguyen	sure, will do!	18:21
mikeling	wiking: no, but I think we do it commit by commit	18:21
@wiking	mikeling, okey!	18:22
mikeling	Otherwise you need review more than 2000lines of code at once	18:22
@wiking	https://prodi.gy/	18:22
@wiking	pip install	18:22
@wiking	man	18:22
@wiking	sometimes i'm like wtf is happening in this world	18:22
@wiking	'radically efficient'	18:22
@wiking	:)	18:22
@HeikoS	olinguyen: cool! something really quick	18:23
@HeikoS	but where you show that you get the concept of generating the training data, and the lagged features	18:23
Trixis	i should learn to do debugging and testing on small datasets :\|	18:28
olinguyen	HeikoS: I'll send a draft by tonight	18:29
@HeikoS	olinguyen: cool! doent need to be perfect	18:29
Trixis	wiking: what exactly is that prodi.gy thing. kinda reads like a startup pitch, lol	18:34
@wiking	:)	18:34
-!- HeikoS [~heiko@untrust-out.swc.ucl.ac.uk] has quit [Ping timeout: 240 seconds]		18:57
Trixis	wiking: this is probably a completely dumb question	19:07
Trixis	but when im setting a custom kernel matrix for classification	19:07
Trixis	how do i get around the dimension check	19:07
Trixis	i mean its obvious its not going to be a square matrix like the matrix the classifier was trained on? or do i create a square matrix, and keep all entries but the ones in the lhs x rhs block 0?	19:08
Trixis	wiking: right its probably because im deleting / inserting the kernel, unfortunately cant get around that b/c cant cast	19:27
Trixis	wiking: i guess the only alternative is to retain a java reference to all customkernels	19:35
Trixis	instead of accessing it through combinedkernel	19:35
Trixis	yep works im an idiot, shouldve though of it earlier	19:40
-!- mikeling [uid89706@gateway/web/irccloud.com/x-mkoruuyhyohpgnhz] has quit [Quit: Connection closed for inactivity]		21:37
-!- HeikoS [~heiko@host-92-0-169-11.as43234.net] has joined #shogun		23:35
-!- mode/#shogun [+o HeikoS] by ChanServ		23:35
@HeikoS	olinguyen: hi	23:35
olinguyen	hey	23:35
@HeikoS	Ill go to bed soon	23:35
@HeikoS	just saying, Ill be away over the next 2 days, back on Monday	23:35
@HeikoS	I can still review things a bit	23:35
@HeikoS	but have to use my phone ;)	23:35
olinguyen	ok, np	23:36
olinguyen	enjoy your weekend :)	23:36
@HeikoS	you too!	23:36
-!- HeikoS [~heiko@host-92-0-169-11.as43234.net] has quit [Remote host closed the connection]		23:42
--- Log closed Sat Aug 05 00:00:25 2017

Generated by irclog2html.py 2.10.0 by Marius Gedminas - find it at mg.pov.lt!