--- Log opened Sat Dec 17 00:00:19 2011 | ||
-!- puneetgoyal [~puneetgoy@117.197.177.239] has joined #shogun | 05:56 | |
-!- Ram108 [~amma@14.96.21.81] has joined #shogun | 06:38 | |
-!- puneetgoyal [~puneetgoy@117.197.177.239] has quit [Quit: Leaving] | 10:30 | |
-!- puneetgoyal [~puneetgoy@117.197.177.239] has joined #shogun | 10:30 | |
-!- blackburn [~blackburn@31.28.51.215] has joined #shogun | 11:07 | |
@sonney2k | blackburn, I think we should try to get http://web.engr.oregonstate.edu/~shindler/kMeansCode/ this fast k-means code into shogun | 11:36 |
---|---|---|
@sonney2k | shouldn't be too difficult so if someone asks for what to do... | 11:36 |
@sonney2k | also, porting birch (java) to c++ would make sense | 11:36 |
blackburn | sonney2k: hmm sure | 11:37 |
@sonney2k | and I would propose some new branch of algorithms for sampling from data, i.e. getting a uniform subset | 11:37 |
@sonney2k | or some other more complicated ways | 11:37 |
blackburn | you are nipsted | 11:38 |
blackburn | :D | 11:38 |
@sonney2k | there is more | 11:38 |
@sonney2k | someone here had a fast cross - validation scheme | 11:38 |
@sonney2k | that would be very cool stuff for heiko | 11:38 |
blackburn | sorry but I'm really out of time till new year | 11:38 |
blackburn | fast CV scheme? how can it be fast? | 11:39 |
@sonney2k | btw, we have now shogun in debian unstable again... | 11:39 |
blackburn | did you packed it? | 11:39 |
@sonney2k | it does CV on small subsets of the data first | 11:39 |
@sonney2k | and this way can throw away lots of combinations to be tested. then it increases these subsets slowly - speedups of 2 orders of magnitude without loss are possible | 11:40 |
blackburn | sonney2k: I now need some binary tree classifier making possible to use say LDA for multiclass | 11:42 |
blackburn | how do you think it is useful? | 11:42 |
@sonney2k | I am currently messing with python -builtin | 11:42 |
blackburn | what is builtin? | 11:42 |
@sonney2k | no more .py files but direct py objects | 11:43 |
@sonney2k | an option for swig | 11:43 |
@sonney2k | blackburn, you are talking about a certain error correcting codes scheme right? | 11:43 |
blackburn | sonney2k: yes, it is a variant of binary tree classifier | 11:43 |
@sonney2k | to convert any binary classifier into a multiclass classifier? | 11:44 |
blackburn | yes | 11:44 |
@sonney2k | there was another very nice algorithm at nips for that | 11:44 |
@sonney2k | some boosting with minimal set of features - massively many classes - but very fast and accurate | 11:44 |
@sonney2k | ahh and a binary tree thingy too - that sounded very reasonable work - let me check if I find it | 11:45 |
blackburn | I need something like that for my road signs work | 11:47 |
@sonney2k | how many classes do you have? | 11:47 |
blackburn | now 43, but will have much more | 11:47 |
blackburn | some of them should be grouped for sure | 11:48 |
blackburn | like red signs, blue signs, etc | 11:48 |
@sonney2k | one is ShareBoost: Efficient multiclass learning with feature sharing | 11:48 |
@sonney2k | S. Shalev-Shwartz, Y. Wexler, A. Shashua | 11:48 |
blackburn | got it thanks | 11:49 |
@sonney2k | the other is http://nips.cc/Conferences/2011/Program/event.php?ID=2647 | 11:49 |
@sonney2k | this is probably more what you could directly use | 11:50 |
@sonney2k | it infers the tree and learns the SVms / LDA whatever | 11:50 |
@sonney2k | seemed very reasonable | 11:50 |
blackburn | yes, I had similar idea before I get to know it is already done :D | 11:51 |
@sonney2k | blackburn, ohh yes please add pointers to all of that in the bts :-) | 11:51 |
blackburn | bts? | 11:51 |
@sonney2k | these are all very worthwhile ideas for gsoc | 11:51 |
@sonney2k | bug tracking system | 11:51 |
@sonney2k | aka github issues | 11:51 |
blackburn | ah yes I'll add it | 11:51 |
@sonney2k | blackburn, that is the thing with the sampling http://nips.cc/Conferences/2011/Program/event.php?ID=3000 | 11:52 |
@sonney2k | they use it to do fast GMM estimation | 11:52 |
@sonney2k | i.e. they sample from the data and then on that sample estimate the GMM | 11:52 |
@sonney2k | but the sampling is very clever... | 11:52 |
blackburn | pretty simple idea heh | 11:53 |
blackburn | well and my point is that all the things should be as simple as it could be | 11:54 |
@sonney2k | yeah - and easy to implement... we just need some new class of 'sampler' algorithms | 11:58 |
@sonney2k | the get as input some data and then return an index with a subset :) | 11:58 |
blackburn | sonney2k: TreeMulticlassMachine? | 12:00 |
@sonney2k | no I think this should be under classifier | 12:01 |
@sonney2k | MulticlassTreeClassifier ? | 12:01 |
blackburn | why classifier? | 12:05 |
blackburn | ah yes | 12:05 |
15SAAI18M | shogun: Sergey Lisitsyn master * rce5c547 / src/shogun/converter/DiffusionMaps.h : Mentioned paper on DiffusionMasp - http://git.io/jmqxHg | 12:11 |
blackburn | sonney2k: is your TU mail available for you still? | 12:12 |
@sonney2k | better use mey debian.org address | 12:12 |
@sonney2k | it is but not for long I think | 12:12 |
blackburn | sonney2k: I'm asking cause Ori Cohen wrote to both of us, have you seen? | 12:13 |
blackburn | it is a guy who was working on C# examples too | 12:13 |
@sonney2k | yes I have seen - I don't know what to do about it | 12:14 |
@sonney2k | I thought you added him to NEWS? | 12:14 |
blackburn | sonney2k: yes, he said it is ok for him | 12:15 |
@sonney2k | btw we have to update the website right to point to github issues | 12:15 |
@sonney2k | what is the url btw? | 12:15 |
blackburn | ehm | 12:16 |
blackburn | https://github.com/shogun-toolbox/shogun/issues | 12:16 |
blackburn | ? | 12:16 |
@sonney2k | yes that one, let me do the update | 12:16 |
@sonney2k | then I can also include Ori in the NEWS on the site | 12:16 |
blackburn | yes please do then | 12:17 |
@sonney2k | done | 12:19 |
@sonney2k | btw, we got our windows7 buildbot | 12:19 |
@sonney2k | I just didn't have time to administer it | 12:19 |
blackburn | heh nice | 12:20 |
-!- puneetgoyal [~puneetgoy@117.197.177.239] has quit [Ping timeout: 240 seconds] | 12:36 | |
-!- puneetgoyal [~puneetgoy@117.197.166.8] has joined #shogun | 12:48 | |
@sonney2k | blackburn, just one thought - should we add some print / string function to show a compact output of shogun objects? | 12:49 |
@sonney2k | e.g. they could show their name and list parameters this way? | 12:49 |
blackburn | sonney2k: well why not | 12:50 |
blackburn | sonney2k: not the crucial thing though, no idea how to use it | 12:50 |
@sonney2k | well you could just do | 12:51 |
@sonney2k | x=GaussianKernel() | 12:51 |
@sonney2k | print x | 12:51 |
@sonney2k | and then it will not say | 12:51 |
@sonney2k | <Swig Object of type 'shogun::CGaussianKernel *' at 0x7fe290d0db90> | 12:51 |
@sonney2k | but | 12:51 |
@sonney2k | GaussianKernel - Parameters width=1 | 12:52 |
blackburn | sure, I understand | 12:52 |
@sonney2k | for features it could show the same kind of summary we have for numpy arrays | 12:52 |
@sonney2k | rather useful I would say | 12:53 |
puneetgoyal | hey :), why do we generally use this gaussian kernel ? | 13:04 |
puneetgoyal | I mean..in most examples, I had seen this kernel only | 13:05 |
blackburn | puneetgoyal: it has some nice features | 13:07 |
blackburn | like 'virtual' infinite-dimension gilbert space mapping hah :) | 13:07 |
puneetgoyal | ok | 13:11 |
-!- naywhayare [~ryan@128.61.149.136] has joined #shogun | 13:36 | |
-!- Ram108 [~amma@14.96.21.81] has quit [Remote host closed the connection] | 17:29 | |
puneetgoyal | hello, I was trying to tokenize emails and now a bit close to it...I wanted to know if there is a way to parse all the files containing the email data to store them in a matrix ? | 18:27 |
blackburn | hmm how? | 18:28 |
puneetgoyal | how I made tokens from an email..or How I wanna parse all files? | 18:30 |
blackburn | how can you store email data in matrix?.. | 18:31 |
puneetgoyal | to tokenize emails...I used the email package and its various modules | 18:31 |
puneetgoyal | I can extract various information from an email that will be used to calculate the probability of an email being a spam or a ham using that email package of python | 18:32 |
puneetgoyal | and store them in a matrix | 18:32 |
blackburn | is it a matrix of probabilities? | 18:34 |
puneetgoyal | no, I guess probabilities will be calculated after I train my system using some emails ? | 18:35 |
blackburn | so it is a token matrix? | 18:36 |
puneetgoyal | yes | 18:36 |
blackburn | how do you plan to use it? | 18:37 |
puneetgoyal | should I use some other method for training? | 18:37 |
blackburn | you may feel free to use any but I'm in doubts | 18:38 |
blackburn | cause stringfeatures in shogun supports just a list of strings | 18:38 |
blackburn | but not a list of list of strings | 18:38 |
puneetgoyal | ok, so from where should I procede? | 18:41 |
blackburn | puneetgoyal: I would suggest you to compute similarity measure with some written-by-you-technique | 18:42 |
blackburn | and then form similarity matrix to train SVM or so | 18:42 |
blackburn | puneetgoyal: for example you can count identical tokens | 18:53 |
puneetgoyal | blackburn: sry, forgot to reply...I was reading about similarty measures | 18:54 |
blackburn | np | 18:54 |
puneetgoyal | identical tokens? | 18:54 |
blackburn | puneetgoyal: ['this','is','spam'] is 1.0 to ['this','is','spam'], but 0.6667 to ['this','is','sparta'] | 18:55 |
puneetgoyal | blackburn: yes, but while testing right? I mean the list I will compare the mail with...would be made after training | 18:57 |
blackburn | not sure I understood you | 18:57 |
puneetgoyal | I mean suppose the first list you gave is a mail you want to check, and second is the list you already have...that you know is a spam or a ham | 18:58 |
blackburn | so? | 18:58 |
puneetgoyal | but to get the second list, you will first have to get some training data | 18:58 |
blackburn | yes | 18:59 |
blackburn | well just get some training mails, determine its status | 18:59 |
blackburn | and form matrix containing similarity between i-th and j-th mails | 18:59 |
puneetgoyal | ok, and would have to write the respective weights against each of the keywords | 19:00 |
puneetgoyal | ok, I will make a module to construct this matrix asap | 19:02 |
blackburn | puneetgoyal: hey why do you hurry? | 19:09 |
puneetgoyal | blackburn: I dont have anything else to do now...so can spend the whole time over this :D | 19:10 |
-!- ishaanmlhtr [~ishaan@115.241.187.65] has joined #shogun | 19:14 | |
-!- ishaanmlhtr [~ishaan@115.241.187.65] has quit [Ping timeout: 240 seconds] | 19:31 | |
-!- ishaanmlhtr [~ishaan@115.241.221.69] has joined #shogun | 20:06 | |
-!- ishaanmlhtr [~ishaan@115.241.221.69] has quit [Ping timeout: 240 seconds] | 22:29 | |
-!- ishaanmlhtr [~ishaan@115.241.221.69] has joined #shogun | 22:31 | |
-!- ishaanmlhtr [~ishaan@115.241.221.69] has quit [Ping timeout: 244 seconds] | 22:42 | |
-!- ishaanmlhtr [~ishaan@115.241.221.69] has joined #shogun | 22:44 | |
-!- puneetgoyal [~puneetgoy@117.197.166.8] has quit [Quit: Leaving] | 23:16 | |
--- Log closed Sun Dec 18 00:00:19 2011 |
Generated by irclog2html.py 2.10.0 by Marius Gedminas - find it at mg.pov.lt!