GETTING STARTED Shogun is split up into libshogun which contains all the machine learning algorithms and 'static interfaces' helpers, the static interfaces python_static, octave_static, matlab_static, r_static and the modular interfaces python_modular, octave_modular and r_modular (all found in the src/interfaces/ subdirectory with corresponding name). See src/INSTALL on how to install shogun. In case one wants to extend shogun the best way is to start using its library. This can be easily done as a number of examples in examples/libshogun document. The simplest libshogun based program would be #include <shogun/base/init.h> int main(int argc, char** argv) { init_shogun(); exit_shogun(); return 0; } which could be compiled with g++ -lshogun minimal.cpp -o minimal and obviously does nothing (apart form initializing and destroying a couple of global shogun objects internally). In case one wants to redirect shoguns output functions SG_DEBUG, SG_INFO, SG_WARN, SG_ERROR, SG_PRINT etc, one has to pass them to init_shogun() as parameters like this void print_message(FILE* target, const char* str) { fprintf(target, "%s", str); } void print_warning(FILE* target, const char* str) { fprintf(target, "%s", str); } void print_error(FILE* target, const char* str) { fprintf(target, "%s", str); } init_shogun(&print_message, &print_warning, &print_error); To finally see some action one has to include the appropriate header files, e.g. we create some features and a gaussian kernel #include <shogun/features/Labels.h> #include <shogun/features/RealFeatures.h> #include <shogun/kernel/GaussianKernel.h> #include <shogun/classifier/svm/LibSVM.h> #include <shogun/base/init.h> #include <shogun/lib/common.h> #include <shogun/io/SGIO.h> #include <stdio.h> void print_message(FILE* target, const char* str) { fprintf(target, "%s", str); } int main(int argc, char** argv) { init_shogun(&print_message); // create some data float64_t* matrix = new float64_t[6]; for (int32_t i=0; i<6; i++) matrix[i]=i; // create three 2-dimensional vectors // shogun will now own the matrix created CRealFeatures* features= new CRealFeatures(); features->set_feature_matrix(matrix, 2, 3); // create three labels CLabels* labels=new CLabels(3); labels->set_label(0, -1); labels->set_label(1, +1); labels->set_label(2, -1); // create gaussian kernel with cache 10MB, width 0.5 CGaussianKernel* kernel = new CGaussianKernel(10, 0.5); kernel->init(features, features); // create libsvm with C=10 and train CLibSVM* svm = new CLibSVM(10, kernel, labels); svm->train(); // apply to training examples for (int32_t i=0; i<3; i++) SG_SPRINT("output[%d]=%f\n", i, svm->apply(i)); // free up memory SG_UNREF(svm); exit_shogun(); return 0; } Now you probably wonder why this example does not leak memory. First of all, supplying pointers to arrays allocated with new[] will make shogun objects own these objects and will make them take care of cleaning them up on object destruction. Then, when creating shogun objects they keep a reference counter internally. Whenever a shogun object is returned or supplied as an argument to some function its reference counter is increased, for example in the example above CLibSVM* svm = new CLibSVM(10, kernel, labels); increases the reference count of kernel and labels. On destruction the reference counter is decreased and the object is freed if the counter is <= 0. It is therefore your duty to prevent objects from destruction if you keep a handle to them globally *which you still intend to use later*. In the example above accessing labels after the call to SG_UNREF(svm) will cause a segmentation fault as the Label object was already destroyed in the SVM destructor. You can do this by SG_REF(obj). To decrement the reference count of an object, call SG_UNREF(obj) which will also automagically destroy it if the counter is <= 0 and set obj=NULL only in this case. Generally, all shogun C++ Objects are prefixed with C, e.g. CSVM and derived from CSGObject. Since variables in the upper class hierarchy, need to be initialized upon construction of the object, the constructor of base class needs to be called in the constructor, e.g. CSVM calls CKernelMachine, CKernelMachine calls CClassifier which finally calls CSGObject. For example if you implement your own SVM called MySVM you would in the constructor do class MySVM : public CSVM { MySVM( ) : CSVM() { ... } }; In case you got your object working we will happily integrate it into shogun provided you follow a number of basic coding conventions detailed below (see FORMATTING for formatting instructions, MACROS on how to use and name macros, TYPES on which types to use, FUNCTIONS on how functions should look like and NAMING CONVENTIONS for the naming scheme. CODING CONVENTIONS: FORMATTING: - indenting uses stroustrup style with tabsize 4, i.e. for emacs use in your ~/.emacs (add-hook 'c-mode-common-hook (lambda () (show-paren-mode 1) (setq indent-tabs-mode t) (c-set-style "stroustrup") (setq tab-width 4))) for vim in ~/.vimrc set cindent " C style indenting set ts=4 " tabstop set sw=4 " shiftwidth - avoid whitespaces at end of lines and never use them for indentation; only ever use tabs for indentations - semicolons and commas ;, should be placed directly after a variable/statement x+=1; set_cache_size(0); for (uint32_t i=0; i<10; i++) ... - brackets () and (greater/lower) equal sign ><= should should not contain unecessary spaces, e.g: int32_t a=1; int32_t b=kernel->compute(); if (a==1) { } exceptions are logical subunits if ( (a==1) && (b==1) ) { } - avoid the use of inline functions where possible (little to zero performance impact). nowadays compilers automagically inline code when beneficial and within the same linking process - breaking long lines and strings limit yourselves to 80 columns for (int32_t vec=params->start; vec<params->end && !CSignal::cancel_computations(); vec++) { //foo } however exceptions are OK if readability is increased (as in function definitions) - don't put multiple assignments on a single line - functions look like int32_t* fun(int32_t* foo) { return foo; } and are separated by a newline, e.g: int32_t* fun1(int32_t* foo1) { return foo; } int32_t* fun2(int32_t* foo2) { return foo2; } - same for if () else clauses, while/for loops if (foo) do_stuff(); if (foo) { do_stuff(); do_more(); } - one empty line between { } block, e.g. for (int32_t i=0; i<17; i++) { // sth } x=1; MACROS & IFDEFS: - use macros sparingly - avoid defining constants using macros (bye bye typechecking), use const int32_t FOO=5; or enums (when defining several realted constants) instead - use ifdefs sparingly (really limit yourself to the ones necessary) as their extreme usage makes the code completely unreadable. to achieve that it may be necessary to wrap a function of (e.g. for pthread_create()/CreateThread()/thread_create() a wrapper function to create a thread and inside of it the ifdefs to do it the solaris/win32/posix way) - if you need to use ifdefs always comment the corresponding #else / #endif in the following way: #ifdef HAVE_LAPACK ... #else //HAVE_LAPACK ... #endif //HAVE_LAPACK TYPES: - types (use only these!): char (8bit char(maybe signed or unsigned)) uint8_t (8bit unsigned char) uint16_t (16bit unsigned short) uint32_t (32bit unsinged int) int32_t (32bit int) int64_t (64bit int) float32_t (32bit float) float64_t (64bit float) floatmax_t (96bit or 128bit float depending on arch) exceptions: file IO / matlab interface - classes must be (directly or indirectly) derived from CSGObject - don't use fprintf/printf, but SG_DEBUG/SG_INFO/SG_WARNING/SG_ERROR/SG_PRINT (if in a from CSGObject derived object) or the static SG_SDEBUG/... functions elsewise FUNCTIONS: - Functions should be short and sweet, and do just one thing. They should fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24, as we all know), and do one thing and do that well. - Another measure of the function is the number of local variables. They shouldn't exceed 5-10, or you're doing something wrong. Re-think the function, and split it into smaller pieces. A human brain can generally easily keep track of about 7 different things, anything more and it gets confused. You know you're brilliant, but maybe you'd like to understand what you did 2 weeks from now. GETTING / SETTING OBJECTS If a class stores a pointer to an object it should call SG_REF(obj) to increase the objects reference count and SG_UNREF(obj) on class desctruction (which will decrease the objects reference count and call the objects destructor if ref_count()==0. Note that the caller (from within C++) of any get_* function returning an object should also call SG_UNREF(obj) when done with the object. This makes the swig wrapped interfaces automagically take care of object destruction. If a class function returns a new object this has to be stated in the corresponding swig .i file for cleanup to work, e.g. if apply() returns a new CLabels then the .i file should contain %newobject CClassifier::apply(); NAMING CONVENTIONS: - naming variables: - in classes are member variables are named like m_feature_vector (to avoid shadowing and the often hard to find bugs shadowing causes) - parameters (in functions) shall be named e.g. feature_vector - don't use meaningless variable names, it is however fine to use short names like i,j,k etc in loops - class names start with 'C', each syllable/subword starts with a capital letter, e.g. CStringFeatures - constants/defined objects are UPPERCASE, i.e. REALVALUED - function are named like get_feature_vector() and should be limited to as few arguments as possible (no monster functions with > 5 arguments please) - objects which can deal with features of type DREAL and class SIMPLE don't need to contain Real/Simple in class name -others are required to clarify class/type they can handle, e.g. CSparseByteLinearKernel, CSparseGaussianKernel -variable and function names are all lowercase (except for class Con/Destructors) syllables/subwords are separated by '_', e.g. compute_kernel_value(), my_local_variable -class member variables all start with m_, e.g. m_member (this is to avoid shadowing) - features -featureclass Simple/Sparse -featuretype Real/Byte/... - preprocessors -featureclass Simple/Sparse -featuretype Real/Byte/... - kernel -featureclass Simple/Sparse -featuretype Real/Byte/... -kerneltype Linear/Gaussian/... VERSIONING SCHEME: - an automagic version will be created from the date of the last svn update. if that is not enough make releases: e.g.: svn cp trunk releases/shogun_0.1.0