SHOGUN
4.0.0

Abstract base class that provides an interface for performing kernel twosample test on streaming data using Maximum Mean Discrepancy (MMD) as the test statistic. The MMD is the distance of two probability distributions \(p\) and \(q\) in a RKHS (see [1] for formal description).
\[ \text{MMD}[\mathcal{F},p,q]^2=\textbf{E}_{x,x'}\left[ k(x,x')\right] 2\textbf{E}_{x,y}\left[ k(x,y)\right] +\textbf{E}_{y,y'}\left[ k(y,y')\right]=\mu_p  \mu_q^2_\mathcal{F} \]
where \(x,x'\sim p\) and \(y,y'\sim q\). The data has to be provided as streaming features, which are processed in blocks for a given blocksize. The blocksize determines how many examples are processed at once. A method for getting a specified number of blocks of data is provided which can optionally merge and permute the data within the current burst. The exact computation of kernel functions for MMD computation is abstract and has to be defined by its subclasses, which should return a vector of function values. Please note that for streaming MMD, the number of data points from both the distributions has to be equal.
Along with the statistic comes a method to compute a pvalue based on a Gaussian approximation of the nulldistribution which is possible in linear time and constant space. Sampling from null is also possible (no permutations but new examples will be used here). If unsure which one to use, sampling with 250 iterations always is correct (but slow). When the sample size is large (>1000) at least, the Gaussian approximation is an accurate and much faster choice.
To choose, use set_null_approximation_method() and choose from
MMD1_GAUSSIAN: Approximates the nulldistribution with a Gaussian. Only use from at least 1000 samples. If using, check if type I error equals the desired value.
PERMUTATION: For permuting available samples to sample nulldistribution.
For kernel selection see CMMDKernelSelection.
[1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., & Smola, A. (2012). A Kernel TwoSample Test. Journal of Machine Learning Research, 13, 671721.
Definition at line 88 of file StreamingMMD.h.
Public Attributes  
SGIO *  io 
Parallel *  parallel 
Version *  version 
Parameter *  m_parameters 
Parameter *  m_model_selection_parameters 
Parameter *  m_gradient_parameters 
ParameterMap *  m_parameter_map 
uint32_t  m_hash 
Protected Member Functions  
virtual SGVector< float64_t >  compute_squared_mmd (CKernel *kernel, CList *data, index_t num_this_run)=0 
virtual TParameter *  migrate (DynArray< TParameter * > *param_base, const SGParamInfo *target) 
virtual void  one_to_one_migration_prepare (DynArray< TParameter * > *param_base, const SGParamInfo *target, TParameter *&replacement, TParameter *&to_migrate, char *old_name=NULL) 
virtual void  load_serializable_pre () throw (ShogunException) 
virtual void  load_serializable_post () throw (ShogunException) 
virtual void  save_serializable_pre () throw (ShogunException) 
virtual void  save_serializable_post () throw (ShogunException) 
CStreamingMMD  (  ) 
default constructor
Definition at line 40 of file StreamingMMD.cpp.
CStreamingMMD  (  CKernel *  kernel, 
CStreamingFeatures *  p,  
CStreamingFeatures *  q,  
index_t  m,  
index_t  blocksize = 10000 

) 
constructor.
kernel  kernel to use 
p  streaming features p to use 
q  streaming features q to use 
m  number of samples from each distribution 
blocksize  size of examples that are processed at once when computing statistic/threshold. 
Definition at line 45 of file StreamingMMD.cpp.

virtual 
destructor
Definition at line 60 of file StreamingMMD.cpp.

inherited 
Builds a dictionary of all parameters in SGObject as well of those of SGObjects that are parameters of this object. Dictionary maps parameters to the objects that own them.
dict  dictionary of parameters to be built. 
Definition at line 1243 of file SGObject.cpp.

virtualinherited 
Creates a clone of the current object. This is done via recursively traversing all parameters, which corresponds to a deep copy. Calling equals on the cloned object always returns true although none of the memory of both objects overlaps.
Definition at line 1360 of file SGObject.cpp.
computes a pvalue based on current method for approximating the nulldistribution. The pvalue is the 1p quantile of the null distribution where the given statistic lies in.
The method for computing the pvalue can be set via set_null_approximation_method(). Since the null distribution is normal, a Gaussian approximation is available.
statistic  statistic value to compute the pvalue for 
Reimplemented from CTwoSampleTest.
Definition at line 119 of file StreamingMMD.cpp.

protectedpure virtual 
abstract method that computes the squared MMD
kernel  the kernel to be used for computing MMD. This will be useful when multiple kernels are used 
data  the list of data on which kernels are computed. The order of data in the list is \(x,x',\cdots\sim p\) followed by \(y,y',\cdots\sim q\). It is assumed that detele_data flag is set inside the list 
num_this_run  number of data points in current blocks 
Implemented in CLinearTimeMMD.

virtual 
Computes the squared MMD for the current data. This is an unbiased estimate. This method relies on compute_statistic_and_variance which has to be defined in the subclasses
Note that the underlying streaming feature parser has to be started before this is called. Otherwise deadlock.
Implements CKernelTwoSampleTest.
Definition at line 85 of file StreamingMMD.cpp.
Same as compute_statistic(), but with the possibility to perform on multiple kernels at once
multiple_kernels  if true, and underlying kernel is K_COMBINED, method will be executed on all subkernels on the same data 
Implements CKernelTwoSampleTest.
Definition at line 95 of file StreamingMMD.cpp.

pure virtual 
Same as compute_statistic_and_variance, but computes a linear time estimate of the covariance of the multiplekernelMMD. See [1] for details.
Implemented in CLinearTimeMMD.

pure virtual 
Abstract method that computes MMD and a linear time variance estimate. If multiple_kernels is set to true, each subkernel is evaluated on the same data.
statistic  return parameter for statistic, vector with entry for each kernel. May be allocated before but doesn not have to be 
variance  return parameter for statistic, vector with entry for each kernel. May be allocated before but doesn not have to be 
multiple_kernels  optional flag, if set to true, it is assumed that the underlying kernel is of type K_COMBINED. Then, the MMD is computed on all subkernel separately rather than computing it on the combination. This is used by kernel selection strategies that need to evaluate multiple kernels on the same data. Since the linear time MMD works on streaming data, one cannot simply compute MMD, change kernel since data would be different for every kernel. 
Implemented in CLinearTimeMMD.
computes a threshold based on current method for approximating the nulldistribution. The threshold is the value that a statistic has to have in ordner to reject the nullhypothesis.
The method for computing the pvalue can be set via set_null_approximation_method(). Since the null distribution is normal, a Gaussian approximation is available.
alpha  test level to reject nullhypothesis 
Reimplemented from CTwoSampleTest.
Definition at line 142 of file StreamingMMD.cpp.

virtual 
computes a linear time estimate of the variance of the squared mmd, which may be used for an approximation of the nulldistribution The value is the variance of the vector of which the MMD is the mean.
Definition at line 109 of file StreamingMMD.cpp.

virtualinherited 
A deep copy. All the instance variables will also be copied.
Definition at line 200 of file SGObject.cpp.
Recursively compares the current SGObject to another one. Compares all registered numerical parameters, recursion upon complex (SGObject) parameters. Does not compare pointers!
May be overwritten but please do with care! Should not be necessary in most cases.
other  object to compare with 
accuracy  accuracy to use for comparison (optional) 
tolerant  allows linient check on float equality (within accuracy) 
Definition at line 1264 of file SGObject.cpp.

inherited 

inherited 

inherited 

virtualinherited 
Definition at line 86 of file KernelTwoSampleTest.h.

inherited 
Definition at line 127 of file TwoSampleTest.h.

inherited 
Definition at line 1135 of file SGObject.cpp.

inherited 
Returns description of a given parameter string, if it exists. SG_ERROR otherwise
param_name  name of the parameter 
Definition at line 1159 of file SGObject.cpp.

inherited 
Returns index of model selection parameter with provided index
param_name  name of model selection parameter 
Definition at line 1172 of file SGObject.cpp.

virtual 
Implements CKernelTwoSampleTest.
Reimplemented in CLinearTimeMMD.
Definition at line 269 of file StreamingMMD.h.

virtual 
Not implemented for streaming MMD since it uses streaming feautres
Reimplemented from CTwoSampleTest.
Definition at line 307 of file StreamingMMD.cpp.

pure virtualinherited 
returns the statistic type of this test statistic
Implemented in CQuadraticTimeMMD, CNOCCO, CHSIC, and CLinearTimeMMD.

virtual 
Getter for streaming features of p distribution.
Definition at line 314 of file StreamingMMD.cpp.

virtual 
Getter for streaming features of q distribution.
Definition at line 320 of file StreamingMMD.cpp.

virtualinherited 
If the SGSerializable is a class template then TRUE will be returned and GENERIC is set to the type of the generic.
generic  set to the type of the generic if returning TRUE 
Definition at line 297 of file SGObject.cpp.

inherited 
maps all parameters of this instance to the provided file version and loads all parameter data from the file into an array, which is sorted (basically calls load_file_parameter(...) for all parameters and puts all results into a sorted array)
file_version  parameter version of the file 
current_version  version from which mapping begins (you want to use Version::get_version_parameter() for this in most cases) 
file  file to load from 
prefix  prefix for members 
Definition at line 704 of file SGObject.cpp.

inherited 
loads some specified parameters from a file with a specified version The provided parameter info has a version which is recursively mapped until the file parameter version is reached. Note that there may be possibly multiple parameters in the mapping, therefore, a set of TParameter instances is returned
param_info  information of parameter 
file_version  parameter version of the file, must be <= provided parameter version 
file  file to load from 
prefix  prefix for members 
Definition at line 545 of file SGObject.cpp.

virtualinherited 
Load this object from file. If it will fail (returning FALSE) then this object will contain inconsistent data and should not be used!
file  where to load from 
prefix  prefix for members 
param_version  (optional) a parameter version different to (this is mainly for testing, better do not use) 
Definition at line 374 of file SGObject.cpp.

protectedvirtualinherited 
Can (optionally) be overridden to postinitialize some member variables which are not PARAMETER::ADD'ed. Make sure that at first the overridden method BASE_CLASS::LOAD_SERIALIZABLE_POST is called.
ShogunException  will be thrown if an error occurs. 
Reimplemented in CKernel, CWeightedDegreePositionStringKernel, CList, CAlphabet, CLinearHMM, CGaussianKernel, CInverseMultiQuadricKernel, CCircularKernel, and CExponentialKernel.
Definition at line 1062 of file SGObject.cpp.

protectedvirtualinherited 
Can (optionally) be overridden to preinitialize some member variables which are not PARAMETER::ADD'ed. Make sure that at first the overridden method BASE_CLASS::LOAD_SERIALIZABLE_PRE is called.
ShogunException  will be thrown if an error occurs. 
Reimplemented in CDynamicArray< T >, CDynamicArray< float64_t >, CDynamicArray< float32_t >, CDynamicArray< int32_t >, CDynamicArray< char >, CDynamicArray< bool >, and CDynamicObjectArray.
Definition at line 1057 of file SGObject.cpp.

inherited 
Takes a set of TParameter instances (base) with a certain version and a set of target parameter infos and recursively maps the base level wise to the current version using CSGObject::migrate(...). The base is replaced. After this call, the base version containing parameters should be of same version/type as the initial target parameter infos. Note for this to work, the migrate methods and all the internal parameter mappings have to match
param_base  set of TParameter instances that are mapped to the provided target parameter infos 
base_version  version of the parameter base 
target_param_infos  set of SGParamInfo instances that specify the target parameter base 
Definition at line 742 of file SGObject.cpp.

protectedvirtualinherited 
creates a new TParameter instance, which contains migrated data from the version that is provided. The provided parameter data base is used for migration, this base is a collection of all parameter data of the previous version. Migration is done FROM the data in param_base TO the provided param info Migration is always one version step. Method has to be implemented in subclasses, if no match is found, base method has to be called.
If there is an element in the param_base which equals the target, a copy of the element is returned. This represents the case when nothing has changed and therefore, the migrate method is not overloaded in a subclass
param_base  set of TParameter instances to use for migration 
target  parameter info for the resulting TParameter 
Definition at line 949 of file SGObject.cpp.

protectedvirtualinherited 
This method prepares everything for a onetoone parameter migration. One to one here means that only ONE element of the parameter base is needed for the migration (the one with the same name as the target). Data is allocated for the target (in the type as provided in the target SGParamInfo), and a corresponding new TParameter instance is written to replacement. The to_migrate pointer points to the single needed TParameter instance needed for migration. If a name change happened, the old name may be specified by old_name. In addition, the m_delete_data flag of to_migrate is set to true. So if you want to migrate data, the only thing to do after this call is converting the data in the m_parameter fields. If unsure how to use  have a look into an example for this. (base_migration_type_conversion.cpp for example)
param_base  set of TParameter instances to use for migration 
target  parameter info for the resulting TParameter 
replacement  (used as output) here the TParameter instance which is returned by migration is created into 
to_migrate  the only source that is used for migration 
old_name  with this parameter, a name change may be specified 
Definition at line 889 of file SGObject.cpp.

virtualinherited 
Definition at line 263 of file SGObject.cpp.

inherited 
Performs the complete twosample test on current data and returns a binary answer wheter null hypothesis is rejected or not.
This is just a wrapper for the above perform_test() method that returns a pvalue. If this pvalue lies below the test level alpha, the null hypothesis is rejected.
Should not be overwritten in subclasses. (Therefore not virtual)
alpha  test level alpha. 
Definition at line 121 of file HypothesisTest.cpp.

virtual 
Performs the complete twosample test on current data and returns a pvalue.
In case null distribution should be estimated with MMD1_GAUSSIAN, statistic and pvalue are computed in the same loop, which is more efficient than first computing statistic and then computung pvalues.
In case of sampling null, superclass method is called.
The method for computing the pvalue can be set via set_null_approximation_method().
Reimplemented from CHypothesisTest.
Definition at line 165 of file StreamingMMD.cpp.

inherited 
prints all parameter registered for model selection and their type
Definition at line 1111 of file SGObject.cpp.

virtualinherited 
prints registered parameters out
prefix  prefix for members 
Definition at line 309 of file SGObject.cpp.
Mimics sampling null for MMD. However, samples are not permutated but constantly streamed and then merged. Usually, this is not necessary since there is the Gaussian approximation for the null distribution. However, in certain cases this may fail and sampling the null distribution might be numerically more stable. Ovewrite superclass method that merges samples.
Reimplemented from CKernelTwoSampleTest.
Definition at line 194 of file StreamingMMD.cpp.

virtualinherited 
Save this object to file.
file  where to save the object; will be closed during returning if PREFIX is an empty string. 
prefix  prefix for members 
param_version  (optional) a parameter version different to (this is mainly for testing, better do not use) 
Definition at line 315 of file SGObject.cpp.

protectedvirtualinherited 
Can (optionally) be overridden to postinitialize some member variables which are not PARAMETER::ADD'ed. Make sure that at first the overridden method BASE_CLASS::SAVE_SERIALIZABLE_POST is called.
ShogunException  will be thrown if an error occurs. 
Reimplemented in CKernel.
Definition at line 1072 of file SGObject.cpp.

protectedvirtualinherited 
Can (optionally) be overridden to preinitialize some member variables which are not PARAMETER::ADD'ed. Make sure that at first the overridden method BASE_CLASS::SAVE_SERIALIZABLE_PRE is called.
ShogunException  will be thrown if an error occurs. 
Reimplemented in CKernel, CDynamicArray< T >, CDynamicArray< float64_t >, CDynamicArray< float32_t >, CDynamicArray< int32_t >, CDynamicArray< char >, CDynamicArray< bool >, and CDynamicObjectArray.
Definition at line 1067 of file SGObject.cpp.
void set_blocksize  (  index_t  blocksize  ) 
Setter for the blocksize of examples to be processed at once
blocksize  new blocksize to use 
Definition at line 226 of file StreamingMMD.h.

inherited 
set generic type to T
Definition at line 42 of file SGObject.cpp.

inherited 

inherited 
set the parallel object
parallel  parallel object to use 
Definition at line 243 of file SGObject.cpp.

inherited 
set the version object
version  version object to use 
Definition at line 284 of file SGObject.cpp.

virtualinherited 
Setter for the underlying kernel
kernel  new kernel to use 
Definition at line 77 of file KernelTwoSampleTest.h.

inherited 
m  number of samples from first distribution p 
Definition at line 162 of file TwoSampleTest.cpp.

virtualinherited 
sets the method how to approximate the nulldistribution
null_approximation_method  method to use 
Definition at line 61 of file HypothesisTest.cpp.

virtualinherited 
sets the number of permutation iterations for sample_null()
num_null_samples  how often permutation shall be done 
Definition at line 67 of file HypothesisTest.cpp.

virtual 
Not implemented for streaming MMD since it uses streaming feautres
Reimplemented from CTwoSampleTest.
Definition at line 301 of file StreamingMMD.cpp.
void set_simulate_h0  (  bool  simulate_h0  ) 
simulate_h0  if true, samples from p and q will be mixed and permuted 
Definition at line 263 of file StreamingMMD.h.

virtualinherited 
A shallow copy. All the SGObject instance variables will be simply assigned and SG_REFed.
Reimplemented in CGaussianKernel.
Definition at line 194 of file SGObject.cpp.
Streams num_blocks data from each distribution with blocks of size num_this_run. If m_simulate_h0 is set, it merges the blocks together, shuffles and redistributes between the blocks.
num_blocks  number of blocks to be streamed from each distribution 
num_this_run  number of data points to be streamed for one block 
Definition at line 220 of file StreamingMMD.cpp.

inherited 
unset generic type
this has to be called in classes specializing a template class
Definition at line 304 of file SGObject.cpp.

virtualinherited 
Updates the hash of current parameter combination
Definition at line 250 of file SGObject.cpp.

inherited 
io
Definition at line 496 of file SGObject.h.

protected 
Number of examples processed at once, i.e. in one burst
Definition at line 296 of file StreamingMMD.h.

inherited 
parameters wrt which we can compute gradients
Definition at line 511 of file SGObject.h.

inherited 
Hash of parameter values
Definition at line 517 of file SGObject.h.

protectedinherited 
underlying kernel
Definition at line 121 of file KernelTwoSampleTest.h.

protectedinherited 
defines the first index of samples of q
Definition at line 139 of file TwoSampleTest.h.

inherited 
model selection parameters
Definition at line 508 of file SGObject.h.

protectedinherited 
Defines how the the null distribution is approximated
Definition at line 177 of file HypothesisTest.h.

protectedinherited 
number of iterations for sampling from nulldistributions
Definition at line 174 of file HypothesisTest.h.

protectedinherited 
concatenated samples of the two distributions (two blocks)
Definition at line 136 of file TwoSampleTest.h.

inherited 
map for different parameter versions
Definition at line 514 of file SGObject.h.

inherited 
parameters
Definition at line 505 of file SGObject.h.

protected 
If this is true, samples will be mixed between p and q in any method that computes the statistic
Definition at line 300 of file StreamingMMD.h.

protected 
Streaming feature objects that are used instead of merged samples
Definition at line 290 of file StreamingMMD.h.

protected 
Streaming feature objects that are used instead of merged samples
Definition at line 293 of file StreamingMMD.h.

inherited 
parallel
Definition at line 499 of file SGObject.h.

inherited 
version
Definition at line 502 of file SGObject.h.