Kernel Density Estimation

by Parijat Mazumdar (GitHub ID: mazumdarparijat)

This notebook is on using the Shogun Machine Learning Toolbox for kernel density estimation (KDE). We start with a brief overview of KDE. Then we demonstrate the use of Shogun's $KernelDensity$ class on a toy example. Finally, we apply KDE to a real world example, thus demonstrating the its prowess as a non-parametric statistical method.

Brief overview of Kernel Density Estimation

Kernel Density Estimation (KDE) is a non-parametric way of estimating the probability density function (pdf) of ANY distribution given a finite number of its samples. The pdf of a random variable X given finite samples ($x_i$s), as per KDE formula, is given by:

$$pdf(x)=\frac{1}{nh} \Sigma_{i=1}^n K(\frac{||x-x_i||}{h})$$

In the above equation, K() is called the kernel - a symmetric function that integrates to 1. h is called the kernel bandwidth which controls how smooth (or spread-out) the kernel is. The most commonly used kernel is the normal distribution function.

KDE is a computationally expensive method. Given $N_1$ query points (i.e. the points where we want to compute the pdf) and $N_2$ samples, computational complexity of KDE is $\mathcal{O}(N_1.N_2.D)$ where D is the dimension of the data. This computational load can be reduced by spatially segregating data points using data structures like KD-Tree and Ball-Tree. In single tree methods, only the sample points are structured in a tree whereas in dual tree methods both sample points and query points are structured in respective trees. Using these tree structures enables us to compute the density estimate for a bunch of points together at once thus reducing the number of required computations. This speed-up, however, results in reduced accuracy. Greater the speed-up, lower the accuracy. Therefore, in practice, the maximum amount of speed-up that can be afforded is usually controlled by error tolerance values.

KDE on toy data

Let us learn about KDE in Shogun by estimating a mixture of 2 one-dimensional gaussian distributions. $$pdf(x) = \frac{1}{2} [\mathcal{N}(\mu_1,\sigma_1) + \mathcal{N}(\mu_2,\sigma_2)]$$ We start by plotting the actual distribution and generating the required samples (i.e. $x_i$s).

In []:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

# generates samples from the distribution
def generate_samples(n_samples,mu1,sigma1,mu2,sigma2):
    samples1 = np.random.normal(mu1,sigma1,(1,n_samples/2.0))
    samples2 = np.random.normal(mu2,sigma2,(1,n_samples/2.0))
    samples = np.concatenate((samples1,samples2),1)
    return samples

# parameters of the distribution
mu1=4
sigma1=1
mu2=8
sigma2=2

# number of samples
n_samples = 200
samples=generate_samples(n_samples,mu1,sigma1,mu2,sigma2)

# pdf function for plotting
x = np.linspace(0,15,500)
y = 0.5*(stats.norm(mu1,sigma1).pdf(x)+stats.norm(mu2,sigma2).pdf(x))

# plot samples
plt.plot(samples[0,:],np.zeros(n_samples),'rx',label="Samples")
# plot actual pdf
plt.plot(x,y,'b--',label="Actual pdf")
plt.legend(numpoints=1)
plt.show()

Now, we will apply KDE to estimate the actual pdf using the samples. Using KDE in Shogun is a 3 stage process : setting the model parameters, supplying sample data points for training and supplying query points for getting log of pdf estimates.

In []:
from modshogun import KernelDensity, RealFeatures, K_GAUSSIAN, D_EUCLIDEAN, EM_KDTREE_SINGLE

def get_kde_result(bandwidth,samples):
    # set model parameters
    kernel_type = K_GAUSSIAN
    dist_metric = D_EUCLIDEAN # other choice is D_MANHATTAN
    eval_mode = EM_KDTREE_SINGLE # other choices are EM_BALLTREE_SINGLE, EM_KDTREE_DUAL and EM_BALLTREE_DUAL
    leaf_size = 1 # min number of samples to be present in leaves of the spatial tree
    abs_tol = 0 # absolute tolerance
    rel_tol = 0 # relative tolerance i.e. accepted error as fraction of true density
    k=KernelDensity(bandwidth,kernel_type,dist_metric,eval_mode,leaf_size,abs_tol,rel_tol)

    # form Shogun features and train
    train_feats=RealFeatures(samples)
    k.train(train_feats)

    # get log density
    query_points = np.array([np.linspace(0,15,500)])
    query_feats = RealFeatures(query_points)
    log_pdf = k.get_log_density(query_feats)
    return query_points,log_pdf

query_points,log_pdf=get_kde_result(0.5,samples)

We have calculated log of pdf. Let us see how accurate it is by comparing it with the actual pdf.

In []:
import matplotlib.pyplot as plt
% matplotlib inline

def plot_pdf(samples,query_points,log_pdf,title):
    plt.plot(samples,np.zeros((1,samples.size)),'rx')
    plt.plot(query_points[0,:],np.exp(log_pdf),'r',label="Estimated pdf")
    plt.plot(x,y,'b--',label="Actual pdf")
    plt.title(title)
    plt.legend()
    plt.show()
    
plot_pdf(samples,query_points,log_pdf,'num_samples=200, bandwidth=0.5')

We see that the estimated pdf resembles the actual pdf with reasonable accuracy. This is a small demonstration of the fact that KDE can be used to estimate any arbitrary distribution given a finite number of it's samples.

Effect of bandwidth

Kernel bandwidth is a very important controlling parameter of the kernel density estimate. We have already seen that for bandwidth of 0.5, the estimated pdf almost coincides with the actual pdf. Let us see what happens when we decrease or increase the value of the kernel bandwidth keeping number of samples constant at 200.

In []:
query_points,log_pdf=get_kde_result(0.1,samples)
plot_pdf(samples,query_points,log_pdf,'num_samples=200, bandwidth=0.1')

query_points,log_pdf=get_kde_result(0.2,samples)
plot_pdf(samples,query_points,log_pdf,'num_samples=200, bandwidth=0.2')

query_points,log_pdf=get_kde_result(0.5,samples)
plot_pdf(samples,query_points,log_pdf,'num_samples=200, bandwidth=0.5')

query_points,log_pdf=get_kde_result(1.1,samples)
plot_pdf(samples,query_points,log_pdf,'num_samples=200, bandwidth=1.1')

query_points,log_pdf=get_kde_result(1.5,samples)
plot_pdf(samples,query_points,log_pdf,'num_samples=200, bandwidth=1.5')

From the above plots, it can be inferred that the kernel bandwidth controls the extent of smoothness of the pdf function. Low value of bandwidth parameter causes under-smoothing (which is the case with the first 2 plots from top) and high value causes over-smoothing (as it is the case with the bottom 2 plots). The perfect value of the kernel bandwidth should be estimated using model-selection techniques which is presently not supported by Shogun (to be updated soon).

Effect of number of samples

Here, we see the effect of the number of samples on the estimated pdf, fine-tuning bandwidth in each case such that we get the most accurate pdf.

In []:
samples=generate_samples(20,mu1,sigma1,mu2,sigma2)
query_points,log_pdf=get_kde_result(0.7,samples)
plot_pdf(samples,query_points,log_pdf,'num_samples=20, bandwidth=0.7')

samples=generate_samples(200,mu1,sigma1,mu2,sigma2)
query_points,log_pdf=get_kde_result(0.5,samples)
plot_pdf(samples,query_points,log_pdf,'num_samples=200, bandwidth=0.5')

samples=generate_samples(2000,mu1,sigma1,mu2,sigma2)
query_points,log_pdf=get_kde_result(0.4,samples)
plot_pdf(samples,query_points,log_pdf,'num_samples=2000, bandwidth=0.4')

Firstly, We see that the estimated pdf becomes more accurate with increasing number of samples. By running the above snippent multiple times, we also notice that the variation in the shape of estimated pdf, between 2 different runs of the above code snippet, is highest when the number of samples is 20 and lowest when the number of samples is 2000. Therefore, we can say that with increase in the number of samples, the stability of the estimated pdf increases. Both the results can be explained using the intuitive fact that a larger number of samples gives a better picture of the entire distribution. A formal proof of the same has been presented by L. Devroye in his book "Nonparametric Density Estimation: The $L_1$ View" [3]. It is theoretically proven that as the number of samples tends to $\infty$, the estimated pdf converges to the real pdf.

Classification using KDE

In this section we see how KDE can be used for classification using a generative approach. Here, we try to classify the different varieties of Iris plant making use of Fisher's Iris dataset borrowed from the UCI Machine Learning Repository. There are 3 varieties of Iris plants:

  • Iris Sensosa
  • Iris Versicolour
  • Iris Virginica

The Iris dataset enlists 4 features that can be used to segregate these varieties, but for ease of analysis and visualization, we only use two of the most important features (ie. features with very high class correlations)[refer to summary statistics] namely

  • petal length
  • petal width

As a first step, we plot the data.

In []:
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

f = open('../../../data/uci/iris/iris.data')
features = []
# read data from file
for line in f:
	words = line.rstrip().split(',')
	features.append([float(i) for i in words[0:4]])

f.close()

# create observation matrix
obsmatrix = np.array(features).T

# Just keep 2 most important features
obsmatrix = obsmatrix[2:4,:]

# plot the data
def plot_samples(marker='o',plot_show=True):
    # First 50 data belong to Iris Sentosa, plotted in green
    plt.plot(obsmatrix[0,0:50], obsmatrix[1,0:50], marker, color='green', markersize=5,label='Iris Sentosa')
    # Next 50 data belong to Iris Versicolour, plotted in red
    plt.plot(obsmatrix[0,50:100], obsmatrix[1,50:100], marker, color='red', markersize=5,label='Iris Versicolour')
    # Last 50 data belong to Iris Virginica, plotted in blue
    plt.plot(obsmatrix[0,100:150], obsmatrix[1,100:150], marker, color='blue', markersize=5,label='Iris Virginica')

    if plot_show:
        plt.xlim(0,8)
        plt.ylim(-1,3)
        plt.title('3 varieties of Iris plants')
        plt.xlabel('petal length')
        plt.ylabel('petal width')
        plt.legend(numpoints=1,bbox_to_anchor=(0.97,0.35))
        plt.show()
    
plot_samples()

Next, let us use the samples to estimate the probability density functions of each category of plant.

In []:
from modshogun import KernelDensity, RealFeatures, K_GAUSSIAN, D_EUCLIDEAN, EM_BALLTREE_DUAL
import scipy.interpolate as interpolate

def get_kde(samples):
    # set model parameters
    bandwidth = 0.4
    kernel_type = K_GAUSSIAN
    dist_metric = D_EUCLIDEAN
    eval_mode = EM_BALLTREE_DUAL
    leaf_size = 1
    abs_tol = 0
    rel_tol = 0
    k=KernelDensity(bandwidth,kernel_type,dist_metric,eval_mode,leaf_size,abs_tol,rel_tol)
    
    # form Shogun features and train
    train_feats=RealFeatures(samples)
    k.train(train_feats)
    
    return k

def density_estimate_grid(kdestimator):
    xmin,xmax,ymin,ymax=[0,8,-1,3]

    # Set up a regular grid of interpolation points
    x, y = np.linspace(xmin, xmax, 100), np.linspace(ymin, ymax, 100)
    x, y = np.meshgrid(x, y)

    # compute density estimate at each of the grid points
    query_feats=RealFeatures(np.array([x[0,:],y[0,:]]))
    z=np.array([kdestimator.get_log_density(query_feats)])
    z=np.exp(z)
    for i in xrange(1,x.shape[0]):
        query_feats=RealFeatures(np.array([x[i,:],y[i,:]]))
        zi=np.exp(kdestimator.get_log_density(query_feats))
        z=np.vstack((z,zi))

    return (x,y,z)

def plot_pdf(kdestimator,title):

    # compute interpolation points and corresponding kde values
    x,y,z=density_estimate_grid(kdestimator)
    
    # plot pdf
    plt.imshow(z, vmin=z.min(), vmax=z.max(), origin='lower',extent=[x.min(), x.max(), y.min(), y.max()])
    plt.title(title)
    plt.colorbar(shrink=0.5)
    plt.xlabel('petal length')
    plt.ylabel('petal width')
    plt.show()

    
kde1=get_kde(obsmatrix[:,0:50])
plot_pdf(kde1,'pdf for Iris Sentosa')

kde2=get_kde(obsmatrix[:,50:100])
plot_pdf(kde2,'pdf for Iris Versicolour')

kde3=get_kde(obsmatrix[:,100:150])
plot_pdf(kde3,'pdf for Iris Virginica')

kde=get_kde(obsmatrix[:,0:150])
plot_pdf(kde,'Combined pdf')

The above contour plots depict the pdf of respective categories of iris plant. These probability density functions can be used as generative models to estimate the likelihood of any test sample belonging to a particular category. We use these likelihoods for classification by forming a simple decision rule: a test sample is assigned the class for which it's likelihood is maximum. With this in mind, let us try to segregate the entire 2-D space into 3 regions :

  • Iris Sentosa (green)
  • Iris Versicolour (red)
  • Iris Virginica (blue)
In []:
# get 3 likelihoods for each test point in grid
x,y,z1=density_estimate_grid(kde1)
x,y,z2=density_estimate_grid(kde2)
x,y,z3=density_estimate_grid(kde3)

# classify using our decision rule
z=[]
for i in xrange(0,x.shape[0]):
    zj=[]
    for j in xrange(0,x.shape[1]):
        if ((z1[i,j]>z2[i,j]) and (z1[i,j]>z3[i,j])):
            zj.append(1)
        elif (z2[i,j]>z3[i,j]):
            zj.append(2)
        else:
            zj.append(0)
            
    z.append(zj)

z=np.array(z)

# plot results
plt.imshow(z, vmin=z.min(), vmax=z.max(), origin='lower',extent=[x.min(), x.max(), y.min(), y.max()])
plt.title("Classified regions")
plt.xlabel('petal length')
plt.ylabel('petal width')
plot_samples(marker='x',plot_show=False)
plt.show()

The classified regions that we get almost match our intuitive expectation. In the plot, the exact position of the 2 decision boundaries can be controlled by varying the kernel bandwidth parameter.

References

[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science

[2] Wikipedia page on Kernel Density Estimation. http://en.wikipedia.org/wiki/Kernel_density_estimation

[3] L. Devroye and L. Gyorfi.Nonparametric Density Estimation: The $L_1$ View. Wiley, 1985