========================= Kernel Density Estimation ========================= The *Kernel Density Estimation* (KDE) is a non-parametric way to estimate the `probability density function (PDF) `_ of a **continuous random variable**, as opposed to the `probability mass function (PMF) `_ which is used in the context of discrete random variables (random variables that take values on a countable set). Theory ====== The probability density function of a continuous random variable is the probabilities of occurrence of different possible outcomes of that random variable. The shape of that probability density function across the domain for a random variable is referred to as the **probability distribution**. Common probability distributions have names: uniform, normal, exponential. One can use an histogram to identify whether the density looks like a common probability distribution or not. If it's the case, we can use predefined functions to summarize the relashionship between the observations and their probability. In that case the PDF can be controlled with parameters (parametric density estimation). For example the normal distribution has two parameters: the mean and the standard deviation. Given these two parameters we know the PDF. Sometimes the distribution may not resemble common probability distribution. This is often the case when the data has two peaks (bimodal distribution) or many peaks (multimodal distribution). In this case the parametric density estimation is not feasible. Instead, an algorithm is used to approximate the the probability distribution of the data without a pre-defined distribution, referred to as a non-parametric method. Example of KDE for a bimodal data sample ======================================== .. code:: python import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm from scipy import stats from scipy.stats.kde import gaussian_kde from sklearn.neighbors import KernelDensity from sklearn.model_selection import GridSearchCV sample1 = np.random.normal(loc=20, scale=5, size=300) sample2 = np.random.normal(loc=40, scale=5, size=700) obs_dist = np.hstack((sample1, sample2)) X_plot = np.linspace(obs_dist.min(), obs_dist.max(), 1000) sm_kde_u = sm.nonparametric.KDEUnivariate(obs_dist) sm_kde_u.fit() sp_kde = gaussian_kde(obs_dist) sk_kde = KernelDensity(kernel='gaussian', bandwidth=1.7).fit(np.array(obs_dist)[:, np.newaxis]) log_dens = sk_kde.score_samples(X_plot[:, np.newaxis]) # --------------------------------------------------------------------------------------------------- fig = plt.figure(figsize=(12, 5)) ax = fig.add_subplot(111) # Plot the histrogram ax.hist(obs_dist, bins=20, density=True, label='Histogram from samples', zorder=5, edgecolor='k', alpha=0.5) # Plot the true distribution true_values = (stats.norm.pdf(sm_kde_u.support, obs_dist.mean(), obs_dist.std())) ax.plot(sm_kde_u.support, true_values, lw=3, label='Normal distribution', zorder=15) # Plot the KDEs ax.plot(sm_kde_u.support, sm_kde_u.density, lw=3, label='KDE (Uni) from statsmodels', zorder=10) ax.plot(X_plot, sp_kde(X_plot), 'b', label="KDE from scipy") ax.plot(X_plot, np.exp(log_dens), 'r', lw=2, linestyle='--', label="KDE from sklearn") ax.legend(loc='best') ax.grid(True, zorder=-5) .. figure:: img/2020-09-18-14-08-42.png :width: 600px We can use the cross-validation method with ``GridSearchCV`` from ``sklearn`` to find to best bandwidth: .. code:: python grid = GridSearchCV(KernelDensity(), {'bandwidth': np.linspace(1, 5, 100)}, cv=5) # 5-fold cross-validation grid.fit(np.array(obs_dist)[:, None]) print(grid.best_params_) # {'bandwidth': 2.0101010101010104} KDEs function from ``scipy`` and ``sklearn`` have direct method to generate data from the estimated probability: .. code:: python plt.hist(sp_kde.resample(size=100)[0]); plt.hist(sk_kde.sample(100)); .. figure:: img/2020-09-18-14-10-11.png :width: 300px References ========== * https://machinelearningmastery.com/probability-density-estimation * https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation * https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html * https://www.homeworkhelponline.net/blog/programming/predict-pdf-generate-data * https://abaqus-docs.mit.edu/2017/English/SIMACAEMODRefMap/simamod-c-probdensityfunc.htm :tag:`Machine Learning` :tag:`Python` :tag:`KDE`