Unconditional case

KDE

class kdelearn.kde.KDE(kernel_name: str = 'gaussian')[source]

Kernel density estimator with product kernel:

\[\hat{f}(x) = \sum_{i=1}^m w_{i} \prod_{j=i}^n \frac{1}{h_j} K \left( \frac{x_{j} - x_{i, j}}{h_j} \right), \quad x \in \mathbb{R}^n\]

Read more here.

Parameters:

kernel_name ({'gaussian', 'uniform', 'epanechnikov', 'cauchy'}, default='gaussian') – Name of kernel function.

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> # Fit
>>> kde = KDE("gaussian").fit(x_train)

References

[1] Silverman, B. W. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.

[2] Wand, M. P., Jones M.C. Kernel Smoothing. Chapman and Hall, 1995.

Methods

fit(x_train[, weights_train, bandwidth, ...])

Fit the estimator.

pdf(x_test)

Compute probability density.

sample()

fit(x_train: ndarray, weights_train: ndarray | None = None, bandwidth: ndarray | None = None, bandwidth_method: str = 'direct_plugin', **kwargs) KDE[source]

Fit the estimator.

Parameters:
  • x_train (ndarray of shape (m_train, n)) – Array containing data points with float type for constructing the estimator.

  • weights_train (ndarray of shape (m_train,), optional) – Weights of data points. If None, all data points are equally weighted.

  • bandwidth (ndarray of shape (n,), optional) – Smoothing parameter for scaling the estimator. If None, bandwidth_method is used to compute the bandwidth.

  • bandwidth_method ({'normal_reference', 'direct_plugin'}, default='normal_reference') – Name of bandwidth selection method used to compute bandwidth when it is not given explicitly.

Returns:

self – Fitted self instance of KDE.

Return type:

object

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> weights_train = np.full((m_train,), 1 / m_train)
>>> bandwidth = np.full((n,), 1.0)
>>> # Fit the estimator
>>> kde = KDE().fit(x_train, weights_train, bandwidth)
pdf(x_test: ndarray) ndarray[source]

Compute probability density.

Parameters:

x_test (ndarray of shape (m_test, n)) – Argument of the estimator - array containing data points with float type.

Returns:

scores – Computed estimation of probability densities for testing data points x_test.

Return type:

ndarray of shape (m_test,)

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> m_test = 10
>>> x_train = np.random.normal(0, 1, (m_train, n))
>>> x_test = np.linspace(-3, 3, 10).reshape(-1, 1)
>>> # Fit the estimator
>>> kde = KDE().fit(x_train)
>>> # Compute pdf
>>> scores = kde.pdf(x_test)  # shape of scores: (10,)

KDEClassification

class kdelearn.kde_tasks.KDEClassification(kernel_name: str = 'gaussian')[source]

Bayes’ classifier based on kernel density estimation.

Probability that \(x\) belongs to class \(c\):

\[P(C=c|X=x) \propto \pi_c \hat{f}_c(X=x)\]

To predict class label for \(x\) we need to take class \(c\) with the highest probability:

\[\underset{c}{\mathrm{argmax}} \quad P(C=c|X=x)\]

Read more here.

Parameters:

kernel_name ({'gaussian', 'uniform', 'epanechnikov', 'cauchy'}, default='gaussian') – Name of kernel function.

Examples

>>> # Prepare data for two classes
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> labels_train1 = np.full(m_train // 2, 1)
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> labels_train2 = np.full(m_train // 2, 2)
>>> x_train = np.concatenate((x_train1, x_train2))
>>> labels_train = np.concatenate((labels_train1, labels_train2))
>>> # Fit
>>> classifier = KDEClassification("gaussian").fit(x_train, labels_train)

References

[1] Silverman, B. W. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.

Methods

fit(x_train, labels_train[, weights_train, ...])

Fit the classifier.

pdfs(x_test)

Compute pdf of each class.

predict(x_test)

Predict class labels.

fit(x_train: ndarray, labels_train: ndarray, weights_train: ndarray | None = None, bandwidths: ndarray | None = None, bandwidth_method: str = 'direct_plugin', share_bandwidth: bool = False, prior_prob: ndarray | None = None, **kwargs)[source]

Fit the classifier.

Parameters:
  • x_train (ndarray of shape (m_train, n)) – Array containing data points with float type for constructing the classifier.

  • labels_train (ndarray of shape (m_train,)) – Class labels of x_train containing data with int type.

  • weights_train (ndarray of shape (m_train,), default=None) – Weights for data points. If None, all data points are equally weighted.

  • bandwidths (ndarray of shape (n_classes, n), optional) – Smoothing parameters for scaling the estimators of each class. If None, bandwidth_method is used to compute the bandwidth.

  • bandwidth_method ({'normal_reference', 'direct_plugin'}, default='normal_reference') – Name of bandwidth selection method used to compute bandwidths when it is not given explicitly.

  • share_bandwidth (bool, default=False) – Determines whether all classes should have common bandwidth. If False, estimator of each class gets its own bandwidth.

  • prior_prob (ndarray of shape (n_classes,), default=None) – Prior probabilities of each class. If None, all classes are equally probable.

Returns:

self – Fitted self instance of KDEClassification.

Return type:

object

Examples

>>> # Prepare data for two classes
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> labels_train1 = np.full((m_train // 2,), 1)
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> labels_train2 = np.full((m_train // 2,), 2)
>>> x_train = np.concatenate((x_train1, x_train2))
>>> labels_train = np.concatenate((labels_train1, labels_train2))
>>> weights_train = np.full((m_train,), 1 / m_train)
>>> # Fit
>>> prior_prob = np.array([0.3, 0.7])
>>> params = (x_train, labels_train, weights_train)
>>> classifier = KDEClassification().fit(*params, prior_prob=prior_prob)
predict(x_test: ndarray) ndarray[source]

Predict class labels.

Parameters:

x_test (ndarray of shape (m_test, n)) – Data points to classify - array containing data points with float type.

Returns:

labels_pred – Predicted class labels containing data with int type.

Return type:

ndarray of shape (m_test,)

Examples

>>> # Prepare data for two classes
>>> m_train, n = 100, 1
>>> m_test = 10
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> labels_train1 = np.full(m_train // 2, 1)
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> labels_train2 = np.full(m_train // 2, 2)
>>> x_train = np.concatenate((x_train1, x_train2))
>>> labels_train = np.concatenate((labels_train1, labels_train2))
>>> # Fit the classifier
>>> x_test = np.linspace(-3, 6, m_test).reshape(-1, 1)
>>> classifier = KDEClassification().fit(x_train, labels_train)
>>> # Predict labels
>>> labels_pred = classifier.predict(x_test)  # shape: (10,)
pdfs(x_test: ndarray) ndarray[source]

Compute pdf of each class.

Parameters:

x_test (ndarray of shape (m_test, n)) – Argument of each class estimator - array containing data points with float type.

Returns:

scores – Predicted scores as an array containing data with float type.

Return type:

ndarray of shape (m_test, n_classes)

Examples

>>> # Prepare data for two classes
>>> m_train, n = 100, 1
>>> m_test = 10
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> labels_train1 = np.full(m_train // 2, 1)
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> labels_train2 = np.full(m_train // 2, 2)
>>> x_train = np.concatenate((x_train1, x_train2))
>>> labels_train = np.concatenate((labels_train1, labels_train2))
>>> # Fit the classifier
>>> x_test = np.linspace(-3, 6, m_test).reshape(-1, 1)
>>> classifier = KDEClassification().fit(x_train, labels_train)
>>> # Compute pdf of each class
>>> scores = classifier.pdfs(x_test)  # shape: (10, 2)

KDEOutliersDetection

class kdelearn.kde_tasks.KDEOutliersDetection(kernel_name: str = 'gaussian')[source]

Outliers detectoion based on kernel density estimation.

Read more here.

Parameters:

kernel_name ({'gaussian', 'uniform', 'epanechnikov', 'cauchy'}, default='gaussian') – Name of kernel function.

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> # Fit the outliers detector
>>> outliers_detector = KDEOutliersDetection("gaussian").fit(x_train)

Methods

fit(x_train[, weights_train, bandwidth, ...])

Fit the outliers detector.

predict(x_test)

Predict labels.

fit(x_train: ndarray, weights_train: ndarray | None = None, bandwidth: ndarray | None = None, bandwidth_method: str = 'direct_plugin', r: float = 0.05, **kwargs)[source]

Fit the outliers detector.

Parameters:
  • x_train (ndarray of shape (m_train, n)) – Array containing data points with float type for constructing the detector.

  • weights_train (ndarray of shape (m_train,), default=None) – Weights of data points. If None, all data points are equally weighted.

  • bandwidth (ndarray of shape (n,), optional) – Smoothing parameter for scaling the estimator. If None, bandwidth_method is used to compute the bandwidth.

  • bandwidth_method ({'normal_reference', 'direct_plugin'}, default='normal_reference') – Name of bandwidth selection method used to compute bandwidth when it is not given explicitly.

  • r (float, default=0.05) – Threshold separating outliers and inliers.

Returns:

self – Fitted self instance of KDEOutliersDetection.

Return type:

object

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> weights_train = np.full((m_train,), 1 / m_train)
>>> # Fit the outliers detector
>>> params = (x_train, weights_train)
>>> outliers_detector = KDEOutliersDetection().fit(*params, r=0.05)
predict(x_test: ndarray) ndarray[source]

Predict labels.

Parameters:

x_test (ndarray of shape (m_test, n)) – Argument of the detector - array containing data points with float type.

Returns:

labels_pred – Predicted labels (0 - inlier, 1 - outlier) containing data with int type.

Return type:

ndarray of shape (m_test,)

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> m_test = 10
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> x_test = np.linspace(-3, 3, m_test).reshape(-1, 1)
>>> # Fit the outliers detector
>>> outliers_detector = KDEOutliersDetection().fit(x_train, r=0.1)
>>> # Predict the labels
>>> labels_pred = outliers_detector.predict(x_test)  # shape: (10,)

KDEClustering

class kdelearn.kde_tasks.KDEClustering[source]

Clustering based on kernel density estimation.

Read more here.

Examples

>>> # Prepare data for two clusters
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> x_train = np.concatenate((x_train1, x_train2))
>>> # Fit
>>> clustering = KDEClustering().fit(x_train)

Methods

fit(x_train[, weights_train, bandwidth, ...])

Fit the model.

predict(x_test[, algorithm, epsilon, delta])

Predict cluster labels.

fit(x_train: ndarray, weights_train: ndarray | None = None, bandwidth: ndarray | None = None, bandwidth_method: str = 'direct_plugin', **kwargs)[source]

Fit the model.

Parameters:
  • x_train (ndarray of shape (m_train, n)) – Array containing data points with float type for constructing the model.

  • weights_train (ndarray of shape (m_train,), optional) – Weights of data points. If None, all data points are equally weighted.

  • bandwidth (ndarray of shape (n,), optional) – Smoothing parameter for scaling the estimator. If None, bandwidth_method is used to compute the bandwidth.

  • bandwidth_method ({'normal_reference', 'direct_plugin'}, default='normal_reference') – Name of bandwidth selection method used to compute bandwidth when it is not given explicitly.

Returns:

self – Fitted self instance of KDEClustering.

Return type:

object

Examples

>>> # Prepare data for two clusters
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> x_train = np.concatenate((x_train1, x_train2))
>>> weights_train = np.full((m_train,), 1 / m_train)
>>> # Fit
>>> clustering = KDEClustering().fit(x_train, weights_train)
predict(x_test: ndarray, algorithm: str = 'mean_shift', epsilon: float = 1e-08, delta: float = 0.001)[source]

Predict cluster labels.

Parameters:
  • x_test (ndarray of shape (m_test, n)) – Data points to be grouped - array containing data points with float type.

  • algorithm ({'gradient_ascent', 'mean_shift'}, default='mean_shift') – Name of clustering algorithm.

  • epsilon (float, default=1e-8) – Threshold for difference (euclidean distance) of data point position while shifting. When the difference is less than epsilon, data point is no longer shifted.

  • delta (float, default=1e-3) – Acceptance error (euclidean distance) between shifted data point and representative of cluster. If the error is less than delta, data point is assigned to cluster represented by cluster representative.

Returns:

labels_pred – Predicted cluster labels containing data with int type.

Return type:

ndarray of shape (m_train,)

Examples

>>> # Prepare data for two clusters
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> x_train = np.concatenate((x_train1, x_train2))
>>> # Fit
>>> clustering = KDEClustering().fit(x_train)
>>> labels_pred = clustering.predict(x_train)