Unconditional case

KDE

class kdelearn.kde.KDE(kernel_name: str = 'gaussian')[source]

Kernel density estimator with product kernel:

\[\hat{f}(x) = \sum_{i=1}^m w_{i} \prod_{j=i}^n \frac{1}{h_j} K \left( \frac{x_{j} - x_{i, j}}{h_j} \right), \quad x \in \mathbb{R}^n\]

Read more here.

Parameters

kernel_name ({'gaussian', 'uniform', 'epanechnikov', 'cauchy'}, default='gaussian') – Name of kernel function.

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> # Fit
>>> kde = KDE("gaussian").fit(x_train)

References

[1] Silverman, B. W. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.

[2] Wand, M. P., Jones M.C. Kernel Smoothing. Chapman and Hall, 1995.

Methods

fit(x_train[, weights_train, bandwidth, ...])

Fit the estimator.

pdf(x_test)

Compute probability density.

sample()

fit(x_train: ndarray, weights_train: Optional[ndarray] = None, bandwidth: Optional[ndarray] = None, bandwidth_method: str = 'normal_reference', **kwargs)[source]

Fit the estimator.

Parameters
  • x_train (ndarray of shape (m_train, n)) – Array containing data points with float type for constructing the estimator.

  • weights_train (ndarray of shape (m_train,), optional) – Weights of data points. If None, all data points are equally weighted.

  • bandwidth (ndarray of shape (n,), optional) – Smoothing parameter for scaling the estimator. If None, bandwidth_method is used to compute the bandwidth.

  • bandwidth_method ({'normal_reference', 'direct_plugin'}, default='normal_reference') – Name of bandwidth selection method used to compute bandwidth when it is not given explicitly.

Returns

self – Fitted self instance of KDE.

Return type

object

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> weights_train = np.full((m_train,), 1 / m_train)
>>> bandwidth = np.full((n,), 1.0)
>>> # Fit the estimator
>>> kde = KDE().fit(x_train, weights_train, bandwidth)
pdf(x_test: ndarray) ndarray[source]

Compute probability density.

Parameters

x_test (ndarray of shape (m_test, n)) – Argument of the estimator - array containing data points with float type.

Returns

scores – Computed estimation of probability densities for testing data points x_test.

Return type

ndarray of shape (m_test,)

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> m_test = 10
>>> x_train = np.random.normal(0, 1, (m_train, n))
>>> x_test = np.linspace(-3, 3, 10).reshape(-1, 1)
>>> # Fit the estimator
>>> kde = KDE().fit(x_train)
>>> # Compute pdf
>>> scores = kde.pdf(x_test)  # shape of scores: (10,)

KDEClassification

class kdelearn.kde_tasks.KDEClassification(kernel_name: str = 'gaussian')[source]

Bayes’ classifier based on kernel density estimation.

Probability that \(x\) belongs to class \(c\):

\[P(C=c|X=x) \propto \pi_c \hat{f}_c(X=x)\]

To predict class label for \(x\) we need to take class \(c\) with the highest probability:

\[\underset{c}{\mathrm{argmax}} \quad P(C=c|X=x)\]

Read more here.

Parameters

kernel_name ({'gaussian', 'uniform', 'epanechnikov', 'cauchy'}, default='gaussian') – Name of kernel function.

Examples

>>> # Prepare data for two classes
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> labels_train1 = np.full(m_train // 2, 1)
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> labels_train2 = np.full(m_train // 2, 2)
>>> x_train = np.concatenate((x_train1, x_train2))
>>> labels_train = np.concatenate((labels_train1, labels_train2))
>>> # Fit
>>> classifier = KDEClassification("gaussian").fit(x_train, labels_train)

References

[1] Silverman, B. W. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.

Methods

fit(x_train, labels_train[, weights_train, ...])

Fit the classifier.

pdfs(x_test)

Compute pdf of each class.

predict(x_test)

Predict class labels.

fit(x_train: ndarray, labels_train: ndarray, weights_train: Optional[ndarray] = None, bandwidths: Optional[ndarray] = None, bandwidth_method: str = 'normal_reference', share_bandwidth: bool = False, prior_prob: Optional[ndarray] = None, **kwargs)[source]

Fit the classifier.

Parameters
  • x_train (ndarray of shape (m_train, n)) – Array containing data points with float type for constructing the classifier.

  • labels_train (ndarray of shape (m_train,)) – Class labels of x_train containing data with int type.

  • weights_train (ndarray of shape (m_train,), default=None) – Weights for data points. If None, all data points are equally weighted.

  • bandwidths (ndarray of shape (n_classes, n), optional) – Smoothing parameters for scaling the estimators of each class. If None, bandwidth_method is used to compute the bandwidth.

  • bandwidth_method ({'normal_reference', 'direct_plugin'}, default='normal_reference') – Name of bandwidth selection method used to compute bandwidths when it is not given explicitly.

  • share_bandwidth (bool, default=False) – Determines whether all classes should have common bandwidth. If False, estimator of each class gets its own bandwidth.

  • prior_prob (ndarray of shape (n_classes,), default=None) – Prior probabilities of each class. If None, all classes are equally probable.

Returns

self – Fitted self instance of KDEClassification.

Return type

object

Examples

>>> # Prepare data for two classes
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> labels_train1 = np.full((m_train // 2,), 1)
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> labels_train2 = np.full((m_train // 2,), 2)
>>> x_train = np.concatenate((x_train1, x_train2))
>>> labels_train = np.concatenate((labels_train1, labels_train2))
>>> weights_train = np.full((m_train,), 1 / m_train)
>>> # Fit
>>> prior_prob = np.array([0.3, 0.7])
>>> params = (x_train, labels_train, weights_train)
>>> classifier = KDEClassification().fit(*params, prior_prob=prior_prob)
predict(x_test: ndarray) ndarray[source]

Predict class labels.

Parameters

x_test (ndarray of shape (m_test, n)) – Data points to classify - array containing data points with float type.

Returns

labels_pred – Predicted class labels containing data with int type.

Return type

ndarray of shape (m_test,)

Examples

>>> # Prepare data for two classes
>>> m_train, n = 100, 1
>>> m_test = 10
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> labels_train1 = np.full(m_train // 2, 1)
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> labels_train2 = np.full(m_train // 2, 2)
>>> x_train = np.concatenate((x_train1, x_train2))
>>> labels_train = np.concatenate((labels_train1, labels_train2))
>>> # Fit the classifier
>>> x_test = np.linspace(-3, 6, m_test).reshape(-1, 1)
>>> classifier = KDEClassification().fit(x_train, labels_train)
>>> # Predict labels
>>> labels_pred = classifier.predict(x_test)  # shape: (10,)
pdfs(x_test: ndarray) ndarray[source]

Compute pdf of each class.

Parameters

x_test (ndarray of shape (m_test, n)) – Argument of each class estimator - array containing data points with float type.

Returns

scores – Predicted scores as an array containing data with float type.

Return type

ndarray of shape (m_test, n_classes)

Examples

>>> # Prepare data for two classes
>>> m_train, n = 100, 1
>>> m_test = 10
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> labels_train1 = np.full(m_train // 2, 1)
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> labels_train2 = np.full(m_train // 2, 2)
>>> x_train = np.concatenate((x_train1, x_train2))
>>> labels_train = np.concatenate((labels_train1, labels_train2))
>>> # Fit the classifier
>>> x_test = np.linspace(-3, 6, m_test).reshape(-1, 1)
>>> classifier = KDEClassification().fit(x_train, labels_train)
>>> # Compute pdf of each class
>>> scores = classifier.pdfs(x_test)  # shape: (10, 2)

KDEOutliersDetection

class kdelearn.kde_tasks.KDEOutliersDetection(kernel_name: str = 'gaussian')[source]

Outliers detectoion based on kernel density estimation.

Read more here.

Parameters

kernel_name ({'gaussian', 'uniform', 'epanechnikov', 'cauchy'}, default='gaussian') – Name of kernel function.

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> # Fit the outliers detector
>>> outliers_detector = KDEOutliersDetection("gaussian").fit(x_train)

Methods

fit(x_train[, weights_train, bandwidth, ...])

Fit the outliers detector.

predict(x_test)

Predict labels.

fit(x_train: ndarray, weights_train: Optional[ndarray] = None, bandwidth: Optional[ndarray] = None, bandwidth_method: str = 'normal_reference', r: float = 0.1, **kwargs)[source]

Fit the outliers detector.

Parameters
  • x_train (ndarray of shape (m_train, n)) – Array containing data points with float type for constructing the detector.

  • weights_train (ndarray of shape (m_train,), default=None) – Weights of data points. If None, all data points are equally weighted.

  • bandwidth (ndarray of shape (n,), optional) – Smoothing parameter for scaling the estimator. If None, bandwidth_method is used to compute the bandwidth.

  • bandwidth_method ({'normal_reference', 'direct_plugin'}, default='normal_reference') – Name of bandwidth selection method used to compute bandwidth when it is not given explicitly.

  • r (float, default=0.1) – Threshold separating outliers and inliers.

Returns

self – Fitted self instance of KDEOutliersDetection.

Return type

object

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> weights_train = np.full((m_train,), 1 / m_train)
>>> # Fit the outliers detector
>>> params = (x_train, weights_train)
>>> outliers_detector = KDEOutliersDetection().fit(*params, r=0.1)
predict(x_test: ndarray) ndarray[source]

Predict labels.

Parameters

x_test (ndarray of shape (m_test, n)) – Argument of the detector - array containing data points with float type.

Returns

labels_pred – Predicted labels (0 - inlier, 1 - outlier) containing data with int type.

Return type

ndarray of shape (m_test,)

Examples

>>> # Prepare data
>>> m_train, n = 100, 1
>>> m_test = 10
>>> x_train = np.random.normal(0, 1, size=(m_train, n))
>>> x_test = np.linspace(-3, 3, m_test).reshape(-1, 1)
>>> # Fit the outliers detector
>>> outliers_detector = KDEOutliersDetection().fit(x_train, r=0.1)
>>> # Predict the labels
>>> labels_pred = outliers_detector.predict(x_test)  # shape: (10,)

KDEClustering

class kdelearn.kde_tasks.KDEClustering[source]

Clustering based on kernel density estimation.

Read more here.

Examples

>>> # Prepare data for two clusters
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> x_train = np.concatenate((x_train1, x_train2))
>>> # Fit
>>> clustering = KDEClustering().fit(x_train)

Methods

fit(x_train[, weights_train, bandwidth, ...])

Fit the model.

predict(x_test[, algorithm, epsilon, delta])

Predict cluster labels.

fit(x_train: ndarray, weights_train: Optional[ndarray] = None, bandwidth: Optional[ndarray] = None, bandwidth_method: str = 'normal_reference', **kwargs)[source]

Fit the model.

Parameters
  • x_train (ndarray of shape (m_train, n)) – Array containing data points with float type for constructing the model.

  • weights_train (ndarray of shape (m_train,), optional) – Weights of data points. If None, all data points are equally weighted.

  • bandwidth (ndarray of shape (n,), optional) – Smoothing parameter for scaling the estimator. If None, bandwidth_method is used to compute the bandwidth.

  • bandwidth_method ({'normal_reference', 'direct_plugin'}, default='normal_reference') – Name of bandwidth selection method used to compute bandwidth when it is not given explicitly.

Returns

self – Fitted self instance of KDEClustering.

Return type

object

Examples

>>> # Prepare data for two clusters
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> x_train = np.concatenate((x_train1, x_train2))
>>> weights_train = np.full((m_train,), 1 / m_train)
>>> # Fit
>>> clustering = KDEClustering().fit(x_train, weights_train)
predict(x_test: ndarray, algorithm: str = 'mean_shift', epsilon: float = 1e-08, delta: float = 0.001)[source]

Predict cluster labels.

Parameters
  • x_test (ndarray of shape (m_test, n)) – Data points to be grouped - array containing data points with float type.

  • algorithm ({'gradient_ascent', 'mean_shift'}, default='mean_shift') – Name of clustering algorithm.

  • epsilon (float, default=1e-8) – Threshold for difference (euclidean distance) of data point position while shifting. When the difference is less than epsilon, data point is no longer shifted.

  • delta (float, default=1e-3) – Acceptance error (euclidean distance) between shifted data point and representative of cluster. If the error is less than delta, data point is assigned to cluster represented by cluster representative.

Returns

labels_pred – Predicted cluster labels containing data with int type.

Return type

ndarray of shape (m_train,)

Examples

>>> # Prepare data for two clusters
>>> m_train, n = 100, 1
>>> x_train1 = np.random.normal(0, 1, size=(m_train // 2, n))
>>> x_train2 = np.random.normal(3, 1, size=(m_train // 2, n))
>>> x_train = np.concatenate((x_train1, x_train2))
>>> # Fit
>>> clustering = KDEClustering().fit(x_train)
>>> labels_pred = clustering.predict(x_train)