EM Algorithm and its usage (Part 2)
EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is quite helpful in many other Machine Learning (ML) applications. EM algorithm can be used in missing value imputation, topic modelling (variational EM), anomaly detection, clustering etc. In this post, I am going to discuss how the EM algorithm can be used in clustering.
EM Clustering (Gaussian Mixture Model)
We all know that clustering is an unsupervised machine learning method. Hard clustering methods such as KMeans, Hierarchical and DBSCAN etc., attach a single data point to a single cluster only. EM clustering, on the other hand, associates each data point to all clusters with varying degrees of association. That is why, it is also called a soft clustering technique. EM clustering is essentially a Gaussian Mixture Model which assumes that each cluster is generated from some Gaussian. Thus, EM clustering actually generates a model with likelihood scores. To understand the details of this process, we need to understand the math behind it.
Gaussian Mixture
As mentioned above, EM Clustering assumes that clusters are essentially samples drawn from some Gaussian and, if there are K numbers of clusters available, the distribution of the data is essentially a mixture of K Gaussian. This is where the concept of multivariate normal distribution comes into play. The generalized equation of a d-dimensional normal distribution is: $$f_i(x|\mu_i, \Sigma_i)=\frac{1}{(2\pi)^{d/2}|\Sigma_i|^{d/2}}exp\left[-\frac{(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}{2}\right]$$ where $\mu_i$ and $\Sigma_i$ are the mean vector and covariance matrix respectively. As per the mixture model, the probability density function (pdf) of $x$ is the summation of pdf of all the K normals. That is, $$f(x) = \Sigma_{i=1}^Kf_i(x)P(C_i)=\Sigma_{i=1}^Kf_i(x|\mu_i, \Sigma_i)P(C_i)$$ $P(C_i)$ is called the mixture parameter which is a prior probability (probability of occurrence of the $i^{th}$ cluster itself) and it must satisfy the condition $$\Sigma_{i=1}^KP(C_i)=1$$ Thus, the number of parameters to estimate is a set $$\theta = \{\mu_1, \Sigma_1, P(C_1), ... \mu_K, \Sigma_K, P(C_K)\}$$ It is quite easy to understand that the number of parameters to estimate is quite large compared to (say) KMeans Clustering algorithm.
Maximum Likelihood Estimation
If the dataset is denoted by $D$ having $n$ number of data points, the likelihood of $\theta$ is given by $$L = \Pi_{j=1}^nf(x_j)$$ It is to be kept in mind that the assumption of i.i.d of $x_j$ is important for this expression to be valid. Usually, for clustering tasks, we assume that the data points are following i.i.d condition. For mathematical convenience, we work with log-likelihood function and not likelihood function. Log being monotonic in nature, optimum result with log function would result in the optimum value for the original function also. Thus, optimizing log-likelihood is equivalent to optimizing likelihood. The log-likelihood function is given by $$ln(L)=\Sigma_{j=1}^nln(f(x_j)=\Sigma_{j=1}^nln\left(\Sigma_{i=1}^Kf_i(x_j|\mu_i, \Sigma_i)P(C_i)\right)$$
Expectation Maximization to rescue
The above equation shows that optimizing the parameters using MLE directly is a prohibitively difficult task. In fact, MLE in such cases becomes computationally intractable when it is to optimize the number of clusters and the best set of parameters. Hence, to solve this problem, the number of clusters is fixed and the parameters are estimated iteratively. The main objective of the clustering process then turns out to be finding out the conditional probability of cluster membership when the data is given. Mathematically, $P(C_i|x_j)$ is required to be found out. Since this is a conditional probability, Baye's Theorem can be used easily. Applying Baye's theorem, $$P(C_i|x_j)=\frac{P(x_j|C_i)P(C_i)}{\Sigma_{l=1}^KP(x_j|C_l)P(C_l)}$$ If pdf is known, the probability is found out by calculating the area under the curve. If a small interval $\epsilon>0$ is considered centered at $x_j$, we can calculate the conditional probability $p(x_j|C_i)$ as $$p(x_j|C_i) \approx 2\epsilon .f_i(x_j|\mu_i, \Sigma_i)$$ Based on the above equation, the posterior probability $P(C_i|x_j)$ becomes $$P(C_i|x_j)=\frac{f_i(x_j|\mu_i, \Sigma_i)P(C_i)}{\Sigma_{l=1}^Kf_l(x_j|\mu_l, \Sigma_l)P(C_l)}=w_{ij}$$ $P(C_i|x_j)$ is considered as the weight or contribution of the point $x_j$ to cluster $C_i$. With $w_{ij}$ extracted, the Expectation Step in EM algorithm is over. In the Maximization step, the same $w_{ij}$ values are used to reestimate the parameters as shown below. $$\mu_i=\frac{\Sigma_{j=1}^nw_{ij}x_j}{\Sigma_{j=1}^nw_{ij}}$$ $$\Sigma_i=\frac{\Sigma_{j=1}^nw_{ij}(x-\mu_i)(x-\mu_i)^T}{\Sigma_{j=1}^nw_{ij}}$$ $$P(C_i)=\frac{\Sigma_{j=1}^nw_{ij}}{n}$$
The parameters estimated in the Maximization step are fed back to the Expectation step and the process continues till convergence. With more features, the number of parameters to estimate becomes larger and the process becomes slower. However, EM clustering process gives more insight into the clusters than KMeans clustering process. One such information is the uncertainty measure. Since each data point is associated with each cluster with varying degrees of association (given by the probabilities), uncertainty is defined as $$uncertainty=1- max(probability)$$. This uncertainty is quite helpful to identify the data points which are lying near the boundaries of two clusters. The higher the uncertainty, the higher the chance that with a small push, the data point can be pushed into its neighbouring cluster. This aspect may be quite helpful in customer segmentation analysis with clustering.
EM Clustering with Python
This is the practical part of EM clustering using Python. For the said task, the "Wholesale customer data.csv" is downloaded from this Kaggle webpage. It is a small dataset (440 rows) with 8 features whose description is given below.
- Channel: Channel used to purchase the goods (categorical)
- Region: The region of the buyer (categorical)
- Fresh: Amount spent to buy fresh items
- Milk: Amount spent to buy milk items
- Grocery: Amount spent to buy grocery items
- Frozen: Amount spent to buy frozen foods
- Detergent Papers: Amount spent on detergent papers
- Delicassen: Amount spent on delicatessen.
The first two features are not used in this analysis as they are categorical in nature and their values are nominal. The remaining six features are used for clustering. The features are standardized with scaling and are made ready for clustering. The Python codes are given below:
import pandas as pd import matplotlib.pyplot as plt from sklearn.mixture import GaussianMixture from sklearn.preprocessing import StandardScaler import numpy as np data = pd.read_csv('Wholesale customers data.csv') data.head()
data_scaled = StandardScaler().fit_transform(data.drop(['Channel', 'Region'], axis=1)) # Optimize best cluster solution with BIC Scores bic = [] for i in range(1, 21): gmm = GaussianMixture(n_components=i, n_init=5, random_state=42) gmm.fit(data_scaled) bic.append(gmm.bic(data_scaled)) plt.plot(range(1, 21), bic) plt.xlabel('Number of components') plt.ylabel('BIC') plt.show()
best_gmm = GaussianMixture(n_components=5, n_init=5, random_state=42) best_gmm.fit(data_scaled) clusters = best_gmm.predict(data_scaled) cluster_prob = best_gmm.predict_proba(data_scaled) uncertainty = 1 - np.max(cluster_prob, axis=1) tsne = TSNE(n_components=2, random_state=42) tsne_data = tsne.fit_transform(data.loc[:,'Fresh':'Delicassen']) # Countour plot creation xx, yy = np.meshgrid(np.linspace(-20, 30, 100), np.linspace(-20, 30, 100)) grid_data = np.array([xx.ravel(), yy.ravel()]).T gmm_tsne = GaussianMixture(n_components=5, n_init=5, random_state=42) gmm_tsne.fit(tsne_data) zz = gmm_tsne.score_samples(grid_data).reshape(xx.shape) plt.contourf(xx, yy, zz, alpha=0.3, cmap='tab20', levels=100) # Plot scatter plot on the contour plot plt.scatter(tsne_data[:, 0], tsne_data[:, 1], c=clusters, cmap='tab20', s=uncertainty*100) plt.xlabel('t-SNE 1') plt.ylabel('t-SNE 2') plt.show()
In the above plot, the larger the size of the bubble, larger the uncertainty. Data points with higher uncertainty need to be analyzed separately.
Conclusion
EM clustering is essentially a Gaussian Mixture Model which models each cluster as a sample from a specific Gaussian with its own set of parameters. KMeans clustering is, in fact, a special case of EM clustering. EM clustering gives many important insights about the clusters which KMeans clustering does not give directly. However, EM clustering is slower compared to KMeans clustering and it is not so scalable as KMeans clustering. But, a careful marketing analyst would, probably, want to see various aspects of the clusters formed before deciding next marketing moves and in such situations, EM clustering can be quite helpful.