EM Algorithm and its usage (Part 2)

EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is quite helpful in many other Machine Learning (ML) applications. EM algorithm can be used in missing value imputation, topic modelling (variational EM), anomaly detection, clustering etc. In this post, I am going to discuss how the EM algorithm can be used in clustering.

EM Clustering (Gaussian Mixture Model)

We all know that clustering is an unsupervised machine learning method. Hard clustering methods such as KMeans, Hierarchical and DBSCAN etc., attach a single data point to a single cluster only. EM clustering, on the other hand, associates each data point to all clusters with varying degrees of association. That is why, it is also called a soft clustering technique. EM clustering is essentially a Gaussian Mixture Model which assumes that each cluster is generated from some Gaussian. Thus, EM clustering actually generates a model with likelihood scores. To understand the details of this process, we need to understand the math behind it.

Gaussian Mixture

As mentioned above, EM Clustering assumes that clusters are essentially samples drawn from some Gaussian and, if there are K numbers of clusters available, the distribution of the data is essentially a mixture of K Gaussian. This is where the concept of multivariate normal distribution comes into play. The generalized equation of a d-dimensional normal distribution is: $$f_i(x|\mu_i, \Sigma_i)=\frac{1}{(2\pi)^{d/2}|\Sigma_i|^{d/2}}exp\left[-\frac{(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}{2}\right]$$ where $\mu_i$ and $\Sigma_i$ are the mean vector and covariance matrix respectively. As per the mixture model, the probability density function (pdf) of $x$ is the summation of pdf of all the K normals. That is, $$f(x) = \Sigma_{i=1}^Kf_i(x)P(C_i)=\Sigma_{i=1}^Kf_i(x|\mu_i, \Sigma_i)P(C_i)$$ $P(C_i)$ is called the mixture parameter which is a prior probability (probability of occurrence of the $i^{th}$ cluster itself) and it must satisfy the condition $$\Sigma_{i=1}^KP(C_i)=1$$ Thus, the number of parameters to estimate is a set $$\theta = \{\mu_1, \Sigma_1, P(C_1), ... \mu_K, \Sigma_K, P(C_K)\}$$ It is quite easy to understand that the number of parameters to estimate is quite large compared to (say) KMeans Clustering algorithm.

Maximum Likelihood Estimation

If the dataset is denoted by $D$ having $n$ number of data points, the likelihood of $\theta$ is given by $$L = \Pi_{j=1}^nf(x_j)$$ It is to be kept in mind that the assumption of i.i.d of $x_j$ is important for this expression to be valid. Usually, for clustering tasks, we assume that the data points are following i.i.d condition. For mathematical convenience, we work with log-likelihood function and not likelihood function. Log being monotonic in nature, optimum result with log function would result in the optimum value for the original function also. Thus, optimizing log-likelihood is equivalent to optimizing likelihood. The log-likelihood function is given by $$ln(L)=\Sigma_{j=1}^nln(f(x_j)=\Sigma_{j=1}^nln\left(\Sigma_{i=1}^Kf_i(x_j|\mu_i, \Sigma_i)P(C_i)\right)$$

Expectation Maximization to rescue

The above equation shows that optimizing the parameters using MLE directly is a prohibitively difficult task. In fact, MLE in such cases becomes computationally intractable when it is to optimize the number of clusters and the best set of parameters. Hence, to solve this problem, the number of clusters is fixed and the parameters are estimated iteratively. The main objective of the clustering process then turns out to be finding out the conditional probability of cluster membership when the data is given. Mathematically, $P(C_i|x_j)$ is required to be found out. Since this is a conditional probability, Baye's Theorem can be used easily. Applying Baye's theorem, $$P(C_i|x_j)=\frac{P(x_j|C_i)P(C_i)}{\Sigma_{l=1}^KP(x_j|C_l)P(C_l)}$$ If pdf is known, the probability is found out by calculating the area under the curve. If a small interval $\epsilon>0$ is considered centered at $x_j$, we can calculate the conditional probability $p(x_j|C_i)$ as $$p(x_j|C_i) \approx 2\epsilon .f_i(x_j|\mu_i, \Sigma_i)$$ Based on the above equation, the posterior probability $P(C_i|x_j)$ becomes $$P(C_i|x_j)=\frac{f_i(x_j|\mu_i, \Sigma_i)P(C_i)}{\Sigma_{l=1}^Kf_l(x_j|\mu_l, \Sigma_l)P(C_l)}=w_{ij}$$ $P(C_i|x_j)$ is considered as the weight or contribution of the point $x_j$ to cluster $C_i$. With $w_{ij}$ extracted, the Expectation Step in EM algorithm is over. In the Maximization step, the same $w_{ij}$ values are used to re-estimate the parameters as shown below. $$\mu_i=\frac{\Sigma_{j=1}^nw_{ij}x_j}{\Sigma_{j=1}^nw_{ij}}$$ $$\Sigma_i=\frac{\Sigma_{j=1}^nw_{ij}(x-\mu_i)(x-\mu_i)^T}{\Sigma_{j=1}^nw_{ij}}$$ $$P(C_i)=\frac{\Sigma_{j=1}^nw_{ij}}{n}$$

The parameters estimated in the Maximization step are fed back to the Expectation step and the process continues till convergence. With more features, the number of parameters to estimate becomes larger and the process becomes slower. However, EM clustering process gives more insight into the clusters than KMeans clustering process. One such information is the uncertainty measure. Since each data point is associated with each cluster with varying degrees of association (given by the probabilities), uncertainty is defined as $$uncertainty=1- max(probability)$$. This uncertainty is quite helpful to identify the data points which are lying near the boundaries of two clusters. The higher the uncertainty, the higher the chance that with a small push, the data point can be pushed into its neighbouring cluster. This aspect may be quite helpful in customer segmentation analysis with clustering.

EM Clustering with Python

This is the practical part of EM clustering using Python. For the said task, the "Wholesale customer data.csv" is downloaded from this Kaggle webpage. It is a small dataset (440 rows) with 8 features whose description is given below.

Channel: Channel used to purchase the goods (categorical)
Region: The region of the buyer (categorical)
Fresh: Amount spent to buy fresh items
Milk: Amount spent to buy milk items
Grocery: Amount spent to buy grocery items
Frozen: Amount spent to buy frozen foods
Detergent Papers: Amount spent on detergent papers
Delicassen: Amount spent on delicatessen.

The first two features are not used in this analysis as they are categorical in nature and their values are nominal. The remaining six features are used for clustering. The features are standardized with scaling and are made ready for clustering. The Python codes are given below:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
import numpy as np

data = pd.read_csv('Wholesale customers data.csv')
data.head()

Since Gaussian Mixture Model (GMM) actually builds a model with probabilities associated with each data point, it is possible to extract the likelihood scores for every model and hence the Bayesian Information Criteria (BIC) scores can be calculated for each model. This BIC score can be used effectively to arrive at the optimum number of cluster. The model with lowest BIC score is the best model and the number of components in the model is the optimum number of clusters. The python code and the associated plot shows that the optimum number of clusters is 5

data_scaled = StandardScaler().fit_transform(data.drop(['Channel', 'Region'], axis=1))

# Optimize best cluster solution with BIC Scores
bic = []
for i in range(1, 21):
    gmm = GaussianMixture(n_components=i, n_init=5, random_state=42)
    gmm.fit(data_scaled)
    bic.append(gmm.bic(data_scaled))

plt.plot(range(1, 21), bic)
plt.xlabel('Number of components')
plt.ylabel('BIC')
plt.show()

For visualization purpose, the contour plots of the clusters are created along with uncertainty measures of data. As mentioned above uncertainty shows the level of uncertainty associated with a data point when it associated with a cluster having highest probability of association. The higher the uncertainty, the higher the possibility of the data point getting miss-clustered. This is important for customer segmentation. The below code and the plot shows the same. TSNE is used to plot the high dimensional data into 2 dimensions.

best_gmm = GaussianMixture(n_components=5, n_init=5, random_state=42)
best_gmm.fit(data_scaled)

clusters = best_gmm.predict(data_scaled)

cluster_prob = best_gmm.predict_proba(data_scaled)
uncertainty = 1 - np.max(cluster_prob, axis=1)

tsne = TSNE(n_components=2, random_state=42)
tsne_data = tsne.fit_transform(data.loc[:,'Fresh':'Delicassen'])

# Countour plot creation
xx, yy = np.meshgrid(np.linspace(-20, 30, 100), np.linspace(-20, 30, 100))
grid_data = np.array([xx.ravel(), yy.ravel()]).T
gmm_tsne = GaussianMixture(n_components=5, n_init=5, random_state=42)
gmm_tsne.fit(tsne_data)
zz = gmm_tsne.score_samples(grid_data).reshape(xx.shape)

plt.contourf(xx, yy, zz, alpha=0.3, cmap='tab20', levels=100)

# Plot scatter plot on the contour plot
plt.scatter(tsne_data[:, 0], tsne_data[:, 1], c=clusters, cmap='tab20', 
            s=uncertainty*100)
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.show()

In the above plot, the larger the size of the bubble, larger the uncertainty. Data points with higher uncertainty need to be analyzed separately.

Conclusion

EM clustering is essentially a Gaussian Mixture Model which models each cluster as a sample from a specific Gaussian with its own set of parameters. KMeans clustering is, in fact, a special case of EM clustering. EM clustering gives many important insights about the clusters which KMeans clustering does not give directly. However, EM clustering is slower compared to KMeans clustering and it is not so scalable as KMeans clustering. But, a careful marketing analyst would, probably, want to see various aspects of the clusters formed before deciding next marketing moves and in such situations, EM clustering can be quite helpful.

EM Algorithm and its usage (Part 1)

Maximum likelihood estimate is a very powerful and convenient method of estimating parameters of a model. A detail of the same is given in an earlier article. This method is used in several statistical models to estimate the model parameters in an unbiased way. However, a key requirement of this method is to have complete information related to the data, e.g., the distribution to follow and other aspects related to the distribution. But, in real life cases, it is not unusual to have datasets for which some of the information are hidden and yet it is required to estimate parameters of a model to extract the pattern. A simple scenario is mentioned below.

Two coins problem:

Assume that there are two coins each of which is biased either towards head or tail. Let us assume that coin A is biased towards head and coin B is biased towards tail. At random, one coin is chosen and it is tossed for 10 times. This experiment is done 5 times. The below image shows how easy it is to calculate the level of biasness when all information are known, that is which coin was chosen and how many heads and tails are observed.

The estimates are nothing but Maximum Likelihood Estimates based on the available data and information. The same problem becomes quite complicated if one information is kept hidden, i.e., which coin was chosen. In such a situation, working with MLE is not advisable as MLE is bad at dealing with such situation. However, with Expectation Maximization (EM) algorithm, it is possible to estimate the parameters in an iterative way. As the name suggests, EM algorithm has an expectation step and a maximization step. The primary objective of the expectation step is calculating probabilities associated with the data points so that using the probabilities, an expression of likelihood can be found out. The likelihood function is then maximized by optimizing the parameters. The parameters then take part is recalculating the probabilities for each data point and the likelihood value is again maximized by optimizing the parameters. Thus, in an iterative manner the parameters keep updating until a breaking point (or threshold limit) is reached.

The problem of two coins can be solved with EM algorithm when the critical information is mission, i.e., which coin was chosen for tossing. For applying the algorithm, an initialization is needed which is mostly done in a random manner. Afterward, the iterative process commences. The process is shown in the diagram below.

In the above diagram, how probabilities are calculated is not mentioned explicitly. This is mentioned in the below diagram for the first data point where there were 5 heads and 5 tails. $\theta_A^{(0)}=0.6$ and $\theta_B^{(0)}=0.5$.

After 10 iterations, the values converged to 0.8 for coin A and 0.52 for coin B. Obviously, the number of experiments was less; hence, the iterative algorithm shows both coins as biased towards heads. The same experiment can be replicated with a Python routine. The code is given below:

import numpy as np
import scipy
import scipy

# Generate Data
'''
    Each coin is tossed 100 times and the experiment is done 120 times, 60 times for each coin
'''
coinA = np.random.binomial(n=1, p=0.8, size=(60,100)) # probability of head = 0.8
coinB = np.random.binomial(n=1, p=0.35, size=(60,100)) # probability of head = 0.35
data = np.vstack([coinA, coinB])

index = np.array(range(data.shape[0]))
np.random.shuffle(index)
data = data[index]

# Create frequency table for heads and tails counts
n_heads = data.sum(axis=1)
n_tails = data.shape[1] - n_heads
freq_table = np.hstack((n_heads.reshape(-1,1), n_tails.reshape(-1,1)))

# EM Algorithm starts
# --------------------

# Initialization
thetaA = 0.6
thetaB = 0.3
n_iter = 20

for it in range(n_iter):

    # E-Step starts:
    p_headsA = np.array([scipy.stats.binom.pmf(r, data.shape[1], thetaA) for r in freq_table[:,0]])
    p_headsB = np.array([scipy.stats.binom.pmf(r, data.shape[1], thetaB) for r in freq_table[:,0]])

    P_A = p_headsA/(p_headsA + p_headsB)
    P_B = p_headsB/(p_headsA + p_headsB)

    expected_count_A = P_A.reshape(-1,1) * freq_table
    expected_count_B = P_B.reshape(-1,1) * freq_table

    # E-Step ends

    # M-Step starts
    expected_total_A = expected_count_A.sum(axis=0)
    expected_total_B = expected_count_B.sum(axis=0)

    thetaA = expected_total_A[0]/ expected_total_A.sum()
    thetaB = expected_total_B[0]/ expected_total_B.sum()

    # M-Step ends

print(f"Probability of head in coin A: {thetaA}")
print(f"Probability of head in coin A: {thetaB}")

It will be seen that the values are reaching closer to the input parameters for creating the dataset. This is the EM algorithm in Python for solving the two coin problems.

I hope that the explanations given are understandable.

Comments are welcom.

Thursday, 26 December 2024