This post is aimed at exploring the ideas of unsupervised learning (clustering) and dimensionality reduction. The exploration is being done by experimenting with the following algorithms on two publicly available datasets —

- Principal Components Analysis
- Independent Components Analysis
- Random Projections
- Feature Selection using Information Gain
- k-Means Clustering
- Clustering using Gaussian Mixture Models

The idea is to firstly observe how these algorithms on their own help unravel the underlying structure in the data (if any) and secondly to see how two different concepts of clustering and dimensionality reduction interact with each other. Let’s start with a slightly detailed overview before diving into the analysis.

### Overview

#### Datasets

The two datasets used for this analysis are going to be referred as *Abalone *and *Wine. *Both these datasets correspond to two classification problems picked up from the UCI ML repository.

*Abalone* — Given a set of physical measurements, the problem is to predict the age of a group of snails commonly referred to as Abalone. The data contains ~4000 instances and 8 features including the custom defined label feature. The label takes value 1 if the age is greater than 10 and 0 otherwise. Most of the features are numerical features except one, which was one-hot encoded appropriately. So eventually the dataset ends up having 9 features. Since this dataset was also used in the earlier assignment to train a Neural Network classifier, it’s also going to be used for Neural Network part of this assignment.

*Wine* — This problem described by this dataset is that of predicting the quality of a wine, given a bunch of features. The features here are largely chemical properties of the wine instance, for e.g., levels of alcohol, Magnesium, Malic acid, alkalinity etc. There are about ~6500 instances and there are 12 such characteristics associated with each instance. The label feature takes 7 unique values from the set {3, 4, 5, 6, 7, 8, 9}. Each label value reflecting the quality of the wine. It’s a multi-class classification problem.

Since we’re going to be looking at clustering results, I thought it might be interesting to pick one of the dataset which has multi-class labels rather than binary, and thus the Wine problem. This is because one of the experiments that I’m going to be doing is to see how do the original labels of the data align with the generated clusters. A thing to note here is that both the datasets don’t really suffer from the curse of dimensionality per se. They both have fairly manageable number of features. While running the experiments, I had also explored the possibility of using a dataset which actually has a significantly large number of features to elucidate the concepts of dimensionality reduction. But since these large number of features were very sparse (as a result of one-hot encoding), it became really troublesome dealing with the feature matrix while applying techniques like ICA, PCA as they rely on matrix inversions. One-hot encoding was perhaps creating linearly dependent features. So anyway, I decided to stick with simpler datasets with dense feature matrices for the scope of this analysis.

#### Analysis Outline

There are three parts to the analysis. The first part discusses the experiments on dimensionality reduction on both the datasets. After having identified the best parameters (best lower dimension basically), four reduced datasets are created for each of the two problems (Abalone and Wine) corresponding to each of the four dimensionality reduction algorithms (the first four in the list above). The second part discusses the clustering experiments. Both the clustering algorithms are run on 5 datasets for each of the problems — the first dataset is with the original set of features, the other four are with the reduced set of features as obtained from the first part. This is an interesting way to see what kind of structure do each of the dimensionality reduction algorithm preserve. Finally, in the third part, all the different kinds of datasets (clustered, dimensionally reduced etc.) generated so far are used to train a Neural Network for predicting the true labels. Measuring performance in this third part is in my opinion the real deal, because eventually this is the problem we’re trying to solve. All the reduction and clustering algorithms are nothing but trying to simplify this problem solving. The third part tries to quantify this by looking at several performance metrics.

### Dimensionality Reduction

#### Principal Components Analysis (PCA)

The idea behind PCA is to identify the directions that capture the maximum variance in the data. We rotate the data in such a way that the new axes correspond to these variance maximising directions. Usually most of the variance is just explained by a handful of directions as opposed to the large original dimension space. We basically pick the top *k *principal components that explain a significant amount of variance in the data. Following charts plot the same along with the corresponding reconstruction errors–

As it can be seen, the first three principal components are able to explain more than 95% of the variance in the data. So an ideal strategy for dimensionality reduction would be to pick these first three principal components. Clearly, there’s an elbow at x = 3 in the reconstruction charts, so the strategy of picking the first 3 principal components is a valid one for both the datasets.

#### Independent Components Analysis (ICA)

ICA on the other hand identifies features that are statistically independent from each other. The given features here are assumed to be a linear combination of some unknown latent features (the number of these latent features is the lower dimension here). ICA tries to describe the given data in terms of these latent features. The latent features are assumed to be non-Gaussian and mutually independent. In fact, the non-Gaussanity is the key to estimating independent components (from central limit theorem). The classical measure of non-Gaussanity is the *kurtosis*. Since kurtosis for a Gaussian random variable is zero, we need to pick the independent components that have higher kurtosis values in order to *maximize *non-Gaussanity. Since ICA is a local algorithm relying on random starting points, it was necessary to run the experiments several times to capture a trend. As there is no ordering in independent components, a value of *k *on the x-axis corresponds to *top k *components (by kurtosis) as compared to *first k* components in PCA. So the correct way to interpret the x axis values is to read it as the *top kth *independent component, rather than just the *kth *independent component. The below charts show the average kurtosis values for various independent components.

The following charts capture the corresponding reconstruction errors when we choose to represent the data by the top *k* independent components —

From the above two charts, it is clear that we should be selecting the top 3 independent components for Abalone and top 2 components for Wine. On using more components as part of the model seems to deteriorate the reconstruction error. This validates our argument saying that we should only be using non-Gaussian components in the model, i.e., components with high kurtosis values.

#### Random Projections (RP)

Random projection involves the projection of vectors lying in a higher dimensional space to a *randomly chosen* lower dimensional subspace. Note that this projection need not be Euclidean, i.e., the subspace need not be aligned with the basis vectors from the original space. A vector is projected by multiplying the vector by a suitable random matrix. For this analysis, a uniformly random orthonormal matrix as described here[1] is chosen as the projection matrix. Such a random n x k matrix can be constructed as follows. First, construct a matrix R1 with each entry chosen independently from the distribution *N(0, 1).* Then, ortho-normalize the columns of R1 using Gram-Schmidt process (QR-decomposition) and form the required matrix R using these resultant vectors as columns. This special construction of the random projection matrix allows for the Johnson Lindenstrauss lemma to hold. The lower dimension in this case is chosen by looking at the reconstruction errors.

Each coloured line corresponds to an independent random trial. Calculating the reconstruction error for a randomly projected matrix is slightly a tricky thing and in fact an ongoing research topic. For the scope of this analysis, I limited myself to calculating it the simplest way, i.e. inverting the random projection matrix by multiplying by its pseudo-inverse. The manner in which the random matrix was constructed in this case causes the pseudo-inverse to just be its transpose. The solid black line corresponds to the average. Since it’s a randomised procedure, such variation is expected in the results. A thing to note here is that when *k *is equal to the original dimension, the error is exactly zero, which makes sense. When the dimensions are same, the random projection matrix is basically just randomly *rotating *the data and thus, no information is lost. An optimal lower dimension for these datasets was chosen to be the midpoint due to a lack of hard/elbow cuts. So *k=5 *for Abalone and *k=6 *for Wine.

#### Feature Selection Using Information Gain (DT)

The feature selection algorithm for this part has been designed using Decision Trees. This is because decision trees naturally separate out features according to the respective information gains. Unlike the other three dimensionality reduction algorithms, this one sees the true labels. The first step is to exhaustively train a Decision Tree on our dataset, even if the tree overfits (which it quite obviously will). This to ensure that most of the features end up showing up in some tree rule or the other. After the training, GINI index values are calculated for each of the features. These values are representative of the relative importance of each of the features for classifying the data according to the given labels. We can then select the top *k *important features to reduce the dimensionality of our data. Following charts plot the feature importance according to the GINI index for both the problems.

Clearly there are a few features which seem to be a lot more important than the rest. This skewness will help us pick the relevant features and reduce the dimension of the problem. Notice that Feature 12 for Wine problem shows a zero value in the chart. This might be because the feature does not contain enough information to show up in any of the rules in the trained tree. Using the above charts, we select top 3 features for Abalone and top 6 features for Wine. Interestingly, unlike the other dimensionality reduction algorithms where the new features were generated by transforming the original features, the new features in this case are simply a subset of the original features.

We have now learnt the optimal lower dimensions for each of the dimensionality reduction algorithms for both the problems. We’ll use datasets generated using these optimal settings for the remaining two parts of the analysis.

### Clustering

The objective for this part is explore two clustering methodologies — an algorithm name k-Means and an expectation maximisation clustering algorithm based on the Gaussian mixture models. The idea is to also look at how the results look/vary when used on dimensionally reduced datasets. There are three sub-experiments that have been undertaken for each of the clustering algorithm. The first is to identify the optimal number of clusters, i.e., finding the right k. Once we have the right *k,* the second experiment is then to look at how these clusters vary when the algorithms are run on various dimensionally reduced datasets. The third experiment is to see how these different clusters (found from original and reduced datasets) compare with the true labels of the datasets. The distance metric used for both the metrics is the *Euclidean distance*.

*Choosing the right number of clusters — *The metric that has been used to pick the right number of clusters is commonly referred to as the *silhouette. *For each datum *i, *let *a(i) *be the average dissimilarity of *i *with all other data within the same cluster, and let *b(i) *be the lowest average dissimilarity of *i* to any other cluster, of which *i* is not a member. Then, a silhouette is defined as–

The silhouette value is a measure of how similar an object is to it’s own cluster (cohesion) compared to other clusters (separation). It ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighbouring clusters. Both the algorithms use the same metric to fine-tune the choice of optimal *k.*

*Comparing different sets of clusters* — To compare similarity between two data clusterings, a measure known as the *Rand Index *is used. Given a set *S *of *n *elements, and two partitions (clusterings) *X *and *Y*, the Rand index *R *is defined as –

where *a* — # pairs of elements in the same set in *X* and same set in *Y*, *b* — # pairs of elements in different sets in *X* and different sets in *Y*, *c* — # pairs of elements in the same set in *X* and different sets in *Y*, and *d* — # pairs of elements in different sets in *X* and same set in *Y. *Conceptually, it’s analogous to looking at a confusion matrix. So for two exactly same clusterings, the values of *c *and *d *will be zero and the Rand index will be 1. On the other hand, for two completely different clusterings, *a *and *b *will be zero and the index value will be 0. Higher the index, more similar the clusterings.

#### k-Means Clustering

The aim with the k-means clustering algorithm is to partition *n *data points into *k *clusters, in which each data point belongs to the cluster with the nearest mean. The distance metric being used here is the standard Euclidean distance. To obtain an optimal k-means clustering, we need to optimise for *k*, i.e., the number of clusters. Let’s look at the following charts.

From the above charts, it’s apparent that for Abalone — the optimal number of clusters are 3 for the original and the PCA reduced dataset, 2 for the ICA reduced and Dec. Tree reduced dataset; for Wine — the optimal number of clusters are 2 for all the datasets.

The above charts show us how these identified clusters compare with the true labels of the dataset. Note that for the above experiment, the number of clusters had to be equal to the number of different labels in the original problem. Somehow for a different *k *value, the Rand index always came out to be lower. So it seems that overall, for Abalone, ~50% of the points are clustered according to their labels. The percentage for Wine is at ~60%, and a higher number here intuitively makes sense because it’s a multi-label dataset, so one can misplace a data point in a different cluster in more number of ways. From the Rand index definition, the value of *b *is higher in this case. Another thing to note here is that even though the index is pretty much same for all the reduced datasets, it does not imply that the algorithms generate exactly same clusters when run on each of these different datasets. This point is also validated by the following charts.

The above charts plot the Rand index, comparing the k-means clusters on the reduced datasets to the clusters in the dataset with original features. Even though they all had a similar Rand index when compared with the true labels, they clearly don’t generate the same clusters. It is not surprising to see that the dataset reduced by PCA and Random Projection end up with very similar clusters as that of the original dataset. This is because both of these transformations end up preserving Euclidean distances in their own way by design.

#### Clustering using Gaussian Mixture Models (Expectation Max. — EM)

The aim in EM clustering algorithm is to assign a probability score to each data point for that point to lie in each cluster. It is often termed as *soft *clustering as the points are not definitively allotted clusters, but are only done so probabilistically.

This experiment was done using MATLAB’s GMM clustering module, where Gaussian mixture models are used for data clustering. The first step is to fit the data using *k *components of the GMM, and the second step is to assign data points to these components while maximising the component posterior probability given the data. The above observations in the charts are used to estimate *k*. The following two charts compare the various clusters obtained with the true labels and the clusters on the original feature space respectively.

Similar trends as that observed in the k-means clustering. PCA and Random Projections reduced datasets produce clusters closest to the clusters produced on the data with original features. But one of the things to notice is that the silhouette values for the EM algorithm are relatively lower than the ones for the k-means algorithm. So in a Euclidean sense, the quality of clusters produced by EM look to be slightly worse off than the k-means ones. The reasons for this could be the fact that EM soft clusters, which means the points on the probability thresholds can go multiple ways. So perhaps these boundary points are causing this slight dip in the silhouette values.

### Learning using Neural Networks

This is where it all culminates. This is where we look at how the classification performance is affected when clustering and/or dimensionality reduction algorithms are applied to the dataset at the pre-processing stage. Only The Abalone dataset was used for this part. A neural network is used as the base classifier to compare performances. The optimal configuration for the neural network was obtained using grid search and cross-validation. The final network looks like — a feed-forward neural network with a hidden layer made up of 15 sigmoid neurons and an output layer made up of 2 softmax neurons (binary classification problem).

For the first part of the experiment, a neural network is trained on five datasets, each with same number of observations but with different set of features — first with the original set of features and the remaining four with features coming from the dimensionality reduction step. Here are the observations –

Note that the error in the above table is measured in terms of cross-entropy loss. A few observations can be drawn. Firstly, only DT features come close to matching the original features in terms of loss performance, doing so with fewer epochs and in lesser time. So it seems that the features that our decision tree filtered out were indeed the most informative ones. The other reduced feature sets don’t end up doing so well. The performance from the independent components is especially bad, perhaps because the underlying assumptions of ICA aren’t true, that there’s no way of separating the original features into independent ones. The fact that the decision tree gives us better features can also be argued by saying that it was the only dimensionality reduction method where the information about the true labels was used. The other three methods were truly unsupervised, but decision tree filtering wasn’t.

The second part was to learn labels from using clusters as features. Both the clustering algorithms k-Means and EM were used to cluster the five datasets (original and reduced dimensions). The same number of clusters were used as determined optimally in the clustering step above. Since the cluster labels are not ordinal, these new features had to be one-hot encoded. The neural network was then trained on three datasets — one with original features, one with only the cluster features, and one with all features combined. The following results were obtained.

So clearly, using the cluster features only is a bad idea performance wise. Both the training and test loss increase. Though the NN trains faster and in less epochs. This might be due to the fact that the corresponding feature matrix when we only consider the cluster features is highly sparse, as all the features are eventually one-hot encoded. Having said that, when we add the cluster features to the original ones and then train the NN, only the training error improves. The test error deteriorates, implying overfitting. So we’re creating clusters out of the training data and then using the same clusters as features to train another model on the same data. Perhaps we’re *trusting *our training data a bit too much, which is what’s causing the model to overfit. And that wraps up the analysis.

### References

[1] S. Vempala, The random projection method., ser. DIMACS series. AMS, 2004, vol. 65.