 Unsupervised learning occurs when an algorithm learns from plain examples without any associated response, leaving to the algorithm to determine the data patterns on its own. This type of algorithms tends to restructure the data into something else, such as new features that may represent a class or a new series of uncorrelated values. They are quite useful in providing humans with insights into the meaning of data and new useful inputs to supervised machine learning algorithms.

“AT THE END OF THE ARTICLE, YOU WILL BE COMfortable talking about UNSUPERVISED LEARNING”

This way of learning is very similar to the way humans learn to figure out patterns in objects or events, e.g. a human isn’t expected to find camels in Canada. It comes from observing the degree of similarity between objects.

Some recommendation systems used in automated marketing are based on unsupervised learning algorithms. The marketing automation algorithm derives its suggestions from what you’ve bought in the past. The recommendations are based on an estimation of what group of customers you resemble the most and then inferring your likely preferences based on that group.

Unsupervised learning may be used for different use cases. In this article, we present two algorithms used in two different use cases : K-means for clustering and t-SNE for dimensionality reduction. Although PCA can also be used to reduce the dimensionality of the data and enable us to get more sense of it, it is weaker than t-SNE because it assumes linearity between the variables.

## K-means

K-means is the most popular iterative descent clustering algorithm requiring quantitative variables and using the Euclidian distance as a dissimilarity measure.

$d(x_i, x_j) = \sum_{k = 1}^{p} (x_{ik} - x_{jk})^2 = \vert \vert x_i - x_j \vert \vert ^2$

It enables to cluster the data points into K clusters where K is a hyper-parameter of the algorithm. The choice of K is often motivated by the interest of the user, e.g. the number of clusters by default in Scikit-learn is 8.

The idea behind it is quite simple. At first, data points are assigned clusters randomly, then each cluster centroid is computed and data points are assigned to the cluster corresponding to the nearest centroid. This will be iterated until the assignments do not change.

When applying K-means clustering on MNIST dataset (digits from 0 to 9) after performing PCA to enable 2D visualization, we get the following results:

## t-SNE

t-SNE stands for t-distribution Stochastic Neighborhood Embedding. It is a dimensionality reduction algorithm projecting the data points from a space of high dimension into a space of lower dimension (typically 2 or 3 dimensions). Here we will see on a medium level how t-SNE works.

Let us suppose we have 100-dimensional dataset and we want to visualize it and get sense of it. We need therefore to incorporate all the knowledge in a 2-dimensional space.

First we work on the high-dimensional space. We compute the similarity between each pair of data points, similar data points will have a high value and different ones will have a low value. Then we compute the joint probabilities using these similarities according to the normal distribution. We will call these probabilities P.

$p_{j|i} = \frac{exp(-\vert \vert x_i - x_j \vert \vert ^2 / 2 \sigma_i^2)}{\sum_{k \ != \ l}exp(-\vert \vert x_k - x_l\vert \vert ^2 / 2 \sigma_i^2)}$

Now that the joint probabilities are computed on the high-dimensional space, we will do the same on the low-dimensional space and as you maybe understood the idea is to reduce the difference (or divergence) between these two sets probabilities.

Now, t-SNE arranges all data points randomly on the specified lower dimensional space (2-dimensional here). Then the joint probabilities between each pair of data points are computed using the similarities and according to the t-distribution therefore generating the probability distribution Q.

$q_{ij} = \frac{(1 + \vert \vert y_i - y_j \vert \vert ^2)^{-1}}{\sum_{k \ != \ l} (1 + \vert \vert y_k- y_l \vert \vert ^2)^{-1} }$

At present the objective is to reduce the difference between P and Q. For this task, we use the Kullback-Leibler divergence and the low-dimensional space data points’ coordinates will be updated using this divergence gradient.

$KL(P\vert \vert Q) = \sum_i \sum_j p_{ij} \ log \ \frac{p_{ij}}{q_{ij}} \ where \ p_{ij} = (p_{i|j} + p_{j | i}) / 2$

I hope that you understood more the difference between supervised and unsupervised learning and that you feel more comfortable talking about K-means and t-SNE.