What is supervised learning?

As was said in the article Did you say Machine Learning?, supervised learning is one of the three major types of machine learning. It is done using a ground truth, which means that the training data samples are labeled and the purpose is to learn a function that best approximates the relationship between input and output observable in the data.

“AT THE END OF THE ARTICLE, YOU WILL BE ABLE to DIStinguish between classification and regression ALGORITHMS”

The question you may ask yourself now is : to what types of problems is it suited ? There are mainly two contexts when supervised learning is recommended : Classification and regression.

Classification is when you want to sort items into categories. The ground truth is hence discrete, e.g. success/fail of an experiment, survival of passengers on the Titanic.

Regression is when the purpose is to identify real values (dollars, weights, etc.). The growth truth is hence continuous, e.g. store sales, the number of tourists next year.

Before digging deeper, let us take a look at how a machine learning problem is approached.

At first the data is extracted using multiple ways (API call, CSV file, SQL query, ..), then it is cleaned to remove any redundancy or inaccuracy and explored to recognize the features, the target, the problem’s type (classification/regression), the number of missing values, etc.

Next, we have to manipulate the predictors through some feature engineering techniques like : one-hot encoding and filling of the missing values. This is an important step as it makes the datasets compliant with the models specifications.

Eventually, we fit our model on the training data and test it on some validation data he’s never seen to evaluate its performance. For a more robust performance estimation, we use K-fold cross-validation where the validation set is taken from different parts of the data.

Image credits : https://mapr.com/blog/fast-data-processing-pipeline-predicting-flight-delays-using-apache-apis-pt-1

Regression algorithms – Examples

Linear Regression

Simple linear regression is probably the oldest machine learning technique. It isn’t a method which was designed explicitly for use within prediction frameworks but its initial purpose is to observe the relationship between two constant numerical variables. For example, earnings of a basketball player vs. points scored the year before. It tries to find the correlation between the two variables by drawing a trend line.

However, in complex datasets we have more than two variables to analyze. Here comes the notion of multivariate linear regression where the idea is to get a “correlation” coefficient for each predictor with the response variable.

Let’s discuss briefly a simple example when we have three predictors and a target variable “y”.

Let’s assume that we have three predictors and a target variable “y” to explain. Our objective is to find the coefficients \beta_i such that: y = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 . This can be done through solving the maximum likelihood problem.

Neural Networks

A neural network is designed to model the functioning of the human brain. It is a network of interconnected functional elements, each with several inputs and an output.

To keep things simple, let’s see how a 1-hidden layer neural network works. It is composed of an input layer, a hidden layer and an output layer.

One-layer neural network

In the figure above, we have samples described by three predictors. Once the hidden layer receives the input data, it applies the activation function on the scalar product between the weights and the predictors and generates the output:

y(x_1,\ldots,x_n) = f(w_1x_1 + w_2x_2 + \ldots + w_nx_n)

Where w_i are the parameters, x_i the predictors and f the activation function.

In a neural network architecture, the last layer is the one generating the output, it depends on the problem to be solved: classification or regression. There are several activation functions used, the best known are sigmoid, tanh, ReLU and Leaky ReLU.

Activation functions

In the case of classification, sigmoid and tanh are used to get probabilities between 0 and 1 while for regression, we mostly use ReLU or Leaky ReLU.

Classification algorithms – Examples

Logistic Regression

Logistic regression was introduced following the desire to model the posterior probabilities of the different classes using linear functions as in the case of linear regression. We would like to model the probabilities in this way :

P(Y_i\vert X_i) = \beta_0 + \beta^tX_i

However, they must be defined in the interval [0,1], which is not guaranteed in the expression above. To verify this condition, we use the logit function :

logit(p) = log(\frac{p}{1-p})

We will, therefore, look for \beta_0 and \beta_1 in the following equation trough solving the maximum likelihood optimization problem:

log(\frac{P(Y_i=1\vert X_i)}{1 - P(Y_i=1\vert X_i)}) = \beta_0 + \beta^tX_i

Random Forest Classifier

Random Forest comes within the framework of tree algorithms. It is based on the principle of bagging (Bootstrap Aggregating) where the idea is to build several trees taking each time a bootstrap of the data (draw with discount) to reduce the variance of the trees. The main difference between Single Bagging and Random Forest is that in the first one it is strongly assumed that the trees are independent and identically distributed when in fact the trees are correlated.
Indeed, when looking for variables on which to divide the branches, it is very likely to find the same ones by passing from one tree to another. The idea is to randomly draw a subset of all variables on which the search for the best split will be made.

Image credits: Eureka

SVM Classifier

Support Vector Machines (SVMs) are a class of learning algorithms initially defined for classification, that is, predicting a binary qualitative variable. They were then generalized to regression tasks. In the case of discrimination of a dichotomous variable, they are based on the search for the optimal margin hyperplane which, when the data is linearly separable, separate the data correctly while being as far as possible from all observations. The principle is therefore to find a discriminator whose capacity of generalization (quality of forecasting) is the greatest possible.

It is important to know that the margin is determined by the support vectors only as it is shown in the pictures for a linearly separable case. This means that if the support vectors are known to be accurate, the model is robust to noise as the margin isn’t affected by those new points.

Image credits : Datacamp


Please enter your comment!
Please enter your name here