8 Machine Learning Algorithms Explained in Human Language

by | Nov 6, 2017 | À la une, Machine Learning, Smart Data, Statistics

What we call Machine Learning is nothing other than the meeting of statistics with the computing power available today (memory, processors, graphics cards).

This field has taken on its full importance as a result of the digital revolution in business, which has led to the production of massive data of various forms and types, at ever-increasing rates: Big Data.

On a purely mathematical level, most of the algorithms used are already several decades old. In this article, I'll explain how 8 machine learning algorithms work, in the simplest possible terms.

1. A few key concepts before moving on to algorithms

A. Classification or prediction?

Classification

Assigning a class/category to each observation in a dataset is called classification. This is done a posteriori, once the data has been retrieved.

Example: classifying consumers' reasons for visiting a store in order to create a hyper-personalized marketing campaign.

Prediction

A prediction is based on a new observation. In the case of a numerical (continuous) variable, this is called regression.

Example: stroke prediction based on electrocardiogram data.

B. Supervised and unsupervised learning

Supervised

You already have labels on historical data and are looking to classify new data. The number of classes is known.

Example: in botany, you've taken measurements (stem length, petal length, etc.) on 100 plants of 3 different species. Each measurement is labelled with the plant's species. You want to build a model that will automatically tell you which species a new plant belongs to, based on the same type of measurements.

Unsupervised

In unsupervised learning, on the other hand, you have no labels, no predefined classes. You want to identify common patterns in order to create homogeneous groups from your observations.

Example: You'd like to classify your customers according to their browsing history on your website, but you haven't set up any groups a priori, and are in the process of exploring which groups of customers would be similar (behavioral marketing). In these cases, a clustering (partitioning) algorithm is appropriate.

2. Machine Learning algorithms

We will describe 8 algorithms used in Machine Learning. The aim here is not to go into the details of the models, but rather to give the reader some insight into each of them.

A. "The decision tree

A decision tree is used to classify future observations given a corpus of observations already labelled. This is the case in our botanical example, where we already have 100 observations classified as species A, B and C.

The tree starts with a root (where we have all our observations), then a series of branches whose intersections are called nodes, and ends with leaves, each corresponding to one of the classes to be predicted. The depth of the tree is defined as the maximum number of nodes before reaching a leaf.

Each node in the tree represents a rule (e.g. petal length greater than 2.5 cm). Going through the tree means checking a series of rules.

The tree is constructed in such a way that each node corresponds to the rule (measurement type and threshold) that best divides the initial set of observations.

Example:

Decsion tree data

Decision Tree

The tree has a depth of 2 (one node plus the root). Petal length is the first measure used, as it best separates the 4 observations according to class membership (in this case, class B).

B. "Random Forests

As its name suggests, the random forest algorithm is based on decision trees.

To better understand how this algorithm works, let's start with an example:

You're looking for a good travel destination for your next vacation. You ask your best friend for his opinion. He asks about your previous trips and makes a recommendation.

You decide to ask a group of friends who are randomly asking you questions. They each give you a recommendation. The destination selected is the one most recommended by your friends.

Recommendations from your best friend and the group will both seem like good destination choices. But when the first method of recommendation works very well for you, the second will be more reliable for others.

This is because your best friend, who is building a decision tree to give you a destination recommendation, knows you very well, so the decision tree has overlearned you.

Your group of friends represents the random forest of multiple decision trees, and it's a model that, when used properly, avoids the pitfall of overfitting.

So how is this forest built?

Here are the main steps:
  • We take a number X of observations from the initial data set (with discount).
  • We take a number K of the M available variables (features),
    for example: only temperature and population density
  • A decision tree is trained on this dataset.
  • Repeat steps 1. to 4. N times to obtain N trees.

For a new observation whose class we're looking for, we go down the N trees. Each tree proposes a different class. The class selected is the one that is most represented among all the trees in the forest (majority vote / 'Ensemble' in English).

C. Gradient Boosting / XG Boost

The gradient boosting method is used to strengthen a model that produces weak predictions, such as a decision tree (see how to judge the quality of a model).

We'll explain the principle of gradient boosting with the decision tree, but it could be with another model.

You have a database of individuals with demographic information and past activities. For 50% of the individuals, you have their age, but the other half is unknown.

You want to know a person's age based on their activities: food shopping, television, gardening, video games, etc.

In this case, it's a regression tree, because the value to be predicted is numerical.

Your first regression tree is satisfactory, but there's plenty of room for improvement: for example, it predicts that an individual is 19 years old when in reality he's 13, and for another 55 years old instead of 68.

The principle of gradient boosting is that you will remodel the difference between the predicted value and the true value to be predicted.

Age Prediction Tree 1 Difference Prediction Tree 2
13 19 -6 15
68 55 +13 63

This step is repeated N times, where N is determined by successively minimizing the error between the prediction and the true value.

The method for optimizing is the gradient descent method, which we won't explain here.

The XG Boost (eXtreme Gradient Boosting) model is one of the gradient boosting implementations founded by Tianqi Chen, and has won over the Kaggle data scientist community with its efficiency and performance. The publication explaining the algorithm can be found here.

D. "Genetic Algorithms

As their name suggests, genetic algorithms are based on the process of genetic evolution that made us who we are...

More prosaically, they are mainly used when we don't have any initial observations, but rather want a machine to learn as it goes along.

These algorithms are not the most efficient for a specific problem, but rather for a set of sub-problems (e.g. learning to balance and walk in robotics).

Let's take a simple example:

We want to find the code of a safe made up of 15 letters:

"MACHINELEARNING

The genetic algorithm approach will be as follows:

We start with a population of 10,000 "chromosomes" of 15 letters each. We say that the code is a word or a set of similar words.

"DEEP-LEARNING

"STATISTICAL-INFERENCE

"HUMAN-MACHINE INTERFACE

etc.

We're going to define a method of reproduction: for example, combining the start of one chromosome with the end of another.

Example: "DEEP-LEARNING" + "STATISTICAL-INFERENCE" = "DEEP-INFERENCE".

Next, we're going to define a mutation method that allows us to evolve an offspring that is blocked. In our case, this could be to vary one of the letters randomly.

Finally, we define a score that will reward one or other chromosome descendant. In our case, where the code is hidden, we can imagine a sound made by the chest when 80% of the letters are similar, becoming louder and louder as we get closer to the right code.

Our genetic algorithm will start from the initial population and form chromosomes until the solution has been found.

E. Support Vector Machines

Also known as "SVM" (Support Vector Machine), this algorithm is mainly used for classification problems, although it has been extended to regression problems (Drucker et al., 96).

Let's return to our example of ideal vacation destinations. For the sake of simplicity, let's consider just 2 variables to describe each city: temperature and population density. We can therefore represent cities in 2 dimensions.

We use circles to represent cities you've really enjoyed visiting, and squares to represent those you've enjoyed less. When you're considering new cities, you'll want to know which group this city is closest to.

As we can see from the graph on the right, there are many planes (straight lines when we only have 2 dimensions) separating the two groups.

SVM Optimal Plane

We'll choose the straight line that lies at the maximum distance between the two groups. To build it, we can already see that we don't need all the points, we just need to take the points that are at the boundary of their group - we call these points or vectors, the support vectors. The planes passing through these support vectors are called support planes. The dividing plane is the one equidistant from the two support planes.

What if the groups aren't as easily separable, e.g. on one dimension the circles are found among the squares and vice versa?

We're going to transform these points using a function to separate them. As in the example below:

SVM transformation example
The SVM algorithm will therefore seek both the optimal hyperplane and minimize classification errors.

F. The "K nearest neighbors

Pause. After 5 relatively technical models, the K-nearest neighbor algorithm will seem like a formality. Here's how it works:

K nearest neighbours

An observation is assigned the class of its K nearest neighbors. "That's all, you may ask. Yes, that's all, but as the following example shows, the choice of K can change a lot of things.

We will therefore try different values of K to obtain the most satisfactory separation.

G. Logistic regression

Let's start by recalling the principle of linear regression. Linear regression is used to predict a numerical variable, such as the price of cotton, as a function of other numerical or binary variables: the number of hectares under cultivation, the demand for cotton by various industries, etc.

The aim is to find the coefficients a1, a2, ... to obtain the best estimate:

Cotton price = a1 * Number of hectares + a2 * Cotton demand + ...

Logistic regression is used in classification in the same way as the algorithms described so far. We'll take the example of travel again, considering only two classes: good destination (Y=1) and bad destination (Y=0).

P(1): probability of the city being a good destination.

P(0 ): probability that the city is a bad destination.

The city is represented by a number of variables, of which we will consider just two: temperature and population density.

X = (X1: temperature, X2: population density)

We are therefore interested in constructing a function that gives us for a city X :

P(1|X): probability that the destination is good knowing X, i.e. probability that the city verifying X is a good destination.

We'd like to link this probability to a linear combination such as linear regression. Only the probability P(1|X) varies between 0 and 1, and we want a function that spans the entire real number domain (from -infinity to +infinity).

To do this, we'll start by considering P(1|X) / (1 - P(1|X)), which is the ratio between the probability that the destination is good and the probability that the destination is bad. For high probabilities this ratio approaches + infinity (for example, a probability of 0.99 gives 0.99 / 0.01 = 99) and for low probabilities it approaches 0: (a probability of 0.01 gives 0.01 / 0.99 = 0.0101 ).

We've gone from [0,1] to [0,+infinity[. To extend the scope of possible values to ]-infinity,0], we take the natural logarithm of this ratio.

It follows that we are looking for b0,b1,b2, ... such that :

ln (P(1|X)/(1 - P(1|X)) = b0 + b1X1 + b2X2

The right-hand side represents the regression and the natural logarithm denotes the logistic part.

The logistic regression algorithm will therefore find the best coefficients to minimize the error between the prediction made for destinations visited and the true label (good, bad) given.

H. Clustering

Supervised vs. unsupervised learning. Do you remember it?

Up to now, we've been talking about supervised learning algorithms. The classes are known and we want to classify or predict a new observation. But how do you do this when there is no predefined group? When you want to find patterns shared by groups of individuals?

Through their unsupervised learning, clustering algorithms fulfill this role.

Let's take the example of a company that has begun its digital transformation. It has new sales and communication channels through its website and one or more associated mobile applications. In the past, it addressed its customers on the basis of demographic data and purchase history. But how do you make the most of your customers' browsing data? Does online behavior correspond to traditional customer segments?

These questions can motivate you to use clustering to see if any major trends emerge. This helps to confirm or refute any business intuitions you may have.

There are many different clustering algorithms (hierarchical clustering, k-means, DBSCAN, ...). One of the most widely used is the k-means algorithm. Here's a simple explanation of how it works:

Even if we don't know how the clusters will be formed, the k-means algorithm requires us to give the expected number of clusters. Techniques exist to find the optimal number of clusters.

Let's take the example of cities. Our dataset has 2 variables, so we're working in 2 dimensions. After an initial study, we expect to have 2 clusters. We begin by randomly placing two points; these represent our initial means. We associate the observations closest to these means with the same clusters. We then calculate the mean of the observations in each cluster and move the means to the calculated position. We then reassign the observations to the nearest means, and so on.

Clustering K means

To ensure the stability of the groups found, it is advisable to repeat the drawing of the initial 'means' several times, as some initial drawings may give a configuration that differs from the vast majority of cases.

3. Relevance and quality factors for Machine Learning algorithms

Machine learning algorithms are evaluated on the basis of their ability to classify or predict correctly both on the observations used to train the model (training and test set) and, above all, on observations whose label or value is known and which were not used to build the model (validation set).

Good classification implies both placing observations in the right group and not placing any in the wrong groups.

The metric chosen may vary according to the algorithm's purpose and business use.

A number of data-related factors can have a major impact on algorithm quality. Here are the main ones:

A. The number of observations

The fewer the observations, the more difficult the analysis. But the more there are, the greater the need for computer memory and the longer the analysis).

B. The number and quality of attributes describing these observations

  • The distance between two numerical variables (price, height, weight, light intensity, noise intensity, etc.) is easy to establish.
  • The relationship between two categorical attributes (color, beauty, utility...) is more delicate.

C. Percentage of complete and missing data

D. Noise

The number and "location" of doubtful values (potential errors, outliers...), or values that naturally do not conform to the general distribution pattern of the "examples" in their distribution space, will have an impact on the quality of the analysis.

Conclusion

We have seen that machine learning algorithms serve two purposes: to classify and to predict, and that they can be divided into supervised and unsupervised algorithms. There are many possible algorithms, and we have looked at 8 of them, including logistic regression and random forests to classify an observation, and clustering to generate homogeneous groups from the data. We also saw that the value of an algorithm depended on the associated cost or loss function, but that its predictive power depended on several factors linked to the quality and volume of the data.

I hope this article has given you some insight into what Machine Learning is all about. Please feel free to use the comment section to give me feedback on any aspects you'd like to clarify or expand on.

Author

Gaël Bonnardot

Cofounder and CTO at Datakeen

At Datakeen we aim to simplify the use and understanding of new algorithmic techniques by business functions across all industries.

Read also

  • 3 Deep Learning Architectures explained in Human Language
  • Key Successes of Deep Learning and Machine Learning in production

Sources

confetti

Datakeen

Discover all our AI solutions