This field has taken on its full importance as a result of the digital revolution in business, which has led to the production of massive data of various forms and types, at ever-increasing rates: Big Data.
On a purely mathematical level, most of the algorithms used are already several decades old. In this article, I'll explain how 8 machine learning algorithms work, in the simplest possible terms.
1. A few key concepts before moving on to algorithms
A. Classification or prediction?
Example: classifying consumers' reasons for visiting a store in order to create a hyper-personalized marketing campaign.
Example: stroke prediction based on electrocardiogram data.
B. Supervised and unsupervised learning
Example: in botany, you've taken measurements (stem length, petal length, etc.) on 100 plants of 3 different species. Each measurement is labelled with the plant's species. You want to build a model that will automatically tell you which species a new plant belongs to, based on the same type of measurements.
Example: You'd like to classify your customers according to their browsing history on your website, but you haven't set up any groups a priori, and are in the process of exploring which groups of customers would be similar (behavioral marketing). In these cases, a clustering (partitioning) algorithm is appropriate.
2. Machine Learning algorithms
A. "The decision tree
The tree starts with a root (where we have all our observations), then a series of branches whose intersections are called nodes, and ends with leaves, each corresponding to one of the classes to be predicted. The depth of the tree is defined as the maximum number of nodes before reaching a leaf.
Each node in the tree represents a rule (e.g. petal length greater than 2.5 cm). Going through the tree means checking a series of rules.
The tree is constructed in such a way that each node corresponds to the rule (measurement type and threshold) that best divides the initial set of observations.
The tree has a depth of 2 (one node plus the root). Petal length is the first measure used, as it best separates the 4 observations according to class membership (in this case, class B).
B. "Random Forests
To better understand how this algorithm works, let's start with an example:
You're looking for a good travel destination for your next vacation. You ask your best friend for his opinion. He asks about your previous trips and makes a recommendation.
You decide to ask a group of friends who are randomly asking you questions. They each give you a recommendation. The destination selected is the one most recommended by your friends.
Recommendations from your best friend and the group will both seem like good destination choices. But when the first method of recommendation works very well for you, the second will be more reliable for others.
This is because your best friend, who is building a decision tree to give you a destination recommendation, knows you very well, so the decision tree has overlearned you.
Your group of friends represents the random forest of multiple decision trees, and it's a model that, when used properly, avoids the pitfall of overfitting.
So how is this forest built?
- We take a number X of observations from the initial data set (with discount).
- We take a number K of the M available variables (features),
for example: only temperature and population density
- A decision tree is trained on this dataset.
- Repeat steps 1. to 4. N times to obtain N trees.
For a new observation whose class we're looking for, we go down the N trees. Each tree proposes a different class. The class selected is the one that is most represented among all the trees in the forest (majority vote / 'Ensemble' in English).
C. Gradient Boosting / XG Boost
We'll explain the principle of gradient boosting with the decision tree, but it could be with another model.
You have a database of individuals with demographic information and past activities. For 50% of the individuals, you have their age, but the other half is unknown.
You want to know a person's age based on their activities: food shopping, television, gardening, video games, etc.
In this case, it's a regression tree, because the value to be predicted is numerical.
Your first regression tree is satisfactory, but there's plenty of room for improvement: for example, it predicts that an individual is 19 years old when in reality he's 13, and for another 55 years old instead of 68.
The principle of gradient boosting is that you will remodel the difference between the predicted value and the true value to be predicted.
|Prediction Tree 1
|Prediction Tree 2
This step is repeated N times, where N is determined by successively minimizing the error between the prediction and the true value.
The method for optimizing is the gradient descent method, which we won't explain here.
The XG Boost (eXtreme Gradient Boosting) model is one of the gradient boosting implementations founded by Tianqi Chen, and has won over the Kaggle data scientist community with its efficiency and performance. The publication explaining the algorithm can be found here.
D. "Genetic Algorithms
More prosaically, they are mainly used when we don't have any initial observations, but rather want a machine to learn as it goes along.
These algorithms are not the most efficient for a specific problem, but rather for a set of sub-problems (e.g. learning to balance and walk in robotics).
Let's take a simple example:
We want to find the code of a safe made up of 15 letters:
The genetic algorithm approach will be as follows:
We start with a population of 10,000 "chromosomes" of 15 letters each. We say that the code is a word or a set of similar words.
We're going to define a method of reproduction: for example, combining the start of one chromosome with the end of another.
Example: "DEEP-LEARNING" + "STATISTICAL-INFERENCE" = "DEEP-INFERENCE".
Next, we're going to define a mutation method that allows us to evolve an offspring that is blocked. In our case, this could be to vary one of the letters randomly.
Finally, we define a score that will reward one or other chromosome descendant. In our case, where the code is hidden, we can imagine a sound made by the chest when 80% of the letters are similar, becoming louder and louder as we get closer to the right code.
Our genetic algorithm will start from the initial population and form chromosomes until the solution has been found.
E. Support Vector Machines
Let's return to our example of ideal vacation destinations. For the sake of simplicity, let's consider just 2 variables to describe each city: temperature and population density. We can therefore represent cities in 2 dimensions.
We use circles to represent cities you've really enjoyed visiting, and squares to represent those you've enjoyed less. When you're considering new cities, you'll want to know which group this city is closest to.
As we can see from the graph on the right, there are many planes (straight lines when we only have 2 dimensions) separating the two groups.
We'll choose the straight line that lies at the maximum distance between the two groups. To build it, we can already see that we don't need all the points, we just need to take the points that are at the boundary of their group - we call these points or vectors, the support vectors. The planes passing through these support vectors are called support planes. The dividing plane is the one equidistant from the two support planes.
What if the groups aren't as easily separable, e.g. on one dimension the circles are found among the squares and vice versa?
We're going to transform these points using a function to separate them. As in the example below:
The SVM algorithm will therefore seek both the optimal hyperplane and minimize classification errors.
F. The "K nearest neighbors
An observation is assigned the class of its K nearest neighbors. "That's all, you may ask. Yes, that's all, but as the following example shows, the choice of K can change a lot of things.
We will therefore try different values of K to obtain the most satisfactory separation.
G. Logistic regression
The aim is to find the coefficients a1, a2, ... to obtain the best estimate:
Cotton price = a1 * Number of hectares + a2 * Cotton demand + ...
Logistic regression is used in classification in the same way as the algorithms described so far. We'll take the example of travel again, considering only two classes: good destination (Y=1) and bad destination (Y=0).
P(1): probability of the city being a good destination.
P(0 ): probability that the city is a bad destination.
The city is represented by a number of variables, of which we will consider just two: temperature and population density.
X = (X1: temperature, X2: population density)
We are therefore interested in constructing a function that gives us for a city X :
P(1|X): probability that the destination is good knowing X, i.e. probability that the city verifying X is a good destination.
We'd like to link this probability to a linear combination such as linear regression. Only the probability P(1|X) varies between 0 and 1, and we want a function that spans the entire real number domain (from -infinity to +infinity).
To do this, we'll start by considering P(1|X) / (1 - P(1|X)), which is the ratio between the probability that the destination is good and the probability that the destination is bad. For high probabilities this ratio approaches + infinity (for example, a probability of 0.99 gives 0.99 / 0.01 = 99) and for low probabilities it approaches 0: (a probability of 0.01 gives 0.01 / 0.99 = 0.0101 ).
We've gone from [0,1] to [0,+infinity[. To extend the scope of possible values to ]-infinity,0], we take the natural logarithm of this ratio.
It follows that we are looking for b0,b1,b2, ... such that :
ln (P(1|X)/(1 - P(1|X)) = b0 + b1X1 + b2X2
The right-hand side represents the regression and the natural logarithm denotes the logistic part.
The logistic regression algorithm will therefore find the best coefficients to minimize the error between the prediction made for destinations visited and the true label (good, bad) given.
Up to now, we've been talking about supervised learning algorithms. The classes are known and we want to classify or predict a new observation. But how do you do this when there is no predefined group? When you want to find patterns shared by groups of individuals?
Through their unsupervised learning, clustering algorithms fulfill this role.
Let's take the example of a company that has begun its digital transformation. It has new sales and communication channels through its website and one or more associated mobile applications. In the past, it addressed its customers on the basis of demographic data and purchase history. But how do you make the most of your customers' browsing data? Does online behavior correspond to traditional customer segments?
These questions can motivate you to use clustering to see if any major trends emerge. This helps to confirm or refute any business intuitions you may have.
There are many different clustering algorithms (hierarchical clustering, k-means, DBSCAN, ...). One of the most widely used is the k-means algorithm. Here's a simple explanation of how it works:
Even if we don't know how the clusters will be formed, the k-means algorithm requires us to give the expected number of clusters. Techniques exist to find the optimal number of clusters.
Let's take the example of cities. Our dataset has 2 variables, so we're working in 2 dimensions. After an initial study, we expect to have 2 clusters. We begin by randomly placing two points; these represent our initial means. We associate the observations closest to these means with the same clusters. We then calculate the mean of the observations in each cluster and move the means to the calculated position. We then reassign the observations to the nearest means, and so on.
To ensure the stability of the groups found, it is advisable to repeat the drawing of the initial 'means' several times, as some initial drawings may give a configuration that differs from the vast majority of cases.
3. Relevance and quality factors for Machine Learning algorithms
Good classification implies both placing observations in the right group and not placing any in the wrong groups.
The metric chosen may vary according to the algorithm's purpose and business use.
A number of data-related factors can have a major impact on algorithm quality. Here are the main ones:
A. The number of observations
B. The number and quality of attributes describing these observations
- The distance between two numerical variables (price, height, weight, light intensity, noise intensity, etc.) is easy to establish.
- The relationship between two categorical attributes (color, beauty, utility...) is more delicate.
C. Percentage of complete and missing data
I hope this article has given you some insight into what Machine Learning is all about. Please feel free to use the comment section to give me feedback on any aspects you'd like to clarify or expand on.
Cofounder and CTO at Datakeen
At Datakeen we aim to simplify the use and understanding of new algorithmic techniques by business functions across all industries.
- 3 Deep Learning Architectures explained in Human Language
- Key Successes of Deep Learning and Machine Learning in production