
Most of the machine learning algorithms were inspired by the way human mind thinks. These are divided into two types - Supervised learning and Unsupervised learning. In turn, these roughly translate to two classes of algorithms - Regression and Clustering. This blog looks into some of the major algorithms of either kind. Let us start with the core understanding of what is Supervised and Unsupervised, then we can dive into the details of some of the important algorithms.
Supervised Learning
As the name suggests, Supervised Learning is learning under supervision. Consider for example, the problem of predicting real estate prices based on the varioius parameters like number of rooms, the floor, locality, etc. We do not have a precise equation for this. But we know they are related. We also have a good amount of data about the recent transactions on real estate. We can use this data to identify the relation between the variable parameters and the prices - the input and the output.
Essentially it involves five steps:
Guess an equation that relates the input and output values.
Use the equation to calculate the output for the given input
Compare this calculated output with the real values available with us and identify the "error"
Alter the equation in a way that we expect will reduce the error.
Continue doing this till we reach a point where the error is within limits or any change increases the error again.
This is called Supervised Learning. For this to work, we need:
Good amount of data that can help us map the input and output.
A good set of parameters that rightfully define the price.
In technical terms, the training data is used to train a model in a way that we reach a point of minimum cost.
There are two major types of problems in supervised learning. Regression and Classification. Regression deals with continuous data and classification (as the name suggests) deals with classification of data into discrete output values.
Important Concepts
Before we move ahead, it is important to understand some of the core concepts.
Hypothesis
The Hypothesis is the assumed relation between the input and output. This hypothesis is verified and improved on each iteration. In the implementation level, this could be the individual coefficients in a polynomial expression or nodes of a neural network, etc. Essentially it is a relation that we propose for correction and refining through the process of learning.
Weights
The hypothesis relation is identified by parameters that can be altered to make it fit the given data set. For example, in a polynomial expression, the coefficients are the weighs. Typically these are denoted by θ0, θ1 ...
Cost Function
The cost of a hypothesis is a measure of the overall gap between the actual expected output and the output calculated by the hypothesis. This cost is naturally a related to the weights that we use to define the hypothesis. Cost function is the formal definition of this cost - as a function of these weights. Given the cost function, a good hypothesis is one with minimal cost.
Regression
Typically, the regression process of supervised learning begins with proposing a relation between the input and output. This relation could be as simple as a linear equation or it could be a polynomial or even a massive neural network. Each such relations have many parameters. For example, a linear equation y = ax + b has parameters a and b - which define the equation. The kind of proposal depends upon the analysis of the situation, availability of data and of course the experience of the developer. But once a good model is defined, the next step is to identify the values of these parameters.
Thus, the process of regression consists of multiple iterations of forward propagation and backward propagation. These two steps consist of first finding the error based on the current weights (forward propagation) and then updating the weights based on the last calculated error (backward propagation). Multiple iterations of forward and backward propagation can gradually refine the model. This is the essence of regression - the most fundamental concept in supervised learning.
Gradient Descent
Gradient descent is one of the most popular methods for identifying the parameters of a learning model. Essentially gradient descent consists of identifying the Error content of the model - for the available data. Then, gradually updating the parameters of the model in a way that the best error reduction is ensured on every step. This can be visualized as a ball rolling down a curved surface. At every point it moves in the direction that gives the best reduction in its height. In order to perform Gradient Descent, we need to obtain a good measure of the error function.
Stochastic Gradient Descent
In traditional gradient descent, we run the forward and backward propagation over the entire data - to identify the error function and then try to regressively optimize the model. If this is computationally expensive, why not run through one data point at a time? Stochastic gradient descent consists of training over one data sample at a time. The learning curve of stochastic gradient descent is not as clean as the classical gradient descent algorithm, but it does converge very quickly if the hyperparameters are chosen well. But, stochastic gradient descent can be disasterous if we get stuck with the wrong hyperparameters.
A major advantage is when we do not have all the data available right away - for example data streaming into the system over the internet.
Mini Batch Gradient Descent.
Stochastic gradient descent is an extreme just as Gradient descent. We all know extremes are limited. Mini Batch Gradient Descent tries to merge the benefits of both by going through the process in small batches.
Error Function
This is an important part of the story. The efficiency of the gradient descent method depends heavily on the function that we use to represent the error for the proposed parameters.
Mean Square Error - used for continuous regression
Cross Entropy - used for classification
The cost function can be seen as a cumulative of the error function
Underfitting
Any amount of effort on minimizing the cost function will not help if the hypothesis is not rich enough. For example, if you want to fit a complex data in a simple linear expression, that just can't work. This is scenario is called underfitting.
Overfitting
This is the other extreme of underfitting. It is also possible that the hypothesis is excessively rich. In such a case, the hypothesis perfectly fits the available dataset, but it curves heavily between these points, making it very bad for any data that is not in the input set. Such a hypothesis will seem perfect while training, but will make absurd predictions when tested on another dataset. For example, we can have a high order polynomial fitting a set of points in a straight line. While training we may feel we have done a great job that fits with zero error. But for any point out of the training set, the calculations will go for a toss.
Unsupervised Learning
Unsupervised learning, as the name suggests, is about learning without supervision. What does it mean to learn without supervision? What can you learn without supervision? As we saw in the regression, the supervision comes from a part of the input data set - that is required to ascertain the correctness of your hypothesis. The input data has the questions as well as correct answers - the machine just needs to map them.
Unsupervised learning is quite different from this. Here, we do not have any answers. No questions either! Then what do we have? We have raw data and we have no idea about its structure. There is nothing to learn from. Unsupervised learning is just making sense of the data in hand. This is not just a theoretical fantacy. Unsupervised learning has a great application for analyzing data that we know nothing about.
When faced with such a situation, of huge chunk of unknown data, that natural tendency is to categorize it into different parts based on the parameters we already know. Then, we can identify the tendencies of each of these clusters so that we can get some meaningful mapping of the data. If your parameters are wide enough, your predictions will be correct too.
Clustering is one of the most important problem solved using unsupervised learning. It is used to discover the inherent groupings in the input data. For example: Grouping companies based on their revenues or grouping countries based on their GDP, etc. Since it is a form of unsupervised learning, there is no feedback or verification while learning. The grouping is purely based on the values associated with the input samples.
In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Such clustering is a common technique for statistical data analysis used in many fields.
Clustering and Unsupervised Learning in general forms an important part of any real life AI development. Most of the data inputs for today's AI development is from the internet and mobile devices. This is raw data fetched out of the chaos called internet. Not everything is useful for the particular problem. The available data has to be assorted to identify what could be useful. Clustering is one of the primary approaches of doing this.
Clustering Algorithms
There are many different algorithms used for clustering. Principally, there are three main types of algorithms for doing this.
Hierarchical Clustering
In agglomerative hierarchical algorithms, we start by defining each data point as a cluster. Then, the two closest clusters are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster. In divisive hierarchical algorithms, we start by putting all data points into a single cluster. Then we divide this cluster into two clusters. At each subsequent step, we divide an existing cluster into two clusters. Agglomerative methods are used much more often than divisive methods. Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.
Linkage
There are different ways we can connect two points in the hierarchy. Each has its own nuances.
Single Linkage: In single linkage, we define the distance between two clusters as the minimum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters with the smallest single linkage distance.
Complete Linkage: In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest complete linkage distance.
Average Linkage: In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest average linkage distance.
Centroid Method
In centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. At each stage of the process we combine the two clusters that have the smallest centroid distance.
Ward’s Method
This method does not directly define a measure of distance between two points or clusters. It is an ANOVA based approach. One-way univariate ANOVAs are done for each variable with groups defined by the clusters at that stage of the process. At each stage, two clusters merge that provide the smallest increase in the combined error sum of squares.
Measure of Association
Any clustering is meaningful in the context of a given measure of association or a measure of distance. There are several definitions of the association or distance between two points.
Euclidean Distance: In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is called the Euclidean norm
d(X1, X2) = (sum((X1(i) - X2(i))^2))^(1/2)
Minkowski Distance: The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. Essentially, it is a generalization of the Euclidean distance to power of n
d(X1, X2) = (sum((X1(i) - X2(i))^n))^(1/n)
Canberra Metric: The Canberra Metric is intuitively quite different from the Euclidean distance. In a gross sense, one can think of it as a measure of angle in the Euclidean space.
d(X1,X2) = sum(abs(X1(i) - X2(i)) / (abs(X1(i)) + abs(X2(i))))
MetricCzekanowski Coefficient: Similar to Canberra Metric, the MetricCzekanowski Coefficient also measures an angular distance, but sums over the values before taking a ratio.
d(X1,X2) = 1 - 2 * sum(abs(X1(i) - X2(i))) / sum(abs(X1(i)) + abs(X2(i)))
Custom Distance: Or we can create our own. Any distance measure should satisfy some properties
1. Symmetry: Distance between point X1 and X2 should be same as distance between point X2 and X1.
d(X1, X2) = d(X2, X1)
2. Positivity: Distance between two distinct points X1 and X2 should be greater than 0
d(X1, X2) > 0 if X1 != X2
3. Identity: Distance of a point with itself should be 0
d(X, X) = 0
4. Triangle Inequality: The generic triangular inequality should also hold for the new measure of distance
d(X1, X2) + d(X2, X3) >= d(X1, X3)
Non-Hierarchical
In a non-hierarchical method, the data are initially partitioned into a set of K clusters. This may be a random partition or a partition based on a first "good" guess at seed points which form the initial centers of the clusters. Then data points are iteratively moved into different clusters until there is no sensible reassignment possible. The initial number of clusters (K) may be specified by the user or by the software algorithm. The most commonly used non-hierarchical method is MacQueen's K-means method.
This has some advantages and some disadvantages compared to the Hierarchical algorithms. The main question here is guessing the right value for k. If there are k+1 distinct clusters and we start off with k, we will never have a good result. However, if the cluster chunks are not so distinct, a non-hierarchical cluster can help in segregating the data rather quickly.
Model Based
A model based method uses a mixture model to specify the density function of the x-variables. In a mixture model, a population is modeled as a mixture of different subpopulations, each with the same general form for its probability density function and possibly different values for parameters, such as the mean vector. For instance, the model may be a mixture of multivariate normal distributions. In cluster analysis, the algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model. Examples are Affinity Propagation, MeanShift, DBSCAN.
Linear Regression

Linear regression is perhaps the most basic concept of machine learning. It is the most basic implementation of Regression. It is too simple to be of any real use in the modern age machine learning applications. But is the best way to understand the concepts. This is how it works:
Consider a very simple world, where the salary of a person is related to the years of experience and performance. If someone wants to understand this relation, this is what one would do:
Essentially it involves five steps:
Collect information across the industry - about various samples of salary and experience and the performance measured in terms of successful deliveries.
Try to identify the relation in these,
Test this relation with some more data and make required changes
Repeat step 3 till you are satisfied with the results.
Conclude that the relation is established and use it going forward, for making predictions about expected salary.
Now let's check this process in more detail. Assume you have successfully gathered the required data as follows. This is the first step. Now Linear Regression helps you with steps 2/3/4. You start with assuming some 'Linear' relation.
Salary = a * experience + b * performance + c
Here, the Salary is the output, Experience and Performance are the inputs. a, b, c are the parameters. Now, the question boils down to identifying the values of a, b and c. We all know that there is some relation between the inputs and outputs. But identifying such relation using logical deduction is almost an impossible task - because it requires extreme analysis. Regression helps you identify this relation - not through logic, but by learning.
Learning begins with a hypothesis. Propose a particular solution to the problem - based on what you already know, or just a random guess. Naturally, hypothesis based on previous knowledge helps speed up things. But, when you know nothing about the topic, a random guess is not bad either. A better word for this random guess is hypothesis. So, we start with some hypothesis
Salary = a0 * experience + b0 * performance + c0
Here, a0, b0 and c0 are arbitrary random numbers that we start with. We choose random numbers instead of 0's because that helps us begin better, with more relevant initialization. Ofcourse, this is not the correct solution. It is just a hypothesis - that we will verify and refine as we move on. In fact, any real life problem will never find a perfect solution in a numerical equation. We call it a good solution if the error is minimal or perhaps, just less than a given threshold - that is good enough for our purpose. Now, we test this hypothesis. There are may ways to test a hypothesis. Brute force is the easiest to understand. So we will start with that. We validate each records in the 'Training Set', to identify the error. Then we find out the 'Mean Square Error' - that is a measure of how good or bad is our hypothesis. The Cost of a hypothesis is a cumulative function of the errors. It could be a simple numeric mean. As we progress to higher algorithms, the derivation of cost gets more and more complex.
Cost = Sum((a0 * e(i) + b0 * p(i) + c(0) - s(i))**2) / (number of samples)
This is the most basic formula to calculate the error. A you go along, you will come across more meaningful and complicated ways of calculating the error.
This error tells us how bad is our hypothesis. If we are extremely lucky, this could be 0 right away. But, Murphy says that is not the case. So we have to adjust our parameters - a, b, c - to a1, b1, c1. We go through this correction again and then get a2, b2, c2 and so on... till we get values of a, b, c such that the error is within the threshold that we need - and the problem is solved! That sounds too simple. But how do we get a1,b1,c1 or a2,b2,c2 ...? The error that we get, tells us our hypothesis is not correct. Not just that, it guides us with the direction and amount of the change required to each parameter. This is identified using partial derivatives. Remember high school calculus? Machine Learning is full of calculus and linear algebra. Check out this Calculus Tutorial if you would like a refresher.
The derivative shows us the way in which we need to move and the amount. Higher the derivative, further away is the ideal and we need to take larger steps towards the end. The ideal point is where the derivatives are 0. That is the point where error is at the minimum possible level. Obviously, the mean square error is extremely high if the values of a,b,c are very low and also if these values are very high. So the optimum is between the two. Since our hypothesis is a linear equation, the mean square error would be a quadratic equation - having a single minimum. That simplifies our task of finding the minimum. We calculate the partial derivatives of the error expression with respect to a, b and c - Call them da, db and dc. Based on these we pick the next values:
a1 = a0 - alpha * da
b1 = b0 - alpha * db
c1 = c0 - alpha * dc
Here alpha is the learning rate. It is some positive number, that we choose based on our experience about machine learning. We can see many such Hyper Paramteres in the machine learning domain. Choosing the right hyperparameter is very important in machine learning. The hyperparameters differentiate successful and unsuccessful machine learning projects. Partial derivative of f(x, y, z) with respect to x can be calculated as
(f(x + d, y, z) - f(x - d, y, z)) / 2d # for very small values of d.
According to this,
da = (2a * sum(e) + b ( sum(e) * sum(p) ) + c ( sum(e)) - sum(s) ) / N
= (sum(e) * (2a + b * sum(p) + c - sum(s))) / N
Thus,
a(i) = a(i-1) - alpha * (sum(e) * (2a(i-1) + b(j-i) * sum(p) + c(i-1) - sum(s))) / N
The process of predicting the salary set based on the current set of parameters is called forward propagation. And the process of correcting the parameters based on the error is called backward propagation. One complete cycle of forward and backward propagation is called an epoch. Each epoch should take us closer to the optimal if the hyperparameters are correctly chosen. This process is called Regression. Since we used a linear hypothesis, we call it Linear Regression. As we train the parameters through several epochs, the error level can go below the threshold. But this is only the beginning. We need to do a lot more to get things working. We can have problems like overfitting, underfitting or local optimum.
Obviously very few real life applications are simple enough to fit in a straight line. Most of them will need a lot more than this. For these, we have Polynomial Regression, Logistic Regression, Neural Networks and many other types of input - output relations that can used as a model for the hypothesis. But core the concept of regression remains the same. Ofcourse, the calculations get more and more complex along with the hypothesis.
Polynomial Regression

Linear Regression is a good for understanding the concept. But, one can easily guess that it is too simple for anything meaningful. Polynomial regression was the next best approach for researchers. As the computational power developed, people started trying out polynomial regression.
As the name suggests, Polynomial Regression is based on a polynomial hypothesis function rather than a linear hypothesis. The process of regression thus involves identification of multiple weights rather than just two.
Local Minimum
The process of regression requires identifying the minimum value of the cost function. This minimum is defined by the derivative of the function. When the derivative is 0 or near 0, we consider that point to be the minimum. But, we have one major problem in polynomial regression. As the polynomial order increases, the cost function gets more and more complicated.
As the complication and order of the cost function increases, we can have multiple points where the derivative is 0. Only one of these is the real minimum. The others are called Local Minimum - because the cost function is less than the values around it. But not less than all the values. If the gradient descent lands on one of these local minimum, we get into trouble!
There are different ways to implement the cost function and error values and the gradient descent functionality in order to reduce the possibility of getting trapped in a local minimum. But should always be careful of this problem.
Overfitting
This has been a major problem in supervised learning. If the model underfits the data, you can identify it rather quickly. But if it overfits the data, the situation can be really tricky.
In polynomial regression, it can be observed that the possibility of overfitting is much higher when any particular weight is too high or if the weights of the higher order coefficients are too high.
Regularization is one of the common approaches to avoid overfitting - by preventing any particular weight from growing too high. There are two main types of Regularization based on how the weights are penalized:
L1 Regularization
Here, the weights are penalized based on the absolute values of the weights. (L1 because it uses the values and not their squares as in L2). L1 Regularization tends to produce a sparce model by discarding a lot of weights that are not so important. This may not result in a model as good as L2; but the performance is a lot better.
L2 Regularization
Here, the weights are penalized based on the sum of squares rather than the absolute values. (L2 because it uses the squares rather than the value). L2 Regularization tends to produce a dense model by lowering the weights. Hence the model is a lot better but the performance is not so good.
Logistic Regression

Logistic Regression is a very important concept in Supervised Machine Learning - because it helps you use the powerful techniques of Regression based learning to the Classification problems. Typically Regression algorithms deals with data where input as well as output are continuous. Logistic Regression extends the same algorithms to binary classification.
This is done by mapping the output of the matrix product into a binary output. This is done using an activation function. An activation function is a simple function that compares the input value with some threshold and generates a near binary, differentiable output. Not just linear regression, the activation function can be used to map any regression model over to a logistic model.
There are several different types of activation functions - Sigmoid, Tanh and Relu are the most popular ones. Essentially, these are continuous functions that generate "near binary" output. That is, the output is similar for input values much less than 0 and also for values much more than 0. And there is a strong gradient around 0. This approximates very well to a binary classifier - although mathematically, the function is continuous. Thus, we can use the powerful techniques of Regression.
Activation Function
The activation function forms a major component of logistic regression. There are different types of activation functions for different kinds of needs. Sigmoid, Tanh, Relu are some.
Sigmoid
The sigmoid function is very close to 0 for large negative numbers, and very close to 1 for large positive numbers. The gradient is steep around 0. That makes sigmoid a very good activation function.
sigmoid(x) = 1 / (1 + e**(-x))
The value of e**(-x)) is very high for negative values of x and very low for positive values of x. Hence 1 / (1 + e**(-x)) is almost 0 for negative numbers and almost 1 for positive numbers. The gradient is very high around 0. At 0, its value is 1/2.
Thus, the sigmoid function can be used for classification - making it a good candidate for activation.
Tanh
In principle, Tanh is quite similar to the Sigmoid function. But its value ranges between -1 and 1.
tanh(x) = (e**x - e**(-x)) / (e**x + e**(-x))
For negative values of x, e**x is close to 0, so the value is close to -e**(-x) / e**(-x). That is -1. And for positive values of x, e**(-x) is close to 0. So the value is close to e**(x) / e**(x). That is 1. The gradient is steep around 0.
Relu
This is a bit different from Sigmoid and Tanh. Arithmetically it is a lot simpler than them.
relu(x) = x > 0 ? x : 0 # Value is x if x > 0 and 0 if x <= 0.
It's application is not so simple for final classification. But it is very good for intermediate phases. We will check that when we look at the implementations.
Decision Tree

Decision Tree is an interesting concept that mimics a very common way our mind approaches a classification problem.
Suppose our training data set has n features, we can take up one feature at a time and classify data elements of that feature. The two nodes thus obtained can be classified further based on the remaining features. Thus, we get a tree structure where each node representing a given data feature. As we perform this classification, we expect to reach a state where each node has true on one side and false on the other - thus completing our model training.
Consider for example, our old breast cancer data set. We can classify the data on one feature at a time. Say we start with the size. We can define a threshold and anything less than the threshold goes to the lower node and anything higher than the threshold goes to the upper node. Then we look at the weight. And so on.. Over time, we have a Tree of nodes - each classifying the data based on a threshold of a given feature. If our thresholds are wisely chosen, then we can expect that each leaf node of a well trained tree will be able to correctly classify into a positive or negative.
Of course, we have two hyperparameters going into our assumption - the choice of order of features and the "wisely chosen" thresholds. There are several algorithms for identifying these. But, once they are chosen, the classification is not a difficult task.
The essential concept of decision tree is based on splitting the tree on the appropriate features at appropriate thresholds. But identifying these features and thresholds is very important. And overfitting, of course is the single dominant evil in any machine learning scenario. Researchers have identified several ways around these problems. Below is an introduction to the most important ones.
Linear v/s Tree models
When we have different algorithms for the same task, a natural question in any mind is - how do they compare? Which one is better? Which one should I use?
Well, the fact is that both are good at their own set of problems. There are some problems that fit much better in a linear model and there are some others that fit much better in a tree model. Intuitively, we can say that if the correlation between the input features and the output is simple and linear (in the sense that one increases/decreases uniformly with the other), then a Linear model would work much better. But if the correlation is pretty complex and not linear, then a Tree model has a better chance of working out.
Also, compared to Linear models, a Tree model is a lot easier to grasp intuitively. So, if you need humans to understand the model, then a Tree model is far better.
Optimize the Split
The first step to preparing for the Decision Tree is to identify the right set of features and split. Below are some of the important techniques to help you with that.
Gini Index
This works with categorical target variable, only Binary splits. The Gini Index is based on the concept of Purity of Population. A Population is pure if the probability of two random samples belonging to the same class is 1. You should identify the Gini index for different possible splits. The one with highest Gini Index should be the starting point.
Chi-Square
This is another way of identifying the more significant component. Chi-Square is computed from sum of squares of standardized differences between observed and expected frequencies of target variable. Specifically, the Chi-Square value is calculated as
Chi-Square = ((Actual Positive Count - Expected Positive Count)^2 / (Expected Positive Count)) ^ (1/2)
A low value of Chi-Square suggests that the Actual and Expected counts are similar. That means, the component hardly impacts the decision - the output is statistical. High difference between the Actual and Expected counts means that the component has enough influence to drive the decision significantly away from its statistical value. This more significant component should be used to start the decision.
Information Gain
Entropy has always been the known distraction in any walk of life. Most of the processes are targeted towards minimizing the entropy. In terms of information theory, higher the entropy, lower is the information. Most of our computational effort is in an attempt to gain information - so we try to minimize the entropy. The entropy of any split is computed as
entropy = - p * (log2 q) - q * (log2 p)
Where p and q are the probability of success and failure on that node. If both are 0.5 - meaning the node is redundant - the entropy is 1. On the other hand, if both are close to 0 - meaning we have almost no false negatives or false positives, this is 'the' decision node - the entropy is very low. Surely, it should be the first choice
Reduce Tree Size
Next, we look at the techniques for avoiding over fitting. When training a decision tree, it is possible that the tree grows huge to the extent that there is a node for each data element - or perhaps even more. In such a case, we have 100% accuracy for the training data and of course, it fails miserably outside the set. Thus, the fundamental reason for over fitting is the tree growing too big. The below techniques avoid this situation to in order to avoid over fitting. Logically, there are two ways to do this - don't let the tree grow and cut it down. That is what the below techniques do.
Constrain the Tree Size
The tree can be constrained by defining individual constraints on its growth. This is done by defining limits on the following:
Minimum samples for a node split
Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
Too high values can lead to under-fitting hence, it should be tuned using CV.
Minimum samples for a terminal node (leaf)
Defines the minimum samples (or observations) required in a terminal node or leaf.
Used to control over-fitting similar to min_samples_split.
Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.
Maximum depth of tree (vertical depth)
The maximum depth of a tree.
Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
Should be tuned using CV.
Maximum number of terminal nodes
The maximum number of terminal nodes or leaves in a tree.
Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
Maximum features to consider for split
The number of features to consider while searching for a best split. These will be randomly selected.
As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.
Higher values can lead to over-fitting but depends on case to case.
Tree Pruning
Constraining the tree works reactively - while training. That means, the tree is constrained without checking up the full data. This may be quicker, but mathematically, this is not the best way. Pruning on the other hand lets the tree grow really huge. Then tries to reduce the tree by pruning low performing leaves - bottom up. Thus, the process is a lot more consistent and results in a much better model.
Bias & Variance
Imperfection manifests in any model as Bias and Variance. These are two metrics that show how good or bad is your model. In simple terms, Bias tells us if the entire set is offset from what it is meant to be. On the other hand, Variance tells us how good or bad is the compliance within the data set. It is possible that the average of the entire set matches what it is supposed to be.
But, that is not enough if individual points do not match. This is defined by the variance. On the other hand, it is possible that none of the points match, but all of them are shifted by a constant value. This would be denoted by high bias and low variance. A small tree has high Bias and low Variance. On the other hand, a big tree tends to produce low Bias but high Variance. Naturally, we would like to find the minimum value in this U shaped error curve.
Bagging is one of the techniques used to reduce the variance of our predictions. Essentially it combines multiple classifier models trained on different sub-samples of the given data set. The training data set is sampled multiple times to get multiple subsets. Once this is available, each is used to train a model independently. The models are then merged to get the final model. Thus, the steps for bagging are:
Create Multiple DataSets:
Sampling is done with replacement on the original data and new datasets are formed.
The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model
Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting
Build Multiple Classifiers:
Classifiers are built on each data set.
Generally the same classifier is modeled on each data set and predictions are made.
Combine Classifiers:
The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
The combined values are generally more robust than a single model.
There are many different implementations of this concept of Bagging. Random Forest is one of them
Random Forest

A forest is essentially a collection of trees. In machine learning too, the "Random Forest" is an ensemble of "Decision Trees". The Random forest algorithm fixes a lot of issues we noticed in the Decision Tree.
Essentially, Random Forest comes up with several Decision Tree CART models using different initial variables and data sets. This eliminates the effort on identifying initial variables and assorting the data sets. Each of these Decision Trees "votes" for an output - and the final result is a function of these. It could be just the mean; or we can use a weighted mean based on the initial variables, or any particular mapping that could evolve based on the algorithms used.
Tuning the Forest
The efficiency and performance of the Random Forest algorithm depends upon the right choice of some important parameters.
Max Features
The maximum number of features allowed for a given Decision Tree. On the first sight, this may seem crazy that we limit the number of features used in a tree. One may feel that allowing each tree to access each feature will make them more robust and complete. But if we peep into the forest, we see another picture. We want to strengthen the forest as a whole and not individual trees. If each tree uses all the features, we would be depriving the diversity of the solution. On the other hand, having very few features is not good either. Having just one feature per tree is like having a single Decision Tree with each individual tree as a node. We want something between the two - so that each tree can respectfully contribute to the final decision.
Number of Estimators
This is another parameter that sounds crazy at first sight. We do not wait for building all trees at once before we start to consider the voting and averages. We can start off when a predefined number of trees is trained and that should be enough to move further. Higher this value, better is the performance of the forest. But that will also lead to lower performance. Typically this value should be as high as the processor can allow. But we should be ready for less than 100%.
Minimum Sample Leaf Size
This parameter is inherited from the Decision Trees. Lower the value of sample leaf size, higher is the noise captured in training - as it can lead to overfitting. And a very high value, will lead to underfitting. We have different thumb rules for this; but each comes with a disclaimer that one should try out multiple options to find out the best.
Clustering Algorithms

There are many different algorithms used for clustering. Principally, there are three main types of algorithms for doing this.
Hierarchical Clustering
In agglomerative hierarchical algorithms, we start by defining each data point as a cluster. Then, the two closest clusters are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster. In divisive hierarchical algorithms, we start by putting all data points into a single cluster. Then we divide this cluster into two clusters. At each subsequent step, we divide an existing cluster into two clusters. Agglomerative methods are used much more often than divisive methods. Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.
Linkage
There are different ways we can connect two points in the hierarchy. Each has its own nuances.
Single Linkage
In single linkage, we define the distance between two clusters as the minimum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters with the smallest single linkage distance.
Complete Linkage
In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest complete linkage distance.
Average Linkage
In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest average linkage distance.
Centroid Method
In centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. At each stage of the process we combine the two clusters that have the smallest centroid distance.
Ward's Method
This method does not directly define a measure of distance between two points or clusters. It is an ANOVA based approach. One-way univariate ANOVAs are done for each variable with groups defined by the clusters at that stage of the process. At each stage, two clusters merge that provide the smallest increase in the combined error sum of squares.
Measure of Association
Any clustering is meaningful in the context of a given measure of association or a measure of distance. There are several definitions of the association or distance between two points.
Euclidean Distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is called the Euclidean norm
d(X1, X2) = (sum((X1(i) - X2(i))^2))^(1/2)
Minkowski Distance
The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. Essentially, it is a generalization of the Euclidean distance to power of n
d(X1, X2) = (sum((X1(i) - X2(i))^n))^(1/n)
Canberra Metric
The Canberra Metric is intuitively quite different from the Euclidean distance. In a gross sense, one can think of it as a measure of angle in the Euclidean space.
d(X1,X2) = sum(abs(X1(i) - X2(i)) / (abs(X1(i)) + abs(X2(i))))
MetricCzekanowski Coefficient
Similar to Canberra Metric, the MetricCzekanowski Coefficient also measures an angular distance, but sums over the values before taking a ratio.
d(X1,X2) = 1 - 2 * sum(abs(X1(i) - X2(i))) / sum(abs(X1(i)) + abs(X2(i)))
Custom Distance
Or we can create our own. Any distance measure should satisfy some properties
Symmetry: Distance between point X1 and X2 should be same as distance between point X2 and X1.
d(X1, X2) = d(X2, X1)
Positivity: Distance between two distinct points X1 and X2 should be greater than 0
d(X1, X2) > 0 if X1 != X2
Identity: Distance of a point with itself should be 0
d(X, X) = 0
Triangle Inequality: The generic triangular inequality should also hold for the new measure of distance
d(X1, X2) + d(X2, X3) >= d(X1, X3)
Non-Hierarchical
In a non-hierarchical method, the data are initially partitioned into a set of K clusters. This may be a random partition or a partition based on a first "good" guess at seed points which form the initial centers of the clusters. Then data points are iteratively moved into different clusters until there is no sensible reassignment possible. The initial number of clusters (K) may be specified by the user or by the software algorithm. The most commonly used non-hierarchical method is MacQueen's K-means method.
This has some advantages and some disadvantages compared to the Hierarchical algorithms. The main question here is guessing the right value for k. If there are k+1 distinct clusters and we start off with k, we will never have a good result. However, if the cluster chunks are not so distinct, a non-hierarchical cluster can help in segregating the data rather quickly.
Model Based
A model based method uses a mixture model to specify the density function of the x-variables. In a mixture model, a population is modeled as a mixture of different subpopulations, each with the same general form for its probability density function and possibly different values for parameters, such as the mean vector. For instance, the model may be a mixture of multivariate normal distributions. In cluster analysis, the algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model. Examples are Affinity Propagation, MeanShift, DBSCAN.
Affinity Propagation

This is one of the most important among the different clustering algorithms. It is also more elaborate than the others. Affinity Propagation, takes the similarity between pairs of data points as input. It considers all data points as potential exemplars. It works by exchanging real valued messages between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. This is not as complicated as it sounds. Let's look at it in detail.
The algorithm requires us to provide two sets of data:
Similarities between data points, representing how well-suited a point is to be another one’s exemplar. If there’s no similarity between two points, as in they cannot belong to the same cluster, this similarity can be omitted or set to -Infinity depending on implementation.
Preferences, representing each data point’s suitability to be an exemplar. We may have some initial information which points could be favored for this role, and so we can represent it through preferences.
All this information is normally represented through a single matrix. The main diagonal of this matrix represents preferences. Another approach is to keep the similarities between points in the memory. (The latter works better when the data is sparse). The 'exchanging messages between points' is nothing more than matrices manipulation. The algorithm then runs through a number of iterations, until it converges.
Thus, each iteration has two steps:
Calculating responsibilities
Calculating availabilities.
Responsibility and Availability are two basic components of Affinity Propagation. To understand them, consider the scenario of a group of employees working with an employer. From the team to work, you have to ensure that the employees find the employer better than other employers around. Additionally you have to ensure that the employees get along well. The first is measured in terms of Responsibility while the second is measured in terms of Availability.
In mathematical terms, the Responsibility r(i, k) reflects the accumulated evidence of k's potential for being an exemplar of i - considering the various different possible exemplars for i. On the other hand, Availability - a(i, k) is the accumulated evidence for k's potential of being an exemplar of i - considering the various different points that consider k as an exemplar. Essentially, the "Responsibility" measures k's responsibility to take up i under its hood. On the other hand, "Availability" measures k's availability for the new point i. While working on the classification, the Responsibility message is passed from i to k and Availability message is passed from k to i. That is, i tells k if it thinks k is responsible for i and k replies if it feels that it is available for i.
In order to calculate responsibilities, the algorithm uses original similarities and availabilities calculated in the previous iteration (initially, all availabilities are set to zero). Responsibilities are set to the input similarity between point i and point k as its exemplar, minus the largest of the similarity and availability sum between point i and other candidate exemplars. The logic behind calculating how suitable a point is for an exemplar is that it is favored more if the initial apriori preference was higher, but the responsibility gets lower when there is a similar point that considers itself a good candidate, so there is a ‘competition’ between the two until one is decided in some iteration.
Calculating availabilities, then, uses calculated responsibilities as evidence whether each candidate would make a good exemplar. Availability a(i, k) is set to the self-responsibility r(k, k) plus the sum of the positive responsibilities that candidate exemplar k receives from other points.
Finally, we can have different stopping criteria to terminate the procedure, such as when changes in values fall below some threshold, or the maximum number of iterations is reached. At any point through Affinity Propagation procedure, summing Responsibility (r) and Availability (a) matrices gives us the clustering information we need: for point i, the k with maximum r(i, k) + a(i, k) represents point i’s exemplar. Or, if we just need the set of exemplars, we can scan the main diagonal. If r(i, i) + a(i, i) > 0, point i is an exemplar.
Hyper Parameters
There are two major hyper parameters that are required to make the Affinity Propagation effective.
Damping
Each new iteration of the Responsibility and Availability leads to recalculation of the clusters. Because of this, it is possible that the system gets into an oscillation without any convergence. In order to avoid this problem, the algorithm introduces Damping. This forces inertia on the changes by allocating some weight to the current position of the data point.
Affinity
The measure of affinity is perhaps the most important component in this exercise of clustering. Generally the euclidean affinity is used. But it is possible to push in any other measure of affinity - so long as it is consistent.
Mean Shift Clustering

Mean shift clustering uses sliding-window to find dense areas in the data points. It is a centroid-based algorithm. The goal of the algorithm is to locate the centre points of each group/class, which works by updating candidates for centre points to be the mean of the points within the sliding-window. These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of centre points and their corresponding groups. The assumption is that the population is dense at the centre of each cluster.
Mean-shift algorithm starts by selecting a random point in the data space. Then, check the data in a fixed radius around the point, to identify the direction where the density increases most. The point continues to shift in the data space in order to move in the direction where the density increases most. This continues till you reach a point where the density decreases on every side. This is the centre of the group. This process is done with many sliding windows until all points lie within a window. When multiple windows overlap the merge and the maximum is preserved. The data points are then clustered according to the sliding window in which they reside.
This is different from the K-means algorithm because we do not select the number of clusters. The mean shift algorithm identifies this for us. That adds a lot of value. Ofcourse it leaves the room for configuration (hence room for doubt) because we have to select the radius of the sliding window. The radius in turn impacts the count of clusters identified.
This limitation of the algorithm is that it identifies the cluster based on the density of points and if an entire cluster is not dense enough, it is pushed into another dense cluster. Thus, it may miss some clusters. But the algorithm offers a very good computational performance.
K-Means Clustering

K-Means is an interesting way of identifying clusters in the given data. Conceptually, we can think of the process as follows:
If we want to identify K clusters in the data set; we start with picking K separate points in the data space. Assuming these K points are the centroids, identify the K clusters. Any point in the data space belongs to the cluster defined by the closest centroid. Now, based on these clusters identify the new centroids. And then identify the clusters again. If we do this recursively, we should ideally end up with a situation where the K centroids do not move much on each iteration. The clusters are thus identified using the K-Means algorithm.
Ofcourse, there are several important factors here. How do you choose K? If K is very large, we might end up with a cluster for every data point. That defeats the purpose of clustering. On the other hand, if we use K = 1, we have one big cluster for the entire data set. This again defeats the purpose. Ideally, we should choose K such that the clusters identified are indeed different from each other - That is the distance of all points from the respective centroid is significantly less than the distance between the centroids.
More formally, sum of square of distance between centroid and the data points within each cluster is considered a metric for optimizing the model. As the value of K increases, the sum of square of distances decreases for every step. But this decreases is very fast for the initial values of K, and then it slows down. This defines the right value of K.
In real life, it is very difficult to know the number of such details before we start. But we still have use cases for K-Means - in cass where we want to break the data into a predefined number of clusters. Else, if we have enough computational power, we can try over various number of clusters - regressively trying to reduce the mismatch. That is certainly not the best way out.
We have other more effective algorithms. But they are not as simple.
Nearest Neighbor Clustering

Classification is an important subject in Supervised Learning. A major part of machine learning applications deal with binary outputs which require classification rather than regression. KNN Classifier is one of the many classification algorithms.
Like most Machine Learning algorithms, the KNN algorithm is also inspired by a tendency of the human mind - to go along with the crowd. Conceptually, KNN just looks at the known points around the query point and predicts that its outcome is similar to the points around it. More precisely, For any new point, it checks for the K points that are closest in terms of the defined distance metric.
Once these are identified, the outcome of each of those points is identified based on the training set. And the outcome of the new point is defined based on the highest bidder in the neighborhood. For example, if we look for the 5 nearest of a given test point, if 3 of those points say positive and two say negative, the outcome is predicted as positive since that is the highest bidder.
The KNN Classifier is a good tool for classification when the size of the data and the features are within control - else the computation can get very expensive. The classifier accuracy is based on the assumption that the similar points are geometrically close to each other - which may not always be the case. The distance metric is very important. What is near according to one metric may be far away by the other.
Consider for example, data sets in form of two concentric circles - the inner circle being positive and the outer circle negative. In such a case, inventing new distance metric may help to an extent. But the cost of computation increases very rapidly with the complexity of the distance metric. The cost also increases rapidly with the number of features.
But it is a very elegant and intuitive way of classifying when the data is good.
Support Vector Machine

Support Vector Machine (SVM) is a classification technique. It tries to geometrically divide the data available. If the input data has N features, the data is plotted as points in an N dimensional space. Then, it identifies an N-1 dimensional structure that could separate the groups. The good separation is one which has maximum distance from the two groups. The distance from the group could be identified in various different forms - the distance from the closest points or the distance from the center, or mean of all distances, etc. That would depend upon the kind of data.
But things may not always be so simple. If the features are not separated well enough, we may not have a good plane passing through them. In that case, the scores would be very bad in spite of any tuning. In such a case, we have to rework the features to make sure the data gets segregated properly. We may have to consider a new distance metric that can separate the positive and negative clusters well enough. Or if nothing works, we might have to look for another algorithm.
Hence it is important to have a good idea about the data we have. That can help us choose the correct algorithm and save our time and computation resources.
In effect, this complements the Nearest Neighbor algorithm. Problems that are easier with Nearest Neighbor are difficult with the SVM and vice-versa.
Neural Networks

The concept of neural networks is not new. In fact it dates back to 1943 when the first neural network model was published by McCulloch Pitts. The concept was refined further by development of a Perceptron in 1957. But it just remained a subject of academic study - mainly due to the extensive computational power required to implement it - and the lack of it in those days.
Now that we have immense computational power available to us, the neural networks have become the hot topic of study and implementation. But the core concept of the Perceptron - the fundamental unit of the neural network - has not changed.
From the mathematical point of view, there is a limit to what a linear function can do and it was noticed that results of polynomials were not good enough to justify the computational expense. Hence the concept of neurons picked up momentum. A lot of Machine Learning is inspired by how the human mind works. The concept of Neural Networks goes one step further - to take inspiration from the way Neurons are laid out in the human brain.

As you can see in the image above, the neurons get multiple inputs and based on these, give out one output that feeds into another set of neurons, and so on. The nervous system is built of many such neurons connected to each other. Each neuron contributes to the decision process by appropriately forwarding the input signal - based on the training it has gathered. Such a model has the potential to hold all that a human brain does. Each neuron has a minimal functionality that can potentially do wonders.
Neurons are implemented as linear function with a non linear topping - called the activation function. Thus, each neuron is defined by weights for each input and a bias. The result of this operation is fed into the activation function. The final output is the input for the next set of neurons. Such Artificial neuron is called a Perceptron.

Often the network has multiple layers of such Perceptrons. That is called MLP (Multy Layer Perceptron). In an MLP, we have an input layer, an output layer and zero or more hidden layers.
Network of Perceptrons
Each Perceptron has an array of inputs and an array of weights that are multiplied with the inputs to generate a scalar. This processing is linear - they cannot help fitting a non linear curve - irrespective of the depth of the network. If the network has to fit non linear curves, we need some non linear element on each perceptron. Hence, perceptrons are tipped with a non linear activation function. This could be a sigmoid or tanh or relu ... Researchers have offered several activation functions that have specific advantages.
With everything in place, a neural network looks like this:

The layout, width and depth of the network is one of the most interesting topics of research. Experts have developed different kinds of networks for different kinds of problems. The deeper and larger the network, more is its capacity. Human brain has around 100 billion neurons. Neural Networks are nowhere near that - some researchers quote experiments with million. This concept of large neural networks or Deep Learning is not new. But it was limited to mathematical curiosity and research papers. The recent boom in the availability of massive training data and computing power has made it a big success.
Building, training and tuning Neural Networks is a massive domain and deserves many blogs dedicated to each topic.
Activation Functions
The perceptron, better known today as the artificial neuron is a processing unit composed to two components. The first component is the adder that gets a weighted sum of the input values. This is followed by the activation function that adds a component of non-linearity to the output.

The activation function we use is a major component of the neuron (perceptron) and hence the neural network as a whole. Rest of the neuron is just the weights that we train. But, the efficiency and trainability of the neuron is primarily defined by the activation function that it uses.
We have different types of activation functions that can be used. Each has its own ups and downs in terms of complexity, processing power and trainability. We can choose the right one based on the problem at hand.
Threshold Activation Function
This is the simplest and the most primitive of all. It was used by McCulloch Pitts Neuron. Here, the output value is 1 if the input is greater than 0. Else, it is 0. It is not discontinuous at x=0. So it is not possible to use this in gradient descent.
Sigmoid Activation Function
This is a frequently used activation function. It defines the output by the function g(x) = 1 / (1 + e-x). The value of e-x varies from 0 (for x approaching infinity) and infinity (for x approaching negative infinity)
So, the value of g(x) is a continuous and differentiable function that plays between 0 and 1. It is interesting to note that the value of g(x) is almost 1 for large numbers and almost 0 for negative large numbers. The major change occurs between -1 and +1. The importance of this property will be more evident when we look at feature normalization. This causes the problem of vanishing gradients - reducing the trainability.
TanH Activation Function
The tanh - or hyperbolic tangent activation function is based on similar concepts as the sigmoid. Tanh is defined as (1 - e-2x) / (1 + e-2x). One can see that this output of this function varies between -1 and 1. Also, just like the sigmoid activation, tanh also has problem of vanishing gradients.
RelU Activation Function
Also known as the rectified linear function, it is defined as g(x) = (x > 0) ? x : 0
This may seem too trivial for its value. But it is one of the most popular ones today. It adds non linearity as well as sparsity to the model - because every time, only a few of the nodes actually contribute to the output - rest are zeroed out. This can also have a problem that some nodes remain inactive and always zeroed out - but there are other ways to handle that problem.
Softmax Activation Function
This is quite different from the other activation functions we have seen so far. All the above functions depend on the inputs to the given neuron. But the softmax activation also accounts for all the neurons in the layer. It is a normalized exponential function.
Mathematically, the softmax activation function can be defined as yi = exi/Σjexj
The normalization ensures that the gradient never blows up and the regression algorithm does not stumble on the gradients. But on the other hand, it can also defeats the purpose of having several nodes in the layer. So it is rarely used in the inner layers. But it works out to be the best for the final output layer.
Most common architectures have a Softmax for the final output layers with a ReLU for the inner layers.
Deep Learning
Neural Networks offered a major breakthrough in building non linear models. But that was not enough. Any amount of training and data is not enough if our model is not rich enough. After all, the amount of information contained in the model is limited by the number of weights therein. A simple network with a hundred weights cannot define a model for complicated tasks like face recognition. We need a lot more.
It may seem simple, just increasing the number of perceptrons in a network can increase the count of weights. What is the big deal about it? But that is not so simple. As the network grows larger, many other problems start peeping in. In general, the ability of the network does not grow linearly with the number of perceptrons. In fact, it can decrease beyond a point - unless we take care of some important aspects.
The capacity of the networks is a lot better when the network is deeper than wider - that is, if the network has a lot more layers rather than having too many perceptrons in the same layer. Such deep neural networks have enabled miraculous innovations in the past few years. Deep Learning is a branch of machine learning that looks into these aspects of Neural Networks. Deep Learning is the branch of Machine Learning that deals with deep Neural Networks.
Kommentarer