Machine Learning Glossary

(This glossary is work in progress. I will keep it updated while I cultivate in my deep learning garden.)

(See this page for some books about machine learning that I recommend. See this page for deep learning glossary. See this page for some essential resources for Deep Learning. See this post for essential terms and techniques in NLP that I collected and organized.)

If you do not have any experience with machine learning or deep learning, check out those set of cheatsheets on the topics here  (it has a website version as well for better readability).

Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data).

Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy.

The acquisition of labeled data for a learning problem often requires a skilled human agent or a physical experiment.

Intuitively, we can think of the learning problem as an exam and labeled data as the few example problems that the teacher solved in class. The teacher also provides a set of unsolved problems. In the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are practice problems of the sort you will encounter on the in-class exam.


Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones.

In order to evaluate our models, we must reserve a portion of the annotated data for the test set. As we already mentioned, if the test set is too small, then our evaluation may not be accurate. However, making the test set larger usually means making the training set smaller, which can have a significant impact on performance if a limited amount of annotated data is available.

One solution to this problem is to perform multiple evaluations on different test sets, then to combine the scores from those evaluations, a technique known as cross-validation. In particular, we subdivide the original corpus into N subsets called folds. For each of these folds, we train a model using all of the data except the data in that fold, and then test that model on the fold. Even though the individual folds might be too small to give accurate evaluation scores on their own, the combined evaluation score is based on a large amount of data, and is therefore quite reliable.

A second, and equally important, advantage of using cross-validation is that it allows us to examine how widely the performance varies across different training sets. If we get very similar scores for all N training sets, then we can be fairly confident that the score is accurate. On the other hand, if scores vary widely across the N training sets, then we should probably be skeptical about the accuracy of the evaluation score.

k-fold cross-validation

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter.

For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and test on d1, followed by training on d1 and testing on d0.

When k = n (the number of observations), the k-fold cross-validation is exactly the leave-one-out cross-validation.

A confusion matrix provides a summary of all of the predictions made compared to the expected actual values.

The results are presented in a matrix with counts in each cell. The counts of actual class values are summarized horizontally, whereas the counts of predictions for each class values are presented vertically.

A perfect set of predictions is shown as a diagonal line from the top left to the bottom right of the matrix.

The value of a confusion matrix for classification problems is that you can clearly see which predictions were wrong and the type of mistake that was made.

When performing classification tasks with three or more labels, it can be informative to subdivide the errors made by the model based on which types of mistake it made. Confusion matrix will help.

A good explain of confusion matrix can be found at here (pdf).

Confusion matrix online calculator

What is a Confusion Matrix in Machine Learning

Simple guide to confusion matrix terminology (pdf)

  •  Scale Machine Learning Data

Many machine learning algorithms expect data to be scaled consistently.

There are two popular methods that you should consider when scaling your data for machine learning.

In this post, you will discover how you can rescale your data for machine learning. After reading this tutorial you will know:

  • How to normalize your data from scratch.
  • How to standardize your data from scratch.
  • When to normalize as opposed to standardize data.

In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration,[1] via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Feature learning or representation learning is a set of techniques that learn a feature: a transformation of raw data input to a representation that can be effectively exploited in machine learning tasks.

Automatic feature extraction from raw data.

Feature Engineering (Feb 20, 2017 by HJ van Veen)

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for three reasons:

  • simplification of models to make them easier to interpret by researchers/users,
  • shorter training times,
  • enhanced generalization by reducing overfitting (formally, reduction of variance )


Multinomial logistic regression is known by a variety of other names, including polytomous LRmulticlass LR, softmax regression, multinomial logit, maximum entropy (MaxEnt) classifier, conditional maximum entropy model.

Check whether Maximum Entropy model used for NER in this paper is the same as the “maximum entropy classifier“:

OpenNLP OpenNLP2 is a Java based library for various natural language processing tasks, such as tokenization, part-of-speech (POS) tagging, and named entity recognition. For named entity recognition, it trains a Maximum Entropy model using the information from the whole document to recognize entities in documents.


In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.

Softmax Classifiers Explained

Regularization is a technique used to reduce overfitting.

Generalization in machine learning refers to how well the concepts learned by a machine learning model apply to specific examples not seen by the model when it was learning.

A good post:  Overfitting and Underfitting With Machine Learning Algorithms (March 21, 2016 by Jason Brownlee)

The cause of poor performance in machine learning is either overfitting or underfitting the data. Overfitting refers to a model that models the training data too well, and generalization not well; that is, when new data comes in, it will give large error.

Overfitting: Good performance on the training data, poor generliazation to other data.

A good post:  Overfitting and Underfitting With Machine Learning Algorithms (March 21, 2016 by Jason Brownlee)


  • Underfitting

The cause of poor performance in machine learning is either overfitting or underfitting the data. Underfitting refers to a model that can neither model the training data not generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data. Normally underfitting indicate we need to add more training examples.

Underfitting: Poor performance on the training data and poor generalization to other data

A good post:  Overfitting and Underfitting With Machine Learning Algorithms (March 21, 2016 by Jason Brownlee)

Decision trees are a powerful prediction method and extremely popular.

They are popular because the final model is so easy to understand by practitioners and domain experts alike. The final decision tree can explain exactly why a specific prediction was made, making it very attractive for operational use.

Decision trees also provide the foundation for more advanced ensemble methods such as bagging, random forests and gradient boosting.

Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

Decision trees can suffer from high variance which makes their results fragile to the specific training data used.

Building multiple models from samples of your training data, called bagging, can reduce this variance, but the trees are highly correlated.

Random Forest is an extension of bagging that in addition to building trees based on multiple samples of your training data, it also constrains the features that can be used to build the trees, forcing trees to be different. This, in turn, can give a lift in performance.

Good post: Random Forests (Leo Breiman and Adele Cutler)

In statistics, Markov chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution of its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.

Random walk Monte Carlo methods make up a large subclass of MCMC methods.


In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.

In computer science, incremental learning is a method of machine learning, in which input data is continuously used to extend the existing model’s knowledge (i.e. to further train the model). It represents a dynamic technique of supervised learning and unsupervised learning that can be applied when training data becomes available gradually over time or its size is out of system memory limits. Algorithms that can facilitate incremental learning are known as incremental machine learning algorithms.

The aim of incremental learning is for the learning model to adapt to new data without forgetting its existing knowledge, it does not retrain the model.

Transfer learning or inductive transfer is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks. This area of research bears some relation to the long history of psychological literature on transfer of learning, although formal ties between the two fields are limited.

A good tutorial of transfer learning in TensorFlow can be found here (How to Retrain Inception’s Final Layer for New Categories).

In statistics and machine learningensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Simple but powerful ensemble  techniques includes:

  1. Max Voting
  2. Averaging
  3. Weighted Averaging



References and Further Reading List: