Machine Learning Glossary

(This glossary is work in progress. I will keep it updated while I cultivate in my deep learning garden.)

(See this page for some books about machine learning that I recommend. See this page for deep learning glossary. See this page for some essential resources for Deep Learning. See this post for essential terms and techniques in NLP that I collected and organized.)

If you do not have any experience with machine learning or deep learning, check out those set of cheatsheets on the topics here  (it has a website version as well for better readability).

Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data).

Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy.

The acquisition of labeled data for a learning problem often requires a skilled human agent or a physical experiment.

Intuitively, we can think of the learning problem as an exam and labeled data as the few example problems that the teacher solved in class. The teacher also provides a set of unsolved problems. In the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are practice problems of the sort you will encounter on the in-class exam.

(source)

Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones.

In order to evaluate our models, we must reserve a portion of the annotated data for the test set. As we already mentioned, if the test set is too small, then our evaluation may not be accurate. However, making the test set larger usually means making the training set smaller, which can have a significant impact on performance if a limited amount of annotated data is available.

One solution to this problem is to perform multiple evaluations on different test sets, then to combine the scores from those evaluations, a technique known as cross-validation. In particular, we subdivide the original corpus into N subsets called folds. For each of these folds, we train a model using all of the data except the data in that fold, and then test that model on the fold. Even though the individual folds might be too small to give accurate evaluation scores on their own, the combined evaluation score is based on a large amount of data, and is therefore quite reliable.

A second, and equally important, advantage of using cross-validation is that it allows us to examine how widely the performance varies across different training sets. If we get very similar scores for all N training sets, then we can be fairly confident that the score is accurate. On the other hand, if scores vary widely across the N training sets, then we should probably be skeptical about the accuracy of the evaluation score.

k-fold cross-validation

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter.

For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and test on d1, followed by training on d1 and testing on d0.

When k = n (the number of observations), the k-fold cross-validation is exactly the leave-one-out cross-validation.

A confusion matrix provides a summary of all of the predictions made compared to the expected actual values.

The results are presented in a matrix with counts in each cell. The counts of actual class values are summarized horizontally, whereas the counts of predictions for each class values are presented vertically.

A perfect set of predictions is shown as a diagonal line from the top left to the bottom right of the matrix.

The value of a confusion matrix for classification problems is that you can clearly see which predictions were wrong and the type of mistake that was made.

When performing classification tasks with three or more labels, it can be informative to subdivide the errors made by the model based on which types of mistake it made. Confusion matrix will help.

A good explain of confusion matrix can be found at here (pdf).

Confusion matrix online calculator

What is a Confusion Matrix in Machine Learning

Simple guide to confusion matrix terminology (pdf)

  •  Scale Machine Learning Data

Many machine learning algorithms expect data to be scaled consistently.

There are two popular methods that you should consider when scaling your data for machine learning.

In this post, you will discover how you can rescale your data for machine learning. After reading this tutorial you will know:

  • How to normalize your data from scratch.
  • How to standardize your data from scratch.
  • When to normalize as opposed to standardize data.

In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration,[1] via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Feature learning or representation learning is a set of techniques that learn a feature: a transformation of raw data input to a representation that can be effectively exploited in machine learning tasks.

Automatic feature extraction from raw data.

Feature Engineering (Feb 20, 2017 by HJ van Veen)

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for three reasons:

  • simplification of models to make them easier to interpret by researchers/users,
  • shorter training times,
  • enhanced generalization by reducing overfitting (formally, reduction of variance )

 

Multinomial logistic regression is known by a variety of other names, including polytomous LRmulticlass LR, softmax regression, multinomial logit, maximum entropy (MaxEnt) classifier, conditional maximum entropy model.

Check whether Maximum Entropy model used for NER in this paper is the same as the “maximum entropy classifier“:

OpenNLP OpenNLP2 is a Java based library for various natural language processing tasks, such as tokenization, part-of-speech (POS) tagging, and named entity recognition. For named entity recognition, it trains a Maximum Entropy model using the information from the whole document to recognize entities in documents.

http://opennlp.apache.org

 

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.

Softmax Classifiers Explained

Regularization is a technique used to reduce overfitting.

Generalization in machine learning refers to how well the concepts learned by a machine learning model apply to specific examples not seen by the model when it was learning.

A good post:  Overfitting and Underfitting With Machine Learning Algorithms (March 21, 2016 by Jason Brownlee)

The cause of poor performance in machine learning is either overfitting or underfitting the data. Overfitting refers to a model that models the training data too well, and generalization not well; that is, when new data comes in, it will give large error.

Overfitting: Good performance on the training data, poor generliazation to other data.

A good post:  Overfitting and Underfitting With Machine Learning Algorithms (March 21, 2016 by Jason Brownlee)

 

  • Underfitting

The cause of poor performance in machine learning is either overfitting or underfitting the data. Underfitting refers to a model that can neither model the training data not generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data. Normally underfitting indicate we need to add more training examples.

Underfitting: Poor performance on the training data and poor generalization to other data

A good post:  Overfitting and Underfitting With Machine Learning Algorithms (March 21, 2016 by Jason Brownlee)

Decision trees are a powerful prediction method and extremely popular.

They are popular because the final model is so easy to understand by practitioners and domain experts alike. The final decision tree can explain exactly why a specific prediction was made, making it very attractive for operational use.

Decision trees also provide the foundation for more advanced ensemble methods such as bagging, random forests and gradient boosting.

Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

Decision trees can suffer from high variance which makes their results fragile to the specific training data used.

Building multiple models from samples of your training data, called bagging, can reduce this variance, but the trees are highly correlated.

Random Forest is an extension of bagging that in addition to building trees based on multiple samples of your training data, it also constrains the features that can be used to build the trees, forcing trees to be different. This, in turn, can give a lift in performance.

Good post: Random Forests (Leo Breiman and Adele Cutler)

In statistics, Markov chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution of its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.

Random walk Monte Carlo methods make up a large subclass of MCMC methods.

 

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.

In computer science, incremental learning is a method of machine learning, in which input data is continuously used to extend the existing model’s knowledge (i.e. to further train the model). It represents a dynamic technique of supervised learning and unsupervised learning that can be applied when training data becomes available gradually over time or its size is out of system memory limits. Algorithms that can facilitate incremental learning are known as incremental machine learning algorithms.

The aim of incremental learning is for the learning model to adapt to new data without forgetting its existing knowledge, it does not retrain the model.

Transfer learning or inductive transfer is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks. This area of research bears some relation to the long history of psychological literature on transfer of learning, although formal ties between the two fields are limited.

A good tutorial of transfer learning in TensorFlow can be found here (How to Retrain Inception’s Final Layer for New Categories).

In statistics and machine learningensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Simple but powerful ensemble  techniques includes:

  1. Max Voting
  2. Averaging
  3. Weighted Averaging

 

 

References and Further Reading List:

 

Deep Learning Glossary

(This glossary is work in progress. I will keep it updated while I cultivate in my deep learning garden.)

(As deep learning is a branch of machine learning, it will be very helpful to look at some basics of machine learning while you start with your deep learning journey. See this page for a machine learning glossary. See this page for some essential resources for Deep Learning.)

If you do not have any experience with machine learning or deep learning, check out those set of cheatsheets on the topics here  (it has a website version as well for better readability).

Note: Those terms in bold are the most essential ones. Some of the terms are grouped by deep learning algorithm categories, not in alphabetical order.

It is one of machine learning techniques.

Deep Learning allows computational models composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection, and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large datasets by using the back-propagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about dramatic improvements in processing images, video, speech and audio, while recurrent nets have shone on sequential data such as text and speech. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep learning methods are representation learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level.

- NIPS

Good intro: What is deep learning? (By Jason Brownlee on August 16, 2016)

Papers:



See Chapter 9: Convolutional Networks in  Deep Learning (MIT Press) by Ian Goodfellow and Yoshua Bengio and Aaron Courville

See pages 43 -45 for a pretty good intro to CNN in Learning Deep Architectures for AI by Yoshua Bengio (pdf)

ReLU is the abbreviation of Rectified Linear Units. This is a layer of neurons that applies the non-saturating activation function {\displaystyle f(x)=\max(0,x)}. It increases the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer.

Other functions are also used to increase nonlinearity, for example the saturating hyperbolic tangent {\displaystyle f(x)=\tanh(x)}, {\displaystyle f(x)=|\tanh(x)|}, and the sigmoid function f(x)=(1+e^{-x})^{-1}. Compared to other functions the usage of ReLU is preferable, because it results in the neural network training several times faster,[40] without making a significant difference to generalisation accuracy. [from https://en.wikipedia.org/wiki/Convolutional_neural_network#ReLU_layer]

Pronounciation of the abbr.: often pronounced as ‘relu’ – where “re” as in “rest”, and “lu” as in “look”


Pooling is another important concept of CNNs , which is a form of non-linear downsampling. There are several non-linear functions (e.g., max, mean) to implement pooling among which max pooling is the most common.

 



Good intro: 


A very special kind of recurrent neural network which works, for many tasks, much much better than the traditional RNN. It is an improved version of RNN which considered flushing memory.

In a traditional recurrent neural network, during the gradient backpropagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.

If the weights in this matrix are small (or, more formally, if the leading eigenvalue of the weight matrix is smaller than 1.0), it can lead to a situation called vanishing gradients where the gradient signal gets so small that learning either becomes very slow or stops working altogether. It can also make more difficult the task of learning long-term dependencies in the data. Conversely, if the weights in this matrix are large (or, again, more formally, if the leading eigenvalue of the weight matrix is larger than 1.0), it can lead to a situation where the gradient signal is so large that it can cause learning to diverge. This is often referred to as exploding gradients.

These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell (see the diagram below for illustration of an LSTM memory cell). A memory cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. The self-recurrent connection has a weight of 1.0 and ensures that, barring any outside interference, the state of a memory cell can remain constant from one timestep to another. The gates serve to modulate the interactions between the memory cell itself and its environment. The input gate can allow incoming signal to alter the state of the memory cell or block it. On the other hand, the output gate can allow the state of the memory cell to have an effect on other neurons or prevent it. Finally, the forget gate can modulate the memory cell’s self-recurrent connection, allowing the cell to remember or forget its previous state, as needed.

                     Illustration of an LSTM memory cell (source: here)

Good intro:

 


A GRU is a pared-down LSTM. GRUs rely on gating mechanisms to learn long-range dependencies while sidestepping the vanishing gradient problem. They include reset and update gates to decide when to update the GRUs memory at each time step.

GRU  is  simplified LSTM

Papers: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

 


MLPs are perhaps the oldest form of deep neural network. They consist of multiple, fully connected feedforward layers.


An Autoencoder is a Neural Network model whose goal is to predict the input itself, typically through a “bottleneck” somewhere in the network. By introducing a bottleneck, we force the network to learn a lower-dimensional representation of the input, effectively compressing the input into a good representation. Autoencoders are related to PCA and other dimensionality reduction techniques, but can learn more complex mappings due to their nonlinear nature. A wide range of autoencoder architectures exist, including Denoising Autoencoders, Variational Autoencoders, or Sequence Autoencoders.



Basic Terms

The concepts in this group are used and get involved in different learning algorithms.

To allow Neural Networks to learn complex decision boundaries, we apply a nonlinear activation function to some of its layers. Commonly used functions include sigmoid, tanh, ReLU (Rectified Linear Unit) and variants of these.

An activation, or activation function, for a neural network is defined as the mapping of the input to the output via a non-linear transform function at each “node”, which is simply a locus of computation within the net. Each layer in a neural net consists of many nodes, and the number of nodes in a layer is known as its width.

Activation algorithms are the gates that determine, at each node in the net, whether and to what extent to transmit the signal the node has received from the previous layer. A combination of weights (coefficients) and biases work on the input data from the previous layer to determine whether that signal surpasses a given treshhold and is deemed significant. Those weights and biases are slowly updated as the neural net minimizes its error; i.e. the level of nodes’ activation change in the course of learning. Deeplearning4j includes activation functions such as sigmoid, relu, tanh and ELU. These activation functions allow neural networks to make complex boundary decisions for features at various levels of abstraction.

Note: a decision boundary is defined not defined by training set, but by hypothesis parameters. But for sure, training set is used to fit hypothesis parameters.


  • Affine layer

A fully-connected layer in a Neural Network. Affine means that each neuron in the previous layer is connected to each neuron in the current layer. In many ways, this is the “standard” layer of a Neural Network. Affine layers are often added on top of the outputs of Convolutional Neural Networks or Recurrent Neural Networks before making a final prediction. An affine layer is typically of the form y = f(Wx + b) where x are the layer inputs, w the parameters, b a bias vector, and f a nonlinear activation function.


Backpropagation is an algorithm used to efficiently calculate the gradients in a Neural Network, or more generally, a feedforward computational graph. It boils down to applying the chain rule of differentiation starting from the network output and propagating the gradients backward.

It calculates the gradient of a loss function with respect to all the weights in the network, so that the gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function.

It is used to calculate the gradient the relate weights to error, we use a technique known as backpropagation, which is also referred to as the backward pass of the network. Backpropagation is a repeated application of chain rule of calculus for partial derivatives. The first step is to calculate the derivatives of the objective function with respect to the output units, then the derivatives of the output of the last hidden layer to the input of the last hidden layer; then the input of the last hidden layer to the weights between it and the penultimate hidden layer, etc. Here’s a derivation of backpropagation. And here’s Yann LeCun’s important paper on the subject.

Good intro:

A special form of backpropagation is called backpropagation through time, or BPTT, which is specifically useful for recurrent networks analyzing text and time series. With BPTT, each time step of the RNN is the equivalent of a layer in a feed-forward network. To backpropagate over many time steps, BPTT can be truncated for the purpose of efficiency. Truncated BPTT limits the time steps over which error is propagated.

Papers:

Backpropagation Through Time: What It Does and How to Do It


The gradient is a derivative, which you will know from differential calculus. That is, it’s the ratio of the rate of change of a neural net’s parameters and the error it produces, as it learns how to reconstruct a dataset or make guesses about labels. The process of minimizing error is called gradient descent. Descending a gradient has two aspects: choosing the direction to step in (momentum) and choosing the size of the step (learning rate, check out here (PDF) for an good intro about learning rate).

Since MLPs are, by construction, differentiable operators, they can be trained to minimise any differentiable objective function using gradient descent. The basic idea of gradient descent is to find the derivative of the objective function with respect to each of the network weights, then adjust the weights in the direction of the negative slope. -Graves

source: http://cs231n.github.io/neural-networks-3/


loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.

 

A central problem in machine learning is how to make an algorithm that will
perform well not just on the training data, but also on new inputs; namely has good generalization. Many strategies used in machine learning are explicitly designed to reduce the test error, possibly
at the expense of increased training error. These strategies are known collectively
as regularization.

Regularization is a central theme of machine learning and can significantly improve training results. Don’t forget to tune regularization in your models.

Good intro: 

Regularization in deep learning

Understanding regularization for image classification and machine learning

See Chapter 7: Regularization for Deep Learning in  Deep Learning (MIT Press) by Ian Goodfellow and Yoshua Bengio and Aaron Courville

 

Dropout is a hyperparameter used for regularization in neural networks. Like all regularization techniques, its purpose is to prevent overfitting. Dropout randomly makes nodes in the neural network “drop out” by setting them to zero, which encourages the network to rely on other features that act as signals. That, in turn, creates more generalizable representations of data.

Papers: 

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Recurrent Neural Network Regularization

  • Embedding

An embedding is a representation of input, or an encoding. For example, a neural word embedding is a vector that represents that word. The word is said to be embedded in vector space. Word2vec and GloVe are two techniques used to train word embeddings to predict a word’s context. Because an embedding is a form of representation learning, we can “embed” any data type, including sounds, images and time series.

  • Epoch vs. Iteration

In machine-learning and deep learning, an epoch is a complete pass through a given dataset. That is, by the end of one epoch, your neural network – be it a restricted Boltzmann machine, convolutional net or deep-belief network – will have been exposed to every record to example within the dataset once. Not to be confused with an iteration, which is simply one update of the neural net model’s parameters. Many iterations can occur before an epoch is over. Epoch and iteration are only synonymous if you update your parameters once for each pass through the whole dataset; if you update using mini-batches, they mean different things. Say your data has 2 minibatches: A and B. .numIterations(3)performs training like AAABBB, while 3 epochs looks like ABABAB.

See below for more detailed explanation:

One epoch = one forward pass and one backward pass of all the training examples.

The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.

Batch size = the number of training examples in one forward/backward pass. The larger the batch size, the more memory it requires.

A good explanation of epoch and batch size can be found at here (pdf)

Number of iterations = number of passes. Each pass using [batch size] number of examples. One pass = one forward pass + one backward pass.  Note that the forward pass and backward pass are not counted as two different passes.

For example: if we have 1000 training examples, and the batch size is defined as 500, then it will take 2 iterations to complete 1 epoch.

 

In order to evaluate our models, we must reserve a portion of the annotated data for the test set. As we already mentioned, if the test set is too small, then our evaluation may not be accurate. However, making the test set larger usually means making the training set smaller, which can have a significant impact on performance if a limited amount of annotated data is available.

One solution to this problem is to perform multiple evaluations on different test sets, then to combine the scores from those evaluations, a technique known as cross-validation. In particular, we subdivide the original corpus into N subsets called folds. For each of these folds, we train a model using all of the data except the data in that fold, and then test that model on the fold. Even though the individual folds might be too small to give accurate evaluation scores on their own, the combined evaluation score is based on a large amount of data, and is therefore quite reliable.

A second, and equally important, advantage of using cross-validation is that it allows us to examine how widely the performance varies across different training sets. If we get very similar scores for all N training sets, then we can be fairly confident that the score is accurate. On the other hand, if scores vary widely across the N training sets, then we should probably be skeptical about the accuracy of the evaluation score.

 

  • f1 Score (also called F-score / F-measure)

The f1 score is a number between zero and one that explains how well the network performed during training. It is analogous to a percentage, with 1 being the best score and zero the worst. f1 is basically the probability that your net’s guesses are correct.

F1 = 2 * ((precision * recall) / (precision + recall))

Accuracy measures how often you get the right answer, while f1 scores are a measure of accuracy. For example, if you have 100 fruit – 99 apples and 1 orange – and your model predicts that all 100 items are apples, then it is 99% accurate. But that model failed to identify the difference between apples and oranges. f1 scores help you judge whether a model is actually doing well as classifying when you have an imbalance in the categories you’re trying to tag.

An f1 score is an average of both precision and recall. More specifically, it is a type of average called the harmonic mean, which tends to be less than the arithmetic or geometric means. Recall answers: “Given a positive example, how likely is the classifier going to detect it?” It is the ratio of true positives to the sum of true positives and false negatives.

Precision answers: “Given a positive prediction from the classifier, how likely is it to be correct ?” It is the ratio of true positives to the sum of true positives and false positives.

For f1 to be high, both recall and precision of the model have to be high.

A neural network that takes the initial input and triggers the activation of each layer of the network successively, without circulating. Feed-forward nets contrast with recurrent and recursive nets in that feed-forward nets never let the output of one node circle back to the same or previous nodes.

 

  • ModelIn neural networks, the model is the collection of weights and biases that transform input into output. A neural network is a set of algorithms that update models such that the models guess with less error as they learn. A model is a symbolic, logical or mathematical machine whose purpose is to deduce output from input. If a model’s assumptions are correct, then one must necessarily believe its conclusions. Neural networks produced trained models that can be deployed to process, classify, cluster and make predictions about data.
  •  Normalization

The process of transforming the data to span a range from 0 to 1.

Used in classification and bag of words. The label for each example is all 0s, except for a 1 at the index of the actual class to which the example belongs. For BOW, the one represents the word encountered.

Reinforcement learning is a branch of machine learning that is goal oriented; that is, reinforcement learning algorithms have as their objective to maximize a reward, often over the course of many decisions. Unlike deep neural networks, reinforcement learning is not differentiable.

Representation learning is learning the best representation of input. A vector, for example, can “represent” an image. Training a neural network will adjust the vector’s elements to represent the image better, or lead to better guesses when a neural network is fed the image. The neural net might train to guess the image’s name, for instance. Deep learning means that several layers of representations are stacked atop one another, and those representations are increasingly abstract; i.e. the initial, low-level representations are granular, and may represent pixels, while the higher representations will stand for combinations of pixels, and then combinations of combinations, and so forth.

 



Neural Network architectures:

In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables (“hidden units”), with connections between the layers but not between units within each layer.[1]

When trained on a set of examples without supervision, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors.[1] After this learning step, a DBN can be further trained with supervision to perform classification.

DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs)[1] or autoencoders,[3] where each sub-network’s hidden layer serves as the visible layer for the next. An RBM is an undirected, generative energy-based model with a “visible” input layer and a hidden layer and connections between but not within layers. This composition leads to a fast, layer-by-layer unsupervised training procedure, where contrastive divergence is applied to each sub-network in turn, starting from the “lowest” pair of layers (the lowest visible layer is a training set).

A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.

RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986,[1] and rose to prominence after Geoffrey Hinton and collaborators invented fast learning algorithms for them in the mid-2000. RBMs have found applications in dimensionality reduction,[2]classification,[3] collaborative filtering,[4] feature learning[5] and topic modelling.[6] They can be trained in either supervised or unsupervised ways, depending on the task.

As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the “visible” and “hidden” units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, “unrestricted” Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.[7]

Restricted Boltzmann machines can also be used in deep learning networks. In particular, deep belief networks can be formed by “stacking” RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation.[8]

An neural machine translation (NMT) system uses Neural Networks to translate between languages, such as English and French. NMT systems can be trained end-to-end using bilingual corpora, which differs from traditional Machine Translation systems that require hand-crafted features and engineering. NMT systems are typically implemented using encoder and decoder recurrent neural networks that encode a source sentence and produce a target sentence, respectively.

Papers: 

Sequence to sequence learning with neural networksLearning Phrase Representations

using RNN Encoder-Decoder for Statistical Machine Translation

 

Neural Turing machines (NTMs) are Neural Network architectures that can infer simple algorithms from examples. For example, a NTM may learn a sorting algorithm through example inputs and outputs. NTMs typically learn some form of memory and attention mechanism to deal with state during program execution.

Papers: Neural Turing Machines

Others 



Tomas Mikolov’s neural networks, known as Word2vec, have become widely used because they help produce state-of-the-art word embeddings. Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand. Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well to genes, code, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.

Word2Vec and GloVe are two popular word embedding algorithms recently which used to construct vector representations for words. And those methods can be used to compute the semantic similarity between words by the mathematically vector representation.

Good intro:

 

  • Global Vectors for Word Representation (GloVe)

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Project website:
http://nlp.stanford.edu/projects/glove/

Related Paper:
GloVe: Global Vectors for Word Representation

  • Data Science

Data science is the discipline of drawing conclusions from data using computation. There are three core aspects of effective data analysis: exploration, prediction, and inference.



References and Further reading List:

This tutorial introduced basic concepts involved in deep learning (i.e., deep neural nets, MLP, RNN, LSTM).

  • Cross-Validation  (pdf)
  • Seni, G., & Elder, J. F. (2010). Ensemble methods in data mining: improving accuracy through combining predictions. Synthesis Lectures on Data Mining and Knowledge Discovery2(1), 1-126. (pdf

P 26-28 has pretty good and concise introduction to cross-validation.