(This glossary is work in progress. I will keep it updated while I cultivate in my deep learning garden.)
(As deep learning is a branch of machine learning, it will be very helpful to look at some basics of machine learning while you start with your deep learning journey. See this page for a machine learning glossary. See this page for some essential resources for Deep Learning.)
Note: Those terms in bold are the most essential ones. Some of the terms are grouped by deep learning algorithm categories, not in alphabetical order.

 Deep learning (also called deep neural nets)
It is one of machine learning techniques.
Deep Learning allows computational models composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the stateoftheart in speech recognition, visual object recognition, object detection, and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large datasets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about dramatic improvements in processing images, video, speech and audio, while recurrent nets have shone on sequential data such as text and speech. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep learning methods are representation learning methods with multiple levels of representation, obtained by composing simple but nonlinear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level.  NIPS
Good intro: What is deep learning? (By Jason Brownlee on August 16, 2016)
Papers:
 LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep learning.” Nature 521, no. 7553 (2015): 436444.
 Schmidhuber, Jürgen. “Deep learning in neural networks: An overview.” Neural networks 61 (2015): 85117. (pdf)
 Convolutional neural network (CNN, or ConvNet)
See Chapter 9: Convolutional Networks in Deep Learning (MIT Press) by Ian Goodfellow and Yoshua Bengio and Aaron Courville
See pages 43 45 for a pretty good intro to CNN in Learning Deep Architectures for AI by Yoshua Bengio (pdf)
ReLU is the abbreviation of Rectified Linear Units. This is a layer of neurons that applies the nonsaturating activation function . It increases the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer.
Other functions are also used to increase nonlinearity, for example the saturating hyperbolic tangent , , and the sigmoid function . Compared to other functions the usage of ReLU is preferable, because it results in the neural network training several times faster,^{[40]} without making a significant difference to generalisation accuracy. [from https://en.wikipedia.org/wiki/Convolutional_neural_network#ReLU_layer]
Pronounciation of the abbr.: often pronounced as ‘relu’ – where “re” as in “rest”, and “lu” as in “look”
Pooling is another important concept of CNNs , which is a form of nonlinear downsampling. There are several nonlinear functions (e.g., max, mean) to implement pooling among which max pooling is the most common.
Good intr0: A Beginner’s Guide to Recurrent Networks and LSTMs
A very special kind of recurrent neural network which works, for many tasks, much much better than the traditional RNN. It is an improved version of RNN which considered flushing memory.
In a traditional recurrent neural network, during the gradient backpropagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.
If the weights in this matrix are small (or, more formally, if the leading eigenvalue of the weight matrix is smaller than 1.0), it can lead to a situation called vanishing gradients where the gradient signal gets so small that learning either becomes very slow or stops working altogether. It can also make more difficult the task of learning longterm dependencies in the data. Conversely, if the weights in this matrix are large (or, again, more formally, if the leading eigenvalue of the weight matrix is larger than 1.0), it can lead to a situation where the gradient signal is so large that it can cause learning to diverge. This is often referred to as exploding gradients.
These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell (see the diagram below for illustration of an LSTM memory cell). A memory cell is composed of four main elements: an input gate, a neuron with a selfrecurrent connection (a connection to itself), a forget gate and an output gate. The selfrecurrent connection has a weight of 1.0 and ensures that, barring any outside interference, the state of a memory cell can remain constant from one timestep to another. The gates serve to modulate the interactions between the memory cell itself and its environment. The input gate can allow incoming signal to alter the state of the memory cell or block it. On the other hand, the output gate can allow the state of the memory cell to have an effect on other neurons or prevent it. Finally, the forget gate can modulate the memory cell’s selfrecurrent connection, allowing the cell to remember or forget its previous state, as needed.
Illustration of an LSTM memory cell (source: here)
Good intro:
 Understanding LSTM Networks, see here (pdf if irretrievable) for an example of using LSTM for sentiment analysis
 A very good minicourse about LSTM can be found here (pdf).
A GRU is a pareddown LSTM. GRUs rely on gating mechanisms to learn longrange dependencies while sidestepping the vanishing gradient problem. They include reset and update gates to decide when to update the GRUs memory at each time step.
GRU is simplified LSTM
Papers: Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation
MLPs are perhaps the oldest form of deep neural network. They consist of multiple, fully connected feedforward layers.
An Autoencoder is a Neural Network model whose goal is to predict the input itself, typically through a “bottleneck” somewhere in the network. By introducing a bottleneck, we force the network to learn a lowerdimensional representation of the input, effectively compressing the input into a good representation. Autoencoders are related to PCA and other dimensionality reduction techniques, but can learn more complex mappings due to their nonlinear nature. A wide range of autoencoder architectures exist, including Denoising Autoencoders, Variational Autoencoders, or Sequence Autoencoders.
Basic Terms
The concepts in this group are used and get involved in different learning algorithms.
To allow Neural Networks to learn complex decision boundaries, we apply a nonlinear activation function to some of its layers. Commonly used functions include sigmoid, tanh, ReLU (Rectified Linear Unit) and variants of these.
An activation, or activation function, for a neural network is defined as the mapping of the input to the output via a nonlinear transform function at each “node”, which is simply a locus of computation within the net. Each layer in a neural net consists of many nodes, and the number of nodes in a layer is known as its width.
Activation algorithms are the gates that determine, at each node in the net, whether and to what extent to transmit the signal the node has received from the previous layer. A combination of weights (coefficients) and biases work on the input data from the previous layer to determine whether that signal surpasses a given treshhold and is deemed significant. Those weights and biases are slowly updated as the neural net minimizes its error; i.e. the level of nodes’ activation change in the course of learning. Deeplearning4j includes activation functions such as sigmoid, relu, tanh and ELU. These activation functions allow neural networks to make complex boundary decisions for features at various levels of abstraction.
Note: a decision boundary is defined not defined by training set, but by hypothesis parameters. But for sure, training set is used to fit hypothesis parameters.
 Affine layer
A fullyconnected layer in a Neural Network. Affine means that each neuron in the previous layer is connected to each neuron in the current layer. In many ways, this is the “standard” layer of a Neural Network. Affine layers are often added on top of the outputs of Convolutional Neural Networks or Recurrent Neural Networks before making a final prediction. An affine layer is typically of the form y = f(Wx + b) where x are the layer inputs, w the parameters, b a bias vector, and f a nonlinear activation function.
 Backpropagation (often called backprop)
Backpropagation is an algorithm used to efficiently calculate the gradients in a Neural Network, or more generally, a feedforward computational graph. It boils down to applying the chain rule of differentiation starting from the network output and propagating the gradients backward.
It calculates the gradient of a loss function with respect to all the weights in the network, so that the gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function.
It is used to calculate the gradient the relate weights to error, we use a technique known as backpropagation, which is also referred to as the backward pass of the network. Backpropagation is a repeated application of chain rule of calculus for partial derivatives. The first step is to calculate the derivatives of the objective function with respect to the output units, then the derivatives of the output of the last hidden layer to the input of the last hidden layer; then the input of the last hidden layer to the weights between it and the penultimate hidden layer, etc. Here’s a derivation of backpropagation. And here’s Yann LeCun’s important paper on the subject.
Good intro:
 Calculus on Computational Graphs: Backpropagation

Yes you should understand backprop (by Andrej Karpathy, Dec 19, 2016)
 BPTT (Backpropagation through time)
A special form of backpropagation is called backpropagation through time, or BPTT, which is specifically useful for recurrent networks analyzing text and time series. With BPTT, each time step of the RNN is the equivalent of a layer in a feedforward network. To backpropagate over many time steps, BPTT can be truncated for the purpose of efficiency. Truncated BPTT limits the time steps over which error is propagated.
Papers:
Backpropagation Through Time: What It Does and How to Do It
The gradient is a derivative, which you will know from differential calculus. That is, it’s the ratio of the rate of change of a neural net’s parameters and the error it produces, as it learns how to reconstruct a dataset or make guesses about labels. The process of minimizing error is called gradient descent. Descending a gradient has two aspects: choosing the direction to step in (momentum) and choosing the size of the step (learning rate, check out here (PDF) for an good intro about learning rate).
Since MLPs are, by construction, differentiable operators, they can be trained to minimise any differentiable objective function using gradient descent. The basic idea of gradient descent is to find the derivative of the objective function with respect to each of the network weights, then adjust the weights in the direction of the negative slope. Graves
source: http://cs231n.github.io/neuralnetworks3/
A loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.
Regularization is a central theme of machine learning and can significantly improve training results. Don’t forget to tune regularization in your models.
Good intro:
Regularization in deep learning
Understanding regularization for image classification and machine learning
See Chapter 7: Regularization for Deep Learning in Deep Learning (MIT Press) by Ian Goodfellow and Yoshua Bengio and Aaron Courville
Dropout is a hyperparameter used for regularization in neural networks. Like all regularization techniques, its purpose is to prevent overfitting. Dropout randomly makes nodes in the neural network “drop out” by setting them to zero, which encourages the network to rely on other features that act as signals. That, in turn, creates more generalizable representations of data.
Papers:
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Recurrent Neural Network Regularization
 Embedding
An embedding is a representation of input, or an encoding. For example, a neural word embedding is a vector that represents that word. The word is said to be embedded in vector space. Word2vec and GloVe are two techniques used to train word embeddings to predict a word’s context. Because an embedding is a form of representation learning, we can “embed” any data type, including sounds, images and time series.
 Epoch vs. Iteration
In machinelearning and deep learning, an epoch is a complete pass through a given dataset. That is, by the end of one epoch, your neural network – be it a restricted Boltzmann machine, convolutional net or deepbelief network – will have been exposed to every record to example within the dataset once. Not to be confused with an iteration, which is simply one update of the neural net model’s parameters. Many iterations can occur before an epoch is over. Epoch and iteration are only synonymous if you update your parameters once for each pass through the whole dataset; if you update using minibatches, they mean different things. Say your data has 2 minibatches: A and B. .numIterations(3)
performs training like AAABBB, while 3 epochs looks like ABABAB.
See below for more detailed explanation:
One epoch = one forward pass and one backward pass of all the training examples.
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
Batch size = the number of training examples in one forward/backward pass. The larger the batch size, the more memory it requires.
A good explanation of epoch and batch size can be found at here (pdf)
Number of iterations = number of passes. Each pass using [batch size] number of examples. One pass = one forward pass + one backward pass. Note that the forward pass and backward pass are not counted as two different passes.
For example: if we have 1000 training examples, and the batch size is defined as 500, then it will take 2 iterations to complete 1 epoch.
In order to evaluate our models, we must reserve a portion of the annotated data for the test set. As we already mentioned, if the test set is too small, then our evaluation may not be accurate. However, making the test set larger usually means making the training set smaller, which can have a significant impact on performance if a limited amount of annotated data is available.
One solution to this problem is to perform multiple evaluations on different test sets, then to combine the scores from those evaluations, a technique known as crossvalidation. In particular, we subdivide the original corpus into N subsets called folds. For each of these folds, we train a model using all of the data except the data in that fold, and then test that model on the fold. Even though the individual folds might be too small to give accurate evaluation scores on their own, the combined evaluation score is based on a large amount of data, and is therefore quite reliable.
A second, and equally important, advantage of using crossvalidation is that it allows us to examine how widely the performance varies across different training sets. If we get very similar scores for all N training sets, then we can be fairly confident that the score is accurate. On the other hand, if scores vary widely across the N training sets, then we should probably be skeptical about the accuracy of the evaluation score.
 f1 Score (also called Fscore / Fmeasure)
The f1 score is a number between zero and one that explains how well the network performed during training. It is analogous to a percentage, with 1 being the best score and zero the worst. f1 is basically the probability that your net’s guesses are correct.
F1 = 2 * ((precision * recall) / (precision + recall))
Accuracy measures how often you get the right answer, while f1 scores are a measure of accuracy. For example, if you have 100 fruit – 99 apples and 1 orange – and your model predicts that all 100 items are apples, then it is 99% accurate. But that model failed to identify the difference between apples and oranges. f1 scores help you judge whether a model is actually doing well as classifying when you have an imbalance in the categories you’re trying to tag.
An f1 score is an average of both precision and recall. More specifically, it is a type of average called the harmonic mean, which tends to be less than the arithmetic or geometric means. Recall answers: “Given a positive example, how likely is the classifier going to detect it?” It is the ratio of true positives to the sum of true positives and false negatives.
Precision answers: “Given a positive prediction from the classifier, how likely is it to be correct ?” It is the ratio of true positives to the sum of true positives and false positives.
For f1 to be high, both recall and precision of the model have to be high.
A neural network that takes the initial input and triggers the activation of each layer of the network successively, without circulating. Feedforward nets contrast with recurrent and recursive nets in that feedforward nets never let the output of one node circle back to the same or previous nodes.
 ModelIn neural networks, the model is the collection of weights and biases that transform input into output. A neural network is a set of algorithms that update models such that the models guess with less error as they learn. A model is a symbolic, logical or mathematical machine whose purpose is to deduce output from input. If a model’s assumptions are correct, then one must necessarily believe its conclusions. Neural networks produced trained models that can be deployed to process, classify, cluster and make predictions about data.
 Normalization
The process of transforming the data to span a range from 0 to 1.
 OneHot Encoding
Used in classification and bag of words. The label for each example is all 0s, except for a 1 at the index of the actual class to which the example belongs. For BOW, the one represents the word encountered.
Reinforcement learning is a branch of machine learning that is goal oriented; that is, reinforcement learning algorithms have as their objective to maximize a reward, often over the course of many decisions. Unlike deep neural networks, reinforcement learning is not differentiable.
Representation learning is learning the best representation of input. A vector, for example, can “represent” an image. Training a neural network will adjust the vector’s elements to represent the image better, or lead to better guesses when a neural network is fed the image. The neural net might train to guess the image’s name, for instance. Deep learning means that several layers of representations are stacked atop one another, and those representations are increasingly abstract; i.e. the initial, lowlevel representations are granular, and may represent pixels, while the higher representations will stand for combinations of pixels, and then combinations of combinations, and so forth.
Neural Network architectures:
In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables (“hidden units”), with connections between the layers but not between units within each layer.^{[1]}
When trained on a set of examples without supervision, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors.^{[1]} After this learning step, a DBN can be further trained with supervision to perform classification.
DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs)^{[1]} or autoencoders,^{[3]} where each subnetwork’s hidden layer serves as the visible layer for the next. An RBM is an undirected, generative energybased model with a “visible” input layer and a hidden layer and connections between but not within layers. This composition leads to a fast, layerbylayer unsupervised training procedure, where contrastive divergence is applied to each subnetwork in turn, starting from the “lowest” pair of layers (the lowest visible layer is a training set).
A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.
RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986,^{[1]} and rose to prominence after Geoffrey Hinton and collaborators invented fast learning algorithms for them in the mid2000. RBMs have found applications in dimensionality reduction,^{[2]}classification,^{[3]} collaborative filtering,^{[4]} feature learning^{[5]} and topic modelling.^{[6]} They can be trained in either supervised or unsupervised ways, depending on the task.
As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the “visible” and “hidden” units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, “unrestricted” Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradientbased contrastive divergence algorithm.^{[7]}
Restricted Boltzmann machines can also be used in deep learning networks. In particular, deep belief networks can be formed by “stacking” RBMs and optionally finetuning the resulting deep network with gradient descent and backpropagation.^{[8]}
An neural machine translation (NMT) system uses Neural Networks to translate between languages, such as English and French. NMT systems can be trained endtoend using bilingual corpora, which differs from traditional Machine Translation systems that require handcrafted features and engineering. NMT systems are typically implemented using encoder and decoder recurrent neural networks that encode a source sentence and produce a target sentence, respectively.
Papers:
Sequence to sequence learning with neural networksLearning Phrase Representations
using RNN EncoderDecoder for Statistical Machine Translation
Neural Turing machines (NTMs) are Neural Network architectures that can infer simple algorithms from examples. For example, a NTM may learn a sorting algorithm through example inputs and outputs. NTMs typically learn some form of memory and attention mechanism to deal with state during program execution.
Papers: Neural Turing Machines
Others
Tomas Mikolov’s neural networks, known as Word2vec, have become widely used because they help produce stateoftheart word embeddings. Word2vec is a twolayer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand. Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well to genes, code, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.
Word2Vec and GloVe are two popular word embedding algorithms recently which used to construct vector representations for words. And those methods can be used to compute the semantic similarity between words by the mathematically vector representation.
Good intro:
 Word2Vec word embedding tutorial in Python and TensorFlow (pdf)(July 21, 2017) provides a nice introduction to what is Word2Vec and why we need Word2Vec, as well as a python tutorial for Word2Vec in TensorFlow.
 The amazing power of word vectors (pdf)
 How is GloVe different from word2vec? (pdf)
 Global Vectors for Word Representation (GloVe)
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global wordword cooccurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Project website:
http://nlp.stanford.edu/projects/glove/
Related Paper:
GloVe: Global Vectors for Word Representation
 Data Science
Data science is the discipline of drawing conclusions from data using computation. There are three core aspects of effective data analysis: exploration, prediction, and inference.
References and Reading List:
 Deep Learning Glossary (wildml.com)
 Dr. Andrew Ng’s Machine Learning course on Coursera
 Christopher M.. Bishop. (2006). Pattern recognition and machine learning. Springer.
 Michael Nielsen’s book Neural networks and deep learning
 Deep Learning and Neural Network Glossary
 Deep Learning in a Nutshell: Core Concepts (NVIDIA)
 Deep Learning in a Nutshell: History and Training (NVIDIA)
 Deep Learning in a Nutshell: Sequence Learning NVIDIA)
 Get Started with Deep Learning (NVIDIA) – this is the place to find deep learning related resources, including choosing NVIDIA GPUs etc.
 Neural Networks Tutorial – A Pathway to Deep Learning (pdf). This post describes essential concepts involved in deep learning. The good thing is that for most concepts it provides Python code snippets.
 Deep Learning Tutorial by LISA lab, University of Montreal
 Introduction to Evaluation metrics at here.
This tutorial introduced basic concepts involved in deep learning (i.e., deep neural nets, MLP, RNN, LSTM).
 CrossValidation (pdf)
 Seni, G., & Elder, J. F. (2010). Ensemble methods in data mining: improving accuracy through combining predictions. Synthesis Lectures on Data Mining and Knowledge Discovery, 2(1), 1126. (pdf)
P 2628 has pretty good and concise introduction to crossvalidation.
 machinelearningmastery.com
 Wikipedia
 Christopher M.. Bishop. (2006). Pattern recognition and machine learning. Springer.
 Deep Learning Glossary
 25 Must Know Terms & concepts for Beginners in Deep Learning (MAY 21, 2017) — pdf
 Deep Learning Essential Terms from Ritchie Ng’s machine learning blog.
 Learning Deep Architectures for AI by Yoshua Bengio (pdf)