Top 10 Data Mining Algorithms, Explained

source: from kdnuggets.com (Good posts sometimes disappear, so I repost it here for my and for your information.)

Top 10 data mining algorithms, selected by top researchers, are explained here, including what do they do, the intuition behind the algorithm, available implementations of the algorithms, why use them, and interesting applications.

By Raymond Li.

Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let’s get started!

top-10-data-mining-algorithms

Here are the algorithms:

  • 1. C4.5
  • 2. k-means
  • 3. Support vector machines
  • 4. Apriori
  • 5. EM
  • 6. PageRank
  • 7. AdaBoost
  • 8. kNN
  • 9. Naive Bayes
  • 10. CART

We also provide interesting resources at the end.

1. C4.5

What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified.

Wait, what’s a classifier? A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to.

What’s an example of this? Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse, blood pressure, VO2max, family history, etc. These are called attributes.

Now:

Given these attributes, we want to predict whether the patient will get cancer. The patient can fall into 1 of 2 classes: will get cancer or won’t get cancer. C4.5 is told the class for each patient.

And here’s the deal:

Using a set of patient attributes and the patient’s corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes.

Cool, so what’s a decision tree? Decision tree learning creates something similar to a flowchart to classify new data. Using the same patient example, one particular path in the flowchart could be:

  • Patient has a history of cancer
  • Patient is expressing a gene highly correlated with cancer patients
  • Patient has tumors
  • Patient’s tumor size is greater than 5cm

The bottom line is:

At each point in the flowchart is a question about the value of some attribute, and depending on those values, he or she gets classified. You can find lots of examples of decision trees.

Is this supervised or unsupervised? This is supervised learning, since the training dataset is labeled with classes. Using the patient example, C4.5 doesn’t learn on its own that a patient will get cancer or won’t get cancer. We told it first, it generated a decision tree, and now it uses the decision tree to classify.

You might be wondering how C4.5 is different than other decision tree systems?

  • First, C4.5 uses information gain when generating the decision tree.
  • Second, although other systems also incorporate pruning, C4.5 uses a single-pass pruning process to mitigate over-fitting. Pruning results in many improvements.
  • Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.
  • Finally, incomplete data is dealt with in its own ways.

Why use C4.5? Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.

Where is it used? A popular open-source Java implementation can be found over at OpenTox. Orange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.

Classifiers are great, but make sure to checkout the next algorithm about clustering…

2. k-means

What does it do? k-means creates k groups from a set of objects so that the members of a group are more similar. It’s a popular cluster analysis technique for exploring a dataset.

Hang on, what’s cluster analysis? Cluster analysis is a family of algorithms designed to form groups such that the group members are more similar versus non-group members. Clusters and groups are synonymous in the world of cluster analysis.

Is there an example of this? Definitely, suppose we have a dataset of patients. In cluster analysis, these would be called observations. We know various things about each patient like age, pulse, blood pressure, VO2max, cholesterol, etc. This is a vector representing the patient.

Look:

You can basically think of a vector as a list of numbers we know about the patient. This list can also be interpreted as coordinates in multi-dimensional space. Pulse can be one dimension, blood pressure another dimension and so forth.

You might be wondering:

Given this set of vectors, how do we cluster together patients that have similar age, pulse, blood pressure, etc?

Want to know the best part?

You tell k-means how many clusters you want. K-means takes care of the rest.

How does k-means take care of the rest? k-means has lots of variations to optimize for certain types of data.

At a high level, they all do something like this:

  1. k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.
  2. Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.
  3. What we have are k clusters, and each patient is now a member of a cluster.
  4. k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).
  5. This center becomes the new centroid for the cluster.
  6. Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.
  7. Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.

Is this supervised or unsupervised? It depends, but most would classify k-means as unsupervised. Other than specifying the number of clusters, k-means “learns” the clusters on its own without any information about which cluster an observation belongs to. k-means can be semi-supervised.

Why use k-means? I don’t think many will have an issue with this:

The key selling point of k-means is its simplicity. Its simplicity means it’s generally faster and more efficient than other algorithms, especially over large datasets.

It gets better:

k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster analysis on the sub-clusters. k-means can also be used to rapidly “play” with k and explore whether there are overlooked patterns or relationships in the dataset.

It’s not all smooth sailing:

Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data — you’ll need to do some tricks to get it to work on discrete data.

Where is it used? A ton of implementations for k-means clustering are available online:

If decision trees and clustering didn’t impress you, you’re going to love the next algorithm.

3. Support vector machines

What does it do? Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all.

Whoa, a hyper-what? A hyperplane is a function like the equation for a line, y = mx + b. In fact, for a simple classification task with just 2 features, the hyperplane can be a line.

As it turns out…

SVM can perform a trick to project your data into higher dimensions. Once projected into higher dimensions…

…SVM figures out the best hyperplane which separates your data into the 2 classes.

Do you have an example? Absolutely, the simplest example I found starts with a bunch of red and blue balls on a table. If the balls aren’t too mixed together, you could take a stick and without moving the balls, separate them with the stick.

You see:

When a new ball is added on the table, by knowing which side of the stick the ball is on, you can predict its color.

What do the balls, table and stick represent? The balls represent data points, and the red and blue color represent 2 classes. The stick represents the simplest hyperplane which is a line.

And the coolest part?

SVM figures out the function for the hyperplane.

What if things get more complicated? Right, they frequently do. If the balls are mixed together, a straight stick won’t work.

Here’s the work-around:

Quickly lift up the table throwing the balls in the air. While the balls are in the air and thrown up in just the right way, you use a large sheet of paper to divide the balls in the air.

You might be wondering if this is cheating:

Nope, lifting up the table is the equivalent of mapping your data into higher dimensions. In this case, we go from the 2 dimensional table surface to the 3 dimensional balls in the air.

How does SVM do this? By using a kernel we have a nice way to operate in higher dimensions. The large sheet of paper is still called a hyperplane, but it is now a function for a plane rather than a line. Note from Yuval that once we’re in 3 dimensions, the hyperplane must be a plane rather than a line.

I found this visualization super helpful:

Reddit also has 2 great threads on this in the ELI5 and ML subreddits.

How do balls on a table or in the air map to real-life data? A ball on a table has a location that we can specify using coordinates. For example, a ball could be 20cm from the left edge and 50cm from the bottom edge. Another way to describe the ball is as (x, y) coordinates or (20, 50). x and y are 2 dimensions of the ball.

Here’s the deal:

If we had a patient dataset, each patient could be described by various measurements like pulse, cholesterol level, blood pressure, etc. Each of these measurements is a dimension.

The bottom line is:

SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes.

Margins are often associated with SVM? What are they? The margin is the distance between the hyperplane and the 2 closest data points from each respective class. In the ball and table example, the distance between the stick and the closest red and blue ball is the margin.

The key is:

SVM attempts to maximize the margin, so that the hyperplane is just as far away from red ball as the blue ball. In this way, it decreases the chance of misclassification.

Where does SVM get its name from? Using the ball and table example, the hyperplane is equidistant from a red ball and a blue ball. These balls or data points are called support vectors, because they support the hyperplane.

Is this supervised or unsupervised? This is a supervised learning, since a dataset is used to first teach the SVM about the classes. Only then is the SVM capable of classifying new data.

Why use SVM? SVM along with C4.5 are generally the 2 classifiers to try first. No classifier will be the best in all cases due to the No Free Lunch Theorem. In addition, kernel selection and interpretability are some weaknesses.

Where is it used? There are many implementations of SVM. A few of the popular ones are scikit-learn, MATLAB and of course libsvm.

The next algorithm is one of my favorites…

4. Apriori

What does it do? The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.

What are association rules? Association rule learning is a data mining technique for learning correlations and relations among variables in a database.

What’s an example of Apriori? Let’s say we have a database full of supermarket transactions. You can think of a database as a giant spreadsheet where each row is a customer transaction and every column represents a different grocery item.

shopping database

Here’s the best part:

By applying the Apriori algorithm, we can learn the grocery items that are purchased together a.k.a association rules.

The power of this is:

You can find those items that tend to be purchased together more frequently than other items — the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets.

For example:

You can probably quickly see that chips + dip and chips + soda seem to frequently occur together. These are called 2-itemsets. With a large enough dataset, it will be much harder to “see” the relationships especially when you’re dealing with 3-itemsets or more. That’s precisely what Apriori helps with!

You might be wondering how Apriori works? Before getting into the nitty-gritty of algorithm, you’ll need to define 3 things:

  1. The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
  2. The second is your support or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset.
  3. The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.

The basic Apriori algorithm is a 3 step approach:

  1. Join. Scan the whole database for how frequent 1-itemsets are.
  2. Prune. Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.
  3. Repeat. This is repeated for each itemset level until we reach our previously defined size.

Is this supervised or unsupervised? Apriori is generally considered an unsupervised learning approach, since it’s often used to discover or mine for interesting patterns and relationships.

But wait, there’s more…

Apriori can also be modified to do classification based on labelled data.

Why use Apriori? Apriori is well understood, easy to implement and has many derivatives.

On the other hand…

The algorithm can be quite memory, space and time intensive when generating itemsets.

Where is it used? Plenty of implementations of Apriori are available. Some popular ones are the ARtool, Weka, and Orange.

The next algorithm was the most difficult for me to understand, look at the next algorithm…

5. EM

What does it do? In data mining, expectation-maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery.

In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables.

OK, hang on while I explain…

I’m not a statistician, so hopefully my simplification is both correct and helps with understanding.

Here are a few concepts that will make this way easier…

What’s a statistical model? I see a model as something that describes how observed data is generated. For example, the grades for an exam could fit a bell curve, so the assumption that the grades are generated via a bell curve (a.k.a. normal distribution) is the model.

Wait, what’s a distribution? A distribution represents the probabilities for all measurable outcomes. For example, the grades for an exam could fit a normal distribution. This normal distribution represents all the probabilities of a grade.

In other words, given a grade, you can use the distribution to determine how many exam takers are expected to get that grade.

Cool, what are the parameters of a model? A parameter describes a distribution which is part of a model. For example, a bell curve can be described by its mean and variance.

Using the exam scenario, the distribution of grades on an exam (the measurable outcomes) followed a bell curve (this is the distribution). The mean was 85 and the variance was 100.

So, all you need to describe a normal distribution are 2 parameters:

  1. The mean
  2. The variance

And likelihood? Going back to our previous bell curve example… suppose we have a bunch of grades and are told the grades follow a bell curve. However, we’re not given all the grades… only a sample.

Here’s the deal:

We don’t know the mean or variance of all the grades, but we can estimate them using the sample. The likelihood is the probability that the bell curve with estimated mean and variance results in those bunch of grades.

In other words, given a set of measurable outcomes, let’s estimate the parameters. Using these estimated parameters, the hypothetical probability of the outcomes is called likelihood.

Remember, it’s the hypothetical probability of the existing grades, not the probability of a future grade.

You’re probably wondering, what’s probability then?

Using the bell curve example, suppose we know the mean and variance. Then we’re told the grades follow a bell curve. The chance that we observe certain grades and how often they are observed is the probability.

In more general terms, given the parameters, let’s estimate what outcomes should be observed. That’s what probability does for us.

Great! Now, what’s the difference between observed and unobserved data? Observed data is the data that you saw or recorded. Unobserved data is data that is missing. There a number of reasons that the data could be missing (not recorded, ignored, etc.).

Here’s the kicker:

For data mining and clustering, what’s important to us is looking at the class of a data point as missing data. We don’t know the class, so interpreting missing data this way is crucial for applying EM to the task of clustering.

Once again: The EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. Hopefully, this is way more understandable now.

The best part is…

By optimizing the likelihood, EM generates an awesome model that assigns class labels to data points — sounds like clustering to me!

How does EM help with clustering? EM begins by making a guess at the model parameters.

Then it follows an iterative 3-step process:

  1. E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.
  2. M-step: Update the model parameters based on the cluster assignments from the E-step.
  3. Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).

Is this supervised or unsupervised? Since we do not provide labeled class information, this is unsupervised learning.

Why use EM? A key selling point of EM is it’s simple and straight-forward to implement. In addition, not only can it optimize for model parameters, it can also iteratively make guesses about missing data.

This makes it great for clustering and generating a model with parameters. Knowing the clusters and model parameters, it’s possible to reason about what the clusters have in common and which cluster new data belongs to.

EM is not without weaknesses though…

  • First, EM is fast in the early iterations, but slow in the later iterations.
  • Second, EM doesn’t always find the optimal parameters and gets stuck in local optima rather than global optima.

Where is it used? The EM algorithm is available in Weka. R has an implementation in the mclust package. scikit-learn also has an implementation in its gmm module.

What data mining does Google do? Take a look…

6. PageRank

What does it do? PageRank is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.

Yikes.. what’s link analysis? It’s a type of network analysis looking to explore the associations (a.k.a. links) among objects.

Here’s an example: The most prevalent example of PageRank is Google’s search engine. Although their search engine doesn’t solely rely on PageRank, it’s one of the measures Google uses to determine a web page’s importance.

Let me explain:

Web pages on the World Wide Web link to each other. If rayli.net links to a web page on CNN, a vote is added for the CNN page indicating rayli.net finds the CNN web page relevant.

And it doesn’t stop there…

rayli.net’s votes are in turn weighted by rayli.net’s importance and relevance. In other words, any web page that’s voted for rayli.net increases rayli.net’s relevance.

The bottom line?

This concept of voting and relevance is PageRank. rayli.net’s vote for CNN increases CNN’s PageRank, and the strength of rayli.net’s PageRank influences how much its vote affects CNN’s PageRank.

What does a PageRank of 0, 1, 2, 3, etc. mean? Although the precise meaning of a PageRank number isn’t disclosed by Google, we can get a sense of its relative meaning.

And here’s how:

Pank Rank Table

You see?

It’s a bit like a popularity contest. We all have a sense of which websites are relevant and popular in our minds. PageRank is just an uber elegant way to define it.

What other applications are there of PageRank? PageRank was specifically designed for the World Wide Web.

Think about it:

At its core, PageRank is really just a super effective way to do link analysis.The objects being linked don’t have to be web pages.

Here are 3 innovative applications of PageRank:

  1. Dr Stefano Allesina, from the University of Chicago, applied PageRank to ecology to determine which species are critical for sustaining ecosystems.
  2. Twitter developed WTF (Who-to-Follow) which is a personalized PageRank recommendation engine about who to follow.
  3. Bin Jiang, from The Hong Kong Polytechnic University, used a variant of PageRank to predict human movement rates based on topographical metrics in London.

Is this supervised or unsupervised? PageRank is generally considered an unsupervised learning approach, since it’s often used to discover the importance or relevance of a web page.

Why use PageRank? Arguably, the main selling point of PageRank is its robustness due to the difficulty of getting a relevant incoming link.

Simply stated:

If you have a graph or network and want to understand relative importance, priority, ranking or relevance, give PageRank a try.

Where is it used? The PageRank trademark is owned by Google. However, the PageRank algorithm is actually patented by Stanford University.

You might be wondering if you can use PageRank:

I’m not a lawyer, so best to check with an actual lawyer, but you can probably use the algorithm as long as it doesn’t commercially compete against Google/Stanford.

Here are 3 implementations of PageRank:

  1. C++ OpenSource PageRank Implementation
  2. Python PageRank Implementation
  3. igraph – The network analysis package (R)

With our powers combined, we are…

 

7. AdaBoost

What does it do? AdaBoost is a boosting algorithm which constructs a classifier.

As you probably remember, a classifier takes a bunch of data and attempts to predict or classify which class a new data element belongs to.

But what’s boosting? Boosting is an ensemble learning algorithm which takes multiple learning algorithms (e.g. decision trees) and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.

What’s the difference between a strong and weak learner? A weak learner classifies with accuracy barely above chance. A popular example of a weak learner is the decision stump which is a one-level decision tree.

Alternatively…

A strong learner has much higher accuracy, and an often used example of a strong learner is SVM.

What’s an example of AdaBoost? Let’s start with 3 weak learners. We’re going to train them in 10 rounds on a training dataset containing patient data. The dataset contains details about the patient’s medical records.

The question is…

How can we predict whether the patient will get cancer?

Here’s how AdaBoost answers the question…

In round 1: AdaBoost takes a sample of the training dataset and tests to see how accurate each learner is. The end result is we find the best learner.

In addition, samples that are misclassified are given a heavier weight, so that they have a higher chance of being picked in the next round.

One more thing, the best learner is also given a weight depending on its accuracy and incorporated into the ensemble of learners (right now there’s just 1 learner).

In round 2: AdaBoost again attempts to look for the best learner.

And here’s the kicker:

The sample of patient training data is now influenced by the more heavily misclassified weights. In other words, previously misclassified patients have a higher chance of showing up in the sample.

Why?

It’s like getting to the second level of a video game and not having to start all over again when your character is killed. Instead, you start at level 2 and focus all your efforts on getting to level 3.

Likewise, the first learner likely classified some patients correctly. Instead of trying to classify them again, let’s focus all the efforts on getting the misclassified patients.

The best learner is again weighted and incorporated into the ensemble, misclassified patients are weighted so they have a higher chance of being picked and we rinse and repeat.

At the end of the 10 rounds: We’re left with an ensemble of weighted learners trained and then repeatedly retrained on misclassified data from the previous rounds.

Is this supervised or unsupervised? This is supervised learning, since each iteration trains the weaker learners with the labelled dataset.

Why use AdaBoost? AdaBoost is simple. The algorithm is relatively straight-forward to program.

In addition, it’s fast! Weak learners are generally simpler than strong learners. Being simpler means they’ll likely execute faster.

Another thing…

It’s a super elegant way to auto-tune a classifier, since each successive AdaBoost round refines the weights for each of the best learners. All you need to specify is the number of rounds.

Finally, it’s flexible and versatile. AdaBoost can incorporate any learning algorithm, and it can work with a large variety of data.

Where is it used? AdaBoost has a ton of implementations and variants. Here are a few:

If you like Mr. Rogers, you’ll like the next algorithm…

8. kNN

What does it do? kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it’s a lazy learner.

What’s a lazy learner? A lazy learner doesn’t do much during the training process other than store the training data. Only when new unlabeled data is input does this type of learner look to classify.

On the other hand, an eager learner builds a classification model during training. When new unlabeled data is input, this type of learner feeds the data into the classification model.

How does C4.5, SVM and AdaBoost fit into this? Unlike kNN, they are all eager learners.

Here’s why:

  1. C4.5 builds a decision tree classification model during training.
  2. SVM builds a hyperplane classification model during training.
  3. AdaBoost builds an ensemble classification model during training.

So what does kNN do? kNN builds no such classification model. Instead, it just stores the labeled training data.

When new unlabeled data comes in, kNN operates in 2 basic steps:

  1. First, it looks at the k closest labeled training data points — in other words, the k-nearest neighbors.
  2. Second, using the neighbors’ classes, kNN gets a better idea of how the new data should be classified.

You might be wondering…

How does kNN figure out what’s closer? For continuous data, kNN uses a distance metric like Euclidean distance. The choice of distance metric largely depends on the data. Some even suggest learning a distance metric based on the training data. There’s tons more details and papers on kNN distance metrics.

For discrete data, the idea is transform discrete data into continuous data. 2 examples of this are:

  1. Using Hamming distance as a metric for the “closeness” of 2 text strings.
  2. Transforming discrete data into binary features.

These 2 Stack Overflow threads have some more suggestions on dealing with discrete data:

How does kNN classify new data when neighbors disagree? kNN has an easy time when all neighbors are the same class. The intuition is if all the neighbors agree, then the new data point likely falls in the same class.

I’ll bet you can guess where things get hairy…

How does kNN decide the class when neighbors don’t have the same class?

2 common techniques for dealing with this are:

  1. Take a simple majority vote from the neighbors. Whichever class has the greatest number of votes becomes the class for the new data point.
  2. Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote 1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller… exactly what we want!

Is this supervised or unsupervised? This is supervised learning, since kNN is provided a labeled training dataset.

Why use kNN? Ease of understanding and implementing are 2 of the key reasons to use kNN. Depending on the distance metric, kNN can be quite accurate.

But that’s just part of the story…

Here are 5 things to watch out for:

  1. kNN can get very computationally expensive when trying to determine the nearest neighbors on a large dataset.
  2. Noisy data can throw off kNN classifications.
  3. Features with a larger range of values can dominate the distance metric relative to features that have a smaller range, so feature scaling is important.
  4. Since data processing is deferred, kNN generally requires greater storage requirements than eager classifiers.
  5. Selecting a good distance metric is crucial to kNN’s accuracy.

Where is it used? A number of kNN implementations exist:

Spam? Fuhgeddaboudit! Read ahead to learn about the next algorithm…

9. Naive Bayes

What does it do? Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption:

Every feature of the data being classified is independent of all other features given the class.

What does independent mean? 2 features are independent when the value of one feature has no effect on the value of another feature.

For example:

Let’s say you have a patient dataset containing features like pulse, cholesterol level, weight, height and zip code. All features would be independent if the value of all features have no effect on each other. For this dataset, it’s reasonable to assume that the patient’s height and zip code are independent, since a patient’s height has little to do with their zip code.

But let’s not stop there, are the other features independent?

Sadly, the answer is no. Here are 3 feature relationships which are not independent:

  • If height increases, weight likely increases.
  • If cholesterol level increases, weight likely increases.
  • If cholesterol level increases, pulse likely increases as well.

In my experience, the features of a dataset are generally not all independent.

And that ties in with the next question…

Why is it called naive? The assumption that all features of a dataset are independent is precisely why it’s called naive — it’s generally not the case that all features are independent.

What’s Bayes? Thomas Bayes was an English statistician for which Bayes’ Theorem is named after. You can click on the link to find about more about Bayes’ Theorem.

In a nutshell, the theorem allows us to predict the class given a set of features using probability.

The simplified equation for classification looks something like this:

P(\textit{Class A}|\textit{Feature 1}, \textit{Feature 2}) = \dfrac{P(\textit{Feature 1}|\textit{Class A}) \cdot P(\textit{Feature 2}|\textit{Class A}) \cdot P(\textit{Class A})}{P(\textit{Feature 1}) \cdot P(\textit{Feature 2})}

Let’s dig deeper into this…

What does the equation mean? The equation finds the probability of Class A given Features 1 and 2. In other words, if you see Features 1 and 2, this is the probability the data is Class A.

The equation reads: The probability of Class A given Features 1 and 2 is a fraction.

  • The fraction’s numerator is the probability of Feature 1 given Class A multiplied by the probability of Feature 2 given Class A multiplied by the probability of Class A.
  • The fraction’s denominator is the probability of Feature 1 multiplied by the probability of Feature 2.

What is an example of Naive Bayes? Below is a great example taken from a Stack Overflow thread.

Here’s the deal:

  • We have a training dataset of 1,000 fruits.
  • The fruit can be a Banana, Orange or Other (these are the classes).
  • The fruit can be Long, Sweet or Yellow (these are the features).

Fruit Probabilities

What do you see in this training dataset?

  • Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
  • Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
  • Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.

If we are given the length, sweetness and color of a fruit (without knowing its class), we can now calculate the probability of it being a banana, orange or other fruit.

Suppose we are told the unknown fruit is long, sweet and yellow.

Here’s how we calculate all the probabilities in 4 steps:

Step 1: To calculate the probability the fruit is a banana, let’s first recognize that this looks familiar. It’s the probability of the class Banana given the features Long, Sweet and Yellow or more succinctly:

P(Banana|Long, Sweet, Yellow)

This is exactly like the equation discussed earlier.

Step 2: Starting with the numerator, let’s plug everything in.

  • P(Long|Banana) = 400/500 = 0.8
  • P(Sweet|Banana) = 350/500 = 0.7
  • P(Yellow|Banana) = 450/500 = 0.9
  • P(Banana) = 500 / 1000 = 0.5

Multiplying everything together (as in the equation), we get:

0.8 \times 0.7 \times 0.9 \times 0.5 = 0.252

Step 3: Ignore the denominator, since it’ll be the same for all the other calculations.

Step 4: Do a similar calculation for the other classes:

  • P(Orange|Long, Sweet, Yellow) = 0
  • P(Other|Long, Sweet, Yellow) = 0.01875

Since the 0.252 is greater than 0.01875, Naive Bayes would classify this long, sweet and yellow fruit as a banana.

Is this supervised or unsupervised? This is supervised learning, since Naive Bayes is provided a labeled training dataset in order to construct the tables.

Why use Naive Bayes? As you could see in the example above, Naive Bayes involves simple arithmetic. It’s just tallying up counts, multiplying and dividing.

Once the frequency tables are calculated, classifying an unknown fruit just involves calculating the probabilities for all the classes, and then choosing the highest probability.

Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, it’s been found to be effective for spam filtering.

Where is it used? Implementations of Naive Bayes can be found in Orange, scikit-learn, Weka and R.

Finally, check out the 10th algorithm…

10. CART

What does it do? CART stands for classification and regression trees. It is a decision tree learning technique that outputs either classification or regression trees. Like C4.5, CART is a classifier.

Is a classification tree like a decision tree? A classification tree is a type of decision tree. The output of a classification tree is a class.

For example, given a patient dataset, you might attempt to predict whether the patient will get cancer. The class would either be “will get cancer” or “won’t get cancer.”

What’s a regression tree? Unlike a classification tree which predicts a class, regression trees predict a numeric or continuous value e.g. a patient’s length of stay or the price of a smartphone.

Here’s an easy way to remember…

Classification trees output classes, regression trees output numbers.

Since we’ve already covered how decision trees are used to classify data, let’s jump right into things…

How does this compare with C4.5?

C4.5 CART
Uses information gain to segment data during decision tree generation. Uses Gini impurity (not to be confused with Gini coefficient). A good discussion of the differences between the impurity and coefficient is available on Stack Overflow.
Uses a single-pass pruning process to mitigate over-fitting. Uses the cost-complexity method of pruning. Starting at the bottom of the tree, CART evaluates the misclassification cost with the node vs. without the node. If the cost doesn’t meet a threshold, it is pruned away.
The decision nodes can have 2 or more branches. The decision nodes have exactly 2 branches.
Probabilistically distributes missing values to children. Uses surrogates to distribute the missing values to children.

Is this supervised or unsupervised? CART is a supervised learning technique, since it is provided a labeled training dataset in order to construct the classification or regression tree model.

Why use CART? Many of the reasons you’d use C4.5 also apply to CART, since they are both decision tree learning techniques. Things like ease of interpretation and explanation also apply to CART as well.

Like C4.5, they are also quite fast, quite popular and the output is human readable.

Where is it used? scikit-learn implements CART in their decision tree classifier. R’s tree package has an implementation of CART. Weka and MATLAB also have implementations.

Interesting Resources

 

 

Great talks about Machine Learning and Deep Learning

(Stay tuned in, the list is growing over time)

AI vs. Machine Learning vs. Deep Learning

(Stay tuned, I keep updating this post while I plow in my deep learning garden:))

in category: Machine Learning vs Deep Learning

*****The following slide is from slideshare.net: Transfer Learning and Fine-tuning Deep Neural Networks  (Sep 2, 2016 by  Anusua Trivedi, Data Scientist @ Microsoft)

*****The following slide is from  Prof. Andrew Ng’s talk  “Machine Learning and AI via Brain simulations” (PDF) at Stanford University. 

*****The following slide is from the lecture talk  “How Could Machines Learn as Efficiently as Animals and Humans?” (December 12, 2017) given by Prof. Yann LeCun, Director of Facebook AI Research and Silver Professor of Computer Science at New York University.

*****Below is an  excerpt from What is deep learning? (By Jason Brownlee on August 16, 2016)

The core of deep learning according to Andrew is that we now have fast enough computers and enough data to actually train large neural networks. When discussing why now is the time that deep learning is taking off at ExtractConf 2015 in a talk titled “What data scientists should know about deep learning“, he commented:

very large neural networks we can now have and … huge amounts of data that we have access to

He also commented on the important point that it is all about scale. That as we construct larger neural networks and train them with more and more data, their performance continues to increase. This is generally different to other machine learning techniques that reach a plateau in performance.

for most flavors of the old generations of learning algorithms … performance will plateau. … deep learning … is the first class of algorithms … that is scalable. … performance just keeps getting better as you feed them more data

Dr. Andrew Ng provides a nice plot  in his slides:

(Source: Ng, A. What Data Scientists Should Know about Deep Learning (see slide 30 of 34), 2015)

*****The relations between AI, Machine Learning, and Deep Learning

“Deep learning is a subset of machine learning, and machine learning is a subset of AI, which is an umbrella term for any computer program that does something smart. In other words, all machine learning is AI, but not all AI is machine learning, and so forth.” (check here for source.)

Below is a short excerpt from the source: The AI Revolution: Why Deep Learning Is Suddenly Changing Your Life (from fortune.com By Roger Parloff, Illustration by Justin Metz on SEPTEMBER 28, 2016)

Think of deep learning as a subset of a subset. “Artificial intelligence” encompasses a vast range of technologies—like traditional logic and rules-based systems—that enable computers and robots to solve problems in ways that at least superficially resemble thinking. Within that realm is a smaller category called machine learning, which is the name for a whole toolbox of arcane but important mathematical techniques that enable computers to improve at performing tasks with experience. Finally, within machine learning is the smaller subcategory called deep learning.

A detailed  explanation similar to the nested set diagram above can be found in this post Understanding the differences between AI, machine learning, and deep learning (By Hope Reese | February 23, 2017).

======Below are some main screenshots from this talk: Watch Now: Deep Learning Demystified

 

 

 

 

References and reading list:

Clear and concise intro to RNN

We motivate why recurrent neural networks are important for dealing with sequence data and review LSTMs and GRU (gated recurrent unit) architectures.   GRU is simplified LSTM.  Notes: BPTT( back propagation through time)

The AI Revolution: Why Deep Learning Is Suddenly Changing Your Life

source: from fortune.com (Good posts sometimes disappear, so I repost it here for my and for your information. Some bold highlights are mine.)  

in category: TechNews_deep learning

Illustration by Justin Metz

Decades-old discoveries are now electrifying the computing industry and will soon transform corporate America.

Over the past four years, readers have doubtlessly noticed quantum leaps in the quality of a wide range of everyday technologies.

Most obviously, the speech-recognition functions on our smartphones work much better than they used to. When we use a voice command to call our spouses, we reach them now. We aren’t connected to Amtrak or an angry ex.

In fact, we are increasingly interacting with our computers by just talking to them, whether it’s Amazon’s Alexa, Apple’s Siri, Microsoft’s Cortana, or the many voice-responsive features of Google. Chinese search giant Baidu says customers have tripled their use of its speech interfaces in the past 18 months.

Machine translation and other forms of language processing have also become far more convincing, with Google GOOGL 0.46% , Microsoft MSFT 1.04% , Facebook FB -0.40% , and Baidu BIDU -1.50% unveiling new tricks every month. Google Translate now renders spoken sentences in one language into spoken sentences in another for 32 pairs of languages, while offering text translations for 103 tongues, including Cebuano, Igbo, and Zulu. Google’s Inbox app offers three ready-made replies for many incoming emails.

 

Then there are the advances in image recognition. The same four companies all have features that let you search or automatically organize collections of photos with no identifying tags. You can ask to be shown, say, all the ones that have dogs in them, or snow, or even something fairly abstract like hugs. The companies all have prototypes in the works that generate sentence-long descriptions for the photos in seconds.

Think about that. To gather up dog pictures, the app must identify anything from a Chihuahua to a German shepherd and not be tripped up if the pup is upside down or partially obscured, at the right of the frame or the left, in fog or snow, sun or shade. At the same time it needs to exclude wolves and cats. Using pixels alone. How is that possible?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The advances in image recognition extend far beyond cool social apps. Medical startups claim they’ll soon be able to use computers to read X-rays, MRIs, and CT scans more rapidly and accurately than radiologists, to diagnose cancer earlier and less invasively, and to accelerate the search for life-saving pharmaceuticals. Better image recognition is crucial to unleashing improvements in robotics, autonomous drones, and, of course, self-driving cars—a development so momentous that we made it a cover story in June. Ford F -0.40% , Tesla TSLA 0.40% , Uber, Baidu, and Google parent Alphabet are all testing prototypes of self-piloting vehicles on public roads today.

But what most people don’t realize is that all these breakthroughs are, in essence, the same breakthrough. They’ve all been made possible by a family of artificial intelligence (AI) techniques popularly known as deep learning, though most scientists still prefer to call them by their original academic designation: deep neural networks.

The most remarkable thing about neural nets is that no human being has programmed a computer to perform any of the stunts described above. In fact, no human could. Programmers have, rather, fed the computer a learning algorithm, exposed it to terabytes of data—hundreds of thousands of images or years’ worth of speech samples—to train it, and have then allowed the computer to figure out for itself how to recognize the desired objects, words, or sentences.

In short, such computers can now teach themselves. “You essentially have software writing software,” says Jen-Hsun Huang, CEO of graphics processing leader Nvidia NVDA -1.13% , which began placing a massive bet on deep learning about five years ago. (For more, read Fortune’s interview with Nvidia CEO Jen-Hsun Huang.)

Neural nets aren’t new. The concept dates back to the 1950s, and many of the key algorithmic breakthroughs occurred in the 1980s and 1990s. What’s changed is that today computer scientists have finally harnessed both the vast computational power and the enormous storehouses of data—images, video, audio, and text files strewn across the Internet—that, it turns out, are essential to making neural nets work well. “This is deep learning’s Cambrian explosion,” says Frank Chen, a partner at the Andreessen Horowitz venture capital firm, alluding to the geological era when most higher animal species suddenly burst onto the scene.

 

That dramatic progress has sparked a burst of activity. Equity funding of AI-focused startups reached an all-time high last quarter of more than $1 billion, according to the CB Insights research firm. There were 121 funding rounds for such startups in the second quarter of 2016, compared with 21 in the equivalent quarter of 2011, that group says. More than $7.5 billion in total investments have been made during that stretch—with more than $6 billion of that coming since 2014. (In late September, five corporate AI leaders—Amazon, Facebook, Google, IBM, and Microsoft—formed the nonprofit Partnership on AI to advance public understanding of the subject and conduct research on ethics and best practices.)

Google had two deep-learning projects underway in 2012. Today it is pursuing more than 1,000, according to a spokesperson, in all its major product sectors, including search, Android, Gmail, translation, maps, YouTube, and self-driving cars. IBM’s IBM 2.24% Watson system used AI, but not deep learning, when it beat two Jeopardy champions in 2011. Now, though, almost all of Watson’s 30 component services have been augmented by deep learning, according to Watson CTO Rob High.

Venture capitalists, who didn’t even know what deep learning was five years ago, today are wary of startups that don’t have it. “We’re now living in an age,” Chen observes, “where it’s going to be mandatory for people building sophisticated software applications.” People will soon demand, he says, “ ‘Where’s your natural-language processing version?’ ‘How do I talk to your app? Because I don’t want to have to click through menus.’ ”

For more on AI, watch this Fortune video:

Some companies are already integrating deep learning into their own day-to-day processes. Says Peter Lee, cohead of Microsoft Research: “Our sales teams are using neural nets to recommend which prospects to contact next or what kinds of product offerings to recommend.”

 

The hardware world is feeling the tremors. The increased computational power that is making all this possible derives not only from Moore’s law but also from the realization in the late 2000s that graphics processing units (GPUs) made by Nvidia—the powerful chips that were first designed to give gamers rich, 3D visual experiences—were 20 to 50 times more efficient than traditional central processing units (CPUs) for deep-learning computations. This past August, Nvidia announced that quarterly revenue for its data center segment had more than doubled year over year, to $151 million. Its chief financial officer told investors that “the vast majority of the growth comes from deep learning by far.” The term “deep learning” came up 81 times during the 83-minute earnings call.

Chip giant Intel INTC 0.77% isn’t standing still. In the past two months it has purchased Nervana Systems (for more than $400 million) and Movidius (price undisclosed), two startups that make technology tailored for different phases of deep-learning computations.

For its part, Google revealed in May that for over a year it had been secretly using its own tailor-made chips, called tensor processing units, or TPUs, to implement applications trained by deep learning. (Tensors are arrays of numbers, like matrices, which are often multiplied against one another in deep-learning computations.)

Indeed, corporations just may have reached another inflection point. “In the past,” says Andrew Ng, chief scientist at Baidu Research, “a lot of S&P 500 CEOs wished they had started thinking sooner than they did about their Internet strategy. I think five years from now there will be a number of S&P 500 CEOs that will wish they’d started thinking earlier about their AI strategy.”

Even the Internet metaphor doesn’t do justice to what AI with deep learning will mean, in Ng’s view. “AI is the new electricity,” he says. “Just as 100 years ago electricity transformed industry after industry, AI will now do the same.”

 

 

Think of deep learning as a subset of a subset. “Artificial intelligence” encompasses a vast range of technologies—like traditional logic and rules-based systems—that enable computers and robots to solve problems in ways that at least superficially resemble thinking. Within that realm is a smaller category called machine learning, which is the name for a whole toolbox of arcane but important mathematical techniques that enable computers to improve at performing tasks with experience. Finally, within machine learning is the smaller subcategory called deep learning.

One way to think of what deep learning does is as “A to B mappings,” says Baidu’s Ng. “You can input an audio clip and output the transcript. That’s speech recognition.” As long as you have data to train the software, the possibilities are endless, he maintains. “You can input email, and the output could be: Is this spam or not?” Input loan applications, he says, and the output might be the likelihood a customer will repay it. Input usage patterns on a fleet of cars, and the output could advise where to send a car next.

Deep learning, in that vision, could transform almost any industry. “There are fundamental changes that will happen now that computer vision really works,” says Jeff Dean, who leads the Google Brain project. Or, as he unsettlingly rephrases his own sentence, “now that computers have opened their eyes.”

Does that mean it’s time to brace for “the singularity”—the hypothesized moment when superintelligent machines start improving themselves without human involvement, triggering a runaway cycle that leaves lowly humans ever further in the dust, with terrifying consequences?

Not just yet. Neural nets are good at recognizing patterns—sometimes as good as or better than we are at it. But they can’t reason.

 

 

The first sparks of the impending revolution began flickering in 2009. That summer Microsoft’s principal researcher Li Deng invited neural nets pioneer Geoffrey Hinton, of the University of Toronto, to visit. Impressed with his research, Deng’s group experimented with neural nets for speech recognition. “We were shocked by the results,” Lee says. “We were achieving more than 30% improvements in accuracy with the very first prototypes.

In 2011, Microsoft introduced deep-learning technology into its commercial speech-recognition products, according to Lee. Google followed suit in August 2012.

But the real turning point came in October 2012. At a workshop in Florence, Italy, Fei-Fei Li, the head of the Stanford AI Lab and the founder of the prominent annual ImageNet computer-vision contest, announced that two of Hinton’s students had invented software that identified objects with almost twice the accuracy of the nearest competitor. “It was a spectacular result,” recounts Hinton, “and convinced lots and lots of people who had been very skeptical before.” (In last year’s contest a deep-learning entrant surpassed human performance.)

Cracking image recognition was the starting gun, and it kicked off a hiring race. Google landed Hinton and the two students who had won that contest. Facebook signed up French deep learning innovator Yann LeCun, who, in the 1980s and 1990s, had pioneered the type of algorithm that won the ImageNet contest. And Baidu snatched up Ng, a former head of the Stanford AI Lab, who had helped launch and lead the deep-learning-focused Google Brain project in 2010.

The hiring binge has only intensified since then. Today, says Microsoft’s Lee, there’s a “bloody war for talent in this space.” He says top-flight minds command offers “along the lines of NFL football players.”

 

 

Geoffrey Hinton, 68, first heard of neural networks in 1972 when he started his graduate work in artificial intelligence at the University of Edinburgh. Having studied experimental psychology as an undergraduate at Cambridge, Hinton was enthusiastic about neural nets, which were software constructs that took their inspiration from the way networks of neurons in the brain were thought to work. At the time, neural nets were out of favor. “Everybody thought they were crazy,” he recounts. But Hinton soldiered on.

Neural nets offered the prospect of computers’ learning the way children do—from experience—rather than through laborious instruction by programs tailor-made by humans. “Most of AI was inspired by logic back then,” he recalls. “But logic is something people do very late in life. Kids of 2 and 3 aren’t doing logic. So it seemed to me that neural nets were a much better paradigm for how intelligence would work than logic was.” (Logic, as it happens, is one of the Hinton family trades. He comes from a long line of eminent scientists and is the great-great-grandson of 19th-century mathematician George Boole, after whom Boolean searches, logic, and algebra are named.)

 

During the 1950s and ’60s, neural networks were in vogue among computer scientists. In 1958, Cornell research psychologist Frank Rosenblatt, in a Navy-backed project, built a prototype neural net, which he called the Perceptron, at a lab in Buffalo. It used a punch-card computer that filled an entire room. After 50 trials it learned to distinguish between cards marked on the left and cards marked on the right. Reporting on the event, the New York Times wrote, “The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

The Perceptron, whose software had only one layer of neuron-like nodes, proved limited. But researchers believed that more could be accomplished with multilayer—or deep—neural networks.

Hinton explains the basic idea this way. Suppose a neural net is interpreting photographic images, some of which show birds. “So the input would come in, say, pixels, and then the first layer of units would detect little edges. Dark one side, bright the other side.” The next level of neurons, analyzing data sent from the first layer, would learn to detect “things like corners, where two edges join at an angle,” he says. One of these neurons might respond strongly to the angle of a bird’s beak, for instance.

The next level “might find more complicated configurations, like a bunch of edges arranged in a circle.” A neuron at this level might respond to the head of the bird. At a still higher level a neuron might detect the recurring juxtaposition of beaklike angles near headlike circles. “And that’s a pretty good cue that it might be the head of a bird,” says Hinton. The neurons of each higher layer respond to concepts of greater complexity and abstraction, until one at the top level corresponds to our concept of “bird.”

To learn, however, a deep neural net needed to do more than just send messages up through the layers in this fashion. It also needed a way to see if it was getting the right results at the top layer and, if not, send messages back down so that all the lower neuron-like units could retune their activations to improve the results. That’s where the learning would occur.

In the early 1980s, Hinton was working on this problem. So was a French researcher named Yann LeCun, who was just starting his graduate work in Paris. LeCun stumbled on a 1983 paper by Hinton, which talked about multilayer neural nets. “It was not formulated in those terms,” LeCun recalls, “because it was very difficult at that time actually to publish a paper if you mentioned the word ‘neurons’ or ‘neural nets.’ So he wrote this paper in an obfuscated manner so it would pass the reviewers. But I thought the paper was super-interesting.” The two met two years later and hit it off.

In 1986, Hinton and two colleagues wrote a seminal paper offering an algorithmic solution to the error-correction problem. “His paper was basically the foundation of the second wave of neural nets,” says LeCun. It reignited interest in the field.

After a post-doc stint with Hinton, LeCun moved to AT&T’s Bell Labs in 1988, where during the next decade he did foundational work that is still being used today for most image-recognition tasks. In the 1990s, NCR NCR 0.24% , which was then a Bell Labs subsidiary, commercialized a neural-nets-powered device, widely used by banks, which could read handwritten digits on checks, according to LeCun. At the same time, two German researchers—Sepp Hochreiter, now at the University of Linz, and Jürgen Schmidhuber, codirector of a Swiss AI lab in Lugano—were independently pioneering a different type of algorithm that today, 20 years later, has become crucial for natural-language processing applications.

 

Despite all the strides, in the mid-1990s neural nets fell into disfavor again, eclipsed by what were, given the computational power of the times, more effective machine-learning tools. That situation persisted for almost a decade, until computing power increased another three to four orders of magnitude and researchers discovered GPU acceleration.

But one piece was still missing: data. Although the Internet was awash in it, most data—especially when it came to images—wasn’t labeled, and that’s what you needed to train neural nets. That’s where Fei-Fei Li, a Stanford AI professor, stepped in. “Our vision was that big data would change the way machine learning works,” she explains in an interview. “Data drives learning.”

In 2007 she launched ImageNet, assembling a free database of more than 14 million labeled images. It went live in 2009, and the next year she set up an annual contest to incentivize and publish computer-vision breakthroughs.

In October 2012, when two of Hinton’s students won that competition, it became clear to all that deep learning had arrived.

By then the general public had also heard about deep learning, though due to a different event. In June 2012, Google Brain published the results of a quirky project now known colloquially as the “cat experiment.” It struck a comic chord and went viral on social networks.

The project actually explored an important unsolved problem in deep learning called “unsupervised learning.” Almost every deep-learning product in commercial use today uses “supervised learning,” meaning that the neural net is trained with labeled data (like the images assembled by ImageNet). With “unsupervised learning,” by contrast, a neural net is shown unlabeled data and asked simply to look for recurring patterns. Researchers would love to master unsupervised learning one day because then machines could teach themselves about the world from vast stores of data that are unusable today—making sense of the world almost totally on their own, like infants.

In the cat experiment, researchers exposed a vast neural net—spread across 1,000 computers—to 10 million unlabeled images randomly taken from YouTube videos, and then just let the software do its thing. When the dust cleared, they checked the neurons of the highest layer and found, sure enough, that one of them responded powerfully to images of cats. “We also found a neuron that responded very strongly to human faces,” says Ng, who led the project while at Google Brain.

Yet the results were puzzling too. “We did not find a neuron that responded strongly to cars,” for instance, and “there were a lot of other neurons we couldn’t assign an English word to. So it’s difficult.”

The experiment created a sensation. But unsupervised learning remains uncracked—a challenge for the future.

 

 

Not surprisingly, most of the deep-learning applications that have been commercially deployed so far involve companies like Google, Microsoft, Facebook, Baidu, and Amazon—the companies with the vast stores of data needed for deep-learning computations. Many companies are trying to develop more realistic and helpful “chatbots”—automated customer-service representatives.

FOUR TECH GIANTS GET SERIOUS ABOUT DEEP LEARNING

  • GOOGLE
  • Google launched the deep-learning-focused Google Brain project in 2011, introduced neural nets into its speech-recognition products in mid-2012, and retained neural nets pioneer Geoffrey Hinton in March 2013. It now has more than 1,000 deep-learning projects underway, it says, extending across search, Android, Gmail, photo, maps, translate, YouTube, and self-driving cars. In 2014 it bought DeepMind, whose deep reinforcement learning project, AlphaGo, defeated the world’s go champion, Lee Sedol, in March, achieving an artificial intelligence landmark.
  • MICROSOFT
  • Microsoft introduced deep learning into its commercial speech-recognition products, including Bing voice search and X-Box voice commands, during the first half of 2011. The company now uses neural nets for its search rankings, photo search, translation systems, and more. “It’s hard to convey the pervasive impact this has had,” says Lee. Last year it won the key image-recognition contest, and in September it scored a record low error rate on a speech-recognition benchmark: 6.3%.
  • FACEBOOK
  • In December 2013, Facebook hired French neural nets innovator Yann LeCun to direct its new AI research lab. Facebook uses neural nets to translate about 2 billion user posts per day in more than 40 languages, and says its translations are seen by 800 million users a day. (About half its community does not speak English.) Facebook also uses neural nets for photo search and photo organization, and it’s working on a feature that would generate spoken captions for untagged photos that could be used by the visually impaired.
  • BAIDU
  • In May 2014, Baidu hired Andrew Ng, who had earlier helped launch and lead the Google Brain project, to lead its research lab. China’s leading search and web services site, Baidu uses neural nets for speech recognition, translation, photo search, and a self-driving car project, among others. Speech recognition is key in China, a mobile-first society whose main language, Mandarin, is difficult to type into a device. The number of customers interfacing by speech has tripled in the past 18 months, Baidu says.

Companies like IBM and Microsoft are also helping business customers adapt deep-learning-powered applications—like speech-recognition interfaces and translation services—for their own businesses, while cloud services like Amazon Web Services provide cheap, GPU-driven deep-learning computation services for those who want to develop their own software. Plentiful open-source software—like Caffe, Google’s TensorFlow, and Amazon’s DSSTNE—have greased the innovation process, as has an open-publication ethic, whereby many researchers publish their results immediately on one database without awaiting peer-review approval.

Many of the most exciting new attempts to apply deep learning are in the medical realm (see sidebar). We already know that neural nets work well for image recognition, observes Vijay Pande, a Stanford professor who heads Andreessen Horowitz’s biological investments unit, and “so much of what doctors do is image recognition, whether we’re talking about radiology, dermatology, ophthalmology, or so many other ‘-ologies.’ ”

 

DEEP LEARNING AND MEDICINE

Startup Enlitic uses deep learning to analyze radiographs and CT and MRI scans. CEO Igor Barani, formerly a professor of radiation oncology at the University of California in San Francisco, says Enlitic’s algorithms outperformed four radiologists in detecting and classifying lung nodules as benign or malignant. (The work has not been peer reviewed, and the technology has not yet obtained FDA approval.)

Merck is trying to use deep learning to accelerate drug discovery, as is a San Francisco startup called Atomwise. Neural networks examine 3D images—thousands of molecules that might serve as drug candidates—and predict their suitability for blocking the mechanism of a pathogen. Such companies are using neural nets to try to improve what humans already do; others are trying to do things humans can’t do at all. Gabriel Otte, 27, who has a Ph.D. in computational biology, started Freenome, which aims to diagnose cancer from blood samples. It examines DNA fragments in the bloodstream that are spewed out by cells as they die. Using deep learning, he asks computers to find correlations between cell-free DNA and some cancers. “We’re seeing novel signatures that haven’t even been characterized by cancer biologists yet,” says Otte.

When Andreessen Horowitz was mulling an investment in Freenome, AH’s Pande sent Otte five blind samples—two normal and three cancerous. Otte got all five right, says Pande, whose firm decided to invest.

 

While a radiologist might see thousands of images in his life, a computer can be shown millions. “It’s not crazy to imagine that this image problem could be solved better by computers,” Pande says, “just because they can plow through so much more data than a human could ever do.”

The potential advantages are not just greater accuracy and faster analysis, but democratization of services. As the technology becomes standard, eventually every patient will benefit.

The greatest impacts of deep learning may well be felt when it is integrated into the whole toolbox of other artificial intelligence techniques in ways that haven’t been thought of yet. Google’s DeepMind, for instance, has already been accomplishing startling things by combining deep learning with a related technique called reinforcement learning. Using the two, it created AlphaGo, the system that, this past March, defeated the champion player of the ancient Chinese game of go—widely considered a landmark AI achievement. Unlike IBM’s Deep Blue, which defeated chess champion Garry Kasparov in 1997, AlphaGo was not programmed with decision trees, or equations on how to evaluate board positions, or with if-then rules. “AlphaGo learned how to play go essentially from self-play and from observing big professional games,” says Demis Hassabis, DeepMind’s CEO. (During training, AlphaGo played a million go games against itself.)

A game might seem like an artificial setting. But Hassabis thinks the same techniques can be applied to real-world problems. In July, in fact, Google reported that, by using approaches similar to those used by AlphaGo, DeepMind was able to increase the energy efficiency of Google’s data centers by 15%. “In the data centers there are maybe 120 different variables,” says Hassabis. “You can change the fans, open the windows, alter the computer systems, where the power goes. You’ve got data from the sensors, the temperature gauges, and all that. It’s like the go board. Through trial and error, you learn what the right moves are.

“So it’s great,” he continues. “You could save, say, tens of millions of dollars a year, and it’s also great for the environment. Data centers use a lot of power around the world. We’d like to roll it out on a bigger scale now. Even the national grid level.”

Chatbots are all well and good. But that would be a cool app.

A version of this article appears in the October 1, 2016 issue of Fortune with the headline “The Deep-Learning Revolution.” This version contains updated figures from the CB Insights research firm.

Why Deep Learning is helpful? Or even a game-changer

Source: from slideshare.net: “Deep Learning Cases: Text and Image Processing” (Apr 3, 2016 by Grigory Sapunov)

in category: Machine Learning vs Deep Learning

 

 

Note that: from slides 58 to 65, different libraries and Frameworks as well as other resources were introduced in the slides with links.