(see this post for a deeper intro to bias and variances and talk about how it interacts with and is affected by the regularization of your learning algorithm.)
If you run the learning algorithm and it doesn’t do as well as you are hoping, almost all the time it will be because you have either a high bias problem or a high variance problem. In other words they’re either an underfitting problem or an overfitting problem.
And in this case it’s very important to figure out which of these two problems is bias or variance or a bit of both that you actually have. Because knowing which of these two things is happening would give a very strong indicator for whether the useful and promising ways to try to improve your algorithm.
How Google used artificial intelligence to transform Google Translate, one of its more popular services — and how machine learning is poised to reinvent computing itself.
BY GIDEON LEWIS-KRAUS
Prologue: You Are What You Have Read
Late one Friday night in early November, Jun Rekimoto, a distinguished professor of human-computer interaction at the University of Tokyo, was online preparing for a lecture when he began to notice some peculiar posts rolling in on social media. Apparently Google Translate, the company’s popular machine-translation service, had suddenly and almost immeasurably improved. Rekimoto visited Translate himself and began to experiment with it. He was astonished. He had to go to sleep, but Translate refused to relax its grip on his imagination.
Rekimoto wrote up his initial findings in a blog post. First, he compared a few sentences from two published versions of “The Great Gatsby,” Takashi Nozaki’s 1957 translation and Haruki Murakami’s more recent iteration, with what this new Google Translate was able to produce. Murakami’s translation is written “in very polished Japanese,” Rekimoto explained to me later via email, but the prose is distinctively “Murakami-style.” By contrast, Google’s translation — despite some “small unnaturalness” — reads to him as “more transparent.”
The second half of Rekimoto’s post examined the service in the other direction, from Japanese to English. He dashed off his own Japanese interpretation of the opening to Hemingway’s “The Snows of Kilimanjaro,” then ran that passage back through Google into English. He published this version alongside Hemingway’s original, and proceeded to invite his readers to guess which was the work of a machine.
Kilimanjaro is a snow-covered mountain 19,710 feet high, and is said to be the highest mountain in Africa. Its western summit is called the Masai “Ngaje Ngai,” the House of God. Close to the western summit there is the dried and frozen carcass of a leopard. No one has explained what the leopard was seeking at that altitude.
Kilimanjaro is a mountain of 19,710 feet covered with snow and is said to be the highest mountain in Africa. The summit of the west is called “Ngaje Ngai” in Masai, the house of God. Near the top of the west there is a dry and frozen dead body of leopard. No one has ever explained what leopard wanted at that altitude.
Even to a native English speaker, the missing article on the leopard is the only real giveaway that No. 2 was the output of an automaton. Their closeness was a source of wonder to Rekimoto, who was well acquainted with the capabilities of the previous service. Only 24 hours earlier, Google would have translated the same Japanese passage as follows:
Kilimanjaro is 19,710 feet of the mountain covered with snow, and it is said that the highest mountain in Africa. Top of the west, “Ngaje Ngai” in the Maasai language, has been referred to as the house of God. The top close to the west, there is a dry, frozen carcass of a leopard. Whether the leopard had what the demand at that altitude, there is no that nobody explained.
Rekimoto promoted his discovery to his hundred thousand or so followers on Twitter, and over the next few hours thousands of people broadcast their own experiments with the machine-translation service. Some were successful, others meant mostly for comic effect. As dawn broke over Tokyo, Google Translate was the No. 1 trend on Japanese Twitter, just above some cult anime series and the long-awaited new single from a girl-idol supergroup. Everybody wondered: How had Google Translate become so uncannily artful?
Four days later, a couple of hundred journalists, entrepreneurs and advertisers from all over the world gathered in Google’s London engineering office for a special announcement. Guests were greeted with Translate-branded fortune cookies. Their paper slips had a foreign phrase on one side — mine was in Norwegian — and on the other, an invitation to download the Translate app. Tables were set with trays of doughnuts and smoothies, each labeled with a placard that advertised its flavor in German (zitrone), Portuguese (baunilha) or Spanish (manzana). After a while, everyone was ushered into a plush, dark theater.
Sadiq Khan, the mayor of London, stood to make a few opening remarks. A friend, he began, had recently told him he reminded him of Google. “Why, because I know all the answers?” the mayor asked. “No,” the friend replied, “because you’re always trying to finish my sentences.” The crowd tittered politely. Khan concluded by introducing Google’s chief executive, Sundar Pichai, who took the stage.
Pichai was in London in part to inaugurate Google’s new building there, the cornerstone of a new “knowledge quarter” under construction at King’s Cross, and in part to unveil the completion of the initial phase of a company transformation he announced last year. The Google of the future, Pichai had said on several occasions, was going to be “A.I. first.” What that meant in theory was complicated and had welcomed much speculation. What it meant in practice, with any luck, was that soon the company’s products would no longer represent the fruits of traditional computer programming, exactly, but “machine learning.”
A rarefied department within the company, Google Brain, was founded five years ago on this very principle: that artificial “neural networks” that acquaint themselves with the world via trial and error, as toddlers do, might in turn develop something like human flexibility. This notion is not new — a version of it dates to the earliest stages of modern computing, in the 1940s — but for much of its history most computer scientists saw it as vaguely disreputable, even mystical. Since 2011, though, Google Brain has demonstrated that this approach to artificial intelligence could solve many problems that confounded decades of conventional efforts. Speech recognition didn’t work very well until Brain undertook an effort to revamp it; the application of machine learning made its performance on Google’s mobile platform, Android, almost as good as human transcription. The same was true of image recognition. Less than a year ago, Brain for the first time commenced with the gut renovation of an entire consumer product, and its momentous results were being celebrated tonight.
Translate made its debut in 2006 and since then has become one of Google’s most reliable and popular assets; it serves more than 500 million monthly users in need of 140 billion words per day in a different language. It exists not only as its own stand-alone app but also as an integrated feature within Gmail, Chrome and many other Google offerings, where we take it as a push-button given — a frictionless, natural part of our digital commerce. It was only with the refugee crisis, Pichai explained from the lectern, that the company came to reckon with Translate’s geopolitical importance: On the screen behind him appeared a graph whose steep curve indicated a recent fivefold increase in translations between Arabic and German. (It was also close to Pichai’s own heart. He grew up in India, a land divided by dozens of languages.) The team had been steadily adding new languages and features, but gains in quality over the last four years had slowed considerably.
Until today. As of the previous weekend, Translate had been converted to an A.I.-based system for much of its traffic, not just in the United States but in Europe and Asia as well: The rollout included translations between English and Spanish, French, Portuguese, German, Chinese, Japanese, Korean and Turkish. The rest of Translate’s hundred-odd languages were to come, with the aim of eight per month, by the end of next year. The new incarnation, to the pleasant surprise of Google’s own engineers, had been completed in only nine months. The A.I. system had demonstrated overnight improvements roughly equal to the total gains the old one had accrued over its entire lifetime.
Pichai has an affection for the obscure literary reference; he told me a month earlier, in his office in Mountain View, Calif., that Translate in part exists because not everyone can be like the physicist Robert Oppenheimer, who learned Sanskrit to read the Bhagavad Gita in the original. In London, the slide on the monitors behind him flicked to a Borges quote: “Uno no es lo que es por lo que escribe, sino por lo que ha leído.”
Grinning, Pichai read aloud an awkward English version of the sentence that had been rendered by the old Translate system: “One is not what is for what he writes, but for what he has read.”
To the right of that was a new A.I.-rendered version: “You are not what you write, but what you have read.”
It was a fitting remark: The new Google Translate was run on the first machines that had, in a sense, ever learned to read anything at all.
Google’s decision to reorganize itself around A.I. was the first major manifestation of what has become an industrywide machine-learning delirium. Over the past four years, six companies in particular — Google, Facebook, Apple, Amazon, Microsoft and the Chinese firm Baidu — have touched off an arms race for A.I. talent, particularly within universities. Corporate promises of resources and freedom have thinned out top academic departments. It has become widely known in Silicon Valley that Mark Zuckerberg, chief executive of Facebook, personally oversees, with phone calls and video-chat blandishments, his company’s overtures to the most desirable graduate students. Starting salaries of seven figures are not unheard-of. Attendance at the field’s most important academic conference has nearly quadrupled. What is at stake is not just one more piecemeal innovation but control over what very well could represent an entirely new computational platform: pervasive, ambient artificial intelligence.
The phrase “artificial intelligence” is invoked as if its meaning were self-evident, but it has always been a source of confusion and controversy. Imagine if you went back to the 1970s, stopped someone on the street, pulled out a smartphone and showed her Google Maps. Once you managed to convince her you weren’t some oddly dressed wizard, and that what you withdrew from your pocket wasn’t a black-arts amulet but merely a tiny computer more powerful than that onboard the Apollo shuttle, Google Maps would almost certainly seem to her a persuasive example of “artificial intelligence.” In a very real sense, it is. It can do things any map-literate human can manage, like get you from your hotel to the airport — though it can do so much more quickly and reliably. It can also do things that humans simply and obviously cannot: It can evaluate the traffic, plan the best route and reorient itself when you take the wrong exit.
Practically nobody today, however, would bestow upon Google Maps the honorific “A.I.,” so sentimental and sparing are we in our use of the word “intelligence.” Artificial intelligence, we believe, must be something that distinguishes HAL from whatever it is a loom or wheelbarrow can do. The minute we can automate a task, we downgrade the relevant skill involved to one of mere mechanism. Today Google Maps seems, in the pejorative sense of the term, robotic: It simply accepts an explicit demand (the need to get from one place to another) and tries to satisfy that demand as efficiently as possible. The goal posts for “artificial intelligence” are thus constantly receding.
When he has an opportunity to make careful distinctions, Pichai differentiates between the current applications of A.I. and the ultimate goal of “artificial general intelligence.” Artificial general intelligence will not involve dutiful adherence to explicit instructions, but instead will demonstrate a facility with the implicit, the interpretive. It will be a general tool, designed for general purposes in a general context. Pichai believes his company’s future depends on something like this. Imagine if you could tell Google Maps, “I’d like to go to the airport, but I need to stop off on the way to buy a present for my nephew.” A more generally intelligent version of that service — a ubiquitous assistant, of the sort that Scarlett Johansson memorably disembodied three years ago in the Spike Jonze film “Her”— would know all sorts of things that, say, a close friend or an earnest intern might know: your nephew’s age, and how much you ordinarily like to spend on gifts for children, and where to find an open store. But a truly intelligent Maps could also conceivably know all sorts of things a close friend wouldn’t, like what has only recently come into fashion among preschoolers in your nephew’s school — or more important, what its users actually want. If an intelligent machine were able to discern some intricate if murky regularity in data about what we have done in the past, it might be able to extrapolate about our subsequent desires, even if we don’t entirely know them ourselves.
The new wave of A.I.-enhanced assistants — Apple’s Siri, Facebook’s M, Amazon’s Echo — are all creatures of machine learning, built with similar intentions. The corporate dreams for machine learning, however, aren’t exhausted by the goal of consumer clairvoyance. A medical-imaging subsidiary of Samsung announced this year that its new ultrasound devices could detect breast cancer. Management consultants are falling all over themselves to prep executives for the widening industrial applications of computers that program themselves. DeepMind, a 2014 Google acquisition, defeated the reigning human grandmaster of the ancient board game Go, despite predictions that such an achievement would take another 10 years.
In a famous 1950 essay, Alan Turing proposed a test for an artificial general intelligence: a computer that could, over the course of five minutes of text exchange, successfully deceive a real human interlocutor. Once a machine can translate fluently between two natural languages, the foundation has been laid for a machine that might one day “understand” human language well enough to engage in plausible conversation. Google Brain’s members, who pushed and helped oversee the Translate project, believe that such a machine would be on its way to serving as a generally intelligent all-encompassing personal digital assistant.
What follows here is the story of how a team of Google researchers and engineers — at first one or two, then three or four, and finally more than a hundred — made considerable progress in that direction. It’s an uncommon story in many ways, not least of all because it defies many of the Silicon Valley stereotypes we’ve grown accustomed to. It does not feature people who think that everything will be unrecognizably different tomorrow or the next day because of some restless tinkerer in his garage. It is neither a story about people who think technology will solve all our problems nor one about people who think technology is ineluctably bound to create apocalyptic new ones. It is not about disruption, at least not in the way that word tends to be used.
It is, in fact, three overlapping stories that converge in Google Translate’s successful metamorphosis to A.I. — a technical story, an institutional story and a story about the evolution of ideas. The technical story is about one team on one product at one company, and the process by which they refined, tested and introduced a brand-new version of an old product in only about a quarter of the time anyone, themselves included, might reasonably have expected. The institutional story is about the employees of a small but influential artificial-intelligence group within that company, and the process by which their intuitive faith in some old, unproven and broadly unpalatable notions about computing upended every other company within a large radius. The story of ideas is about the cognitive scientists, psychologists and wayward engineers who long toiled in obscurity, and the process by which their ostensibly irrational convictions ultimately inspired a paradigm shift in our understanding not only of technology but also, in theory, of consciousness itself.
The first story, the story of Google Translate, takes place in Mountain View over nine months, and it explains the transformation of machine translation. The second story, the story of Google Brain and its many competitors, takes place in Silicon Valley over five years, and it explains the transformation of that entire community. The third story, the story of deep learning, takes place in a variety of far-flung laboratories — in Scotland, Switzerland, Japan and most of all Canada — over seven decades, and it might very well contribute to the revision of our self-image as first and foremost beings who think.
All three are stories about artificial intelligence. The seven-decade story is about what we might conceivably expect or want from it. The five-year story is about what it might do in the near future. The nine-month story is about what it can do right this minute. These three stories are themselves just proof of concept. All of this is only the beginning.
Part I: Learning Machine
1. The Birth of Brain
Jeff Dean, though his title is senior fellow, is the de facto head of Google Brain. Dean is a sinewy, energy-efficient man with a long, narrow face, deep-set eyes and an earnest, soapbox-derby sort of enthusiasm. The son of a medical anthropologist and a public-health epidemiologist, Dean grew up all over the world — Minnesota, Hawaii, Boston, Arkansas, Geneva, Uganda, Somalia, Atlanta — and, while in high school and college, wrote software used by the World Health Organization. He has been with Google since 1999, as employee 25ish, and has had a hand in the core software systems beneath nearly every significant undertaking since then. A beloved artifact of company culture is Jeff Dean Facts, written in the style of the Chuck Norris Facts meme: “Jeff Dean’s PIN is the last four digits of pi.” “When Alexander Graham Bell invented the telephone, he saw a missed call from Jeff Dean.” “Jeff Dean got promoted to Level 11 in a system where the maximum level is 10.” (This last one is, in fact, true.)
One day in early 2011, Dean walked into one of the Google campus’s “microkitchens” — the “Googley” word for the shared break spaces on most floors of the Mountain View complex’s buildings — and ran into Andrew Ng, a young Stanford computer-science professor who was working for the company as a consultant. Ng told him about Project Marvin, an internal effort (named after the celebrated A.I. pioneer Marvin Minsky) he had recently helped establish to experiment with “neural networks,” pliant digital lattices based loosely on the architecture of the brain. Dean himself had worked on a primitive version of the technology as an undergraduate at the University of Minnesota in 1990, during one of the method’s brief windows of mainstream acceptability. Now, over the previous five years, the number of academics working on neural networks had begun to grow again, from a handful to a few dozen. Ng told Dean that Project Marvin, which was being underwritten by Google’s secretive X lab, had already achieved some promising results.
Dean was intrigued enough to lend his “20 percent” — the portion of work hours every Google employee is expected to contribute to programs outside his or her core job — to the project. Pretty soon, he suggested to Ng that they bring in another colleague with a neuroscience background, Greg Corrado. (In graduate school, Corrado was taught briefly about the technology, but strictly as a historical curiosity. “It was good I was paying attention in class that day,” he joked to me.) In late spring they brought in one of Ng’s best graduate students, Quoc Le, as the project’s first intern. By then, a number of the Google engineers had taken to referring to Project Marvin by another name: Google Brain.
Since the term “artificial intelligence” was first coined, at a kind of constitutional convention of the mind at Dartmouth in the summer of 1956, a majority of researchers have long thought the best approach to creating A.I. would be to write a very big, comprehensive program that laid out both the rules of logical reasoning and sufficient knowledge of the world. If you wanted to translate from English to Japanese, for example, you would program into the computer all of the grammatical rules of English, and then the entirety of definitions contained in the Oxford English Dictionary, and then all of the grammatical rules of Japanese, as well as all of the words in the Japanese dictionary, and only after all of that feed it a sentence in a source language and ask it to tabulate a corresponding sentence in the target language. You would give the machine a language map that was, as Borges would have had it, the size of the territory. This perspective is usually called “symbolic A.I.” — because its definition of cognition is based on symbolic logic — or, disparagingly, “good old-fashioned A.I.”
There are two main problems with the old-fashioned approach. The first is that it’s awfully time-consuming on the human end. The second is that it only really works in domains where rules and definitions are very clear: in mathematics, for example, or chess. Translation, however, is an example of a field where this approach fails horribly, because words cannot be reduced to their dictionary definitions, and because languages tend to have as many exceptions as they have rules. More often than not, a system like this is liable to translate “minister of agriculture” as “priest of farming.” Still, for math and chess it worked great, and the proponents of symbolic A.I. took it for granted that no activities signaled “general intelligence” better than math and chess.
There were, however, limits to what this system could do. In the 1980s, a robotics researcher at Carnegie Mellon pointed out that it was easy to get computers to do adult things but nearly impossible to get them to do things a 1-year-old could do, like hold a ball or identify a cat. By the 1990s, despite punishing advancements in computer chess, we still weren’t remotely close to artificial general intelligence.
There has always been another vision for A.I. — a dissenting view — in which the computers would learn from the ground up (from data) rather than from the top down (from rules). This notion dates to the early 1940s, when it occurred to researchers that the best model for flexible automated intelligence was the brain itself. A brain, after all, is just a bunch of widgets, called neurons, that either pass along an electrical charge to their neighbors or don’t. What’s important are less the individual neurons themselves than the manifold connections among them. This structure, in its simplicity, has afforded the brain a wealth of adaptive advantages. The brain can operate in circumstances in which information is poor or missing; it can withstand significant damage without total loss of control; it can store a huge amount of knowledge in a very efficient way; it can isolate distinct patterns but retain the messiness necessary to handle ambiguity.
There was no reason you couldn’t try to mimic this structure in electronic form, and in 1943 it was shown that arrangements of simple artificial neurons could carry out basic logical functions. They could also, at least in theory, learn the way we do. With life experience, depending on a particular person’s trials and errors, the synaptic connections among pairs of neurons get stronger or weaker. An artificial neural network could do something similar, by gradually altering, on a guided trial-and-error basis, the numerical relationships among artificial neurons. It wouldn’t need to be preprogrammed with fixed rules. It would, instead, rewire itself to reflect patterns in the data it absorbed.
This attitude toward artificial intelligence was evolutionary rather than creationist. If you wanted a flexible mechanism, you wanted one that could adapt to its environment. If you wanted something that could adapt, you didn’t want to begin with the indoctrination of the rules of chess. You wanted to begin with very basic abilities — sensory perception and motor control — in the hope that advanced skills would emerge organically. Humans don’t learn to understand language by memorizing dictionaries and grammar books, so why should we possibly expect our computers to do so?
Google Brain was the first major commercial institution to invest in the possibilities embodied by this way of thinking about A.I. Dean, Corrado and Ng began their work as a part-time, collaborative experiment, but they made immediate progress. They took architectural inspiration for their models from recent theoretical outlines — as well as ideas that had been on the shelf since the 1980s and 1990s — and drew upon both the company’s peerless reserves of data and its massive computing infrastructure. They instructed the networks on enormous banks of “labeled” data — speech files with correct transcriptions, for example — and the computers improved their responses to better match reality.
“The portion of evolution in which animals developed eyes was a big development,” Dean told me one day, with customary understatement. We were sitting, as usual, in a whiteboarded meeting room, on which he had drawn a crowded, snaking timeline of Google Brain and its relation to inflection points in the recent history of neural networks. “Now computers have eyes. We can build them around the capabilities that now exist to understand photos. Robots will be drastically transformed. They’ll be able to operate in an unknown environment, on much different problems.” These capacities they were building may have seemed primitive, but their implications were profound.
2. The Unlikely Intern
In its first year or so of existence, Brain’s experiments in the development of a machine with the talents of a 1-year-old had, as Dean said, worked to great effect. Its speech-recognition team swapped out part of their old system for a neural network and encountered, in pretty much one fell swoop, the best quality improvements anyone had seen in 20 years. Their system’s object-recognition abilities improved by an order of magnitude. This was not because Brain’s personnel had generated a sheaf of outrageous new ideas in just a year. It was because Google had finally devoted the resources — in computers and, increasingly, personnel — to fill in outlines that had been around for a long time.
A great preponderance of these extant and neglected notions had been proposed or refined by a peripatetic English polymath named Geoffrey Hinton. In the second year of Brain’s existence, Hinton was recruited to Brain as Andrew Ng left. (Ng now leads the 1,300-person A.I. team at Baidu.) Hinton wanted to leave his post at the University of Toronto for only three months, so for arcane contractual reasons he had to be hired as an intern. At intern training, the orientation leader would say something like, “Type in your LDAP” — a user login — and he would flag a helper to ask, “What’s an LDAP?” All the smart 25-year-olds in attendance, who had only ever known deep learning as the sine qua non of artificial intelligence, snickered: “Who is that old guy? Why doesn’t he get it?”
“At lunchtime,” Hinton said, “someone in the queue yelled: ‘Professor Hinton! I took your course! What are you doing here?’ After that, it was all right.”
A few months later, Hinton and two of his students demonstrated truly astonishing gains in a big image-recognition contest, run by an open-source collective called ImageNet, that asks computers not only to identify a monkey but also to distinguish between spider monkeys and howler monkeys, and among God knows how many different breeds of cat. Google soon approached Hinton and his students with an offer. They accepted. “I thought they were interested in our I.P.,” he said. “Turns out they were interested in us.”
Hinton comes from one of those old British families emblazoned like the Darwins at eccentric angles across the intellectual landscape, where regardless of titular preoccupation a person is expected to make sideline contributions to minor problems in astronomy or fluid dynamics. His great-great-grandfather was George Boole, whose foundational work in symbolic logic underpins the computer; another great-great-grandfather was a celebrated surgeon, his father a venturesome entomologist, his father’s cousin a Los Alamos researcher; the list goes on. He trained at Cambridge and Edinburgh, then taught at Carnegie Mellon before he ended up at Toronto, where he still spends half his time. (His work has long been supported by the largess of the Canadian government.) I visited him in his office at Google there. He has tousled yellowed-pewter hair combed forward in a mature Noel Gallagher style and wore a baggy striped dress shirt that persisted in coming untucked, and oval eyeglasses that slid down to the tip of a prominent nose. He speaks with a driving if shambolic wit, and says things like, “Computers will understand sarcasm before Americans do.”
Hinton had been working on neural networks since his undergraduate days at Cambridge in the late 1960s, and he is seen as the intellectual primogenitor of the contemporary field. For most of that time, whenever he spoke about machine learning, people looked at him as though he were talking about the Ptolemaic spheres or bloodletting by leeches. Neural networks were taken as a disproven folly, largely on the basis of one overhyped project: the Perceptron, an artificial neural network that Frank Rosenblatt, a Cornell psychologist, developed in the late 1950s. The New York Times reported that the machine’s sponsor, the United States Navy, expected it would “be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” It went on to do approximately none of those things. Marvin Minsky, the dean of artificial intelligence in America, had worked on neural networks for his 1954 Princeton thesis, but he’d since grown tired of the inflated claims that Rosenblatt — who was a contemporary at Bronx Science — made for the neural paradigm. (He was also competing for Defense Department funding.) Along with an M.I.T. colleague, Minsky published a book that proved that there were painfully simple problems the Perceptron could never solve.
Minsky’s criticism of the Perceptron extended only to networks of one “layer,” i.e., one layer of artificial neurons between what’s fed to the machine and what you expect from it — and later in life, he expounded ideas very similar to contemporary deep learning. But Hinton already knew at the time that complex tasks could be carried out if you had recourse to multiple layers. The simplest description of a neural network is that it’s a machine that makes classifications or predictions based on its ability to discover patterns in data. With one layer, you could find only simple patterns; with more than one, you could look for patterns of patterns. Take the case of image recognition, which tends to rely on a contraption called a “convolutional neural net.” (These were elaborated in a seminal 1998 paper whose lead author, a Frenchman named Yann LeCun, did his postdoctoral research in Toronto under Hinton and now directs a huge A.I. endeavor at Facebook.) The first layer of the network learns to identify the very basic visual trope of an “edge,” meaning a nothing (an off-pixel) followed by a something (an on-pixel) or vice versa. Each successive layer of the network looks for a pattern in the previous layer. A pattern of edges might be a circle or a rectangle. A pattern of circles or rectangles might be a face. And so on. This more or less parallels the way information is put together in increasingly abstract ways as it travels from the photoreceptors in the retina back and up through the visual cortex. At each conceptual step, detail that isn’t immediately relevant is thrown away. If several edges and circles come together to make a face, you don’t care exactly where the face is found in the visual field; you just care that it’s a face.
The issue with multilayered, “deep” neural networks was that the trial-and-error part got extraordinarily complicated. In a single layer, it’s easy. Imagine that you’re playing with a child. You tell the child, “Pick up the green ball and put it into Box A.” The child picks up a green ball and puts it into Box B. You say, “Try again to put the green ball in Box A.” The child tries Box A. Bravo.
Now imagine you tell the child, “Pick up a green ball, go through the door marked 3 and put the green ball into Box A.” The child takes a red ball, goes through the door marked 2 and puts the red ball into Box B. How do you begin to correct the child? You cannot just repeat your initial instructions, because the child does not know at which point he went wrong. In real life, you might start by holding up the red ball and the green ball and saying, “Red ball, green ball.” The whole point of machine learning, however, is to avoid that kind of explicit mentoring. Hinton and a few others went on to invent a solution (or rather, reinvent an older one) to this layered-error problem, over the halting course of the late 1970s and 1980s, and interest among computer scientists in neural networks was briefly revived. “People got very excited about it,” he said. “But we oversold it.” Computer scientists quickly went back to thinking that people like Hinton were weirdos and mystics.
These ideas remained popular, however, among philosophers and psychologists, who called it “connectionism” or “parallel distributed processing.” “This idea,” Hinton told me, “of a few people keeping a torch burning, it’s a nice myth. It was true within artificial intelligence. But within psychology lots of people believed in the approach but just couldn’t do it.” Neither could Hinton, despite the generosity of the Canadian government. “There just wasn’t enough computer power or enough data. People on our side kept saying, ‘Yeah, but if I had a really big one, it would work.’ It wasn’t a very persuasive argument.”
3. A Deep Explanation of Deep Learning
When Pichai said that Google would henceforth be “A.I. first,” he was not just making a claim about his company’s business strategy; he was throwing in his company’s lot with this long-unworkable idea. Pichai’s allocation of resources ensured that people like Dean could ensure that people like Hinton would have, at long last, enough computers and enough data to make a persuasive argument. An average brain has something on the order of 100 billion neurons. Each neuron is connected to up to 10,000 other neurons, which means that the number of synapses is between 100 trillion and 1,000 trillion. For a simple artificial neural network of the sort proposed in the 1940s, the attempt to even try to replicate this was unimaginable. We’re still far from the construction of a network of that size, but Google Brain’s investment allowed for the creation of artificial neural networks comparable to the brains of mice.
To understand why scale is so important, however, you have to start to understand some of the more technical details of what, exactly, machine intelligences are doing with the data they consume. A lot of our ambient fears about A.I. rest on the idea that they’re just vacuuming up knowledge like a sociopathic prodigy in a library, and that an artificial intelligence constructed to make paper clips might someday decide to treat humans like ants or lettuce. This just isn’t how they work. All they’re doing is shuffling information around in search of commonalities — basic patterns, at first, and then more complex ones — and for the moment, at least, the greatest danger is that the information we’re feeding them is biased in the first place.
If that brief explanation seems sufficiently reassuring, the reassured nontechnical reader is invited to skip forward to the next section, which is about cats. If not, then read on. (This section is also, luckily, about cats.)
Imagine you want to program a cat-recognizer on the old symbolic-A.I. model. You stay up for days preloading the machine with an exhaustive, explicit definition of “cat.” You tell it that a cat has four legs and pointy ears and whiskers and a tail, and so on. All this information is stored in a special place in memory called Cat. Now you show it a picture. First, the machine has to separate out the various distinct elements of the image. Then it has to take these elements and apply the rules stored in its memory. If(legs=4) and if(ears=pointy) and if(whiskers=yes) and if(tail=yes) and if(expression=supercilious), then(cat=yes). But what if you showed this cat-recognizer a Scottish Fold, a heart-rending breed with a prized genetic defect that leads to droopy doubled-over ears? Our symbolic A.I. gets to (ears=pointy) and shakes its head solemnly, “Not cat.” It is hyperliteral, or “brittle.” Even the thickest toddler shows much greater inferential acuity.
Now imagine that instead of hard-wiring the machine with a set of rules for classification stored in one location of the computer’s memory, you try the same thing on a neural network. There is no special place that can hold the definition of “cat.” There is just a giant blob of interconnected switches, like forks in a path. On one side of the blob, you present the inputs (the pictures); on the other side, you present the corresponding outputs (the labels). Then you just tell it to work out for itself, via the individual calibration of all of these interconnected switches, whatever path the data should take so that the inputs are mapped to the correct outputs. The training is the process by which a labyrinthine series of elaborate tunnels are excavated through the blob, tunnels that connect any given input to its proper output. The more training data you have, the greater the number and intricacy of the tunnels that can be dug. Once the training is complete, the middle of the blob has enough tunnels that it can make reliable predictions about how to handle data it has never seen before. This is called “supervised learning.”
The reason that the network requires so many neurons and so much data is that it functions, in a way, like a sort of giant machine democracy. Imagine you want to train a computer to differentiate among five different items. Your network is made up of millions and millions of neuronal “voters,” each of whom has been given five different cards: one for cat, one for dog, one for spider monkey, one for spoon and one for defibrillator. You show your electorate a photo and ask, “Is this a cat, a dog, a spider monkey, a spoon or a defibrillator?” All the neurons that voted the same way collect in groups, and the network foreman peers down from above and identifies the majority classification: “A dog?”
You say: “No, maestro, it’s a cat. Try again.”
Now the network foreman goes back to identify which voters threw their weight behind “cat” and which didn’t. The ones that got “cat” right get their votes counted double next time — at least when they’re voting for “cat.” They have to prove independently whether they’re also good at picking out dogs and defibrillators, but one thing that makes a neural network so flexible is that each individual unit can contribute differently to different desired outcomes. What’s important is not the individual vote, exactly, but the pattern of votes. If Joe, Frank and Mary all vote together, it’s a dog; but if Joe, Kate and Jessica vote together, it’s a cat; and if Kate, Jessica and Frank vote together, it’s a defibrillator. The neural network just needs to register enough of a regularly discernible signal somewhere to say, “Odds are, this particular arrangement of pixels represents something these humans keep calling ‘cats.’ ” The more “voters” you have, and the more times you make them vote, the more keenly the network can register even very weak signals. If you have only Joe, Frank and Mary, you can maybe use them only to differentiate among a cat, a dog and a defibrillator. If you have millions of different voters that can associate in billions of different ways, you can learn to classify data with incredible granularity. Your trained voter assembly will be able to look at an unlabeled picture and identify it more or less accurately.
Part of the reason there was so much resistance to these ideas in computer-science departments is that because the output is just a prediction based on patterns of patterns, it’s not going to be perfect, and the machine will never be able to define for you what, exactly, a cat is. It just knows them when it sees them. This wooliness, however, is the point. The neuronal “voters” will recognize a happy cat dozing in the sun and an angry cat glaring out from the shadows of an untidy litter box, as long as they have been exposed to millions of diverse cat scenes. You just need lots and lots of the voters — in order to make sure that some part of your network picks up on even very weak regularities, on Scottish Folds with droopy ears, for example — and enough labeled data to make sure your network has seen the widest possible variance in phenomena.
It is important to note, however, that the fact that neural networks are probabilistic in nature means that they’re not suitable for all tasks. It’s no great tragedy if they mislabel 1 percent of cats as dogs, or send you to the wrong movie on occasion, but in something like a self-driving car we all want greater assurances. This isn’t the only caveat. Supervised learning is a trial-and-error process based on labeled data. The machines might be doing the learning, but there remains a strong human element in the initial categorization of the inputs. If your data had a picture of a man and a woman in suits that someone had labeled “woman with her boss,” that relationship would be encoded into all future pattern recognition. Labeled data is thus fallible the way that human labelers are fallible. If a machine was asked to identify creditworthy candidates for loans, it might use data like felony convictions, but if felony convictions were unfair in the first place — if they were based on, say, discriminatory drug laws — then the loan recommendations would perforce also be fallible.
Image-recognition networks like our cat-identifier are only one of many varieties of deep learning, but they are disproportionately invoked as teaching examples because each layer does something at least vaguely recognizable to humans — picking out edges first, then circles, then faces. This means there’s a safeguard against error. For instance, an early oddity in Google’s image-recognition software meant that it could not always identify a barbell in isolation, even though the team had trained it on an image set that included a lot of exercise categories. A visualization tool showed them the machine had learned not the concept of “dumbbell” but the concept of “dumbbell+arm,” because all the dumbbells in the training set were attached to arms. They threw into the training mix some photos of solo barbells. The problem was solved. Not everything is so easy.
4. The Cat Paper
Over the course of its first year or two, Brain’s efforts to cultivate in machines the skills of a 1-year-old were auspicious enough that the team was graduated out of the X lab and into the broader research organization. (The head of Google X once noted that Brain had paid for the entirety of X’s costs.) They still had fewer than 10 people and only a vague sense for what might ultimately come of it all. But even then they were thinking ahead to what ought to happen next. First a human mind learns to recognize a ball and rests easily with the accomplishment for a moment, but sooner or later, it wants to ask for the ball. And then it wades into language.
The first step in that direction was the cat paper, which made Brain famous.
What the cat paper demonstrated was that a neural network with more than a billion “synaptic” connections — a hundred times larger than any publicized neural network to that point, yet still many orders of magnitude smaller than our brains — could observe raw, unlabeled data and pick out for itself a high-order human concept. The Brain researchers had shown the network millions of still frames from YouTube videos, and out of the welter of the pure sensorium the network had isolated a stable pattern any toddler or chipmunk would recognize without a moment’s hesitation as the face of a cat. The machine had not been programmed with the foreknowledge of a cat; it reached directly into the world and seized the idea for itself. (The researchers discovered this with the neural-network equivalent of something like an M.R.I., which showed them that a ghostly cat face caused the artificial neurons to “vote” with the greatest collective enthusiasm.) Most machine learning to that point had been limited by the quantities of labeled data. The cat paper showed that machines could also deal with raw unlabeled data, perhaps even data of which humans had no established foreknowledge. This seemed like a major advance not only in cat-recognition studies but also in overall artificial intelligence.
The lead author on the cat paper was Quoc Le. Le is short and willowy and soft-spoken, with a quick, enigmatic smile and shiny black penny loafers. He grew up outside Hue, Vietnam. His parents were rice farmers, and he did not have electricity at home. His mathematical abilities were obvious from an early age, and he was sent to study at a magnet school for science. In the late 1990s, while still in school, he tried to build a chatbot to talk to. He thought, How hard could this be?
“But actually,” he told me in a whispery deadpan, “it’s very hard.”
He left the rice paddies on a scholarship to a university in Canberra, Australia, where he worked on A.I. tasks like computer vision. The dominant method of the time, which involved feeding the machine definitions for things like edges, felt to him like cheating. Le didn’t know then, or knew only dimly, that there were at least a few dozen computer scientists elsewhere in the world who couldn’t help imagining, as he did, that machines could learn from scratch. In 2006, Le took a position at the Max Planck Institute for Biological Cybernetics in the medieval German university town of Tübingen. In a reading group there, he encountered two newpapers by Geoffrey Hinton. People who entered the discipline during the long diaspora all have conversion stories, and when Le read those papers, he felt the scales fall away from his eyes.
“There was a big debate,” he told me. “A very big debate.” We were in a small interior conference room, a narrow, high-ceilinged space outfitted with only a small table and two whiteboards. He looked to the curve he’d drawn on the whiteboard behind him and back again, then softly confided, “I’ve never seen such a big debate.”
He remembers standing up at the reading group and saying, “This is the future.” It was, he said, an “unpopular decision at the time.” A former adviser from Australia, with whom he had stayed close, couldn’t quite understand Le’s decision. “Why are you doing this?” he asked Le in an email.
“I didn’t have a good answer back then,” Le said. “I was just curious. There was a successful paradigm, but to be honest I was just curious about the new paradigm. In 2006, there was very little activity.” He went to join Ng at Stanford and began to pursue Hinton’s ideas. “By the end of 2010, I was pretty convinced something was going to happen.”
What happened, soon afterward, was that Le went to Brain as its first intern, where he carried on with his dissertation work — an extension of which ultimately became the cat paper. On a simple level, Le wanted to see if the computer could be trained to identify on its own the information that was absolutely essential to a given image. He fed the neural network a still he had taken from YouTube. He then told the neural network to throw away some of the information contained in the image, though he didn’t specify what it should or shouldn’t throw away. The machine threw away some of the information, initially at random. Then he said: “Just kidding! Now recreate the initial image you were shown based only on the information you retained.” It was as if he were asking the machine to find a way to “summarize” the image, and then expand back to the original from the summary. If the summary was based on irrelevant data — like the color of the sky rather than the presence of whiskers — the machine couldn’t perform a competent reconstruction. Its reaction would be akin to that of a distant ancestor whose takeaway from his brief exposure to saber-tooth tigers was that they made a restful swooshing sound when they moved. Le’s neural network, unlike that ancestor, got to try again, and again and again and again. Each time it mathematically “chose” to prioritize different pieces of information and performed incrementally better. A neural network, however, was a black box. It divined patterns, but the patterns it identified didn’t always make intuitive sense to a human observer. The same network that hit on our concept of cat also became enthusiastic about a pattern that looked like some sort of furniture-animal compound, like a cross between an ottoman and a goat.
Le didn’t see himself in those heady cat years as a language guy, but he felt an urge to connect the dots to his early chatbot. After the cat paper, he realized that if you could ask a network to summarize a photo, you could perhaps also ask it to summarize a sentence. This problem preoccupied Le, along with a Brain colleague named Tomas Mikolov, for the next two years.
In that time, the Brain team outgrew several offices around him. For a while they were on a floor they shared with executives. They got an email at one point from the administrator asking that they please stop allowing people to sleep on the couch in front of Larry Page and Sergey Brin’s suite. It unsettled incoming V.I.P.s. They were then allocated part of a research building across the street, where their exchanges in the microkitchen wouldn’t be squandered on polite chitchat with the suits. That interim also saw dedicated attempts on the part of Google’s competitors to catch up. (As Le told me about his close collaboration with Tomas Mikolov, he kept repeating Mikolov’s name over and over, in an incantatory way that sounded poignant. Le had never seemed so solemn. I finally couldn’t help myself and began to ask, “Is he … ?” Le nodded. “At Facebook,” he replied.)
They spent this period trying to come up with neural-network architectures that could accommodate not only simple photo classifications, which were static, but also complex structures that unfolded over time, like language or music. Many of these were first proposed in the 1990s, and Le and his colleagues went back to those long-ignored contributions to see what they could glean. They knew that once you established a facility with basic linguistic prediction, you could then go on to do all sorts of other intelligent things — like predict a suitable reply to an email, for example, or predict the flow of a sensible conversation. You could sidle up to the sort of prowess that would, from the outside at least, look a lot like thinking.
Part II: Language Machine
5. The Linguistic Turn
The hundred or so current members of Brain — it often feels less like a department within a colossal corporate hierarchy than it does a club or a scholastic society or an intergalactic cantina — came in the intervening years to count among the freest and most widely admired employees in the entire Google organization. They are now quartered in a tiered two-story eggshell building, with large windows tinted a menacing charcoal gray, on the leafy northwestern fringe of the company’s main Mountain View campus. Their microkitchen has a foosball table I never saw used; a Rock Band setup I never saw used; and a Go kit I saw used on a few occasions. (I did once see a young Brain research associate introducing his colleagues to ripe jackfruit, carving up the enormous spiky orb like a turkey.)
When I began spending time at Brain’s offices, in June, there were some rows of empty desks, but most of them were labeled with Post-it notes that said things like “Jesse, 6/27.” Now those are all occupied. When I first visited, parking was not an issue. The closest spaces were those reserved for expectant mothers or Teslas, but there was ample space in the rest of the lot. By October, if I showed up later than 9:30, I had to find a spot across the street.
Brain’s growth made Dean slightly nervous about how the company was going to handle the demand. He wanted to avoid what at Google is known as a “success disaster” — a situation in which the company’s capabilities in theory outpaced its ability to implement a product in practice. At a certain point he did some back-of-the-envelope calculations, which he presented to the executives one day in a two-slide presentation.
“If everyone in the future speaks to their Android phone for three minutes a day,” he told them, “this is how many machines we’ll need.” They would need to double or triple their global computational footprint.
“That,” he observed with a little theatrical gulp and widened eyes, “sounded scary. You’d have to” — he hesitated to imagine the consequences — “build new buildings.”
There was, however, another option: just design, mass-produce and install in dispersed data centers a new kind of chip to make everything faster. These chips would be called T.P.U.s, or “tensor processing units,” and their value proposition — counterintuitively — is that they are deliberately less precise than normal chips. Rather than compute 12.246 times 54.392, they will give you the perfunctory answer to 12 times 54. On a mathematical level, rather than a metaphorical one, a neural network is just a structured series of hundreds or thousands or tens of thousands of matrix multiplications carried out in succession, and it’s much more important that these processes be fast than that they be exact. “Normally,” Dean said, “special-purpose hardware is a bad idea. It usually works to speed up one thing. But because of the generality of neural networks, you can leverage this special-purpose hardware for a lot of other things.”
Just as the chip-design process was nearly complete, Le and two colleagues finally demonstrated that neural networks might be configured to handle the structure of language. He drew upon an idea, called “word embeddings,” that had been around for more than 10 years. When you summarize images, you can divine a picture of what each stage of the summary looks like — an edge, a circle, etc. When you summarize language in a similar way, you essentially produce multidimensional maps of the distances, based on common usage, between one word and every single other word in the language. The machine is not “analyzing” the data the way that we might, with linguistic rules that identify some of them as nouns and others as verbs. Instead, it is shifting and twisting and warping the words around in the map. In two dimensions, you cannot make this map useful. You want, for example, “cat” to be in the rough vicinity of “dog,” but you also want “cat” to be near “tail” and near “supercilious” and near “meme,” because you want to try to capture all of the different relationships — both strong and weak — that the word “cat” has to other words. It can be related to all these other words simultaneously only if it is related to each of them in a different dimension. You can’t easily make a 160,000-dimensional map, but it turns out you can represent a language pretty well in a mere thousand or so dimensions — in other words, a universe in which each word is designated by a list of a thousand numbers. Le gave me a good-natured hard time for my continual requests for a mental picture of these maps. “Gideon,” he would say, with the blunt regular demurral of Bartleby, “I do not generally like trying to visualize thousand-dimensional vectors in three-dimensional space.”
Still, certain dimensions in the space, it turned out, did seem to represent legible human categories, like gender or relative size. If you took the thousand numbers that meant “king” and literally just subtracted the thousand numbers that meant “queen,” you got the same numerical result as if you subtracted the numbers for “woman” from the numbers for “man.” And if you took the entire space of the English language and the entire space of French, you could, at least in theory, train a network to learn how to take a sentence in one space and propose an equivalent in the other. You just had to give it millions and millions of English sentences as inputs on one side and their desired French outputs on the other, and over time it would recognize the relevant patterns in words the way that an image classifier recognized the relevant patterns in pixels. You could then give it a sentence in English and ask it to predict the best French analogue.
The major difference between words and pixels, however, is that all of the pixels in an image are there at once, whereas words appear in a progression over time. You needed a way for the network to “hold in mind” the progression of a chronological sequence — the complete pathway from the first word to the last. In a period of about a week, in September 2014, three papers came out — one by Le and two others by academics in Canada and Germany — that at last provided all the theoretical tools necessary to do this sort of thing. That research allowed for open-ended projects like Brain’s Magenta, an investigation into how machines might generate art and music. It also cleared the way toward an instrumental task like machine translation. Hinton told me he thought at the time that this follow-up work would take at least five more years.
6. The Ambush
Le’s paper showed that neural translation was plausible, but he had used only a relatively small public data set. (Small for Google, that is — it was actually the biggest public data set in the world. A decade of the old Translate had gathered production data that was between a hundred and a thousand times bigger.) More important, Le’s model didn’t work very well for sentences longer than about seven words.
Mike Schuster, who then was a staff research scientist at Brain, picked up the baton. He knew that if Google didn’t find a way to scale these theoretical insights up to a production level, someone else would. The project took him the next two years. “You think,” Schuster says, “to translate something, you just get the data, run the experiments and you’re done, but it doesn’t work like that.”
Schuster is a taut, focused, ageless being with a tanned, piston-shaped head, narrow shoulders, long camo cargo shorts tied below the knee and neon-green Nike Flyknits. He looks as if he woke up in the lotus position, reached for his small, rimless, elliptical glasses, accepted calories in the form of a modest portion of preserved acorn and completed a relaxed desert decathlon on the way to the office; in reality, he told me, it’s only an 18-mile bike ride each way. Schuster grew up in Duisburg, in the former West Germany’s blast-furnace district, and studied electrical engineering before moving to Kyoto to work on early neural networks. In the 1990s, he ran experiments with a neural-networking machine as big as a conference room; it cost millions of dollars and had to be trained for weeks to do something you could now do on your desktop in less than an hour. He published a paper in 1997 that was barely cited for a decade and a half; this year it has been cited around 150 times. He is not humorless, but he does often wear an expression of some asperity, which I took as his signature combination of German restraint and Japanese restraint.
The issues Schuster had to deal with were tangled. For one thing, Le’s code was custom-written, and it wasn’t compatible with the new open-source machine-learning platform Google was then developing, TensorFlow. Dean directed to Schuster two other engineers, Yonghui Wu and Zhifeng Chen, in the fall of 2015. It took them two months just to replicate Le’s results on the new system. Le was around, but even he couldn’t always make heads or tails of what they had done.
As Schuster put it, “Some of the stuff was not done in full consciousness. They didn’t know themselves why they worked.”
This February, Google’s research organization — the loose division of the company, roughly a thousand employees in all, dedicated to the forward-looking and the unclassifiable — convened their leads at an offsite retreat at the Westin St. Francis, on Union Square, a luxury hotel slightly less splendid than Google’s own San Francisco shop a mile or so to the east. The morning was reserved for rounds of “lightning talks,” quick updates to cover the research waterfront, and the afternoon was idled away in cross-departmental “facilitated discussions.” The hope was that the retreat might provide an occasion for the unpredictable, oblique, Bell Labs-ish exchanges that kept a mature company prolific.
At lunchtime, Corrado and Dean paired up in search of Macduff Hughes, director of Google Translate. Hughes was eating alone, and the two Brain members took positions at either side. As Corrado put it, “We ambushed him.”
“O.K.,” Corrado said to the wary Hughes, holding his breath for effect. “We have something to tell you.”
They told Hughes that 2016 seemed like a good time to consider an overhaul of Google Translate — the code of hundreds of engineers over 10 years — with a neural network. The old system worked the way all machine translation has worked for about 30 years: It sequestered each successive sentence fragment, looked up those words in a large statistically derived vocabulary table, then applied a battery of post-processing rules to affix proper endings and rearrange it all to make sense. The approach is called “phrase-based statistical machine translation,” because by the time the system gets to the next phrase, it doesn’t know what the last one was. This is why Translate’s output sometimes looked like a shaken bag of fridge magnets. Brain’s replacement would, if it came together, read and render entire sentences at one draft. It would capture context — and something akin to meaning.
The stakes may have seemed low: Translate generates minimal revenue, and it probably always will. For most Anglophone users, even a radical upgrade in the service’s performance would hardly be hailed as anything more than an expected incremental bump. But there was a case to be made that human-quality machine translation is not only a short-term necessity but also a development very likely, in the long term, to prove transformational. In the immediate future, it’s vital to the company’s business strategy. Google estimates that 50 percent of the internet is in English, which perhaps 20 percent of the world’s population speaks. If Google was going to compete in China — where a majority of market share in search-engine traffic belonged to its competitor Baidu — or India, decent machine translation would be an indispensable part of the infrastructure. Baidu itself had published a pathbreaking paper about the possibility of neural machine translation in July 2015.
And in the more distant, speculative future, machine translation was perhaps the first step toward a general computational facility with human language. This would represent a major inflection point — perhaps the major inflection point — in the development of something that felt like true artificial intelligence.
Most people in Silicon Valley were aware of machine learning as a fast-approaching horizon, so Hughes had seen this ambush coming. He remained skeptical. A modest, sturdily built man of early middle age with mussed auburn hair graying at the temples, Hughes is a classic line engineer, the sort of craftsman who wouldn’t have been out of place at a drafting table at 1970s Boeing. His jeans pockets often look burdened with curious tools of ungainly dimension, as if he were porting around measuring tapes or thermocouples, and unlike many of the younger people who work for him, he has a wardrobe unreliant on company gear. He knew that various people in various places at Google and elsewhere had been trying to make neural translation work — not in a lab but at production scale — for years, to little avail.
Hughes listened to their case and, at the end, said cautiously that it sounded to him as if maybe they could pull it off in three years.
Dean thought otherwise. “We can do it by the end of the year, if we put our minds to it.” One reason people liked and admired Dean so much was that he had a long record of successfully putting his mind to it. Another was that he wasn’t at all embarrassed to say sincere things like “if we put our minds to it.”
Hughes was sure the conversion wasn’t going to happen any time soon, but he didn’t personally care to be the reason. “Let’s prepare for 2016,” he went back and told his team. “I’m not going to be the one to say Jeff Dean can’t deliver speed.”
A month later, they were finally able to run a side-by-side experiment to compare Schuster’s new system with Hughes’s old one. Schuster wanted to run it for English-French, but Hughes advised him to try something else. “English-French,” he said, “is so good that the improvement won’t be obvious.”
It was a challenge Schuster couldn’t resist. The benchmark metric to evaluate machine translation is called a BLEU score, which compares a machine translation with an average of many reliable human translations. At the time, the best BLEU scores for English-French were in the high 20s. An improvement of one point was considered very good; an improvement of two was considered outstanding.
The neural system, on the English-French language pair, showed an improvement over the old system of seven points.
Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years.
To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development.
In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.
7. Theory Becomes Product
Until then, the neural-translation team had been only three people — Schuster, Wu and Chen — but with Hughes’s support, the broader team began to coalesce. They met under Schuster’s command on Wednesdays at 2 p.m. in a corner room of the Brain building called Quartz Lake. The meeting was generally attended by a rotating cast of more than a dozen people. When Hughes or Corrado were there, they were usually the only native English speakers. The engineers spoke Chinese, Vietnamese, Polish, Russian, Arabic, German and Japanese, though they mostly spoke in their own efficient pidgin and in math. It is not always totally clear, at Google, who is running a meeting, but in Schuster’s case there was no ambiguity.
The steps they needed to take, even then, were not wholly clear. “This story is a lot about uncertainty — uncertainty throughout the whole process,” Schuster told me at one point. “The software, the data, the hardware, the people. It was like” — he extended his long, gracile arms, slightly bent at the elbows, from his narrow shoulders — “swimming in a big sea of mud, and you can only see this far.” He held out his hand eight inches in front of his chest. “There’s a goal somewhere, and maybe it’s there.”
Most of Google’s conference rooms have videochat monitors, which when idle display extremely high-resolution oversaturated public Google+ photos of a sylvan dreamscape or the northern lights or the Reichstag. Schuster gestured toward one of the panels, which showed a crystalline still of the Washington Monument at night.
“The view from outside is that everyone has binoculars and can see ahead so far.”
The theoretical work to get them to this point had already been painstaking and drawn-out, but the attempt to turn it into a viable product — the part that academic scientists might dismiss as “mere” engineering — was no less difficult. For one thing, they needed to make sure that they were training on good data. Google’s billions of words of training “reading” were mostly made up of complete sentences of moderate complexity, like the sort of thing you might find in Hemingway. Some of this is in the public domain: The original Rosetta Stone of statistical machine translation was millions of pages of the complete bilingual records of the Canadian Parliament. Much of it, however, was culled from 10 years of collected data, including human translations that were crowdsourced from enthusiastic respondents. The team had in their storehouse about 97 million unique English “words.” But once they removed the emoticons, and the misspellings, and the redundancies, they had a working vocabulary of only around 160,000.
Then you had to refocus on what users actually wanted to translate, which frequently had very little to do with reasonable language as it is employed. Many people, Google had found, don’t look to the service to translate full, complex sentences; they translate weird little shards of language. If you wanted the network to be able to handle the stream of user queries, you had to be sure to orient it in that direction. The network was very sensitive to the data it was trained on. As Hughes put it to me at one point: “The neural-translation system is learning everything it can. It’s like a toddler. ‘Oh, Daddy says that word when he’s mad!’ ” He laughed. “You have to be careful.”
More than anything, though, they needed to make sure that the whole thing was fast and reliable enough that their users wouldn’t notice. In February, the translation of a 10-word sentence took 10 seconds. They could never introduce anything that slow. The Translate team began to conduct latency experiments on a small percentage of users, in the form of faked delays, to identify tolerance. They found that a translation that took twice as long, or even five times as long, wouldn’t be registered. An eightfold slowdown would. They didn’t need to make sure this was true across all languages. In the case of a high-traffic language, like French or Chinese, they could countenance virtually no slowdown. For something more obscure, they knew that users wouldn’t be so scared off by a slight delay if they were getting better quality. They just wanted to prevent people from giving up and switching over to some competitor’s service.
Schuster, for his part, admitted he just didn’t know if they ever could make it fast enough. He remembers a conversation in the microkitchen during which he turned to Chen and said, “There must be something we don’t know to make it fast enough, but I don’t know what it could be.”
He did know, though, that they needed more computers — “G.P.U.s,” graphics processors reconfigured for neural networks — for training.
Hughes went to Schuster to ask what he thought. “Should we ask for a thousand G.P.U.s?”
Schuster said, “Why not 2,000?”
Ten days later, they had the additional 2,000 processors.
By April, the original lineup of three had become more than 30 people — some of them, like Le, on the Brain side, and many from Translate. In May, Hughes assigned a kind of provisional owner to each language pair, and they all checked their results into a big shared spreadsheet of performance evaluations. At any given time, at least 20 people were running their own independent weeklong experiments and dealing with whatever unexpected problems came up. One day a model, for no apparent reason, started taking all the numbers it came across in a sentence and discarding them. There were months when it was all touch and go. “People were almost yelling,” Schuster said.
By late spring, the various pieces were coming together. The team introduced something called a “word-piece model,” a “coverage penalty,” “length normalization.” Each part improved the results, Schuster says, by maybe a few percentage points, but in aggregate they had significant effects. Once the model was standardized, it would be only a single multilingual model that would improve over time, rather than the 150 different models that Translate currently used. Still, the paradox — that a tool built to further generalize, via learning machines, the process of automation required such an extraordinary amount of concerted human ingenuity and effort — was not lost on them. So much of what they did was just gut. How many neurons per layer did you use? 1,024 or 512? How many layers? How many sentences did you run through at a time? How long did you train for?
“We did hundreds of experiments,” Schuster told me, “until we knew that we could stop the training after one week. You’re always saying: When do we stop? How do I know I’m done? You never know you’re done. The machine-learning mechanism is never perfect. You need to train, and at some point you have to stop. That’s the very painful nature of this whole system. It’s hard for some people. It’s a little bit an art — where you put your brush to make it nice. It comes from just doing it. Some people are better, some worse.”
By May, the Brain team understood that the only way they were ever going to make the system fast enough to implement as a product was if they could run it on T.P.U.s, the special-purpose chips that Dean had called for. As Chen put it: “We did not even know if the code would work. But we did know that without T.P.U.s, it definitely wasn’t going to work.” He remembers going to Dean one on one to plead, “Please reserve something for us.” Dean had reserved them. The T.P.U.s, however, didn’t work right out of the box. Wu spent two months sitting next to someone from the hardware team in an attempt to figure out why. They weren’t just debugging the model; they were debugging the chip. The neural-translation project would be proof of concept for the whole infrastructural investment.
One Wednesday in June, the meeting in Quartz Lake began with murmurs about a Baidu paper that had recently appeared on the discipline’s chief online forum. Schuster brought the room to order. “Yes, Baidu came out with a paper. It feels like someone looking through our shoulder — similar architecture, similar results.” The company’s BLEU scores were essentially what Google achieved in its internal tests in February and March. Le didn’t seem ruffled; his conclusion seemed to be that it was a sign Google was on the right track. “It is very similar to our system,” he said with quiet approval.
The Google team knew that they could have published their results earlier and perhaps beaten their competitors, but as Schuster put it: “Launching is more important than publishing. People say, ‘Oh, I did something first,’ but who cares, in the end?”
This did, however, make it imperative that they get their own service out first and better. Hughes had a fantasy that they wouldn’t even inform their users of the switch. They would just wait and see if social media lit up with suspicions about the vast improvements.
“We don’t want to say it’s a new system yet,” he told me at 5:36 p.m. two days after Labor Day, one minute before they rolled out Chinese-to-English to 10 percent of their users, without telling anyone. “We want to make sure it works. The ideal is that it’s exploding on Twitter: ‘Have you seen how awesome Google Translate got?’ ”
8. A Celebration
The only two reliable measures of time in the seasonless Silicon Valley are the rotations of seasonal fruit in the microkitchens — from the pluots of midsummer to the Asian pears and Fuyu persimmons of early fall — and the zigzag of technological progress. On an almost uncomfortably warm Monday afternoon in late September, the team’s paper was at last released. It had an almost comical 31 authors. The next day, the members of Brain and Translate gathered to throw themselves a little celebratory reception in the Translate microkitchen. The rooms in the Brain building, perhaps in homage to the long winters of their diaspora, are named after Alaskan locales; the Translate building’s theme is Hawaiian.
The Hawaiian microkitchen has a slightly grainy beach photograph on one wall, a small lei-garlanded thatched-hut service counter with a stuffed parrot at the center and ceiling fixtures fitted to resemble paper lanterns. Two sparse histograms of bamboo poles line the sides, like the posts of an ill-defended tropical fort. Beyond the bamboo poles, glass walls and doors open onto rows of identical gray desks on either side. That morning had seen the arrival of new hooded sweatshirts to honor 10 years of Translate, and many team members went over to the party from their desks in their new gear. They were in part celebrating the fact that their decade of collective work was, as of that day, en route to retirement. At another institution, these new hoodies might thus have become a costume of bereavement, but the engineers and computer scientists from both teams all seemed pleased.
Google’s neural translation was at last working. By the time of the party, the company’s Chinese-English test had already processed 18 million queries. One engineer on the Translate team was running around with his phone out, trying to translate entire sentences from Chinese to English using Baidu’s alternative. He crowed with glee to anybody who would listen. “If you put in more than two characters at once, it times out!” (Baidu says this problem has never been reported by users.)
When word began to spread, over the following weeks, that Google had introduced neural translation for Chinese to English, some people speculated that it was because that was the only language pair for which the company had decent results. Everybody at the party knew that the reality of their achievement would be clear in November. By then, however, many of them would be on to other projects.
Hughes cleared his throat and stepped in front of the tiki bar. He wore a faded green polo with a rumpled collar, lightly patterned across the midsection with dark bands of drying sweat. There had been last-minute problems, and then last-last-minute problems, including a very big measurement error in the paper and a weird punctuation-related bug in the system. But everything was resolved — or at least sufficiently resolved for the moment. The guests quieted. Hughes ran efficient and productive meetings, with a low tolerance for maundering or side conversation, but he was given pause by the gravity of the occasion. He acknowledged that he was, perhaps, stretching a metaphor, but it was important to him to underline the fact, he began, that the neural translation project itself represented a “collaboration between groups that spoke different languages.”
Their neural-translation project, he continued, represented a “step function forward” — that is, a discontinuous advance, a vertical leap rather than a smooth curve. The relevant translation had been not just between the two teams but from theory into reality. He raised a plastic demi-flute of expensive-looking Champagne.
“To communication,” he said, “and cooperation!”
The engineers assembled looked around at one another and gave themselves over to little circumspect whoops and applause.
Jeff Dean stood near the center of the microkitchen, his hands in his pockets, shoulders hunched slightly inward, with Corrado and Schuster. Dean saw that there was some diffuse preference that he contribute to the observance of the occasion, and he did so in a characteristically understated manner, with a light, rapid, concise addendum.
What they had shown, Dean said, was that they could do two major things at once: “Do the research and get it in front of, I dunno, half a billion people.”
Everyone laughed, not because it was an exaggeration but because it wasn’t.
Epilogue: Machines Without Ghosts
Perhaps the most famous historic critique of artificial intelligence, or the claims made on its behalf, implicates the question of translation. The Chinese Room argument was proposed in 1980 by the Berkeley philosopher John Searle. In Searle’s thought experiment, a monolingual English speaker sits alone in a cell. An unseen jailer passes him, through a slot in the door, slips of paper marked with Chinese characters. The prisoner has been given a set of tables and rules in English for the composition of replies. He becomes so adept with these instructions that his answers are soon “absolutely indistinguishable from those of Chinese speakers.” Should the unlucky prisoner be said to “understand” Chinese? Searle thought the answer was obviously not. This metaphor for a computer, Searle later wrote, exploded the claim that “the appropriately programmed digital computer with the right inputs and outputs would thereby have a mind in exactly the sense that human beings have minds.”
For the Google Brain team, though, or for nearly everyone else who works in machine learning in Silicon Valley, that view is entirely beside the point. This doesn’t mean they’re just ignoring the philosophical question. It means they have a fundamentally different view of the mind. Unlike Searle, they don’t assume that “consciousness” is some special, numinously glowing mental attribute — what the philosopher Gilbert Ryle called the “ghost in the machine.” They just believe instead that the complex assortment of skills we call “consciousness” has randomly emerged from the coordinated activity of many different simple mechanisms. The implication is that our facility with what we consider the higher registers of thought are no different in kind from what we’re tempted to perceive as the lower registers. Logical reasoning, on this account, is seen as a lucky adaptation; so is the ability to throw and catch a ball. Artificial intelligence is not about building a mind; it’s about the improvement of tools to solve problems. As Corrado said to me on my very first day at Google, “It’s not about what a machine ‘knows’ or ‘understands’ but what it ‘does,’ and — more importantly — what it doesn’t do yet.”
Where you come down on “knowing” versus “doing” has real cultural and social implications. At the party, Schuster came over to me to express his frustration with the paper’s media reception. “Did you see the first press?” he asked me. He paraphrased a headline from that morning, blocking it word by word with his hand as he recited it: GOOGLE SAYS A.I. TRANSLATION IS INDISTINGUISHABLE FROM HUMANS’. Over the final weeks of the paper’s composition, the team had struggled with this; Schuster often repeated that the message of the paper was “It’s much better than it was before, but not as good as humans.” He had hoped it would be clear that their efforts weren’t about replacing people but helping them.
And yet the rise of machine learning makes it more difficult for us to carve out a special place for us. If you believe, with Searle, that there is something special about human “insight,” you can draw a clear line that separates the human from the automated. If you agree with Searle’s antagonists, you can’t. It is understandable why so many people cling fast to the former view. At a 2015 M.I.T. conference about the roots of artificial intelligence, Noam Chomsky was asked what he thought of machine learning. He pooh-poohed the whole enterprise as mere statistical prediction, a glorified weather forecast. Even if neural translation attained perfect functionality, it would reveal nothing profound about the underlying nature of language. It could never tell you if a pronoun took the dative or the accusative case. This kind of prediction makes for a good tool to accomplish our ends, but it doesn’t succeed by the standards of furthering our understanding of why things happen the way they do. A machine can already detect tumors in medical scans better than human radiologists, but the machine can’t tell you what’s causing the cancer.
Then again, can the radiologist?
Medical diagnosis is one field most immediately, and perhaps unpredictably, threatened by machine learning. Radiologists are extensively trained and extremely well paid, and we think of their skill as one of professional insight — the highest register of thought. In the past year alone, researchers have shown not only that neural networks can find tumors in medical images much earlier than their human counterparts but also that machines can even make such diagnoses from the texts of pathology reports. What radiologists do turns out to be something much closer to predictive pattern-matching than logical analysis. They’re not telling you what caused the cancer; they’re just telling you it’s there.
Once you’ve built a robust pattern-matching apparatus for one purpose, it can be tweaked in the service of others. One Translate engineer took a network he put together to judge artwork and used it to drive an autonomous radio-controlled car. A network built to recognize a cat can be turned around and trained on CT scans — and on infinitely more examples than even the best doctor could ever review. A neural network built to translate could work through millions of pages of documents of legal discovery in the tiniest fraction of the time it would take the most expensively credentialed lawyer. The kinds of jobs taken by automatons will no longer be just repetitive tasks that were once — unfairly, it ought to be emphasized — associated with the supposed lower intelligence of the uneducated classes. We’re not only talking about three and a half million truck drivers who may soon lack careers. We’re talking about inventory managers, economists, financial advisers, real estate agents. What Brain did over nine months is just one example of how quickly a small group at a large company can automate a task nobody ever would have associated with machines.
The most important thing happening in Silicon Valley right now is not disruption. Rather, it’s institution-building — and the consolidation of power — on a scale and at a pace that are both probably unprecedented in human history. Brain has interns; it has residents; it has “ninja” classes to train people in other departments. Everywhere there are bins of free bike helmets, and free green umbrellas for the two days a year it rains, and little fruit salads, and nap pods, and shared treadmill desks, and massage chairs, and random cartons of high-end pastries, and places for baby-clothes donations, and two-story climbing walls with scheduled instructors, and reading groups and policy talks and variegated support networks. The recipients of these major investments in human cultivation — for they’re far more than perks for proles in some digital salt mine — have at hand the power of complexly coordinated servers distributed across 13 data centers on four continents, data centers that draw enough electricity to light up large cities.
But even enormous institutions like Google will be subject to this wave of automation; once machines can learn from human speech, even the comfortable job of the programmer is threatened. As the party in the tiki bar was winding down, a Translate engineer brought over his laptop to show Hughes something. The screen swirled and pulsed with a vivid, kaleidoscopic animation of brightly colored spheres in long looping orbits that periodically collapsed into nebulae before dispersing once more.
Hughes recognized what it was right away, but I had to look closely before I saw all the names — of people and files. It was an animation of the history of 10 years of changes to the Translate code base, every single buzzing and blooming contribution by every last team member. Hughes reached over gently to skip forward, from 2006 to 2008 to 2015, stopping every once in a while to pause and remember some distant campaign, some ancient triumph or catastrophe that now hurried by to be absorbed elsewhere or to burst on its own. Hughes pointed out how often Jeff Dean’s name expanded here and there in glowing spheres.
Hughes called over Corrado, and they stood transfixed. To break the spell of melancholic nostalgia, Corrado, looking a little wounded, looked up and said, “So when do we get to delete it?”
“Don’t worry about it,” Hughes said. “The new code base is going to grow. Everything grows.”
Leap From Developer To
Machine Learning Practitioner
or, my answer to the question:
How Do I Get Started In Machine Learning?
I’m a developer. I have read a book or some posts on machine learning. I have watched some of the Coursera machine learning course. I still don’t know how to get started…
Does this sound familiar?
Frustrated with machine learning books and courses?
How do you get started in machine learning?
Photo by Peter Alfred Hess, some rights reserved
The most common question I’m asked by developers on my newsletter is:
How do I get started in machine learning?
I honestly cannot remember how many times I have answered it.
In this post, I lay out all of my very best thinking on this topic.
You will discover why the traditional approach to teaching machine learning does not work for you.
You will discover how to flip the entire model on its head.
And you will discover my simple but very effective antidote that you can use to get started.
Let’s get into it…
A Developer Interested in Machine Learning
You are a developer and you’re interested in getting into machine learning. And why not? It’s a hot topic at the moment, and it’s a fascinating and fast growing field.
You read some blog posts. You tried to go deeper but the books are dreadful. Math focused. Theory focused. Algorithm focused.
Sound familiar? Have you tried books, MOOCs, blog posts and still not know how to get started in machine learning?
You try some video courses. You sign-up and make an honest attempt at the oft-cited Coursera Stanford Machine Learning MOOC. It’s not much better than the books and detailed blog posts. You can’t see what all the fuss is about, why it is recommended to beginners.
You may have even attempted some small data sets, perhaps an entry level Kaggle competition.
The problem is you can’t connect the theory, algorithms and math from the books and courses to the problem. There’s a huge gap. A gulf. How ARE you supposed to get started in machine learning?
Machine Learning Engineer
When you think forward into the future, once you have captured this elusive understanding of machine learning, what does your job look like? How are you using your newfound machine learning skills in your day-to-day?
I think I can see it. You’re a machine learning engineer. You’re a developer that knows how to “do” machine learning.
Do you want to transition from developer to a developer that can do machine learning?
Scenario 1: The one-off model
Your boss walks over and says:
Hey, you know machine learning, right? Can you use the customer data from last year to predict which current customers in our sales pipeline are likely to convert? I want to use it in a presentation to the board next week…
I call this the one-off model.
The problem is well defined by your boss. She gives you the data, which is small enough to look at and understand in MS Excel if you had to. She wants accurate and reliable predictions.
You can deliver. And more importantly, you can explain all the relevant caveats on the results.
Scenario 2: The embedded model
You and your team are collecting requirements from stakeholders on a software project. There is a requirement for the user to be able to freehand draw shapes in the software, and for the software to figure out which shape it is and turn it into a crisp unambiguous version and label it appropriately.
You quickly see that the best (and only viable?) way to solve this problem is to devise and train a predictive model and embed it in your software product.
I call this the embedded model. There are variations (such as whether the model is static or updated, and whether it is local or called remotely via an API), but that’s just detail.
What’s key in this scenario is that you have the experience to notice a problem that is best solved with a predictive model and the skills to devise, train and deploy it.
Scenario 3: The deep model
You have started a new job and the system you are working on is made up of at least one predictive model. Maintenance and the addition of features to this system require an understanding of the model, its inputs and its outputs. The accuracy of the model is a feature of the software product and part of your job will be to improve it.
For example, as a part of regular pre-release system testing, you must demonstrate that the accuracy of the model (when validated on historical data) has the same or better skill than the previous version.
I call this the deep model. You will be expected to build a deep understanding of one specific predictive model and use your experience and skill to improve and verify its accuracy as part of your routine duties.
The Developer That “Does” Machine Learning
These scenarios give you a glimpse at what it’s like to be a developer that knows how to do machine learning. They’re realistic because they are all variations on scenarios I’ve been in or tasks that I have had to complete.
All three of these scenarios make one thing very clear. Although machine learning is a fascinating area, to a developer machine learning algorithms are just another bag of tricks, like multi-threading or 3d graphics programming. Nevertheless, they are a powerful group of methods that are absolutely required for a specific class of problem.
Traditional Answer To: “how do I get started?“
So how do you get started in machine learning?
If you crack a book on machine learning seeking an answer to this question, you’ll get a shock. They start with definitions and move on to mathematical descriptions of concepts and algorithms of ever increasing complexity.
The traditional answer to the question “how do I get started in machine learning” is bottom-up.
Definitions and mathematical descriptions are clear, succinct and often unambiguous. The thing is, they are dry, boring and require the requisite mathematical background to parse and interpret.
There is a reason why machine learning is often taught as a graduate level subject at university. It’s because this “first principles” way of teaching the subject requires years of prerequisites to understand.
For example, it is advisable that you have a good footing in:
This gets worse if you stray slightly into some of the more exotic and interesting algorithms.
This bottom-up and algorithm fixated approach to machine learning is pervasive.
Online courses, MOOCs and YouTube videos mimic the university approach to teaching machine learning. Again, this is great if you have the background or you’ve already put in your half-to-full-decade of studies to earn those higher degrees. It does not help your average developer.
If you skulk off to a question and answer forum like Quora, StackExchange or Reddit and meekly ask how to get started, you’re slapped with the same response. Often this response comes from fellow developers who are just as lost. It’s one big echo chamber of the same bad advice.
It’s no wonder that honest and hard working developers seeking to do the right thing think they have to go back to school and get a Masters or Ph.D. before they feel qualified to “do” machine learning.
The Traditional Approach is DEAD WRONG!
Think about this bottom-up approach to teaching machine learning for a second. It’s rigorous and systematic and sounds like the right idea on the surface. How could it be wrong?
Bottom-Up Programming (or, how to kill off budding programmers)
Imagine you’re a young developer. You’ve picked up some of this and that language and you’re starting to learn how to create standalone software.
You tell friends and family that you want to get into a career where you get to program every day. They tell you that you need to do a degree in computer science before you can get a job as a programmer.
You sign-up and start a computer science degree. Semester after semester you are exposed to more and more esoteric algebra, calculus and discrete math. You use antiquated programming languages. Your passion for programming and building software wavers.
The traditional approach to getting started in machine learning has a gap on the path to practitioner.
Perhaps you somehow make it to the other side. Looking back, you realize you were not taught one thing about modern software development practices, languages, tooling, or anything that you can use in your pursuit of creating and delivering software.
See the parallels to the teaching of machine learning?
Thankfully, programming has been around long enough, is popular enough and is important enough to the economy that we have found other ways to give budding young (or old) programmers the skills they need to actually do the thing they want to do – e.g. create software.
It does not make sense to load up a budding programmer’s head with theory on computability or computational complexity, or even deep details of algorithms and data structures. Some of this useful material (the latter on algorithmic complexity and data structures) can come later. Perhaps with focused material – but importantly in the context of an engineer that is already programming and delivering software, not in isolation.
Thankfully we have focused software engineering degrees. We also have resources like codecademy where you learn to program by… yep, actually programming.
If a developer wants to “do” machine learning, should they really have to go and spend a bunch of years and tens or hundreds of thousands of dollars to get the requisite math and higher degrees?
The answer is of course not! There is a better way.
A Better Approach
As with computer science, you can’t just invert the model and teach the same material top-down.
The reason is, like a computer science course never making it to the subjects that cover the practical concerns of developing and delivering software, machine learning courses and books fall well short. They stop at algorithms.
You need a top-down approach to machine learning. An approach where you focus on the actual result you want: working real machine learning problems from end-to-end using modern and “best of breed” tools and platforms.
A better approach to learning machine learning that starts with working machine learning problems end-to-end.
Here’s what I think your yellow brick road looks like.
1. Repeatable Results with a Systematic Process
Once you know some tooling, it is relatively easy to blast a problem with a machine learning algorithm and call it “done“.
This could be dangerous.
How do you know you’re done? How do you know the results are any good? How do you know the results are reliable on the dataset?
You need to be systematic when working a machine learning problem. It’s a project, like a software project, and good processes can make achieving a high-quality result repeatable from project to project.
Contemplating such a process you can think of some clear requirements, such as:
A process that guides you from end-to-end, from problem specification to presentation or deployment of results. Like a software project, you can think you’re done, but you’re probably not. Having the end deliverable in mind from the beginning sets an unambiguous project stop condition and focuses effort.
A process that is step-by-step so that you always know what to do next. Not knowing what to do next is a project killer.
A process that guarantees “good” results, e.g. better than average or good enough for the needs of the project. It is very common for projects to need good results delivered reliably with known confidence levels, not necessarily the very best accuracy possible.
A process that is invariant to the specific tools, programming languages and algorithm fads. Tools come and go and the process must be adaptive. Given the algorithm obsession in the field, there are always new and powerful algorithms coming out of academia.
Select a systematic and repeatable process that you can use to deliver results consistently.
There are many great processes out there, including some older processes that you can adapt to your needs.
Pick or adapt a process that works best for you and meets the requirements above.
2. Mapping of “Best of Breed” Tools onto Your Process
Machine learning tools and libraries come and go, but at any single point in time you have to use something that best maps onto your chosen process of delivering results.
You don’t want to evaluate and select any old algorithm or library, you want the so-called “best of breed” that is going to give you fast, reliable and high-quality results and automate as much of your process that you can afford.
Again, you are going to have to make these selections yourself. If you ask anyone, you’re going to hear their biases, often the latest tool they’re using.
I have my own biases, and I like to use different tools and platforms for different types of work.
For example, in the scenarios listed above, I would advise the following best of breed tools:
Embedded predictive model: Python with scikit-learn, because I can develop the model in the same language in which it is deployed. IPython is a great way to demonstrate your pipeline and results to the broader team. A MLaaS is also an option for bigger data.
Deep-dive model: R with the caret package, because I can quickly and automatically try a lot of state-of-the-art models and devise more and more elaborate feature selection, feature engineering and algorithm tuning experiments using the whole R platform.
In reality, these three tools bleed across the three scenarios depending on the specifics of a situation.
Map your preferred machine learning tools onto your chosen systematic process for working through problems.
Like development, you need to study your tools to get the most from them. You also need to keep your ear to the ground and jump to newer better tools if and when they are available, forever adapting them to your repeatable process.
3. Targeted Practice with Semi-Formal Work Product
You get good at development by practicing – by developing lots of software. Use this familiar approach to get good at machine learning. The more of the process you practice in each project, the better (ideally, work problems end-to-end).
Carefully Select Your Practice Datasets
You want to pick datasets that are realistic rather than contrived. There are hundreds of free datasets out there of ever increasing complexity.
I would advise starting with small in-memory datasets from places like the UCI Machine Learning Repository. They are well known, relatively clean and a good place to start as you feel out your new process and tooling.
From there, I would recommend larger in-memory datasets, like those from some Kaggle and KDD cup competitions. They are little more dirty and require you to flex work on more and different skills.
Stick with tabular data, this is what I advise all of my students.
Working with image and text data are new and different fields in their own right (computer vision and natural language processing respectively) that require you to learn specialized methods and tooling to those fields. If these are types of problems you want or need to work then it might be best to start there, and there are great resources available.
Write-Up Your Results and Build A Public Portfolio of Work
Create and retain semi-formal outcomes (I refer to outcomes as “work product”) from each project. By this I mean, write up what you did and what you learned into some kind of standalone document so that you can refer back and leverage the results on future and following projects.
This is akin to keeping a directory for each programming project as a developer and reusing code and ideas from previous projects. It speeds up the journey a lot, and I strongly recommend it.
Keep any scripts, code and generated images, but it is also important to write up your findings. Think of it as akin to comments in your code. A standalone write-up could be a simple PPT or text file, or as elaborate as a presentation at a meet-up or video on YouTube.
Work through and complete discrete projects, write up results and build a portfolio of projects.
Save each project in a public version control repository (like GitHub) so that other beginners can learn from you and extend your work. Link to the projects from your blog, LinkedIn or wherever and use the public portfolio to demonstrate your growing skills and capabilities.
A portfolio of public GitHub repositories is fast becoming the resume in the hiring process at companies that actually care about skills and delivering results.
Yes, This Approach Is Tailored For Developers
What we have laid out above is an approach that you can use as a developer to learn, get started and make progress in machine learning.
It’s natural you may have some doubts about whether this approach is really suited to you. Let me address some of your concerns.
You do not need to write code
You may be a web developer or similar where you do not write a lot of code. You can use this approach to get started and apply machine learning. Tools like Weka make it easy to devise machine learning experiments and build models without any programming at all.
Writing code can unlock more and different tools and capabilities, but it is not required, and it does not need to come first.
You do not need to be good at mathematics
Just like development where you don’t need to know a thing about computability or big-O notation to write code and ship useful and reliable software, you can work machine learning problems end-to-end without a background in statistics, probability and linear algebra.
It is important to note that we do not start with theory, but we do not ignore it. Dive in and pull out what you need on a method or algorithm, when you need it. In fact, you won’t be able to hold yourself back. The reason is, working machine learning problems is addictive and consuming. In the pursuit of getting better results and more accurate predictions, you will draw from any resources you can find, learning just enough to extract the nuggets of wisdom for you to apply on your problem.
If your goal is to master the theory, this approach is slower and less efficient. And this is why it so uncomfortable when seen through that lens. When viewed from the goal of being a developer that does machine learning, it makes a lot of sense.
You do not need a higher degree
There are no gatekeepers to this knowledge. It’s all available and you can study it yourself, today, now. You do not need to trade a lot of time and money for a degree before you can start working on machine learning problems.
If you heart is set on getting that higher degree, why not just start working on machine learning problems first and take a look at a degree in a few weeks or months after you have a small portfolio of completed projects built up. You will have a much clearer idea of the extent of the field and the parts you like.
I did go back and get those higher degrees. I love doing research, but I love working real problems and delivering results that clients actually care about a whole lot more. Also, I was working machine learning problems before I started the degree, I just didn’t realize I already had the resources and a path in front of me.
It’s one of the reasons I’m so passionate to convince developers like you that you have what you need to get started right now.
It is so very easy to come up with excuses to not get started in machine learning.
You do not need big data
Machine learning algorithms were developed and are best understood on small data. Data small enough for you to review in MS Excel, to load into memory and to work through on your desktop workstation.
Big data != machine learning. You can build predictive models using big data, but see this as a specialization of your skill set to a domain. I generally advise my students to start with small in memory datasets when starting in machine learning.
If big data machine learning is the area you want to work, then start there.
You do not need a desktop supercomputer
It is true that some of the state-of-the-art algorithms like deep learning require very powerful bazillion-core GPUs. They are powerful and exciting algorithms. They are also algorithms that work on smaller problems that you can compute with your desktop CPU.
You do not need to hold off getting started in machine learning until you have access to a big-fast computer.
Before you go off and buy a desktop supercomputer or rent very large EC2 instances, it might be worth spending some time learning how to get the most from these algorithms on smaller better-understood datasets.
You do not need a lot of time
We all have busy lives, but if you really want something you need to put in the time.
I’ve said it before, working machine learning problems is addictive. If you get caught up in machine learning competitions you will gladly sacrifice a month of evening television to squeeze a few more percent from your algorithm.
That being said, if you start small with a clear process and a best of breed tool, you can work a dataset from end-to-end in an hour or two, perhaps spread over one or two nights. A few of these and you have a beachhead on a portfolio of completed machine learning projects that you can begin to leverage on larger and more interesting problems.
Break it down into snack-size tasks on your Kanban board and make the time to get started.
Biggest Mistakes Developers Make and How To Avoid Them
I have been giving variations on this advice for close to two years now since I launched Machine Learning Mastery. Over that time I’ve seen five common pitfalls that I want you to avoid.
Not Taking Action: It’s all laid out and yet I see so many developers not take action. It is so much easier to watch TV or read news than to build a new and valuable skill in a fascinating field of study. You can lead a horse to water…
Picking Problems that are Too Big: I commonly see the first or second dataset a developer selects to work on being too difficult. It’s too large, too complex or too dirty and they’re not ready for the challenge. The awful thing is that the “failure” kills motivation and the developer flunks out of the process. Pick small problems that you can finish and write up in 60 minutes. Do that for a while before you take on something bigger.
Implementing Algorithms from Scratch: We have algorithm implementations. It’s done. At least done enough for you to do interesting things for the next few years. If your goal is to learn how to develop and deliver reliable and accurate predictive models, do not spend time implementing algorithms from scratch, use a library. On the other hand, if you want to focus on implementing algorithms, then clearly make this your objective and focus in on it.
Not Sticking to a Process: As with agile software development, if you deviate from the process, the wheels can come off your project pretty quickly and the result is often a big mess. Sticking to a process from start-to-finish that systematically works through a problem end-to-end is key. You can revisit “that interesting thing you found…” as a follow-up mini-project (an “ideas for follow-up work” section in your write-up), but finish the process and deliver.
Not Using Resources: There are many great papers, books and blog posts on machine learning. You can leverage these resources to improve your process, usage of tools and accuracy of your results. Use third party resources to get more from your algorithm and your dataset. Get ideas for algorithms and framings of the problem. A nugget of wisdom can change the course of your project. Remember, if you adopt a top-down process, the theory has to come in at the back-end. Take the time to understand your final models.
Don’t let any of these happen to you!
Your Next Step
We have covered a lot of ground and I hope I am starting to convince you that you can get started and make progress in machine learning. That a future in which you are a developer that can do machine learning is very real and very obtainable.
Machine Learning Performance Improvement Cheat Sheet
Photo by NASA, some rights reserved.
This cheat sheet is designed to give you ideas to lift performance on your machine learning problem.
All it takes is one good idea to get a breakthrough.
Find that one idea, then come back and find another.
I have divided the list into 4 sub-topics:
Improve Performance With Data.
Improve Performance With Algorithms.
Improve Performance With Algorithm Tuning.
Improve Performance With Ensembles.
The gains often get smaller the further you go down the list.
For example, a new framing of your problem or more data is often going to give you more payoff than tuning the parameters of your best performing algorithm. Not always, but in general.
1. Improve Performance With Data
You can get big wins with changes to your training data and problem definition. Perhaps even the biggest wins.
Strategy: Create new and different perspectives on your data in order to best expose the structure of the underlying problem to the learning algorithms.
Get More Data. Can you get more or better quality data? Modern nonlinear machine learning techniques like deep learning continue to improve in performance with more data.
Invent More Data. If you can’t get more data, can you generate new data? Perhaps you can augment or permute existing data or use a probabilistic model to generate new data.
Clean Your Data. Can you improve the signal in your data? Perhaps there are missing or corrupt observations that can be fixed or removed, or outlier values outside of reasonable ranges that can be fixed or removed in order to lift the quality of your data.
Resample Data. Can you resample data to change the size or distribution? Perhaps you can use a much smaller sample of data for your experiments to speed things up or over-sample or under-sample observations of a specific type to better represent them in your dataset.
Reframe Your Problem: Can you change the type of prediction problem you are solving? Reframe your data as a regression, binary or multiclass classification, time series, anomaly detection, rating, recommender, etc. type problem.
Rescale Your Data. Can you rescale numeric input variables? Normalization and standardization of input data can result in a lift in performance on algorithms that use weighted inputs or distance measures.
Transform Your Data. Can you reshape your data distribution? Making input data more Gaussian or passing it through an exponential function may better expose features in the data to a learning algorithm.
Project Your Data: Can you project your data into a lower dimensional space? You can use an unsupervised clustering or projection method to create an entirely new compressed representation of your dataset.
Feature Selection. Are all input variables equally important? Use feature selection and feature importance methods to create new views of your data to explore with modeling algorithms.
Feature Engineering. Can you create and add new data features? Perhaps there are attributes that can be decomposed into multiple new values (like categories, dates or strings) or attributes that can be aggregated to signify an event (like a count, binary flag or statistical summary).
Outcome: You should now have a suite of new views and versions of your dataset.
Next: You can evaluate the value of each with predictive modeling algorithms.
2. Improve Performance With Algorithms
Machine learning is all about algorithms.
Strategy: Identify the algorithms and data representations that perform above a baseline of performance and better than average. Remain skeptical of results and design experiments that make it hard to fool yourself.
Resampling Method. What resampling method is used to estimate skill on new data? Use a method and configuration that makes the best use of available data. The k-fold cross-validation method with a hold out validation dataset might be a best practice.
Evaluation Metric. What metric is used to evaluate the skill of predictions? Use a metric that best captures the requirements of the problem and the domain. It probably isn’t classification accuracy.
Baseline Performance. What is the baseline performance for comparing algorithms? Use a random algorithm or a zero rule algorithm (predict mean or mode) to establish a baseline by which to rank all evaluated algorithms.
Spot Check Linear Algorithms. What linear algorithms work well? Linear methods are often more biased, are easy to understand and are fast to train. They are preferred if you can achieve good results. Evaluate a diverse suite of linear methods.
Spot Check Nonlinear Algorithms. What nonlinear algorithms work well? Nonlinear algorithms often require more data, have greater complexity but can achieve better performance. Evaluate a diverse suite of nonlinear methods.
Steal from Literature. What algorithms are reported in the literature to work well on your problem? Perhaps you can get ideas of algorithm types or extensions of classical methods to explore on your problem.
Standard Configurations. What are the standard configurations for the algorithms being evaluated? Each algorithm needs an opportunity to do well on your problem. This does not mean tune the parameters (yet) but it does mean to investigate how to configure each algorithm well and give it a fighting chance in the algorithm bake-off.
Outcome: You should now have a short list of well-performing algorithms and data representations.
Next: The next step is to improve performance with algorithm tuning.
3. Improve Performance With Algorithm Tuning
Algorithm tuning might be where you spend the most of your time. It can be very time-consuming. You can often unearth one or two well-performing algorithms quickly from spot-checking. Getting the most from those algorithms can take, days, weeks or months.
Strategy: Get the most out of well-performing machine learning algorithms.
Diagnostics. What diagnostics and you review about your algorithm? Perhaps you can review learning curves to understand whether the method is over or underfitting the problem, and then correct. Different algorithms may offer different visualizations and diagnostics. Review what the algorithm is predicting right and wrong.
Try Intuition. What does your gut tell you? If you fiddle with parameters for long enough and the feedback cycle is short, you can develop an intuition for how to configure an algorithm on a problem. Try this out and see if you can come up with new parameter configurations to try on your larger test harness.
Steal from Literature. What parameters or parameter ranges are used in the literature? Evaluating the performance of standard parameters is a great place to start any tuning activity.
Random Search. What parameters can use random search? Perhaps you can use random search of algorithm hyperparameters to expose configurations that you would never think to try.
Grid Search. What parameters can use grid search? Perhaps there are grids of standard hyperparameter values that you can enumerate to find good configurations, then repeat the process with finer and finer grids.
Optimize. What parameters can you optimize? Perhaps there are parameters like structure or learning rate than can be tuned using a direct search procedure (like pattern search) or stochastic optimization (like a genetic algorithm).
Alternate Implementations. What other implementations of the algorithm are available? Perhaps an alternate implementation of the method can achieve better results on the same data. Each algorithm has a myriad of micro-decisions that must be made by the algorithm implementor. Some of these decisions may affect skill on your problem.
Algorithm Extensions. What are common extensions to the algorithm? Perhaps you can lift performance by evaluating common or standard extensions to the method. This may require implementation work.
Algorithm Customizations. What customizations can be made to the algorithm for your specific case? Perhaps there are modifications that you can make to the algorithm for your data, from loss function, internal optimization methods to algorithm specific decisions.
Contact Experts. What do algorithm experts recommend in your case? Write a short email summarizing your prediction problem and what you have tried to one or more expert academics on the algorithm. This may reveal leading edge work or academic work previously unknown to you with new or fresh ideas.
Outcome: You should now have a short list of highly tuned algorithms on your machine learning problem, maybe even just one.
Next:One or more models could be finalized at this point and used to make predictions or put into production. Further lifts in performance can be gained by combining the predictions from multiple models.
4. Improve Performance With Ensembles
You can combine the predictions from multiple models. After algorithm tuning, this is the next big area for improvement. In fact, you can often get good performance from combining the predictions from multiple “good enough” models rather than from multiple highly tuned (and fragile) models.
Strategy: Combine the predictions of multiple well-performing models.
Blend Model Predictions. Can you combine the predictions from multiple models directly? Perhaps you could use the same or different algorithms to make multiple models. Take the mean or mode from the predictions of multiple well-performing models.
Blend Data Representations. Can you combine predictions from models trained on different data representations? You may have many different projections of your problem which can be used to train well-performing algorithms, whose predictions can then be combined.
Blend Data Samples. Can you combine models trained on different views of your data? Perhaps you can create multiple subsamples of your training data and train a well-performing algorithm, then combine predictions. This is called bootstrap aggregation or bagging and works best when the predictions from each model are skillful but in different ways (uncorrelated).
Correct Predictions. Can you correct the predictions of well-performing models? Perhaps you can explicitly correct predictions or use a method like boosting to learn how to correct prediction errors.
Learn to Combine. Can you use a new model to learn how to best combine the predictions from multiple well-performing models? This is called stacked generalization or stacking and often works well when the submodels are skillful but in different ways and the aggregator model is a simple linear weighting of the predictions. This process can be repeated multiple layers deep.
Outcome: You should have one or more ensembles of well-performing models that outperform any single model.
Next: One or more ensembles could be finalized at this point and used to make predictions or put into production.
This cheat sheet is jam packed full of ideas to try to improve performance on your problem.
How To Get Started
You do not need to do everything. You just need one good idea to get a lift in performance.
Here’s how to handle the overwhelm:
Pick one group
Pick one method from the group.
Pick one thing to try of the chosen method.
Compare the results, keep if there was an improvement.
The Wine Quality Dataset involves predicting the quality of white wines on a scale given chemical measures of each wine.
It is a multi-class classification problem, but could also be framed as a regression problem. The number of observations for each class is not balanced. There are 4,898 observations with 11 input variables and one output variable. The variable names are as follows:
Free sulfur dioxide.
Total sulfur dioxide.
Quality (score between 0 and 10).
The baseline performance of predicting the mean value is an RMSE of approximately 0.148 quality points.
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:
Number of times pregnant.
Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
Diastolic blood pressure (mm Hg).
Triceps skinfold thickness (mm).
2-Hour serum insulin (mu U/ml).
Body mass index (weight in kg/(height in m)^2).
Diabetes pedigree function.
Class variable (0 or 1).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.
The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 208 observations with 60 input variables and 1 output variable. The variable names are as follows:
Sonar returns at different angles
Class (M for mine and R for rock)
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 53%. Top results achieve a classification accuracy of approximately 88%.
The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:
Variance of Wavelet Transformed image (continuous).
Skewness of Wavelet Transformed image (continuous).
Kurtosis of Wavelet Transformed image (continuous).
Entropy of image (continuous).
Class (0 for authentic, 1 for inauthentic).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.
The Iris Flowers Dataset involves predicting the flower species given measurements of iris flowers.
It is a multi-class classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:
Sepal length in cm.
Sepal width in cm.
Petal length in cm.
Petal width in cm.
Class (Iris Setosa, Iris Versicolour, Iris Virginica).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 26%.
The Abalone Dataset involves predicting the age of abalone given objective measures of individuals.
It is a multi-class classification problem, but can also be framed as a regression. The number of observations for each class is not balanced. There are 4,177 observations with 8 input variables and 1 output variable. The variable names are as follows:
Sex (M, F, I).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 16%. The baseline performance of predicting the mean value is an RMSE of approximately 3.2 rings.
The Ionosphere Dataset requires the prediction of structure in the atmosphere given radar returns targeting free electrons in the ionosphere.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 351 observations with 34 input variables and 1 output variable. The variable names are as follows:
17 pairs of radar return data.
Class (g for good and b for bad).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 64%. Top results achieve a classification accuracy of approximately 94%.
The Wheat Seeds Dataset involves the prediction of species given measurements of seeds from different varieties of wheat.
It is a binary (2-class) classification problem. The number of observations for each class is balanced. There are 210 observations with 7 input variables and 1 output variable. The variable names are as follows:
Length of kernel.
Width of kernel.
Length of kernel groove.
Class (1, 2, 3).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 28%.
Machine learning algorithms are a very large part of machine learning.
You have to understand how they work to make any progress in the field.
In this post you will discover a 14-part machine learning algorithms mini course that you can follow to finally understand machine learning algorithms.
We are going to cover a lot of ground in this course and you are going to have a great time. Let’s get started.
Machine Learning Algorithms Mini-Course
Photo by Jared Tarbell, some rights reserved.
Who is This Course For?
Before we get started, let’s make sure you are in the right place.
This course is for beginners curious about machine learning algorithms.
This course does not assume you know how to write code.
This course does not assume a background in mathematics.
This course does not assume a background in machine learning theory.
This mini-course will take you on a guided tour of machine learning algorithms from foundations and through 10 top techniques.
We will visit each algorithm to give you a sense of how it works, but not go into too much depth to keep things moving.
Let’s take a look at what we’re going to cover over the next 14 lessons.
You may need to come back to this post again and again, so you may want to bookmark it.
This mini-course is broken down int four parts: Algorithm Foundations, Linear Algorithms, Nonlinear Algorithms and Ensemble Algorithms.
Lesson 1: How To Talk About Data in Machine Learning
Lesson 2: Principle That Underpins All Algorithms
Lesson 3: Parametric and Nonparametric Algorithms
Lesson 4: Bias, Variance and the Trade-off
Lesson 5: Linear Regression
Lesson 6: Logistic Regression
Lesson 7: Linear Discriminant Analysis
Lesson 8: Classification and Regression Trees
Lesson 9: Naive Bayes
Lesson 10: k-Nearest Neighbors
Lesson 11: Learning Vector Quantization
Lesson 12: Support Vector Machines
Lesson 13: Bagging and Random Forest
Lesson 14: Boosting and AdaBoost
Lesson 1: How To Talk About Data in Machine Learning
Data plays a big part in machine learning.
It is important to understand and use the right terminology when talking about data.
How do you think about data? Think of a spreadsheet. You have columns, rows, and cells.
The statistical perspective of machine learning frames data in the context of a hypothetical function (f) that the machine learning algorithm aims to learn. Given some input variables (Input) the function answer the question as to what is the predicted output variable (Output).
Output = f(Input)
The inputs and outputs can be referred to as variables or vectors.
The computer science perspective uses a row of data to describe an entity (like a person) or an observation about an entity. As such, the columns for a row are often referred to as attributes of the observation and the rows themselves are called instances.
Lesson 2: The Principle That Underpins All Algorithms
There is a common principle that underlies all supervised machine learning algorithms for predictive modeling.
Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y).
Y = f(X)
This is a general learning task where we would like to make predictions in the future (Y) given new examples of input variables (X). We don’t know what the function (f) looks like or it’s form. If we did, we would use it directly and we would not need to learn it from data using machine learning algorithms.
The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for new X. This is called predictive modeling or predictive analytics and our goal is to make the most accurate predictions possible.
Lesson 3: Parametric and Nonparametric Algorithms
What is a parametric machine learning algorithm and how is it different from a nonparametric machine learning algorithm?
Assumptions can greatly simplify the learning process, but can also limit what can be learned. Algorithms that simplify the function to a known form are called parametric machine learning algorithms.
The algorithms involve two steps:
Select a form for the function.
Learn the coefficients for the function from the training data.
Some examples of parametric machine learning algorithms are Linear Regression and Logistic Regression.
Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. By not making assumptions, they are free to learn any functional form from the training data.
Non-parametric methods are often more flexible, achieve better accuracy but require a lot more data and training time.
Examples of nonparametric algorithms include Support Vector Machines, Neural Networks and Decision Trees.
Lesson 4: Bias, Variance and the Trade-off
Machine learning algorithms can best be understood through the lens of the bias-variance trade-off.
Bias are the simplifying assumptions made by a model to make the target function easier to learn.
Generally parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.
Decision trees are an example of a low bias algorithm, whereas linear regression is an example of a high-bias algorithm.
Variance is the amount that the estimate of the target function will change if different training data was used. The target function is estimated from the training data by a machine learning algorithm, so we should expect the algorithm to have some variance, not zero variance.
The k-Nearest Neighbors algorithm is an example of a high-variance algorithm, whereas Linear Discriminant Analysis is an example of a low variance algorithm.
The goal of any predictive modeling machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance. The parameterization of machine learning algorithms is often a battle to balance out bias and variance.
Increasing the bias will decrease the variance.
Increasing the variance will decrease the bias.
Lesson 5: Linear Regression Algorithm
Linear regression is perhaps one of the most well known and well understood algorithms in statistics and machine learning.
Isn’t it a technique from statistics?
Predictive modeling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability. We will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends.
The representation of linear regression is a equation that describes a line that best fits the relationship between the input variables (x) and the output variables (y), by finding specific weightings for the input variables called coefficients (B).
y = B0 + B1 * x
We will predict y given the input x and the goal of the linear regression learning algorithm is to find the values for the coefficients B0 and B1.
Different techniques can be used to learn the linear regression model from data, such as a linear algebra solution for ordinary least squares and gradient descent optimization.
Linear regression has been around for more than 200 years and has been extensively studied. Some good rules of thumb when using this technique are to remove variables that are very similar (correlated) and to remove noise from your data, if possible.
It is a fast and simple technique and good first algorithm to try.
Lesson 6: Logistic Regression Algorithm
Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values).
Logistic regression is like linear regression in that the goal is to find the values for the coefficients that weight each input variable.
Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.
The logistic function looks like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value.
Because of the way that the model is learned, the predictions made by logistic regression can also be used as the probability of a given data instance belonging to class 0 or class 1. This can be useful on problems where you need to give more rationale for a prediction.
Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other.
It’s a fast model to learn and effective on binary classification problems.
Lesson 7: Linear Discriminant Analysis Algorithm
Logistic regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.
The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:
The mean value for each class.
The variance calculated across all classes.
Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value.
The technique assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from your data before hand.
It’s a simple and powerful method for classification predictive modeling problems.
Lesson 8: Classification and Regression Trees
Decision Trees are an important type of algorithm for predictive modeling machine learning.
The representation for the decision tree model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric).
The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.
Trees are fast to learn and very fast for making predictions. They are also often accurate for a broad range of problems and do not require any special preparation for your data.
Decision trees have a high variance and can yield more accurate predictions when used in an ensemble, a topic we will cover in Lesson 13 and Lesson 14.
Lesson 9: Naive Bayes Algorithm
Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling.
The model is comprised of two types of probabilities that can be calculated directly from your training data:
The probability of each class.
The conditional probability for each class given each x value.
Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem.
When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities.
Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.
Lesson 10: K-Nearest Neighbors Algorithm
The KNN algorithm is very simple and very effective.
The model representation for KNN is the entire training dataset. Simple right?
Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this might be the mean output variable, in classification this might be the mode (or most common) class value.
The trick is in how to determine similarity between the data instances. The simplest technique if your attributes are all of the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on the differences between each input variable.
KNN can require a lot of memory or space to store all of the data, but only performs a calculation (or learn) when a prediction is needed, just in time. You can also update and curate your training instances over time to keep predictions accurate.
The idea of distance or closeness can break down in very high dimensions (lots of input variables) which can negatively effect the performance of the algorithm on your problem. This is called the curse of dimensionality. It suggests you only use those input variables that are most relevant to predicting the output variable.
Lesson 11: Learning Vector Quantization
A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset.
The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.
The representation for LVQ is a collection of codebook vectors. These are selected randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm.
After learned, the codebook vectors can be used to make predictions just like K-Nearest Neighbors. The most similar neighbor (best matching codebook vector) is found by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction.
Best results are achieved if you rescale your data to have the same range, such as between 0 and 1.
If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.
Lesson 12: Support Vector Machines
Support Vector Machines are perhaps one of the most popular and talked about machine learning algorithms.
A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1.
In two-dimensions you can visualize this as a line and let’s assume that all of our input points can be completely separated by this line.
The SVM learning algorithm finds the coefficients that results in the best separation of the classes by the hyperplane.
The distance between the hyperplane and the closest data points is referred to as the margin. The best or optimal hyperplane that can separate the two classes is the line that as the largest margin.
Only these points are relevant in defining the hyperplane and in the construction of the classifier.
These points are called the support vectors. They support or define the hyperplane.
In practice, an optimization algorithm is used to find the values for the coefficients that maximizes the margin.
SVM might be one of the most powerful out-of-the-box classifiers and worth trying on your dataset.
Lesson 13: Bagging and Random Forest
Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value.
In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees.
Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value.
Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness.
The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value.
If you get good good results with an algorithm with high variance (like decision trees), you can often get better results by bagging that algorithm.
Lesson 14: Boosting and AdaBoost
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.
This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more more weight, whereas easy to predict instances are given less weight.
Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence.
After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on the training data.
Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.
You made it. Well done! Take a moment and look back at how far you have come:
You discovered how to talk about data in machine learning and about the underlying principles of all predictive modeling algorithms.
You discovered the difference between parametric and nonparametric algorithms and the difference between error introduced by bias and variance.
You discovered three linear machine learning algorithms: Linear Regression, Logistic Regression and Linear Discriminant Analysis.
You were introduced to 5 nonlinear algorithms: Classification and Regression Trees, Naive Bayes, K-Nearest Neighbors, Learning Vector Quantization and Support Vector Machines.
Finally, you discovered two of the most popular ensemble algorithms: Bagging with Decision Trees and Boosting with AdaBoost.
Don’t make light of this, you have come a long way in a short amount of time. This is just the beginning of your journey with machine learning algorithms. Keep practicing and developing your skills.
There are many posts on KDnuggets covering the explanation of key terms and concepts in the areas of Data Science, Machine Learning, Deep Learning, Big Data, etc. (see here, here, and here). In fact, it’s one of the tasks that KDnuggets takes quite seriously: introducing and clarifying concepts in the minds of new and seasoned practitioners alike. In many of these posts, concepts and terminology are often expounded upon and fit into The Big Picture, sometimes miring down the key concept in exchange for defining some greater notion.
This is the first in a series of such posts on KDnuggets which will offer concise explanations of a related set of terms (machine learning, in this case), specifically taking a no-frills approach for those looking to isolate and define. After some thought, it was determined that these foundational-yet-informative types of posts have not been given enough exposure in the past, with future iterations likely to include:
According to Mitchell, machine learning is “concerned with the question of how to construct computer programs that automatically improve with experience.” Machine learning is interdisciplinary in nature, and employs techniques from the fields of computer science, statistics, and artificial intelligence, among others. The main artefacts of machine learning research are algorithms which facilitate this automatic improvement from experience, algorithms which can be applied in such diverse fields as computer vision, artificial intelligence, and data mining.
Classification is concerned with building models that separate data into distinct classes. These models are built by inputting a set of training data for which the classes are pre-labelled in order for the algorithm to learn from. The model is then used by inputting a different dataset for which the classes are withheld, allowing the model to predict their class membership based on what it has learned from the training set. Well-known classification schemes include decision trees and support vector machines. As this type of algorithm requires explicit class labelling, classification is a form of supervised learning.
Regression is very closely related to classification. While classification is concerned with the prediction of discrete classes, regression is applied when the “class” to be predicted is made up of continuous numerical values. Linear regression is an example of a regression technique.
Clustering is used for analyzing data which does not include pre-labeled classes, or even a class attribute at all. Data instances are grouped together using the concept of “maximizing the intraclass similarity and minimizing the interclass similarity,” as concisely described by Han, Kamber & Pei. This translates to the clustering algorithm identifying and grouping instances which are very similar, as opposed to ungrouped instances which are much less-similar to one another. k-means clustering is perhaps the most well-known example of a clustering algorithm. As clustering does not require the pre-labeling of instance classes, it is a form of unsupervised learning, meaning that it learns by observation as opposed to learning by example.
Association is most easily explained by introducing market basket analysis, a typical task for which it is well-known. Market basket analysis attempts to identify associations between the various items that have been chosen by a particular shopper and placed in their market basket, be it real or virtual, and assigns support and confidence measures for comparison. The value of this lies in cross-marketing and customer behavior analysis. Association is a generalization of market basket analysis, and is similar to classification except that any attribute can be predicted in association. Apriori enjoys success as the most well-known example of an association algorithm. Association is another example of unsupervised learning.
Decision trees are top-down, recursive, divide-and-conquer classifiers. Decision trees are generally composed of 2 main tasks: tree induction and tree pruning. Tree induction is the task of taking a set of pre-classified instances as input, deciding which attributes are best to split on, splitting the dataset, and recursing on the resulting split datasets until all training instances are categorized. While building our tree, the goal is to split on the attributes which create the purest child nodes possible, which would keep to a minimum the number of splits that would need to be made in order to classify all instances in our dataset. This purity is measured by the concept of information, which relates to how much would need to be known about a previously-unseen instance in order for it to be properly classified.
A completed decision tree model can be overly-complex, contain unnecessary structure, and be difficult to interpret. Tree pruning is the process of removing the unnecessary structure from a decision tree in order to make it more efficient, more easily-readable for humans, and more accurate as well. This increased accuracy is due to pruning’s ability to reduce overfitting.
SVMs are able to classify both linear and nonlinear data. SMVs work by transforming the training dataset into a higher dimension, a higher dimension which is then inspected for the optimal separation boundary, or boundaries, between classes. In SVMs, these boundaries are referred to as hyperplanes, which are identified by locating support vectors, or the instances that most essentially define classes, and their margins, which are the lines parallel to the hyperplane defined by the shortest distance between a hyperplane and its support vectors.
The grand idea with SVMs is that, with a high enough number of dimensions, a hyperplane separating 2 classes can always be found, thereby delineating dataset member classes. When repeated a sufficient number of times, enough hyperplanes can be generated to separate all classes in n-dimension space.
Neural networks are algorithms inspired by the biological brain, although the extent to which they capture actual brain functionality is highly controversial, and claims that they model the biological brain are patently false. Neural networks are made up of numerous interconnected conceptualized artificial neurons, which pass data between themselves, and which have associated weights which are tuned based upon the newtork’s “experience.” Neurons have activation thresholds which, if met by a combination of their associated weights and data passed to them, are fired; combinations of fired neurons result in “learning.”
Deep learning is a relatively new term, although it has existed prior to the dramatic uptick in online searches of late. Enjoying a surge in research and industry, due mainly to its incredible successes in a number of different areas, deep learning is the process of applying deep neural network technologies – that is, neural network architectures with multiple hidden layers of neurons – to solve problems. Deep learning is a process, like data mining, which employs deep neural network architectures, which are particular types of machine learning algorithms.
Bishop best describes reinforcement learning in a single concise sentence: “Reinforcement learning is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward.” Reinforcement algorithms are not given explicit goals; instead, they are forced to learn these optimal goals by trial and error. Think of the classic Mario Bros. video game; reinforcement learning algorithms would, by trial and error, determine that certain movements and button pushes would advance the player’s standing in the game, and trial and error would aim to result in an optimal state of game play.
Cross-validation is a deterministic method for model building, achieved by leaving out one of k segments, or folds, of a dataset, training on all k-1 segments, and using the remaining kth segment for testing; this process is then repeated k times, with the individual prediction error results being combined and averaged in a single, integrated model. This provides variability, with the goal of producing the most accurate predictive models possible.
When referring to probability, there are 2 major schools of thought: classical, or frequentist, probability interpretation views probabilities in terms of the frequencies of random events. In somewhat of a contrast, the Bayesian view of probability aims to quantify uncertainty, and updates a given probability as additional evidence is available. If these probabilities are extended to truth values, and are assigned to hypotheses, we then have “learning” to varying degrees of certainty.
Source: from kdnuggets.com (Good posts sometimes disappear, so I repost it here for my and for your information.)
Read this introductory list of contemporary machine learning algorithms of importance that every engineer should understand.
By James Le, New Story Charity.
It is no doubt that the sub-field of machine learning / artificial intelligence has increasingly gained more popularity in the past couple of years. As Big Data is the hottest trend in the tech industry at the moment, machine learning is incredibly powerful to make predictions or calculated suggestions based on large amounts of data. Some of the most common examples of machine learning are Netflix’s algorithms to make movie suggestions based on movies you have watched in the past or Amazon’s algorithms that recommend books based on books you have bought before.
So if you want to learn more about machine learning, how do you start? For me, my first introduction is when I took an Artificial Intelligence class when I was studying abroad in Copenhagen. My lecturer is a full-time Applied Math and CS professor at the Technical University of Denmark, in which his research areas are logic and artificial, focusing primarily on the use of logic to model human-like planning, reasoning and problem solving. The class was a mix of discussion of theory/core concepts and hands-on problem solving. The textbook that we used is one of the AI classics: Peter Norvig’s Artificial Intelligence — A Modern Approach, in which we covered major topics including intelligent agents, problem-solving by searching, adversarial search, probability theory, multi-agent systems, social AI, philosophy/ethics/future of AI. At the end of the class, in a team of 3, we implemented simple search-based agents solving transportation tasks in a virtual environment as a programming project.
I have learned a tremendous amount of knowledge thanks to that class, and decided to keep learning about this specialized topic. In the last few weeks, I have been multiple tech talks in San Francisco on deep learning, neural networks, data architecture — and a Machine Learning conference with a lot of well-known professionals in the field. Most importantly, I enrolled inUdacity’s Intro to Machine Learningonline course in the beginning of June and has just finished it a few days ago. In this post, I want to share some of the most common machine learning algorithms that I learned from the course.
Machine learning algorithms can be divided into 3 broad categories — supervised learning, unsupervised learning, and reinforcement learning.Supervised learning is useful in cases where a property (label) is available for a certain dataset (training set), but is missing and needs to be predicted for other instances. Unsupervised learning is useful in cases where the challenge is to discover implicit relationships in a given unlabeled dataset (items are not pre-assigned). Reinforcement learning falls between these 2 extremes — there is some form of feedback available for each predictive step or action, but no precise label or error message. Since this is an intro class, I didn’t learn about reinforcement learning, but I hope that 10 algorithms on supervised and unsupervised learning will be enough to keep you interested.
1. Decision Trees: A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance-event outcomes, resource costs, and utility. Take a look at the image to get a sense of how it looks like.
From a business decision point of view, a decision tree is the minimum number of yes/no questions that one has to ask, to assess the probability of making a correct decision, most of the time. As a method, it allows you to approach the problem in a structured and systematic way to arrive at a logical conclusion.
2. Naïve Bayes Classification: Naïve Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. The featured image is the equation — with P(A|B) is posterior probability, P(B|A) is likelihood, P(A) is class prior probability, and P(B) is predictor prior probability.
Naive Bayes Classification
Some of real world examples are:
To mark an email as spam or not spam
Classify a news article about technology, politics, or sports
Check a piece of text expressing positive emotions, or negative emotions?
Used for face recognition software.
3. Ordinary Least Squares Regression: If you know statistics, you probably have heard of linear regression before. Least squares is a method for performing linear regression. You can think of linear regression as the task of fitting a straight line through a set of points. There are multiple possible strategies to do this, and “ordinary least squares” strategy go like this — You can draw a line, and then for each of the data points, measure the vertical distance between the point and the line, and add these up; the fitted line would be the one where this sum of distances is as small as possible.
Ordinary Least Squares Regression
Linear refers the kind of model you are using to fit the data, while least squares refers to the kind of error metric you are minimizing over.
4. Logistic Regression: Logistic regression is a powerful statistical way of modeling a binomial outcome with one or more explanatory variables. It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution.
In general, regressions can be used in real-world applications such as:
Measuring the success rates of marketing campaigns
Predicting the revenues of a certain product
Is there going to be an earthquake on a particular day?
5. Support Vector Machines: SVM is binary classification algorithm. Given a set of points of 2 types in N dimensional place, SVM generates a (N — 1) dimensional hyperlane to separate those points into 2 groups. Say you have some points of 2 types in a paper which are linearly separable. SVM will find a straight line which separates those points into 2 types and situated as far as possible from all those points.
Support Vector Machine
In terms of scale, some of the biggest problems that have been solved using SVMs (with suitably modified implementations) are display advertising, human splice site recognition, image-based gender detection, large-scale image classification…
6. Ensemble Methods: Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a weighted vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, bagging, and boosting.
Ensemble Learning Algorithms
So how do ensemble methods work and why are they superior to individual models?
They average out biases: If you average a bunch of democratic-leaning polls and republican-leaning polls together, you will get an average something that isn’t leaning either way.
They reduce the variance: The aggregate opinion of a bunch of models is less noisy than the single opinion of one of the models. In finance, this is called diversification — a mixed portfolio of many stocks will be much less variable than just one of the stocks alone. This is why your models will be better with more data points rather than fewer.
They are unlikely to over-fit: If you have individual models that didn’t over-fit, and you are combining the predictions from each model in a simple way (average, weighted average, logistic regression), then there’s no room for over-fitting.
7. Clustering Algorithms: Clustering is the task of grouping a set of objects such that objects in the same group (cluster) are more similar to each other than to those in other groups.
Every clustering algorithm is different, and here are a couple of them:
Neural networks / Deep Learning
8. Principal Component Analysis: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
Principal Component Analysis
Some of the applications of PCA include compression, simplifying data for easier learning, visualization. Notice that domain knowledge is very important while choosing whether to go forward with PCA or not. It is not suitable in cases where data is noisy (all the components of PCA have quite a high variance).
9. Singular Value Decomposition: In linear algebra, SVD is a factorization of a real complex matrix. For a given m * n matrix M, there exists a decomposition such that M = UΣV, where U and V are unitary matrices and Σ is a diagonal matrix.
Singular Value Decomposition
PCA is actually a simple application of SVD. In computer vision, the 1st face recognition algorithms used PCA and SVD in order to represent faces as a linear combination of “eigenfaces”, do dimensionality reduction, and then match faces to identities via simple methods; although modern methods are much more sophisticated, many still depend on similar techniques.
10. Independent Component Analysis: ICA is a statistical technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed non-gaussian and mutually independent, and they are called independent components of the observed data.
Independent Component Analysis
ICA is related to PCA, but it is a much more powerful technique that is capable of finding the underlying factors of sources when these classic methods fail completely. Its applications include digital images, document databases, economic indicators and psychometric measurements.
Now go forth and wield your understanding of algorithms to create machine learning applications that make better experiences for people everywhere.
source: from kdnuggets.com (Good posts sometimes disappear, so I repost it here for my and for your information.)
Top 10 data mining algorithms, selected by top researchers, are explained here, including what do they do, the intuition behind the algorithm, available implementations of the algorithms, why use them, and interesting applications.
Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.
Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.
What are we waiting for? Let’s get started!
Here are the algorithms:
3. Support vector machines
9. Naive Bayes
We also provide interesting resources at the end.
What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified.
Wait, what’s a classifier? A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to.
What’s an example of this? Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse, blood pressure, VO2max, family history, etc. These are called attributes.
Given these attributes, we want to predict whether the patient will get cancer. The patient can fall into 1 of 2 classes: will get cancer or won’t get cancer. C4.5 is told the class for each patient.
And here’s the deal:
Using a set of patient attributes and the patient’s corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes.
Cool, so what’s a decision tree? Decision tree learning creates something similar to a flowchart to classify new data. Using the same patient example, one particular path in the flowchart could be:
Patient has a history of cancer
Patient is expressing a gene highly correlated with cancer patients
Patient has tumors
Patient’s tumor size is greater than 5cm
The bottom line is:
At each point in the flowchart is a question about the value of some attribute, and depending on those values, he or she gets classified. You can find lots of examples of decision trees.
Is this supervised or unsupervised? This is supervised learning, since the training dataset is labeled with classes. Using the patient example, C4.5 doesn’t learn on its own that a patient will get cancer or won’t get cancer. We told it first, it generated a decision tree, and now it uses the decision tree to classify.
You might be wondering how C4.5 is different than other decision tree systems?
Why use C4.5? Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.
Where is it used? A popular open-source Java implementation can be found over at OpenTox. Orange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.
Classifiers are great, but make sure to checkout the next algorithm about clustering…
What does it do? k-means creates k groups from a set of objects so that the members of a group are more similar. It’s a popular cluster analysis technique for exploring a dataset.
Hang on, what’s cluster analysis? Cluster analysis is a family of algorithms designed to form groups such that the group members are more similar versus non-group members. Clusters and groups are synonymous in the world of cluster analysis.
Is there an example of this? Definitely, suppose we have a dataset of patients. In cluster analysis, these would be called observations. We know various things about each patient like age, pulse, blood pressure, VO2max, cholesterol, etc. This is a vector representing the patient.
You can basically think of a vector as a list of numbers we know about the patient. This list can also be interpreted as coordinates in multi-dimensional space. Pulse can be one dimension, blood pressure another dimension and so forth.
You might be wondering:
Given this set of vectors, how do we cluster together patients that have similar age, pulse, blood pressure, etc?
Want to know the best part?
You tell k-means how many clusters you want. K-means takes care of the rest.
How does k-means take care of the rest? k-means has lots of variations to optimize for certain types of data.
At a high level, they all do something like this:
k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.
Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.
What we have are k clusters, and each patient is now a member of a cluster.
k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).
This center becomes the new centroid for the cluster.
Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.
Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.
Is this supervised or unsupervised? It depends, but most would classify k-means as unsupervised. Other than specifying the number of clusters, k-means “learns” the clusters on its own without any information about which cluster an observation belongs to. k-means can be semi-supervised.
Why use k-means? I don’t think many will have an issue with this:
The key selling point of k-means is its simplicity. Its simplicity means it’s generally faster and more efficient than other algorithms, especially over large datasets.
It gets better:
k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster analysis on the sub-clusters. k-means can also be used to rapidly “play” with k and explore whether there are overlooked patterns or relationships in the dataset.
It’s not all smooth sailing:
Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data — you’ll need to do some tricks to get it to work on discrete data.
Where is it used? A ton of implementations for k-means clustering are available online:
If decision trees and clustering didn’t impress you, you’re going to love the next algorithm.
3. Support vector machines
What does it do? Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all.
Whoa, a hyper-what? A hyperplane is a function like the equation for a line, y = mx + b. In fact, for a simple classification task with just 2 features, the hyperplane can be a line.
As it turns out…
SVM can perform a trick to project your data into higher dimensions. Once projected into higher dimensions…
…SVM figures out the best hyperplane which separates your data into the 2 classes.
Do you have an example? Absolutely, the simplest example I found starts with a bunch of red and blue balls on a table. If the balls aren’t too mixed together, you could take a stick and without moving the balls, separate them with the stick.
When a new ball is added on the table, by knowing which side of the stick the ball is on, you can predict its color.
What do the balls, table and stick represent? The balls represent data points, and the red and blue color represent 2 classes. The stick represents the simplest hyperplane which is a line.
And the coolest part?
SVM figures out the function for the hyperplane.
What if things get more complicated? Right, they frequently do. If the balls are mixed together, a straight stick won’t work.
Here’s the work-around:
Quickly lift up the table throwing the balls in the air. While the balls are in the air and thrown up in just the right way, you use a large sheet of paper to divide the balls in the air.
You might be wondering if this is cheating:
Nope, lifting up the table is the equivalent of mapping your data into higher dimensions. In this case, we go from the 2 dimensional table surface to the 3 dimensional balls in the air.
How does SVM do this? By using a kernel we have a nice way to operate in higher dimensions. The large sheet of paper is still called a hyperplane, but it is now a function for a plane rather than a line. Note from Yuval that once we’re in 3 dimensions, the hyperplane must be a plane rather than a line.
I found this visualization super helpful:
Reddit also has 2 great threads on this in the ELI5 and ML subreddits.
How do balls on a table or in the air map to real-life data? A ball on a table has a location that we can specify using coordinates. For example, a ball could be 20cm from the left edge and 50cm from the bottom edge. Another way to describe the ball is as (x, y) coordinates or (20, 50). x and y are 2 dimensions of the ball.
Here’s the deal:
If we had a patient dataset, each patient could be described by various measurements like pulse, cholesterol level, blood pressure, etc. Each of these measurements is a dimension.
The bottom line is:
SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes.
Margins are often associated with SVM? What are they? The margin is the distance between the hyperplane and the 2 closest data points from each respective class. In the ball and table example, the distance between the stick and the closest red and blue ball is the margin.
The key is:
SVM attempts to maximize the margin, so that the hyperplane is just as far away from red ball as the blue ball. In this way, it decreases the chance of misclassification.
Where does SVM get its name from? Using the ball and table example, the hyperplane is equidistant from a red ball and a blue ball. These balls or data points are called support vectors, because they support the hyperplane.
Is this supervised or unsupervised? This is a supervised learning, since a dataset is used to first teach the SVM about the classes. Only then is the SVM capable of classifying new data.
Where is it used? There are many implementations of SVM. A few of the popular ones are scikit-learn, MATLAB and of course libsvm.
The next algorithm is one of my favorites…
What does it do? The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.
What are association rules? Association rule learning is a data mining technique for learning correlations and relations among variables in a database.
What’s an example of Apriori? Let’s say we have a database full of supermarket transactions. You can think of a database as a giant spreadsheet where each row is a customer transaction and every column represents a different grocery item.
Here’s the best part:
By applying the Apriori algorithm, we can learn the grocery items that are purchased together a.k.a association rules.
The power of this is:
You can find those items that tend to be purchased together more frequently than other items — the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets.
You can probably quickly see that chips + dip and chips + soda seem to frequently occur together. These are called 2-itemsets. With a large enough dataset, it will be much harder to “see” the relationships especially when you’re dealing with 3-itemsets or more. That’s precisely what Apriori helps with!
You might be wondering how Apriori works? Before getting into the nitty-gritty of algorithm, you’ll need to define 3 things:
The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
The second is your support or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset.
The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.
The basic Apriori algorithm is a 3 step approach:
Join. Scan the whole database for how frequent 1-itemsets are.
Prune. Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.
Repeat. This is repeated for each itemset level until we reach our previously defined size.
Is this supervised or unsupervised? Apriori is generally considered an unsupervised learning approach, since it’s often used to discover or mine for interesting patterns and relationships.
Why use Apriori? Apriori is well understood, easy to implement and has many derivatives.
On the other hand…
The algorithm can be quite memory, space and time intensive when generating itemsets.
Where is it used? Plenty of implementations of Apriori are available. Some popular ones are the ARtool, Weka, and Orange.
The next algorithm was the most difficult for me to understand, look at the next algorithm…
What does it do? In data mining, expectation-maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery.
In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables.
OK, hang on while I explain…
I’m not a statistician, so hopefully my simplification is both correct and helps with understanding.
Here are a few concepts that will make this way easier…
What’s a statistical model? I see a model as something that describes how observed data is generated. For example, the grades for an exam could fit a bell curve, so the assumption that the grades are generated via a bell curve (a.k.a. normal distribution) is the model.
Wait, what’s a distribution? A distribution represents the probabilities for all measurable outcomes. For example, the grades for an exam could fit a normal distribution. This normal distribution represents all the probabilities of a grade.
In other words, given a grade, you can use the distribution to determine how many exam takers are expected to get that grade.
Cool, what are the parameters of a model? A parameter describes a distribution which is part of a model. For example, a bell curve can be described by its mean and variance.
Using the exam scenario, the distribution of grades on an exam (the measurable outcomes) followed a bell curve (this is the distribution). The mean was 85 and the variance was 100.
So, all you need to describe a normal distribution are 2 parameters:
And likelihood? Going back to our previous bell curve example… suppose we have a bunch of grades and are told the grades follow a bell curve. However, we’re not given all the grades… only a sample.
Here’s the deal:
We don’t know the mean or variance of all the grades, but we can estimate them using the sample. The likelihood is the probability that the bell curve with estimated mean and variance results in those bunch of grades.
In other words, given a set of measurable outcomes, let’s estimate the parameters. Using these estimated parameters, the hypothetical probability of the outcomes is called likelihood.
Remember, it’s the hypothetical probability of the existing grades, not the probability of a future grade.
You’re probably wondering, what’s probability then?
Using the bell curve example, suppose we know the mean and variance. Then we’re told the grades follow a bell curve. The chance that we observe certain grades and how often they are observed is the probability.
In more general terms, given the parameters, let’s estimate what outcomes should be observed. That’s what probability does for us.
Great! Now, what’s the difference between observed and unobserved data? Observed data is the data that you saw or recorded. Unobserved data is data that is missing. There a number of reasons that the data could be missing (not recorded, ignored, etc.).
Here’s the kicker:
For data mining and clustering, what’s important to us is looking at the class of a data point as missing data. We don’t know the class, so interpreting missing data this way is crucial for applying EM to the task of clustering.
Once again: The EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. Hopefully, this is way more understandable now.
The best part is…
By optimizing the likelihood, EM generates an awesome model that assigns class labels to data points — sounds like clustering to me!
How does EM help with clustering? EM begins by making a guess at the model parameters.
Then it follows an iterative 3-step process:
E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.
M-step: Update the model parameters based on the cluster assignments from the E-step.
Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).
Is this supervised or unsupervised? Since we do not provide labeled class information, this is unsupervised learning.
Why use EM? A key selling point of EM is it’s simple and straight-forward to implement. In addition, not only can it optimize for model parameters, it can also iteratively make guesses about missing data.
This makes it great for clustering and generating a model with parameters. Knowing the clusters and model parameters, it’s possible to reason about what the clusters have in common and which cluster new data belongs to.
EM is not without weaknesses though…
First, EM is fast in the early iterations, but slow in the later iterations.
Second, EM doesn’t always find the optimal parameters and gets stuck in local optima rather than global optima.
Where is it used? The EM algorithm is available in Weka. R has an implementation in the mclust package. scikit-learn also has an implementation in its gmm module.
What data mining does Google do? Take a look…
What does it do? PageRank is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.
Yikes.. what’s link analysis? It’s a type of network analysis looking to explore the associations (a.k.a. links) among objects.
Here’s an example: The most prevalent example of PageRank is Google’s search engine. Although their search engine doesn’t solely rely on PageRank, it’s one of the measures Google uses to determine a web page’s importance.
Let me explain:
Web pages on the World Wide Web link to each other. If rayli.net links to a web page on CNN, a vote is added for the CNN page indicating rayli.net finds the CNN web page relevant.
And it doesn’t stop there…
rayli.net’s votes are in turn weighted by rayli.net’s importance and relevance. In other words, any web page that’s voted for rayli.net increases rayli.net’s relevance.
The bottom line?
This concept of voting and relevance is PageRank. rayli.net’s vote for CNN increases CNN’s PageRank, and the strength of rayli.net’s PageRank influences how much its vote affects CNN’s PageRank.
What does a PageRank of 0, 1, 2, 3, etc. mean? Although the precise meaning of a PageRank number isn’t disclosed by Google, we can get a sense of its relative meaning.
And here’s how:
It’s a bit like a popularity contest. We all have a sense of which websites are relevant and popular in our minds. PageRank is just an uber elegant way to define it.
What other applications are there of PageRank? PageRank was specifically designed for the World Wide Web.
Think about it:
At its core, PageRank is really just a super effective way to do link analysis.The objects being linked don’t have to be web pages.
Here are 3 innovative applications of PageRank:
Dr Stefano Allesina, from the University of Chicago, applied PageRank to ecology to determine which species are critical for sustaining ecosystems.
Twitter developed WTF (Who-to-Follow) which is a personalized PageRank recommendation engine about who to follow.
What does it do? AdaBoost is a boosting algorithm which constructs a classifier.
As you probably remember, a classifier takes a bunch of data and attempts to predict or classify which class a new data element belongs to.
But what’s boosting? Boosting is an ensemble learning algorithm which takes multiple learning algorithms (e.g. decision trees) and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.
What’s the difference between a strong and weak learner? A weak learner classifies with accuracy barely above chance. A popular example of a weak learner is the decision stump which is a one-level decision tree.
A strong learner has much higher accuracy, and an often used example of a strong learner is SVM.
What’s an example of AdaBoost? Let’s start with 3 weak learners. We’re going to train them in 10 rounds on a training dataset containing patient data. The dataset contains details about the patient’s medical records.
The question is…
How can we predict whether the patient will get cancer?
Here’s how AdaBoost answers the question…
In round 1: AdaBoost takes a sample of the training dataset and tests to see how accurate each learner is. The end result is we find the best learner.
In addition, samples that are misclassified are given a heavier weight, so that they have a higher chance of being picked in the next round.
One more thing, the best learner is also given a weight depending on its accuracy and incorporated into the ensemble of learners (right now there’s just 1 learner).
In round 2: AdaBoost again attempts to look for the best learner.
And here’s the kicker:
The sample of patient training data is now influenced by the more heavily misclassified weights. In other words, previously misclassified patients have a higher chance of showing up in the sample.
It’s like getting to the second level of a video game and not having to start all over again when your character is killed. Instead, you start at level 2 and focus all your efforts on getting to level 3.
Likewise, the first learner likely classified some patients correctly. Instead of trying to classify them again, let’s focus all the efforts on getting the misclassified patients.
The best learner is again weighted and incorporated into the ensemble, misclassified patients are weighted so they have a higher chance of being picked and we rinse and repeat.
At the end of the 10 rounds: We’re left with an ensemble of weighted learners trained and then repeatedly retrained on misclassified data from the previous rounds.
Is this supervised or unsupervised? This is supervised learning, since each iteration trains the weaker learners with the labelled dataset.
Why use AdaBoost? AdaBoost is simple. The algorithm is relatively straight-forward to program.
In addition, it’s fast! Weak learners are generally simpler than strong learners. Being simpler means they’ll likely execute faster.
It’s a super elegant way to auto-tune a classifier, since each successive AdaBoost round refines the weights for each of the best learners. All you need to specify is the number of rounds.
Finally, it’s flexible and versatile. AdaBoost can incorporate any learning algorithm, and it can work with a large variety of data.
Where is it used? AdaBoost has a ton of implementations and variants. Here are a few:
If you like Mr. Rogers, you’ll like the next algorithm…
What does it do? kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it’s a lazy learner.
What’s a lazy learner? A lazy learner doesn’t do much during the training process other than store the training data. Only when new unlabeled data is input does this type of learner look to classify.
On the other hand, an eager learner builds a classification model during training. When new unlabeled data is input, this type of learner feeds the data into the classification model.
How does C4.5, SVM and AdaBoost fit into this? Unlike kNN, they are all eager learners.
C4.5 builds a decision tree classification model during training.
SVM builds a hyperplane classification model during training.
AdaBoost builds an ensemble classification model during training.
So what does kNN do? kNN builds no such classification model. Instead, it just stores the labeled training data.
When new unlabeled data comes in, kNN operates in 2 basic steps:
First, it looks at the closest labeled training data points — in other words, the k-nearest neighbors.
Second, using the neighbors’ classes, kNN gets a better idea of how the new data should be classified.
You might be wondering…
How does kNN figure out what’s closer? For continuous data, kNN uses a distance metric like Euclidean distance. The choice of distance metric largely depends on the data. Some even suggest learning a distance metric based on the training data. There’s tons more details and papers on kNN distance metrics.
For discrete data, the idea is transform discrete data into continuous data. 2 examples of this are:
How does kNN classify new data when neighbors disagree? kNN has an easy time when all neighbors are the same class. The intuition is if all the neighbors agree, then the new data point likely falls in the same class.
I’ll bet you can guess where things get hairy…
How does kNN decide the class when neighbors don’t have the same class?
2 common techniques for dealing with this are:
Take a simple majority vote from the neighbors. Whichever class has the greatest number of votes becomes the class for the new data point.
Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote 1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller… exactly what we want!
Is this supervised or unsupervised? This is supervised learning, since kNN is provided a labeled training dataset.
Why use kNN? Ease of understanding and implementing are 2 of the key reasons to use kNN. Depending on the distance metric, kNN can be quite accurate.
But that’s just part of the story…
Here are 5 things to watch out for:
kNN can get very computationally expensive when trying to determine the nearest neighbors on a large dataset.
Noisy data can throw off kNN classifications.
Features with a larger range of values can dominate the distance metric relative to features that have a smaller range, so feature scaling is important.
Since data processing is deferred, kNN generally requires greater storage requirements than eager classifiers.
Selecting a good distance metric is crucial to kNN’s accuracy.
Where is it used? A number of kNN implementations exist:
Spam? Fuhgeddaboudit! Read ahead to learn about the next algorithm…
9. Naive Bayes
What does it do? Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption:
Every feature of the data being classified is independent of all other features given the class.
What does independent mean? 2 features are independent when the value of one feature has no effect on the value of another feature.
Let’s say you have a patient dataset containing features like pulse, cholesterol level, weight, height and zip code. All features would be independent if the value of all features have no effect on each other. For this dataset, it’s reasonable to assume that the patient’s height and zip code are independent, since a patient’s height has little to do with their zip code.
But let’s not stop there, are the other features independent?
Sadly, the answer is no. Here are 3 feature relationships which are not independent:
If height increases, weight likely increases.
If cholesterol level increases, weight likely increases.
If cholesterol level increases, pulse likely increases as well.
In my experience, the features of a dataset are generally not all independent.
And that ties in with the next question…
Why is it called naive? The assumption that all features of a dataset are independent is precisely why it’s called naive — it’s generally not the case that all features are independent.
What’s Bayes? Thomas Bayes was an English statistician for which Bayes’ Theorem is named after. You can click on the link to find about more about Bayes’ Theorem.
In a nutshell, the theorem allows us to predict the class given a set of features using probability.
The simplified equation for classification looks something like this:
Let’s dig deeper into this…
What does the equation mean? The equation finds the probability of Class A given Features 1 and 2. In other words, if you see Features 1 and 2, this is the probability the data is Class A.
The equation reads: The probability of Class A given Features 1 and 2 is a fraction.
The fraction’s numerator is the probability of Feature 1 given Class A multiplied by the probability of Feature 2 given Class A multiplied by the probability of Class A.
The fraction’s denominator is the probability of Feature 1 multiplied by the probability of Feature 2.
The fruit can be a Banana, Orange or Other (these are the classes).
The fruit can be Long, Sweet or Yellow (these are the features).
What do you see in this training dataset?
Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.
If we are given the length, sweetness and color of a fruit (without knowing its class), we can now calculate the probability of it being a banana, orange or other fruit.
Suppose we are told the unknown fruit is long, sweet and yellow.
Here’s how we calculate all the probabilities in 4 steps:
Step 1: To calculate the probability the fruit is a banana, let’s first recognize that this looks familiar. It’s the probability of the class Banana given the features Long, Sweet and Yellow or more succinctly:
This is exactly like the equation discussed earlier.
Step 2: Starting with the numerator, let’s plug everything in.
Multiplying everything together (as in the equation), we get:
Step 3: Ignore the denominator, since it’ll be the same for all the other calculations.
Step 4: Do a similar calculation for the other classes:
Since the is greater than , Naive Bayes would classify this long, sweet and yellow fruit as a banana.
Is this supervised or unsupervised? This is supervised learning, since Naive Bayes is provided a labeled training dataset in order to construct the tables.
Why use Naive Bayes? As you could see in the example above, Naive Bayes involves simple arithmetic. It’s just tallying up counts, multiplying and dividing.
Once the frequency tables are calculated, classifying an unknown fruit just involves calculating the probabilities for all the classes, and then choosing the highest probability.
Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, it’s been found to be effective for spam filtering.
Uses the cost-complexity method of pruning. Starting at the bottom of the tree, CART evaluates the misclassification cost with the node vs. without the node. If the cost doesn’t meet a threshold, it is pruned away.
Is this supervised or unsupervised? CART is a supervised learning technique, since it is provided a labeled training dataset in order to construct the classification or regression tree model.
Why use CART? Many of the reasons you’d use C4.5 also apply to CART, since they are both decision tree learning techniques. Things like ease of interpretation and explanation also apply to CART as well.
Like C4.5, they are also quite fast, quite popular and the output is human readable.
Where is it used?scikit-learn implements CART in their decision tree classifier. R’s tree package has an implementation of CART. Weka and MATLAB also have implementations.