PostgreSQL — column names of a table are case-sensitive

All identifiers (including column names) that are not double-quoted are converted to lower case in PostgreSQL. Column names that were created with double-quotes and thereby retained upper-case letters (and/or other syntax violations) have to be double-quoted for the rest of their life. So, PostgreSQL column names are case-sensitive:

SELECT * FROM people WHERE "first_Name" = 'xyz';

Note that 'xyz' does not need to be double-quoted. Values (string literals) are enclosed in single quotes.

If you try to do this query ( first_Name is a column in the table people

SELECT * FROM people WHERE first_Name='xyz';

And you will meet this error

ERROR: column “first_Name” does not exist

 

Advice is to use legal, lower-case names exclusively so double-quoting is not needed.

 

(Thanks, Scott, for some hints.)

 

References:

http://stackoverflow.com/questions/20878932/are-postgresql-column-names-case-sensitive

Install Ubuntu Server 16.04 LTS

To install Ubuntu Server 16.04, first download the Ubuntu Server 16.04 LTS ISO file and create a bootable USB. If you do not know how to do this, refer to this guide about installing Ubuntu Server from USB (check the section “Create a Bootable USB Installer” in the post. If it is not accessible, check this pdf version). Once you have the bootable USB ready, pop it in and make your system boot from USB (through the System BIOS).

ATTN: Check my post about boot mode first before proceeding to the installation steps.

1: Select Ubuntu Installer Language

Install Ubuntu 16.04 Server – Installer Language

2: Install Ubuntu Server 16.04

Start Ubuntu Server Installation. Click on Install Ubuntu Server to continue. For normal installations, you do not have to mess with any of the advanced options.

3: Select Operating System Language

4: Select Server Location

This helps in determining timezone.

5: Keyboard Detection

6: Network configuration

if your network has two interfaces, if you are sure which one to use, just select one and try, if it says failure, then go back and select the other options.

Note that: if you encounter this error below, it might be the case: Ethernet cable was loose… Reinstalled the OS and it should work.

Network autoconfiguration failed Your network is probably not using the DHCP protocol. Alternatively, the DHCP server may be slow or some network hardware is not working properly. <Continue>

If the ethernet is for sure not loose, then it’s most probably the issue that I introduced in this post (change the boot mode to Legacy Boot (i.e., disable UEFI mode), Secure Boot Off.) Reinstall the OS after set the Boot mode and boot order, then you should see the this succeeded screen:)

7: Hostname

Enter the hostname of the system. In this example, my server is named server1.psu.edu, so I enter server1.psu.edu

8: Server User Fullname

provide a Fullname for the primary account. This is not the root (administrator) user but this user can temporarily gain admin privileges using the sudo command.

Ubuntu does not let you log in as root user directly. Therefore, create a new system user here for the initial login. I will create a user with the name Administrator and user name administrator (don’t use the user name admin as it is a reserved name on Ubuntu Linux):

9: Server Username

provide the login username for the primary account.

10: Server Password

Select a password for the user created above.

Make sure you set a strong password with a mix of upper case letters, lower case letters, numbers, and special characters.

11: Home Directory Encryption

In general, you do not have to encrypt your home directory. But you may do so if you want to.

12: Confirm Timezone

The installer should automatically pick your timezone. Confirm the choice and move forward.

13: Ubuntu Server Drive Partitioning

This is one of the key steps and probably most complicated of all steps. For now, ensure that you have at least 8GB (headless – no desktop environment) or 10 GB (with desktop environment) of space. In this guide, assume that the server has 2 physical hard disks (sda and sdb). sda is the drive that will be formatted for Ubuntu Server 16.04 installation. sdb is a data storage drive (will not be formatted). Choose Manual partitioning method and proceed.

Primary partition vs. Logic partition

For Linux (Ubuntu), you need minimum TWO Partitions (Root and Swap). The recommended is to have 3 (Root, Swap and Home).
You you need to understand that it’s not possible to have more than 4 Primary Partitions, thus you need to create Extended Partition and inside that, you could have many other Logical Partitions.

If you are not familiar with partition, check this post: Manual Disk Partitioning Guide for Ubuntu Server Edition (pdf if the page is not accessible).

  • root partition (/). The bulk of the programs used for running the system will be installed here.
  • home partition (/home) the partition where your home directory will be located. In the course of using the system, files and folders you create will reside in various folders here
  • swap partition (swap): unformatted disk space for use as virtual memory. swap partition should be at least as big as your RAM size.

14: Confirm Partition Scheme

15: Write the Partitions to Disk

Because partitioning is critical, you will be asked one more time to confirm before partitions will be written to the hard disk.

16: Base Ubuntu 16.04 Server Installation

After partitioning, the installer continue to install Ubuntu Server 16.04 base system. Nothing to do here than wait for it to complete.

17: Setup HTTP Proxy

In typical Ubuntu Server setup, this is generally not needed. So, leave it blank and continue to install Ubuntu 16.04 Server.

Leave the HTTP proxy line empty unless you’re using a proxy server to connect to the Internet:

18: APT Repository Configuration

Ubuntu software are installed from the APT repository. Wait for the installer to configure it.

19: Setup Automatic Updates

Ubuntu Server can automatically install updates when they are available. While this can break things sometimes, installing just the security updates should be fine. So I recommend installing security updates automatically on your Ubuntu server.

20: Ubuntu Server Tasksel

After partitioning, this is the step that requires most user intervention. You will have to select what services you want to install on your Ubuntu 16.04 Server. “Standard System Utilities” should already be selected. In addition, there are some other server services you can install, for example, LAMP server, Samba file server, and OpenSSH Server.

LAMP server installs Apache web server, MySQL database server, and PHP. MySQL may be needed for running a dynamic website. Samba file server will allow you share files with Windows systems.  

OpenSSH will allow you to remotely connect and administer your Ubuntu Server through SSH.

The items I select here are OpenSSH Server and Standard System Utilities so that I can immediately connect to the system with SSH or an SSH client such as PuTTY after the installation has finished:

21: Ubuntu Server Installation

Ubuntu Server installer will continue to setup Ubuntu home server packages you selected in Tasksel. Nothing to do here than just wait.

22: Ubuntu Server Installation Continues

Once again, nothing much to do here. Let the installation continue.

23: GRUB Notification

Select Yes when you are asked Install the GRUB boot loader to the master boot record?

GRUB is the boot menu that is shown immediately after your Ubuntu Server powers on. It shows a list of all OSes installed on the system. It is installed to the hard drive containing the OS. In most cases this is /dev/sda.

24: Installing GRUB

As explained above, it is safe to install GRUB to /dev/sda. No input needed. Let the installer do its thing.

25: Finishing Ubuntu Server Installation

Again, nothing to do here. Let the installer finish a few things to complete setting up Ubuntu 16.04 LTS server.

The base system installation is now finished.

26: Reboot Ubuntu Server

Yah! you are done. After installation completes, hit continue to reboot server.

Note: If your Ubuntu Server 16.o4 LTS has Black Screen after reboot: 

Try pressing (simultaneosly) <Ctrl><Alt><F2> to see whether you can switch to different console

27: Ubuntu GRUB Menu

As explained above you will see the GRUB menu while booting. Nothing to do here. Your server will automatically boot to Ubuntu, which is the default setting.

28: First Login

When you are presented with the login screen enter the username and password you created earlier in the Ubuntu 16.04 Xenial Xerus installation process. Remember that Linux commandline does not show anything (asterisk) while typing passwords.

Now Login on the shell (or remotely by SSH) on the server as user “administrator”.

29: Ubuntu Server Headless Commandline

You should be on your Ubuntu Server commandline after successful login.

30: Get root Privileges

After the reboot, you can log in with your previously created username (e.g. administrator). Because we must run all the steps from this tutorial with root privileges, we can either prepend all commands in this tutorial with the string sudo, or we become root right now by typing:

sudo -s

(You can as well enable the root login by running)

sudo passwd root

And giving root a password. You can then directly log in as root, but this is frowned upon by the Ubuntu developers and community for various reasons. See https://help.ubuntu.com/community/RootSudo.)

About Root

The root user is the administrative user in a Linux environment that has very broad privileges. Because of the heightened privileges of the root account, you are actually discouraged from using it on a regular basis. This is because part of the power inherent with the root account is the ability to make very destructive changes, even by accident.

The next step is to set up an alternative user account with a reduced scope of influence for day-to-day work.  It will tell you how to gain increased privileges during the times when you need them.

31: Create a New User

This example creates a new user called “sam”, but you should replace it with a username that you like:

adduser sam

You will be asked a few questions, starting with the account password.

Enter a strong password and, optionally, fill in any of the additional information if you would like. This is not required and you can just hit ENTER in any field you wish to skip.

32: Add root privileges to the new user you just created

Now, we have a new user account with regular account privileges. However, we may sometimes need to do administrative tasks.

To avoid having to log out of our normal user and log back in as the root account, we can set up what is known as “superuser” or root privileges for our normal account. This will allow our normal user to run commands with administrative privileges by putting the word sudo before each command.

To add these privileges to our new user, we need to add the new user to the “sudo” group. By default, on Ubuntu 16.04, users who belong to the “sudo” group are allowed to use the sudo command.

As root, run this command to add your new user to the sudo group (substitute the highlighted word with your new user):

  • usermod -aG sudo sammy

Now your user can run commands with superuser privileges! For more information about how this works, check out this sudoers tutorial.

 

33: Install the SSH Server (Optional)

If you did not select to install the OpenSSH Server during the system installation above, you could do it now:

apt-get install ssh openssh-server

From now on you can use an SSH client such as PuTTY on windows, and on mac and linux, you can ssh directly in your terminal by typing the command:

ssh your_account_name@yourserver_ip or

ssh your_account_name@hostname_of_yourserver

 

34: How to reboot the Server

Once you’ve installed Ubuntu Server you should make sure the server can boot properly. So type the following command at the prompt to reboot the server:

sudo reboot -h now

Once it’s rebooted and assuming everything’s working fine you’ll end up back at the command prompt. Now you can disconnect the keyboard and screen, but keep the Ethernet cable plugged in.

Note: If your Ubuntu Server 16.o4 LTS has Black Screen after reboot: 

Try pressing (simultaneosly) <Ctrl><Alt><F2> to see whether you can switch to different console

35: How to shutdown the Server with command line

Note: if you are using VNC to connect to your server, DO NOT SHUT DOWN your server through GUI, that would cause trouble. Instead, issue this command in your terminal to shut down the Ubuntu server.

sudo shutdown -h now

You can use the -P switch with shutdown to poweroff the computer.

sudo shutdown -P now

if you want to shut down forcefully, i.e., you don’t want to wait for processes to close normally? In that case you can use:

sudo poweroff -f

 

There are 35 steps to install Ubuntu 16.04 Server, most steps require no interaction from you. The whole installation can be done in less than 30 min (but this is the case when you are not encountering any errors while installation).

 

Posts referenced: 

Screenshot Guide: Install Ubuntu Server 16.04 LTS Xenial Xerus (htpcbeginner.com)

How to install a Ubuntu 16.10 (Yakkety Yak) Minimal Server (howtoforge.com)

Initial Server Setup with Ubuntu 16.04 (digitalocean.com)

How To Edit the Sudoers File on Ubuntu and CentOS (digitalocean.com)

How to install Ubuntu Server – Xenial Xerus 16.04LTS (havetheknowhow.com)

Installing Ubuntu Server for general use (ubuntu.com)

How to install Ubuntu server 16.04 and the Webmin GUI (techrepublic.com)

How to Install Ubuntu Server (wikihow.com)

installing Ubuntu Server from USB  (htpcbeginner.com)

Linux: How to Install Ubuntu Linux Server 16.04 LTS (techonthenet.com)

Manual Disk Partitioning Guide for Ubuntu Server Edition

Choosing partition types for swap and root and choosing device for bootloader installation

 

Look at this before installing Ubuntu 16.04 LTS (Desktop and Server versions)

Known issues for Ubuntu 16.04:

currently impossible to boot a Xen hypervisor from grub in UEFI mode. However the package does not detect this and will set the default boot to Xen mode.

So for any machine in UEFI mode, do not install the Xen hypervisor (or enable legacy mode first).

For Dell computers the steps to change boot mode are below:

  • Restart the computer and press F2 (this key is for Dell computer, others might differ) while starting up. This enters the setup program. 
  • Select the Boot tab and change to Legacy Boot (i.e., disable UEFI mode), Secure Boot Off. After restarting, this shows the boot sequence and the order can be changed using +/- keys.

(in my case, if I choose Legacy Boot, the Secure Boot automatically turned off. but it is better to double check Secure Boot is turned off.

  • I then change the boot order to. CD/DVD. USB drive. Hard drive.

 

How to Enable Secure Shell (SSH) on Ubuntu Server 16.04 LTS

SSH service is not enabled by default in Ubuntu Desktop and Ubuntu Server, but you can easily enable it just by one command.

Note: If you have installed OpenSSH Server when you install your Ubuntu Server (i.e., the software selection process page after your finish your installation of Ubuntu Server base system), just skip this post. 

Log into Ubuntu server and run the command:

sudo apt-get install openssh-server

It installs OpenSSH server, and ssh remote access will be automatically enabled.

You can check its status by the command: sudo service ssh status

OpenSSH is a FREE version of the SSH connectivity tools that technical users of the Internet rely on. Users of telnet, rlogin, and ftp may not realize that their password is transmitted across the Internet unencrypted, but it is. OpenSSH encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking, and other attacks. Additionally, OpenSSH provides secure tunneling capabilities and several authentication methods, and supports all SSH protocol versions.

You can change some settings (e.g., the listening port, and root login permission) by editing the configuration file bye the command:

sudo nano /etc/ssh/sshd_config

Remember to apply the changes by restarting or reloading SSH by the command:

sudo /etc/init.d/ssh restart

 

 

 

 

Regularization and Bias/Variance

 

in category: Machine Learning_tricks4better performance

Source: from the paper by Prof. Domingos :

Domingos, P. (2012). A few useful things to know about machine learningCommunications of the ACM55(10), 78-87. (PDF)

 

*****source for the content below: from machine-learning on Coursera by Dr. Andrew Ng 

See this post for how regularization can help prevent over-fitting. But how does it affect the bias and variances of a learning algorithm? This post will go deeper into the issue of bias and variances and talk about how it interacts with and is affected by the regularization of your learning algorithm. 

If lambda is small then we’re not using much regularization and we run a larger risk of over fitting whereas if lambda is large that is if we were on the right part of this horizontal axis then, with a large value of lambda, we run the higher risk of having a biased problem, so if you plot J train and J cv, what you find is that, for small values of lambda, you can fit the trading set relatively way cuz you’re not regularizing. So, for small values of lambda, the regularization term basically goes away, and you’re just minimizing pretty much just gray arrows. So when lambda is small, you end up with a small value for Jtrain, whereas if lambda is large, then you have a high bias problem, and you might not feel your training that well, so you end up the value up there. So Jtrain of theta will tend to increase when lambda increases, because a large value of lambda corresponds to high bias where you might not even fit your trainings that well, whereas a small value of lambda corresponds to, if you can really fit a very high degree polynomial to your data, let’s say. After the cost validation error we end up with a figure like this.

 

 

When I’m trying to pick the regularization parameter lambda for learning algorithm, often I find that plotting a figure like this one shown below helps me understand better what’s going on and helps me verify that I am indeed picking a good value for the regularization parameter monitor. 

Create a hash table for large data in python

This post introduces how to create a hash table in python.

Text files can have duplicates which will overwrite existing keys in your dictionary (the python name for a hash table). We can create a unique set of the keys, and then use a dictionary comprehension to populate the dictionary.

sample_file.txt:

a
b
c
c

 

Python code:

with open("sample_file.txt") as f:
  keys = set(line.strip() for line in f.readlines())
my_dict = {key: 1 for key in keys if key}

>>> my_dict
{'a': 1, 'b': 1, 'c': 1}

Great Apache Spark tutorial videos on YouTube

This post provides great Apache Spark video available on YouTube.

Sameer Farooqui delivers a hands-on tutorial using Spark SQL and DataFrames to retrieve insights and visualizations from datasets published by the City of San Francisco. [Topics Indexed Below]

The labs are targeted for an audience with some general programming or SQL query experience, but little to no experience with Spark. Sameer will begin with some brief theory and lecture on Spark, before diving into several demos performing visualizations and analysis on calls made to the San Francsico Fire Department on July 4th.

Follow Along:
+ Databricks Community Edition: https://databricks.com/try
+ Labs: https://bit.ly/sfopenlabs
+ Learning Material: https://bit.ly/sfopenreadalong

—–Jump to Topic—–
00:00:06 – Workshop Intro & Environment Setup
00:13:06 – Brief Intro to Spark
00:17:32 – Analysis Overview: SF Fire Department Calls for Service
00:23:22 – Analysis with PySpark DataFrames API
00:29:32 – Doing Date/Time Analysis
00:47:53 – Memory, Caching and Writing to Parquet
01:00:40 – SQL Queries
01:21:11 – Convert a Spark DataFrame to a Pandas DataFrame
—–Q & A—–
01:24:43 – Spark DataFrames vs. SQL: Pros and Cons?
01:26:57 – Workflow for Chaining Databricks notebooks into Pipeline?
01:30:27 – Is Spark 2.0 ready to use in production?

———————————————————————————————-
SPARK 2.0 TRAINING | NewCircle | Onsite & Public Classes
———————————————————————————————-
+ Programming for Spark 2.0 (3 days)
+ Spark 2.0 for Machine Learning & Data Science (3 days)
Learn more: https://newcircle.com/category/apache…

++Code for San Francisco++
http://www.meetup.com/Code-for-San-Fr…

++Learn more about Databricks++
https://databricks.com/product/databr…

All Apache Spark Courses from newcircle training:

Adam Breindel, lead Spark instructor at NewCircle, talks about which APIs to use for modern Spark with a series of brief technical explanations and demos that highlight best practices, latest APIs, and new features. (Topics Indexed Below)

We’ll look at how Dataset and DataFrame behave in Spark 2.0, Whole-Stage Code Generation, and go through a simple example of Spark 2.0 Structured Streaming (Streaming with DataFrames) that you can run in your own free instance of Databricks.

00:00:40 – Intro: What is “Modern Spark”
00:01:26 – DataFrame
00:05:07 – Why not use RDD?
00:09:15 – Intro to DataFrame and Dataset
00:10:13 – DataFrame versus Dataset
00:14:42 – Dataset Queries and Dataset with Scala classes
00:19:07 – Spark Query Optimizer
00:23:26 – Whole-Stage Codegen
00:27:21 – Hive integration
00:29:28 – Wrapping Up DataFrame/Dataset Benefits
00:30:54 – One More Thing – Structured Streaming
00:36:47 – Conclusion

Try the Examples:
+ Databricks Community Edition: https://databricks.com/try
+ Get this Notebook: https://bit.ly/get-notebook

———————————————————————————————-
SPARK 2.0 TRAINING | NewCircle | Onsite & Public Classes
———————————————————————————————-
+ Programming for Spark 2.0 (3 days):
http://bit.ly/spark-prog-newcircle

+ Spark 2.0 for Machine Learning & Data Science (3 days):
http://bit.ly/spark-ml-newcircle

 

“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL’s query optimizer, to all users of Spark. I’ll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” – Michael

Slides: http://www.slideshare.net/databricks/…

Databricks Blog: “Deep Dive into Spark SQL’s Catalyst Optimizer”
https://databricks.com/blog/2015/04/1…

// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.

Follow Michael on –
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelar…

 

 

Model-free vs. Model-based Methods

As Kahneman (2011) pointed out in his book “Thinking, fast and slow’’, we have two modes of thinking: fast and slow. For example, we do not need to think much about how to walk, how to eat; but we do need to think slowly for some complex tasks such as planing our travel routes.

In reinforcement learning, there are two main categories of methods: model-free and model based.

  • Model-free methods: never learn task T and environment E  explicitly. At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment. Deep learning algorithms are model-free methods.
  • Model-based methods: explicitly learn task T. (see model-based reasoning to get a sense of it.)

AlphaGo involves both model-free methods (Convolutional Neural Network (CNN)), and also model-based methods (Monte Carlo Tree Search (MCTS)). In fact, AlphaGo is pretty similar to how we humans think: involving both fast intuition (i.e., cost function by CNN) and also careful and slow thinking (i.e., MCTS).

Combining model-free and model-based methods should probably be the way to go for the solutions to many real-world problems (fast intuition + careful planing).

 

References:

Kahneman, Daniel. Thinking, fast and slow. Macmillan, 2011.

 

 

Overfitting and Underfitting With Machine Learning Algorithms

source: from machinelearningmastery.com  (Good posts sometimes disappear, so I repost it here for my and for your information.)

in category: Machine Learning_tricks4better performance

The cause of poor performance in machine learning is either overfitting or underfitting the data.

In this post you will discover the concept of generalization in machine learning and the problems of overfitting and underfitting that go along with it.

Let’s get started.

Approximate a Target Function in Machine Learning

Supervised machine learning is best understood as approximating a target function (f) that maps input variables (X) to an output variable (Y).

Y = f(X)

This characterization describes the range of classification and prediction problems and the machine algorithms that can be used to address them.

An important consideration in learning the target function from the training data is how well the model generalizes to new data. Generalization is important because the data we collect is only a sample, it is incomplete and noisy.

Generalization in Machine Learning

In machine learning we describe the learning of the target function from training data as inductive learning.

Induction refers to learning general concepts from specific examples which is exactly the problem that supervised machine learning problems aim to solve. This is different from deduction that is the other way around and seeks to learn specific concepts from general rules.

Generalization refers to how well the concepts learned by a machine learning model apply to specific examples not seen by the model when it was learning.

The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen.

There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting.

Overfitting and underfitting are the two biggest causes for poor performance of machine learning algorithms.

Statistical Fit

In statistics a fit refers to how well you approximate a target function.

This is good terminology to use in machine learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for the output variables given the input variables.

Statistics often describe the goodness of fit which refers to measures used to estimate how well the approximation of the function matches the target function.

Some of these methods are useful in machine learning (e.g. calculating the residual errors), but some of these techniques assume we know the form of the target function we are approximating, which is not the case in machine learning.

If we knew the form of the target function, we would use it directly to make predictions, rather than trying to learn an approximation from samples of noisy training data.

Overfitting in Machine Learning

Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance on the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.

Underfitting in Machine Learning

Underfitting refers to a model that can neither model the training data not generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide good contrast to the problem of concept of overfitting.

A Good Fit in Machine Learning

Ideally, you want to select a model at the sweet spot between underfitting and overfitting.

This is the goal, but is very difficult to do in practice.

To understand this goal, we can look at the performance of a machine learning algorithm over time as it is learning a training data. We can plot both the skill on the training data an the skill on a test dataset we have held back from the training process.

Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset. If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset. At the same time the error for the test set starts to rise again as the model’s ability to generalize decreases.

The sweet spot is the point just before the error on the test dataset starts to increase where the model has good skill on both the training dataset and the unseen test dataset.

You can perform this experiment with your favorite machine learning algorithms. This is often not useful technique in practice, because by choosing the stopping point for training using the skill on the test dataset it means that the testset is no longer “unseen” or a standalone objective measure. Some knowledge (a lot of useful knowledge) about that data has leaked into the training procedure.

There are two additional techniques you can use to help find the sweet spot in practice: resampling methods and a validation dataset.

How To Limit Overfitting

Both overfitting and underfitting can lead to poor model performance. But by far the most common problem in applied machine learning is overfitting.

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data.

There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:

  1. Use a resampling technique to estimate model accuracy.
  2. Hold back a validation dataset.

The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.

Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have the data, using a validation dataset is also an excellent practice.

Further Reading

This section lists some recommended resources if you are looking to learn more about generalization, overfitting and underfitting in machine learning.

Summary

In this post you discovered that machine learning is solving problems by the method of induction.

You learned that generalization is a description of how well the concepts learned by a model apply to new data. Finally you learned about the terminology of generalization in machine learning of overfitting and underfitting:

  • Overfitting: Good performance on the training data, poor generliazation to other data.
  • Underfitting: Poor performance on the training data and poor generalization to other data

Do you have any questions about overfitting, underfitting or this post? Leave a comment and ask your question and I will do my best to answer it.