Dataset collection for (deep) machine learning and computer vision

This page provides a collection of (image and text) datasets for (deep) machine learning and computer vision problems.

=====Image datasets======

***Dataset for Natural Images******

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. The creators of the dataset hope ImageNet will become a useful resource for researchers, educators, students and all of you who share our passion for pictures.

    • TBA

 

***Dataset for Sketch images******

 

 

Selected papers used the dataset:

Xu, P., Huang, Y., Yuan, T., Pang, K., Song, Y. Z., Xiang, T., … & Guo, J. (2018). Sketchmate: Deep hashing for million-scale human sketch retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8090-8098). (PDF)

Try the demo!

Citation bibtex:

@article{eitz2012hdhso,
author={Eitz, Mathias and Hays, James and Alexa, Marc},
title={How Do Humans Sketch Objects?},
journal={ACM Trans. Graph. (Proc. SIGGRAPH)},
year={2012},
volume={31},
number={4},
pages = {44:1--44:10}
}

Abstract

We investigate the problem of fine-grained sketch-based image retrieval (SBIR), where free-hand human sketches are used as queries to perform instance-level retrieval of images. This is an extremely challenging task because (i) visual comparisons not only need to be fine-grained but also executed cross-domain, (ii) free-hand (finger) sketches are highly abstract, making fine-grained matching harder, and most importantly (iii) annotated cross-domain sketch-photo datasets required for training are scarce, challenging many state-of-the-art machine learning techniques.In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based image retrieval application. We introduce a new database of 1,432 sketch-photo pairs from two categories with 32,000 fine-grained triplet ranking annotations. We then develop a deep triplet ranking model for instance-level SBIR with a novel data augmentation and staged pre-training strategy to alleviate the issue of insufficient fine-grained training data. Extensive experiments are carried out to contribute a variety of insights into the challenges of data sufficiency and overfitting avoidance when training deep networks for fine-grained cross-domain ranking tasks.

Database   New Dataset! (ShoeV2: 2000 photos + 6648 sketches)

database

Citation bibtex:

@inproceedings{yu2016sketch,
  			title={Sketch Me That Shoe},
  			author={Yu, Qian and Liu, Feng and SonG, Yi-Zhe and Xiang, Tao and Hospedales, Timothy and Loy, Chen Change},
  			booktitle={Computer Vision and Pattern Recognition},
  			year={2016}
		}

Results Updated: On Shoes dataset, acc.@1 is 52.17%. On Chairs dataset, acc.@1 is 72.16%. Please find further details here (Extra comment 1).

code   |  Demo:   Try the demo!

 

Citation bibtex:

@article{
 sketchy2016,
 author = {Patsorn Sangkloy and Nathan Burnell and Cusuh Ham and James Hays},
 title = {The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies},
 journal = {ACM Transactions on Graphics (proceedings of SIGGRAPH)},
 year = {2016},
}

The authors present a new dataset of paired images and contour drawings for the study of visual understanding and sketch generation. In this dataset, there are 1,000 outdoor images and each is paired with 5 human drawings (5,000 drawings in total). The drawings have strokes roughly aligned for image boundaries, making it easier to correspond human strokes with image edges.

The dataset is collected with Amazon Mechanical Turk. The Turkers are asked to trace over a fainted background image. In order to obtain high-quality annotations, we design a labeling interface with a detailed instruction page including many positive and negative examples. The quality control is realized through manual inspection by treating annotations of the following types as rejection candidates: (1) missing inner boundary, (2) missing important objects, (3) with large misalignment with original edges, (4) the content not recognizable, (5) drawing humans with stick figures, (6) shaded on empty areas. Therefore, in addition to the 5,000 drawings accepted, we have 1,947 rejected submissions, which can be used in setting up an automatic quality guard.

License: the dataset is licensed under CC BY-NC-SA (Attribution-NonCommercial-ShareAlike). That means you can use this dataset for non-commerical purposes and your adapted work should be shared under similar conditions.

 

Citation bibtex:

@article{LIPS2019,
  title={Photo-Sketching: Inferring Contour Drawings from Images},
  author={Li, Mengtian and Lin, Zhe and M\v ech, Radom\'ir and Yumer, Ersin and Ramanan, Deva},
  journal={WACV},
  year={2019}
}

 

CUHK Face Sketch database (CUFS) is for research on face sketch synthesis and face sketch recognition. It includes 188 faces from the Chinese University of Hong Kong (CUHK) student database, 123 faces from the AR database [1], and 295 faces from the XM2VTS database [2]. There are 606 faces in total. For each face, there is a sketch drawn by an artist based on a photo taken in a frontal pose, under normal lighting condition, and with a neutral expression.

Sketch Dataset

 

  • TBA

 

***Datasets for Cartoon Images******

It is extracted from the comic books DCM772 public dataset. This dataset is composed of 772 annotated images from 27 golden age comic books. It is freely collected from the free public domain collection of digitized comic books Digital Comics Museum (https://digitalcomicmuseum.com). The ground-truth annotations of this dataset contain bounding boxes for panels and comic characters (body + faces), and segmentation masks for balloons, and links between balloons and characters.

 

 

  • TBA

 

***Diagrams Dataset******

  • AI2D — a dataset of illustrative diagrams for diagram understanding (Download the dataset HERE, paper PDF)

AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering. Each diagram has been densely annotated with object segmentations, diagrammatic and text elements. Each diagram has a corresponding set of questions and answers.

Abstract: Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. Understanding natural images has been extensively studied in computer vision, while diagram understanding has received little attention. In this paper, we study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships. We introduce Diagram Parse Graphs (DPG) as our representation to model the structure of diagrams. We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering. We compile a new dataset of diagrams with exhaustive annotations of constituents and relationships for over 5,000 diagrams and 15,000 questions and answers. Our results show the significance of our models for syntactic parsing and question answering in diagrams using DPGs.

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., & Farhadi, A. (2016, October). A diagram is worth a dozen images. In European Conference on Computer Vision (pp. 235-251). Springer, Cham.

Citation bibtex:

@inproceedings{kembhavi2016eccv,
author ={Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi},
title={A Digram Is Worth A Dozen Images},
booktitle={European Conference on Computer Vision (ECCV)},
year={2016},
}
  • TBA

***3D datasets******

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5828-5839).

 

 

 

 

The CVPR2019 workshp used the ScanNet dataset:

  • TBA

=====Text datasets======

 

 

=====More datasets can be found from the sources below======

For researchers in computer vision & Image processing:

There are  670 + datasets listed on CVonline

Image/video database categories:
Action Databases
Attribute recognition
Autonomous Driving
Biological/Medical
Camera calibration
Face and Eye/Iris Databases
Fingerprints
General Images
General RGBD and depth datasets
General Videos
Hand, Hand Grasp, Hand Action and Gesture Databases
Image, Video and Shape Database Retrieval
Object Databases
People (static), human body pose
People Detection and Tracking Databases (See also Surveillance)
Remote Sensing
Scene or Place Segmentation or Classification
Segmentation
Simultaneous Localization and Mapping
Surveillance (See also People)
Textures
Urban Datasets
Other Collection Pages
Miscellaneous Topics

 

https://sedac.ciesin.columbia.edu/theme/conservation

 

 

HERE

References

  • WACV (IEEE Winter Conference on Applications of Computer Vision)

 

  • TBA

 

 

10 Standard Datasets for Practicing Applied Machine Learning

source: from machinelearningmastery.com  (Good posts sometimes disappear, so I repost it here for my and for your information.)

 

The key to getting good at applied machine learning is practicing on lots of different datasets.

This is because each problem is different, requiring subtly different data preparation and modeling methods.

In this post, you will discover 10 top standard machine learning datasets that you can use for practice.

Let’s dive in.

Overview

A structured Approach

Each dataset is summarized in a consistent way. This makes them easy to compare and navigate for you to practice a specific data preparation technique or modeling method.

The aspects that you need to know about each dataset are:

  1. Name: How to refer to the dataset.
  2. Problem Type: Whether the problem is regression or classification.
  3. Inputs and Outputs: The numbers and known names of input and output features.
  4. Performance: Baseline performance for comparison using the Zero Rule algorithm, as well as best known performance (if known).
  5. Sample: A snapshot of the first 5 rows of raw data.
  6. Links: Where you can download the dataset and learn more.

Standard Datasets

Below is a list of the 10 datasets we’ll cover.

Each dataset is small enough to fit into memory and review in a spreadsheet. All datasets are comprised of tabular data and no (explicitly) missing values.

  1. Swedish Auto Insurance Dataset.
  2. Wine Quality Dataset.
  3. Pima Indians Diabetes Dataset.
  4. Sonar Dataset.
  5. Banknote Dataset.
  6. Iris Flowers Dataset.
  7. Abalone Dataset.
  8. Ionosphere Dataset.
  9. Wheat Seeds Dataset.
  10. Boston House Price Dataset.

1. Swedish Auto Insurance Dataset

The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims.

It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:

  1. Number of claims.
  2. Total payment for all claims in thousands of Swedish Kronor.

The baseline performance of predicting the mean value is an RMSE of approximately 72.251 thousand Kronor.

A sample of the first 5 rows is listed below.

Below is a scatter plot of the entire dataset.

Swedish Auto Insurance Dataset

Swedish Auto Insurance Dataset

2. Wine Quality Dataset

The Wine Quality Dataset involves predicting the quality of white wines on a scale given chemical measures of each wine.

It is a multi-class classification problem, but could also be framed as a regression problem. The number of observations for each class is not balanced. There are 4,898 observations with 11 input variables and one output variable. The variable names are as follows:

  1. Fixed acidity.
  2. Volatile acidity.
  3. Citric acid.
  4. Residual sugar.
  5. Chlorides.
  6. Free sulfur dioxide.
  7. Total sulfur dioxide.
  8. Density.
  9. pH.
  10. Sulphates.
  11. Alcohol.
  12. Quality (score between 0 and 10).

The baseline performance of predicting the mean value is an RMSE of approximately 0.148 quality points.

A sample of the first 5 rows is listed below.

3. Pima Indians Diabetes Dataset

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:

  1. Number of times pregnant.
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  3. Diastolic blood pressure (mm Hg).
  4. Triceps skinfold thickness (mm).
  5. 2-Hour serum insulin (mu U/ml).
  6. Body mass index (weight in kg/(height in m)^2).
  7. Diabetes pedigree function.
  8. Age (years).
  9. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

4. Sonar Dataset

The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 208 observations with 60 input variables and 1 output variable. The variable names are as follows:

  1. Sonar returns at different angles
  2. Class (M for mine and R for rock)

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 53%. Top results achieve a classification accuracy of approximately 88%.

A sample of the first 5 rows is listed below.

5. Banknote Dataset

The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:

  1. Variance of Wavelet Transformed image (continuous).
  2. Skewness of Wavelet Transformed image (continuous).
  3. Kurtosis of Wavelet Transformed image (continuous).
  4. Entropy of image (continuous).
  5. Class (0 for authentic, 1 for inauthentic).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.

A sample of the first 5 rows is listed below.

6. Iris Flowers Dataset

The Iris Flowers Dataset involves predicting the flower species given measurements of iris flowers.

It is a multi-class classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

  1. Sepal length in cm.
  2. Sepal width in cm.
  3. Petal length in cm.
  4. Petal width in cm.
  5. Class (Iris Setosa, Iris Versicolour, Iris Virginica).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 26%.

A sample of the first 5 rows is listed below.

7. Abalone Dataset

The Abalone Dataset involves predicting the age of abalone given objective measures of individuals.

It is a multi-class classification problem, but can also be framed as a regression. The number of observations for each class is not balanced. There are 4,177 observations with 8 input variables and 1 output variable. The variable names are as follows:

  1. Sex (M, F, I).
  2. Length.
  3. Diameter.
  4. Height.
  5. Whole weight.
  6. Shucked weight.
  7. Viscera weight.
  8. Shell weight.
  9. Rings.

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 16%. The baseline performance of predicting the mean value is an RMSE of approximately 3.2 rings.

A sample of the first 5 rows is listed below.

8. Ionosphere Dataset

The Ionosphere Dataset requires the prediction of structure in the atmosphere given radar returns targeting free electrons in the ionosphere.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 351 observations with 34 input variables and 1 output variable. The variable names are as follows:

  1. 17 pairs of radar return data.
  2. Class (g for good and b for bad).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 64%. Top results achieve a classification accuracy of approximately 94%.

A sample of the first 5 rows is listed below.

9. Wheat Seeds Dataset

The Wheat Seeds Dataset involves the prediction of species given measurements of seeds from different varieties of wheat.

It is a binary (2-class) classification problem. The number of observations for each class is balanced. There are 210 observations with 7 input variables and 1 output variable. The variable names are as follows:

  1. Area.
  2. Perimeter.
  3. Compactness
  4. Length of kernel.
  5. Width of kernel.
  6. Asymmetry coefficient.
  7. Length of kernel groove.
  8. Class (1, 2, 3).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 28%.

A sample of the first 5 rows is listed below.

10. Boston House Price Dataset

The Boston House Price Dataset involves the prediction of a house price in thousands of dollars given details of the house and its neighborhood.

It is a regression problem. The number of observations for each class is balanced. There are 506 observations with 13 input variables and 1 output variable. The variable names are as follows:

  1. CRIM: per capita crime rate by town.
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of nonretail business acres per town.
  4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. NOX: nitric oxides concentration (parts per 10 million).
  6. RM: average number of rooms per dwelling.
  7. AGE: proportion of owner-occupied units built prior to 1940.
  8. DIS: weighted distances to five Boston employment centers.
  9. RAD: index of accessibility to radial highways.
  10. TAX: full-value property-tax rate per $10,000.
  11. PTRATIO: pupil-teacher ratio by town.
  12. B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.
  13. LSTAT: % lower status of the population.
  14. MEDV: Median value of owner-occupied homes in $1000s.

The baseline performance of predicting the mean value is an RMSE of approximately 9.21 thousand dollars.

A sample of the first 5 rows is listed below.

Summary

In this post, you discovered 10 top standard datasets that you can use to practice applied machine learning.

Here is your next step:

  1. Pick one dataset.
  2. Grab your favorite tool (like Weka, scikit-learn or R)
  3. See how much you can beat the standard scores.
  4. Report your results in the comments below.