Dataset collection for (deep) machine learning and computer vision

This page provides a collection of (image and text) datasets for (deep) machine learning and computer vision problems.

=====Image datasets======

***Dataset for Natural Images******

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. The creators of the dataset hope ImageNet will become a useful resource for researchers, educators, students and all of you who share our passion for pictures.

    • TBA


***Dataset for Sketch images******



Selected papers used the dataset:

Xu, P., Huang, Y., Yuan, T., Pang, K., Song, Y. Z., Xiang, T., … & Guo, J. (2018). Sketchmate: Deep hashing for million-scale human sketch retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8090-8098). (PDF)

Try the demo!

Citation bibtex:

author={Eitz, Mathias and Hays, James and Alexa, Marc},
title={How Do Humans Sketch Objects?},
journal={ACM Trans. Graph. (Proc. SIGGRAPH)},
pages = {44:1--44:10}


We investigate the problem of fine-grained sketch-based image retrieval (SBIR), where free-hand human sketches are used as queries to perform instance-level retrieval of images. This is an extremely challenging task because (i) visual comparisons not only need to be fine-grained but also executed cross-domain, (ii) free-hand (finger) sketches are highly abstract, making fine-grained matching harder, and most importantly (iii) annotated cross-domain sketch-photo datasets required for training are scarce, challenging many state-of-the-art machine learning techniques.In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based image retrieval application. We introduce a new database of 1,432 sketch-photo pairs from two categories with 32,000 fine-grained triplet ranking annotations. We then develop a deep triplet ranking model for instance-level SBIR with a novel data augmentation and staged pre-training strategy to alleviate the issue of insufficient fine-grained training data. Extensive experiments are carried out to contribute a variety of insights into the challenges of data sufficiency and overfitting avoidance when training deep networks for fine-grained cross-domain ranking tasks.

Database   New Dataset! (ShoeV2: 2000 photos + 6648 sketches)


Citation bibtex:

  			title={Sketch Me That Shoe},
  			author={Yu, Qian and Liu, Feng and SonG, Yi-Zhe and Xiang, Tao and Hospedales, Timothy and Loy, Chen Change},
  			booktitle={Computer Vision and Pattern Recognition},

Results Updated: On Shoes dataset, acc.@1 is 52.17%. On Chairs dataset, acc.@1 is 72.16%. Please find further details here (Extra comment 1).

code   |  Demo:   Try the demo!


Citation bibtex:

 author = {Patsorn Sangkloy and Nathan Burnell and Cusuh Ham and James Hays},
 title = {The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies},
 journal = {ACM Transactions on Graphics (proceedings of SIGGRAPH)},
 year = {2016},

The authors present a new dataset of paired images and contour drawings for the study of visual understanding and sketch generation. In this dataset, there are 1,000 outdoor images and each is paired with 5 human drawings (5,000 drawings in total). The drawings have strokes roughly aligned for image boundaries, making it easier to correspond human strokes with image edges.

The dataset is collected with Amazon Mechanical Turk. The Turkers are asked to trace over a fainted background image. In order to obtain high-quality annotations, we design a labeling interface with a detailed instruction page including many positive and negative examples. The quality control is realized through manual inspection by treating annotations of the following types as rejection candidates: (1) missing inner boundary, (2) missing important objects, (3) with large misalignment with original edges, (4) the content not recognizable, (5) drawing humans with stick figures, (6) shaded on empty areas. Therefore, in addition to the 5,000 drawings accepted, we have 1,947 rejected submissions, which can be used in setting up an automatic quality guard.

License: the dataset is licensed under CC BY-NC-SA (Attribution-NonCommercial-ShareAlike). That means you can use this dataset for non-commerical purposes and your adapted work should be shared under similar conditions.


Citation bibtex:

  title={Photo-Sketching: Inferring Contour Drawings from Images},
  author={Li, Mengtian and Lin, Zhe and M\v ech, Radom\'ir and Yumer, Ersin and Ramanan, Deva},


CUHK Face Sketch database (CUFS) is for research on face sketch synthesis and face sketch recognition. It includes 188 faces from the Chinese University of Hong Kong (CUHK) student database, 123 faces from the AR database [1], and 295 faces from the XM2VTS database [2]. There are 606 faces in total. For each face, there is a sketch drawn by an artist based on a photo taken in a frontal pose, under normal lighting condition, and with a neutral expression.

Sketch Dataset


  • TBA


***Datasets for Cartoon Images******

It is extracted from the comic books DCM772 public dataset. This dataset is composed of 772 annotated images from 27 golden age comic books. It is freely collected from the free public domain collection of digitized comic books Digital Comics Museum ( The ground-truth annotations of this dataset contain bounding boxes for panels and comic characters (body + faces), and segmentation masks for balloons, and links between balloons and characters.



  • TBA


***Diagrams Dataset******

  • AI2D — a dataset of illustrative diagrams for diagram understanding (Download the dataset HERE, paper PDF)

AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering. Each diagram has been densely annotated with object segmentations, diagrammatic and text elements. Each diagram has a corresponding set of questions and answers.

Abstract: Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. Understanding natural images has been extensively studied in computer vision, while diagram understanding has received little attention. In this paper, we study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships. We introduce Diagram Parse Graphs (DPG) as our representation to model the structure of diagrams. We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering. We compile a new dataset of diagrams with exhaustive annotations of constituents and relationships for over 5,000 diagrams and 15,000 questions and answers. Our results show the significance of our models for syntactic parsing and question answering in diagrams using DPGs.

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., & Farhadi, A. (2016, October). A diagram is worth a dozen images. In European Conference on Computer Vision (pp. 235-251). Springer, Cham.

Citation bibtex:

author ={Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi},
title={A Digram Is Worth A Dozen Images},
booktitle={European Conference on Computer Vision (ECCV)},
  • TBA

***3D datasets******

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5828-5839).





The CVPR2019 workshp used the ScanNet dataset:

  • TBA

=====Text datasets======



=====More datasets can be found from the sources below======

For researchers in computer vision & Image processing:

There are  670 + datasets listed on CVonline

Image/video database categories:
Action Databases
Attribute recognition
Autonomous Driving
Camera calibration
Face and Eye/Iris Databases
General Images
General RGBD and depth datasets
General Videos
Hand, Hand Grasp, Hand Action and Gesture Databases
Image, Video and Shape Database Retrieval
Object Databases
People (static), human body pose
People Detection and Tracking Databases (See also Surveillance)
Remote Sensing
Scene or Place Segmentation or Classification
Simultaneous Localization and Mapping
Surveillance (See also People)
Urban Datasets
Other Collection Pages
Miscellaneous Topics





  • WACV (IEEE Winter Conference on Applications of Computer Vision)


  • TBA