多个数据集

github2024-05-21 更新2024-05-31 收录

下载链接：

https://github.com/rudvlf0413/Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该仓库收集了多个领域的数据集，包括图像识别、分类、生成等，以及医学领域的数据集，如肺癌、脑瘤等。

This repository aggregates datasets across multiple domains, encompassing image recognition, classification, generation, as well as datasets in the medical field, such as those related to lung cancer and brain tumors.

创建时间：

2017-03-27

原始信息汇总

视觉数据集

分类或识别或生成

Coil-20
- 链接: http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
STL-10: Self-taught learning
- 链接: https://cs.stanford.edu/~acoates/stl10/
MS COCO
- 链接: http://mscoco.org/dataset/#overview
US Post Office Zip Code Data
- 链接: https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html
Google Conceptual Caption dataset
- 链接: https://ai.google.com/research/ConceptualCaptions/download
Visual Storytelling Dataset (VIST)
- 链接: http://visionandlanguage.net/VIST/
NVIDIA food Image classification
- 链接: https://github.com/corona10/FoodDataSet
CIFAR-10, CIFAR-100
- 链接: https://www.cs.toronto.edu/~kriz/cifar.html
Large-scale CelebFaces Attributes (CelebA) Dataset
- 链接: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
Street View House Numbers (SVHN)
- 链接: http://ufldl.stanford.edu/housenumbers/
MNIST
- 链接: http://yann.lecun.com/exdb/mnist/
Facial Database
- 链接: http://www.face-rec.org/databases/
Labeled Faces in the Wild
- 链接: http://vis-www.cs.umass.edu/lfw/#download
Simple Vector Drawing Datasets
- 链接: https://github.com/hardmaru/sketch-rnn-datasets
Places2 (공간 사진, 정보 데이터)
- 链接: http://places2.csail.mit.edu/download.html
Yelp dataset (식당 정보, 사진)
- 链接: https://www.yelp.com/dataset_challenge
DeepFashion
- 链接: http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
Image to Latex (수식 이미지를 latex 코드로 만드는 데이터셋입니다.)
- 链接: https://zenodo.org/record/56198#.WTpQ73XyhPN
NIST Dataset(Fingerprint, Mugshot, OCR)
- 链接: https://www.nist.gov/srd/nist-special-database-4
Biometics ideal test dataset(Iris, Fingerprint, Face, palmprint, handwriting etc. - 로그인 필요!)
- 链接: http://biometrics.idealtest.org/index.jsp
PASCAL 2012 Dataset (Classification & Detection)
- 链接: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html#data
Flickr Image Dataset
- 链接: http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/flickr100k.html
Stanford dogs dataset
- 链接: http://vision.stanford.edu/aditya86/ImageNetDogs/
CUB-200 dataset (birds)
- 链接: http://www.vision.caltech.edu/visipedia/CUB-200-2011.html
Facial beauty score dataset
- 链接: https://github.com/HCIILAB/SCUT-FBP5500-Database-Release
Tumblr GIF dataset
- 链接: https://www.kaggle.com/raingo/tumblr-gif-description-dataset
Totally looks like dataset
- 链接: https://sites.google.com/view/totally-looks-like-dataset
CAISA WebFace databaset
- 链接: http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html
Labeled Faces in the Wild Home
- 链接: http://vis-www.cs.umass.edu/lfw/
Behance Artistic Media Dataset
- 链接: https://bam-dataset.org/#explore
Handwriting databaset
- 链接: http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
ImageCLEF dataset - Cross language image retrieval task
- 链接: https://www.imageclef.org/
Yale-b - The extended Yale face database
- 链接: http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html
Visual Relationship Detection dataset
- 链接: Images Annotations
Visual Genome dataset
- 链接: http://visualgenome.org/
Oxford-102 dataset (Flower)
- 链接: http://www.robots.ox.ac.uk/~vgg/data/flowers/102/
UCSD Pedestrian dataset (video anomaly detection)
- 链接: http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm

医学数据集

Lung cancer dataset
- 链接: https://luna.grand-challenge.org
- 链接: https://www.kaggle.com/c/data-science-bowl-2017
Brain tumor dataset
- 链接: http://braintumorsegmentation.org
Breast cancer dataset (kaggle)
- 链接: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
The cancer image archive
- 链接: http://www.cancerimagingarchive.net
Mammograpy dataset
- 链接: http://marathon.csee.usf.edu/Mammography/Database.html
Bio Image Dataset @ IIIT Delhi
- 链接: http://www.iab-rubric.org/resources.html
CAMELYON 16 - metatstasis detection in lymph node
- 链接: https://camelyon16.grand-challenge.org/
CAMELYON17 Dataset
- 链接: https://camelyon17.grand-challenge.org/

视频与图像流数据集

YouTube-BoundingBoxes Dataset
- 链接: https://research.google.com/youtube-bb/index.html
Youtube-8M Dataset
- 链接: https://research.google.com/youtube8m/
The Kinetics Human Action Video Dataset
- 链接: https://deepmind.com/research/open-source/open-source-datasets/kinetics/
Announcing AVA: A Finely Labeled Video Dataset for Human Action Understanding
- 链接: https://research.googleblog.com/2017/10/announcing-ava-finely-labeled-video.html?m=1
Microsoft Kinect dataset
- 链接: https://www.microsoft.com/en-us/download/details.aspx?id=52283

文本数据集

机器翻译

StatMT(Machine Translation, summarization 등의 태스크를 위한 데이터셋으로 나라-나라 쌍의 데이터셋입니다.)
- 链接: http://www.statmt.org/wmt14/translation-task.html
- 链接: http://www.statmt.org/wmt15/translation-task.html
- 链接: http://www.statmt.org/wmt16/translation-task.html
- 链接: http://www.statmt.org/wmt17/translation-task.html
UN parallel Corpus
- 链接: https://conferences.unite.un.org/UNCorpus
IWSLT Dataset (including TED Translation)
- 链接: https://sites.google.com/site/iwsltevaluation2016/
The Stacks Project(대수기하학 책의 원본과 latex 코드 pair set?)
- 链接: http://stacks.math.columbia.edu/
Google sentence compression(Google에서 문장을 정형화 한 데이터입니다.)
- 链接: http://storage.googleapis.com/sentencecomp/compression-data.json
조선왕조실록(한글/한문 번역)
- 链接: http://sillok.history.go.kr/main/main.do
OpenSubtitles
- 链接: http://opus.nlpl.eu/OpenSubtitles2018.php

分类与主题建模

20 Newsgroups
- 链接: http://qwone.com/~jason/20Newsgroups/
Reuter dataset
- 链接: https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
SNLI(Stanford Natural Language Inference) dataset
- 链接: https://nlp.stanford.edu/projects/snli/

短文本

Tweet data, a subset of TREC 2011 microblog track
- 链接: http://trec.nist.gov/data/tweets/
Title data, including news titles with class labels from some news websites
- 链接: http://www.sogou.com/al
Italia earthquake twitter dataset
- 链接: https://www.kaggle.com/blackecho/italy-earthquakes

改写

Paraphrase database
- 链接: http://paraphrase.org/#/download

QA与对话

bAbI dataset (Facebook Question Answering)
- 链接: https://research.facebook.com/research/babi/
Question/Answering(빈칸추론문제) pairs using CNN/Daily Mail articles
- 链接: https://github.com/deepmind/rc-data
Stanford Question Answering Dataset
- 链接: https://rajpurkar.github.io/SQuAD-explorer/
Korean Squad dataset
- 链接: https://korquad.github.io/
RACE Reading Comprehension datraset
- 链接: http://www.qizhexie.com/data/RACE_leaderboard
GLUE (General Language Understanding Evaluation) benchmark dataset
- 链接: https://gluebenchmark.com/
ClueWeb12 dataset (information retrieval)
- 链接: https://lemurproject.org/clueweb12/
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
- 链接: http://cs.stanford.edu/people/jcjohns/clevr/
WikiReading dataset
- 链接: https://github.com/google-research-datasets/wiki-reading
SEMPRE: Semantic Parsing with Execution
- 链接: https://nlp.stanford.edu/software/sempre/
Dialogue system datasets
- 链接: https://breakend.github.io/DialogDatasets/
WikiSQL dataset
- 链接: https://github.com/salesforce/WikiSQL
SynthText dataset
- 链接: http://www.robots.ox.ac.uk/~vgg/data/scenetext/
Cornell Movie dialogue corpus
- 链接: http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

词嵌入

Word2Vec에 쓰인 데이터셋(위키피디아, WMT11 등)
- 链接: https://code.google.com/archive/p/word2vec/
Fast Text pre-trained vector set
- 链接: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

情感分析

Stanford Sentiment Treebank(SST)
- 链接: http://nlp.stanford.edu/sentiment/
Multi-Domain Sentiment Dataset
- 链接: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
Visual sentiment ontology
- 链接: http://www.ee.columbia.edu/ln/dvmm/vso/download/flickr_dataset.html
Radboud Face Database (rbfd)
- 链接: http://www.socsci.ru.nl:8180/RaFD2/RaFD?p=main
Aspect sentiment analysis with aspect category
- 链接: https://github.com/hsqmlzno1/MGAN

原始文本

Common Crawl dataset
- 链接: http://commoncrawl.org/the-data/

声音数据集

Nottingham music dataset
- 链接: https://www-labs.iro.umontreal.ca/~lisa/deep/data/
A large-scale dataset of manually annotated audio events (Google research)
- 链接: https://research.google.com/audioset/
Speech Command Dataset
- 链接: https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html
Mozilla DeepSpeech
- 链接: https://github.com/mozilla/DeepSpeech

知识库数据集

Freebase
- 链接: https://datahub.io/ko_KR/dataset/freebase
Wordnet
- 链接: https://wordnet.princeton.edu/
Microsoft Concept Graph
- 链接: https://concept.msra.cn/Home/Download
DBPedia Dataset
- 链接: http://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets
Yago
- 链接: https://datahub.io/ko_KR/dataset/yago
Google Knowledge graph API
- 链接: https://developers.google.com/knowledge-graph/

社交网络与推荐系统数据集

AMiner - Datasets for social network Analysis
- 链接: https://cn.aminer.org/data
- 链接: https://cn.aminer.org/aminernetwork
Netflix Prize Data Set
- 链接: <http://academictorrents.com/details/9b13183dc4d60676b77

搜集汇总

数据集介绍

构建方式

该数据集集合了多个领域的公开数据集，涵盖了视觉、医学、文本、声音、知识库、社交网络等多个领域。这些数据集来源于不同的研究论文和公开资源，包括但不限于Coil-20、STL-10、MS COCO、MNIST等知名数据集。每个数据集都经过精心挑选和整理，以确保其质量和适用性。数据集的构建过程包括数据收集、清洗、标注和验证，确保数据的准确性和一致性。

使用方法

使用该数据集集合时，用户可以根据具体的研究或应用需求选择合适的数据集。首先，用户需要访问相应的数据集链接，下载所需的数据文件。然后，根据数据集的格式和结构，进行数据预处理和加载。对于图像和文本数据，通常需要进行归一化、分词等预处理步骤。最后，用户可以将处理后的数据用于模型训练、验证和测试。部分数据集还提供了API或工具包，方便用户快速集成和使用。

背景与挑战

背景概述

多个数据集是一个汇集了多种领域和应用场景的数据集集合，涵盖了视觉、医学、文本、声音、知识库、社交网络等多个领域。这些数据集的创建时间跨度较大，主要研究人员和机构包括哥伦比亚大学、斯坦福大学、微软研究院、谷歌等知名机构。这些数据集的核心研究问题涉及图像分类、自然语言处理、医疗诊断、社交网络分析等多个前沿领域。这些数据集的发布对相关领域的研究产生了深远影响，为研究人员提供了丰富的实验数据和基准测试资源。

当前挑战

多个数据集在构建过程中面临了多方面的挑战。首先，数据集的多样性带来了数据格式和标注标准的不一致性，增加了数据整合和处理的复杂性。其次，部分数据集涉及敏感信息，如医疗数据和社交网络数据，如何在保护隐私的前提下进行数据共享和分析是一个重要挑战。此外，数据集的规模和质量也存在差异，如何确保数据集的可靠性和代表性是研究人员需要解决的问题。最后，随着技术的不断发展，数据集需要不断更新和扩展，以适应新的研究需求和应用场景。

常用场景

经典使用场景

在计算机视觉领域，这些数据集广泛应用于图像分类、目标识别和生成模型等任务。例如，CIFAR-10和CIFAR-100数据集常用于图像分类算法的基准测试，而MS COCO数据集则被广泛用于目标检测和图像分割的研究。此外，MNIST数据集作为手写数字识别的经典数据集，为初学者提供了丰富的训练和测试资源。

解决学术问题

这些数据集在学术研究中解决了多个关键问题，如图像分类的准确性提升、目标检测的实时性和精确性、以及生成模型的多样性和真实性。通过提供多样化和大规模的图像数据，这些数据集推动了深度学习算法的发展，特别是在卷积神经网络（CNN）和生成对抗网络（GAN）的研究中，为学术界提供了宝贵的实验平台。

实际应用

在实际应用中，这些数据集被广泛用于开发和优化各种视觉系统。例如，CelebA数据集用于人脸识别和属性分析，而Street View House Numbers (SVHN)数据集则支持门牌号识别系统。此外，DeepFashion数据集在时尚行业的图像搜索和推荐系统中发挥了重要作用，帮助提升了用户体验和业务效率。

数据集最近研究