Awesome Public Datasets
收藏github2016-12-18 更新2024-05-31 收录
下载链接:
https://github.com/emorisse/awesome-public-datasets
下载链接
链接失效反馈官方服务:
资源简介:
这是一个收集和整理自博客、问答和用户反馈的高质量公共数据集列表,涵盖多个领域,大部分数据集免费。
This is a curated list of high-quality public datasets collected and organized from blogs, Q&A platforms, and user feedback, spanning multiple domains, with the majority of datasets being freely accessible.
创建时间:
2015-06-20
原始信息汇总
数据集概述
Agriculture
- U.S. Department of Agricultures PLANTS Database
Biology
- 1000 Genomes
- Collaborative Research in Computational Neuroscience (CRCNS)
- Gene Expression Omnibus (GEO)
- Human Microbiome Project (HMP)
- American Gut (Microbiome Project)
- ICOS PSP Benchmark
- MIT Cancer Genomics Data
- NIH Microarray data (FTP)
- Protein Data Bank
- PubChem Project
- PubGene (now Coremine Medical)
- Stanford Microarray Data
- The Personal Genome Project or PGP
- UCSC Public Data
- UniGene
Climate/Weather
- Australian Weather
- Brazilian Weather - Historical data (In Portuguese)
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- Global Climate Data Since 1929
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- The World Bank Open Data Resources for Climate Change
- UEA Climatic Research Unit
- WU Historical Weather Worldwide
Complex Networks
- CrossRef DOI URLs
- DBLP Citation dataset
- NBER Patent Citations
- NIST complex networks data collection
- Small Network Data
- UCI Network Data Repository
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Stanford GraphBase (Steven Skiena)
- Stanford Large Network Dataset Collection
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- The Nexus Network Repository
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
Computer Networks
- 3.5B Web Pages from CommonCraw 2012
- 53.5B Web clicks of 100K users in Indiana Univ.
- CAIDA Internet Datasets
- ClueWeb09 - 1B web pages
- ClueWeb12 - 733M web pages
- CommonCrawl Web Data over 7 years
- CRAWDAD Wireless datasets from Dartmouth Univ.
- Criteo click-through data
- Open Mobile Data by MobiPerf
- UCSD Network Telescope, IPv4 /8 net
Data Challenges
- Challenges in Machine Learning
- D4D Challenge of Orange
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- Kaggle Competition Data
- KDD Cup by Tencent 2012
- Localytics Data Visualization Challenge
- Netflix Prize
- Space Apps Challenge
- Telecom Italia Big Data Challenge
- Yelp Dataset Challenge
Economics
- American Economic Ass (AEA)
- EconData from UMD
- Internet Product Code Database
Energy
- AMPds
- BLUEd
- COMBED
- Dataport
- ECO
- EIA
- HFED
- iAWE
- Plaid
- REDD
- UK-Dale
Finance
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- OANDA
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
Geology
- USGS Earthquake Archives
- Smithsonian Institution Global Volcano and Eruption Database
GeoSpace/GIS
- BODC - marine data of ~22K vars
- Cambridge, MA, US, GIS data on GitHub
- EOSDIS - NASAs earth observing system data
- Factual Global Location Data
- Geo Spatial Data from ASU
- GeoNames Worldwide
- Global Administrative Areas Database (GADM)
- Landsat 8 on AWS
- Natural Earth - vectors and rasters of the world
- Open Street Map (OSM)
- TIGER/Line - U.S. boundaries and roads
- TwoFishes - Foursquares coarse geocoder
- TZ Timezones shapfiles
- World countries in multiple formats
- List of all countries in all languages
- OpenAddresses
Government
- Austin, TX, US
- Australia (abs.gov.au)
- Australia (data.gov.au)
- Austria (data.gv.at)
- Brazil
- Cambridge, MA, US
- Canada
- Chicago
- Dallas Open Data
- Denver Open Data
- England LGInform
- EuroStat
- FedStats
- Finland
- France
- Germany
- Glasgow, Scotland, UK
- Guardian world governments
- Houston Open Data
- Indian Government Data
- London Datastore, UK
- Los Angeles Open Data
- MassGIS, Massachusetts, U.S.
- Mexico
- Netherlands
- New Zealand
- NYC betanyc
- NYC Open Data
- OECD
- Oklahoma
- Open Government Data (OGD) Platform India
- Rio de Janeiro, Brazil
- Romania
- San Francisco Data sets
- Seattle
- South Africa
- Switzerland
- The World Bank
- Texas Open Data
- Puerto Rico Government
- U.K. Government Data
- Uruguay
- U.S. American Community Survey
- U.S. CDC Public Health datasets
- U.S. Census Bureau
- U.S. National Center for Education Statistics (NCES)
- U.S. Department of Housing and Urban Development (HUD)
- U.S. Federal Government Agencies
- U.S. Federal Government Data Catalog
- U.S. Food and Drug Administration (FDA)
- U.S. Open Government
- UK 2011 Census Open Atlas Project
- United Nations
Healthcare
- EHDP Large Health Data Sets
- Gapminder World, demographic databases
- Medicare Coverage Database (MCD), U.S.
- Medicare Data Engine of medicare.gov Data
- Medicare Data File
- Number of Ebola Cases and Deaths in Affected Countries (2014)
Image Processing
- 10k US Adult Faces Database
- 2GB of Photos of Cats
- Affective Image Classification
- Face Recognition Benchmark
- ImageNet (in WordNet hierarchy)
- International Affective Picture System, UFL
- Massive Visual Memory Stimuli, MIT
- SUN database, MIT
- YouTube Faces Database
Machine Learning
- Delve Datasets for classification and regression (Univ. of Toronto)
- Discogs Monthly Data
- eBay Online Auctions (2012)
- IMDb Database
- Keel Repository for classification, regression and time series
- Lending Club Loan Data
- Machine Learning Data Set Repository
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- RDataMining - "R and Data Mining" ebook data
- Registered Meteorites on Earth
- Restaurants Health Score Data in San Francisco
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
Museums
- Cooper-Hewitts Collection Database
- Minneapolis Institute of Arts metadata
- Tate Collection metadata
- The Getty vocabularies
- Rijksmuseum Historical Art Collection
Natural Language
- Blogger Corpus
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia - 4.58M things with 583M facts
- Flickr Personal Taxonomies
- Google Books Ngrams (2.2TB)
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List
- Hansards text chunks of Canadian Parliament
- Machine Translation of European languages
- SMS Spam Collection in English
- USENET postings corpus of 2005~2011
- Wikidata - Wikipedia databases
- Wikipedia Links data - 40 Million Entities in Context
- WordNet databases and tools
Physics
- CERN Open Data Portal
- NSSDC (NASA) data of 550 space spacecraft
- NASA Exoplanet Archive
- Sloan Digital Sky Survey (SDSS) - Mapping the Universe
Psychology/Cognition
- OSU Cognitive Modeling Repository Datasets
Public Domains
- Amazon
- Archive.org Datasets
- CMU JASA data archive
- CMU StatLab collections
- Data360
- Datamob.org
- Infochimps
- KDNuggets Data Collections
- Microsoft Azure Data Market Free DataSets
- Numbray
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- Stats4Stem R data sets
- StatSci.org
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
Search Engines
- Academic Torrents of data sharing from UMB
- Archive-it from Internet Archive
- Datahub.io
- DataMarket (Qlik)
- Freebase.com of people, places, and things
- Harvard Dataverse Network of scientific data
- ICPSR (UMICH)
- Open Data Certificates (beta)
- Statista.com - statistics and Studies
Social Sciences
- Ancestry.com Forum Dataset over 10 years
- CMU Enron Email of 150 users
- EDRM Enron EMail of 151 users, hosted on S3
- Facebook Data Scrape (2005)
- Facebook Social Networks from LAW (since 2007)
- Foursquare Social Network in 2010, 2011
- Foursquare from UMN/Sarwat (2013)
- General Social Survey (GSS) since 1972
- GetGlue - users rating TV shows
- GitHub Collaboration Archive
- MIT Reality Mining Dataset
- Mobile Social Networks from UMASS
搜集汇总
数据集介绍

构建方式
Awesome Public Datasets 是一个收集自博客、回答和用户响应的公共数据集列表。该数据集的构建主要通过从互联网上搜集各种类型的开放数据,整理后形成了一个涵盖多个领域的公共数据集清单。
特点
该数据集的特点在于其内容的丰富性和多样性,涵盖了从农业、生物学到社会科学、物理学的各个领域。此外,大多数数据集都是免费的,且提供了详细的元数据描述,便于用户了解和使用。
使用方法
用户可以通过数据集提供的链接直接访问和下载数据。每个数据集都附有详细的描述和获取方式,用户可以根据自己的需求选择合适的数据集。对于一些特定的数据集,可能需要注册或者遵循特定的使用条款。
背景与挑战
背景概述
Awesome Public Datasets是一个收集自博客、回答和用户响应的公共数据集列表,旨在为研究者提供便捷的数据资源。该数据集创建于近年,由caesar0301等研究人员或机构维护。它的核心研究问题是整理和分类网络上可用的公共数据集,以便于研究者在各自领域内进行探索和研究。该数据集对相关领域的影响力体现在其广泛的数据覆盖面和便捷的获取方式上,大大降低了研究者获取数据的门槛。
当前挑战
在构建过程中, Awesome Public Datasets面临的挑战包括:1) 确保数据的时效性和准确性;2) 数据的分类和描述需要清晰明确,以便研究者快速找到所需数据;3) 随着数据量的不断增长,维护和更新数据集的挑战也在增加。所解决的领域问题是帮助研究者快速定位和获取特定领域的公共数据集,例如生物信息学、气候学、社交网络分析等,从而推动相关领域的研究进展。
常用场景
经典使用场景
Awesome Public Datasets作为一个综合性的数据集清单,其经典使用场景主要在于为研究者提供丰富的数据资源,以便于他们能够快速地找到并利用这些数据进行科学研究。例如,在机器学习领域,研究者可以通过这个清单找到用于训练和测试算法的合适数据集,如IMDb数据库、MovieLens数据集等。
衍生相关工作
基于Awesome Public Datasets,已经衍生出许多相关工作,如数据集的进一步整理、分析和可视化,以及针对特定数据集开发的算法和模型。这些工作不仅丰富了数据科学领域的研究内容,也为实际应用提供了更多的工具和方法。
数据集最近研究
最新研究方向
Awesome Public Datasets数据集涵盖了广泛的研究领域,其最新研究方向主要集中在数据的收集、整理和共享上。该数据集的维护者们致力于从各种来源搜集公开的数据集,并将其整理成易于访问和使用的格式。研究前沿包括如何高效地整合多元数据源,提高数据质量和可用性,以及探索数据的新用途,如支持机器学习、大数据分析和跨学科研究。此外,也有研究关注于特定领域的数据集,如生物信息学、环境科学和社交媒体分析,这些研究旨在推动这些领域的科学发展和技术创新。
以上内容由遇见数据集搜集并总结生成



