five

awesome-public-datasets

收藏
github2017-05-10 更新2024-05-31 收录
下载链接:
https://github.com/HashirZahir/awesome-public-datasets
下载链接
链接失效反馈
官方服务:
资源简介:
一个包含高质量公开数据集的列表,这些数据集来自公共领域,持续更新中。

A list of high-quality public datasets sourced from the public domain, continuously updated.
创建时间:
2016-02-09
原始信息汇总

数据集概述

农业

  • U.S. Department of Agricultures PLANTS Database <http://www.plants.usda.gov/dl_all.html>_

生物学

  • 1000 Genomes <http://www.1000genomes.org/data>_
  • American Gut (Microbiome Project) <https://github.com/biocore/American-Gut>_
  • Broad Cancer Cell Line Encyclopedia (CCLE) <http://www.broadinstitute.org/ccle/home>_
  • Cell Image Library <http://www.cellimagelibrary.org>_
  • Collaborative Research in Computational Neuroscience (CRCNS) <http://crcns.org/data-sets>_
  • Complete Genomics Public Data <http://www.completegenomics.com/public-data/69-genomes/>_
  • EBI ArrayExpress <http://www.ebi.ac.uk/arrayexpress/>_
  • EBI Protein Data Bank in Europe <http://www.ebi.ac.uk/pdbe/emdb/index.html/>_
  • ENCODE project <https://www.encodeproject.org>_
  • Ensembl Genomes <http://ensemblgenomes.org/info/genomes>_
  • Gene Expression Omnibus (GEO) <http://www.ncbi.nlm.nih.gov/geo/>_
  • Gene Ontology (GO) <http://geneontology.org/page/download-annotations>_
  • Global Biotic Interations (GloBI) <https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data>_
  • Harvard Medical School (HMS) LINCS Project <http://lincs.hms.harvard.edu>_
  • Human Genome Diversity Project <http://www.hagsc.org/hgdp/files.html>_
  • Human Microbiome Project (HMP) <http://www.hmpdacc.org/reference_genomes/reference_genomes.php>_
  • ICOS PSP Benchmark <http://ico2s.org/datasets/psp_benchmark.html>_
  • International HapMap Project <http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en>_
  • Journal of Cell Biology DataViewer <http://jcb-dataviewer.rupress.org>_
  • MIT Cancer Genomics Data <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi>_
  • NCBI Proteins <http://www.ncbi.nlm.nih.gov/guide/proteins/#databases>_
  • NCBI Taxonomy <http://www.ncbi.nlm.nih.gov/taxonomy>_
  • NeuroData <http://neurodata.io>_
  • NIH Microarray data <http://bit.do/VVW6>_ or FTP <ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/>_
  • OpenSNP genotypes data <https://opensnp.org/>_
  • Pathguid - Protein-Protein Interactions Catalog <http://www.pathguide.org/>_
  • Protein Data Bank <http://www.rcsb.org/>_
  • PubChem Project <https://pubchem.ncbi.nlm.nih.gov/>_
  • PubGene (now Coremine Medical) <http://www.pubgene.org/>_
  • Sanger Catalogue of Somatic Mutations in Cancer (COSMIC) <http://cancer.sanger.ac.uk/cosmic>_
  • Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC) <http://www.cancerrxgene.org/>_
  • Sequence Read Archive(SRA) <http://www.ncbi.nlm.nih.gov/Traces/sra/>_
  • Stanford Microarray Data <http://smd.stanford.edu/>_
  • Stowers Institute Original Data Repository <http://www.stowers.org/research/publications/odr>_
  • Systems Science of Biological Dynamics (SSBD) Database <http://ssbd.qbic.riken.jp>_
  • Temple University Hospital EEG Database <https://www.nedcdata.org/drupal/node/12>_
  • The Cancer Genome Atlas (TCGA), available via Broad GDAC <https://gdac.broadinstitute.org/>_
  • The Catalogue of Life <http://www.catalogueoflife.org/content/annual-checklist-archive>_
  • The Personal Genome Project <http://www.personalgenomes.org/>_ or PGP <https://my.pgp-hms.org/public_genetic_data>_
  • UCSC Public Data <http://hgdownload.soe.ucsc.edu/downloads.html>_
  • Universal Protein Resource (UnitProt) <http://www.uniprot.org/downloads>_
  • UniGene <http://www.ncbi.nlm.nih.gov/unigene>_

气候/天气

  • Australian Weather <http://www.bom.gov.au/climate/dwo/>_
  • Brazilian Weather - Historical data (In Portuguese) <http://sinda.crn2.inpe.br/PCD/SITE/novo/site/>_
  • Canadian Meteorological Centre <http://weather.gc.ca/grib/index_e.html>_
  • Climate Data from UEA (updated monthly) <https://crudata.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>_
  • European Climate Assessment & Dataset <http://eca.knmi.nl/>_
  • Global Climate Data Since 1929 <http://en.tutiempo.net/climate>_
  • NASA Global Imagery Browse Services <https://wiki.earthdata.nasa.gov/display/GIBS>_
  • NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>_
  • NOAA Climate Datasets <http://www.ncdc.noaa.gov/data-access/quick-links>_
  • NOAA Realtime Weather Models <http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction>_
  • The World Bank Open Data Resources for Climate Change <http://data.worldbank.org/developers/climate-data-api>_
  • UEA Climatic Research Unit <http://www.cru.uea.ac.uk/data>_
  • WorldClim - Global Climate Data <http://www.worldclim.org>_
  • WU Historical Weather Worldwide <https://www.wunderground.com/history/index.html>_

复杂网络

  • CrossRef DOI URLs <https://archive.org/details/doi-urls>_
  • DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>_
  • NBER Patent Citations <http://nber.org/patents/>_
  • NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>_
  • Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>_
  • PyPI and Maven Dependency Network <https://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>_
  • Scopus Citation Database <https://www.elsevier.com/solutions/scopus>_
  • Small Network Data <http://www-personal.umich.edu/~mejn/netdata/>_
  • Stanford GraphBase (Steven Skiena) <http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml>_
  • Stanford Large Network Dataset Collection <http://snap.stanford.edu/data/>_
  • Stanford Longitudinal Network Data Sources <http://stanford.edu/group/sonia/dataSources/index.html>_
  • The Koblenz Network Collection <http://konect.uni-koblenz.de/>_
  • The Laboratory for Web Algorithmics (UNIMI) <http://law.di.unimi.it/datasets.php>_
  • The Nexus Network Repository <http://nexus.igraph.org/>_
  • UCI Network Data Repository <https://networkdata.ics.uci.edu/resources.php>_
  • UFL sparse matrix collection <http://www.cise.ufl.edu/research/sparse/matrices/>_
  • WSU Graph Database <http://www.eecs.wsu.edu/mgd/gdb.html>_

计算机网络

  • 3.5B Web Pages from CommonCraw 2012 <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>_
  • 53.5B Web clicks of 100K users in Indiana Univ. <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/>_
  • CAIDA Internet Datasets <http://www.caida.org/data/overview/>_
  • ClueWeb09 - 1B web pages <http://lemurproject.org/clueweb09/>_
  • ClueWeb12 - 733M web pages <http://lemurproject.org/clueweb12/>_
  • CommonCrawl Web Data over 7 years <http://commoncrawl.org/the-data/get-started/>_
  • CRAWDAD Wireless datasets from Dartmouth Univ. <https://crawdad.cs.dartmouth.edu/>_
  • Criteo click-through data <http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/>_
  • Open Mobile Data by MobiPerf <https://console.developers.google.com/storage/openmobiledata_public/>_
  • UCSD Network Telescope, IPv4 /8 net <http://www.caida.org/projects/network_telescope/>_

上下文数据

  • Context-aware data sets from five domains <http://students.depaul.edu/~yzheng8/DataSets.html#Data>_ or GitHub <https://github.com/irecsys/CARSKit/tree/master/context-aware_data_sets>_

数据挑战

  • Challenges in Machine Learning <http://www.chalearn.org/>_
  • CrowdANALYTIX dataX <http://data.crowdanalytix.com>_
  • D4D Challenge of Orange <http://www.d4d.orange.com/en/home>_
  • DrivenData Competitions for Social Good <http://www.drivendata.org/>_
  • ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>_
  • Kaggle Competition Data <https://www.kaggle.com/>_
  • KDD Cup by Tencent 2012 <http://www.kddcup2012.org/>_
  • Localytics Data Visualization Challenge <https://github.com/localytics/data-viz-challenge>_
  • Netflix Prize <http://www.netflixprize.com/leaderboard>_
  • Space Apps Challenge <https://2015.spaceappschallenge.org>_
  • Telecom Italia Big Data Challenge <https://dandelion.eu/datamine/open-big-data/>_
  • Yelp Dataset Challenge <http://www.yelp.com/dataset_challenge>_

经济学

  • American Economic Ass (AEA) <https://www.aeaweb.org/RFE/toc.php?show=complete>_
  • EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>_
  • Economic Freedom of the World Data <http://www.freetheworld.com/datasets_efw.html>_
  • Historical MacroEconomic Statistics <http://www.historicalstatistics.org/>_
  • International Trade Statistics <http://www.econistatistics.co.za/>_
  • Internet Product Code Database <http://www.upcdatabase.com/>_
  • Joint External Debt Data Hub <http://www.jedh.org/>_
  • Jon Haveman International Trade Data Links <http://www.macalester.edu/research/economics/PAGE/HAVEMAN/Trade.Resources/TradeData.html>_
  • OpenCorporates Database of Companies in the World <https://opencorporates.com/>_
  • Our World in Data <http://ourworldindata.org/>_
  • SciencesPo World Trade Gravity Datasets <http://econ.sciences-po.fr/thierry-mayer/data>_
  • The Atlas of Economic Complexity <http://atlas.cid.harvard.edu>_
  • The Center for International Data <http://cid.econ.ucdavis.edu>_
  • The Observatory of Economic Complexity <http://atlas.media.mit.edu/en/>_
  • UN Commodity Trade Statistics <http://comtrade.un.org/db/>_
  • UN Human Development Reports <http://hdr.undp.org/en>_

教育

  • Student Data from Free Code Camp <http://academictorrents.com/details/030b10dad0846b5aecc3905692890fb02404adbf>_

能源

  • AMPds <http://ampds.org/>_
  • BLUEd <http://nilm.cmubi.org/>_
  • COMBED <http://combed.github.io/>_
  • Dataport <https://dataport.pecanstreet.org/>_
  • ECO <http://www.vs.inf.ethz.ch/res/show.html?what=eco-data>_
  • EIA <http://www.eia.gov/electricity/data/eia923/>_
  • HFED <http://hfed.github.io/>_
  • iAWE <http://iawe.github.io/>_
  • Plaid <http://plaidplug.com/>_
  • REDD <http://redd.csail.mit.edu/>_
  • UK-Dale <http://www.doc.ic.ac.uk/~dk3810/data/>_

金融

  • CBOE Futures Exchange <http://cfe.cboe.com/Data/>_
  • Google Finance <https://www.google.com/finance>_
  • Google Trends <http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0>_
  • NASDAQ <https://data.nasdaq.com/>_
  • OANDA <http://www.oanda.com/>_
  • OSU Financial data <http://fisher.osu.edu/fin/fdf/osudata.htm>_
  • Quandl <https://www.quandl.com/>_
  • St Louis Federal <https://research.stlouisfed.org/fred2/>_
  • Yahoo Finance <http://finance.yahoo.com/>_

地质学

  • Earth Models <http://www.earthmodels.org/>_
  • Smithsonian Institution Global Volcano and Eruption Database <http://volcano.si.edu/>_
  • USGS Earthquake Archives <http://earthquake.usgs.gov/earthquakes/search/>_

地理空间/GIS

  • BODC - marine data of ~22K vars <http://www.bodc.ac.uk/data/where_to_find_data/>_
  • Cambridge, MA, US, GIS data on GitHub <http://cambridgegis.github.io/gisdata.html>_
  • EOSDIS - NASAs earth observing system data <http://sedac.ciesin.columbia.edu/data/sets/browse>_
  • Factual Global Location Data <https://www.factual.com/>_
  • Geo Spatial Data from ASU <http://geodacenter.asu.edu/datalist/>_
  • Geo Wiki Project - Citizen-driven Environmental Monitoring <http://geo-wiki.org/>_
  • GeoFabrik - OSM data extracted to a variety of formats and areas <http://download.geofabrik.de/>_
  • GeoNames Worldwide <http://www.geonames.org/>_
  • Global Administrative Areas Database (GADM) <http://www.gadm.org/>_
  • `International Institute for Systems Analysis - GIS Datasets <http://www.
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是通过收集和整理来自博客、回答和用户响应中的公开数据源而构建的。它包含了许多免费的数据集,但也包含一些非免费的数据集。
特点
数据集的特点在于其广泛性,涵盖了从农业、生物学到气候、经济等多个领域的公共数据。它不仅包括了结构化数据,还包括了一些复杂网络和地理信息系统数据。
使用方法
用户可以通过GitHub页面访问数据集的列表,每个数据集都提供了相应的链接,用户可以根据自己的需求下载或访问这些数据集。对于一些特定的数据集,可能需要遵循特定的使用条款或条件。
背景与挑战
背景概述
Awesome Public Datasets是一个收集和整理自博客、回答和用户响应的公共数据集列表。该数据集创建于2016年,由Caesar0301维护,旨在提供一个综合性的资源,涵盖各个领域的公共数据集。数据集列表包含了农业、生物学、气候/天气、复杂网络、计算机网络、上下文数据、数据挑战、经济学、教育、能源、金融、地质学、地理空间/GIS、政府、健康护理等多个领域。它的影响力体现在为研究人员和开发者提供了一个便捷的资源,以支持他们的研究和项目。
当前挑战
尽管Awesome Public Datasets提供了一个丰富的数据集资源,但在构建和使用过程中仍面临一些挑战。首先是数据集的质量和准确性,由于数据来源多样,保证所有数据集的质量和准确性是一个挑战。其次是数据集的更新和维护,随着新数据的不断产生,保持数据集的时效性需要持续的努力。此外,不同领域的数据集在整合和互操作性方面也存在着挑战,这需要进一步的标准化和技术支持。
常用场景
经典使用场景
awesome-public-datasets数据集广泛收集了各领域的公共数据源,其经典使用场景主要包括为研究人员提供丰富的数据资源,以支持他们进行数据分析、挖掘和科学研究。该数据集被广泛应用于学术研究、商业智能分析、政府公开数据等多个领域。
实际应用
在实际应用中,awesome-public-datasets数据集被用于政府公开数据的发布、商业智能分析、教育资源的共享等多个方面,为政策制定、市场分析和教育推广提供了数据支持。
衍生相关工作
基于awesome-public-datasets数据集,衍生出了许多相关的工作,包括但不限于学术论文的发表、商业智能工具的开发、政府数据开放平台的构建等,这些工作进一步推动了数据科学的发展和应用。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作