five

Awesome Public Datasets

收藏
github2020-01-08 更新2024-05-31 收录
下载链接:
https://github.com/eliangcs/awesome-public-datasets
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个包含高质量公开数据集的列表,涵盖了多个领域,如农业、生物学等,数据集来自公共领域,持续更新中。

This is a list of high-quality public datasets spanning various fields such as agriculture, biology, and more. The datasets are sourced from the public domain and are continuously updated.
创建时间:
2016-06-05
原始信息汇总

数据集概述

农业

  • U.S. Department of Agricultures PLANTS Database <http://www.plants.usda.gov/dl_all.html>_

生物学

  • 1000 Genomes <http://www.1000genomes.org/data>_
  • American Gut (Microbiome Project) <https://github.com/biocore/American-Gut>_
  • Broad Cancer Cell Line Encyclopedia (CCLE) <http://www.broadinstitute.org/ccle/home>_
  • Broad Bioimage Benchmark Collection (BBBC) <https://www.broadinstitute.org/bbbc>_
  • Cell Image Library <http://www.cellimagelibrary.org>_
  • Collaborative Research in Computational Neuroscience (CRCNS) <http://crcns.org/data-sets>_
  • Complete Genomics Public Data <http://www.completegenomics.com/public-data/69-genomes/>_
  • EBI ArrayExpress <http://www.ebi.ac.uk/arrayexpress/>_
  • EBI Protein Data Bank in Europe <http://www.ebi.ac.uk/pdbe/emdb/index.html/>_
  • Electron Microscopy Pilot Image Archive (EMPIAR) <http://www.ebi.ac.uk/pdbe/emdb/empiar/>_
  • ENCODE project <https://www.encodeproject.org>_
  • Ensembl Genomes <http://ensemblgenomes.org/info/genomes>_
  • Gene Expression Omnibus (GEO) <http://www.ncbi.nlm.nih.gov/geo/>_
  • Gene Ontology (GO) <http://geneontology.org/page/download-annotations>_
  • Global Biotic Interactions (GloBI) <https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data>_
  • Harvard Medical School (HMS) LINCS Project <http://lincs.hms.harvard.edu>_
  • Human Genome Diversity Project <http://www.hagsc.org/hgdp/files.html>_
  • Human Microbiome Project (HMP) <http://www.hmpdacc.org/reference_genomes/reference_genomes.php>_
  • ICOS PSP Benchmark <http://ico2s.org/datasets/psp_benchmark.html>_
  • International HapMap Project <http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en>_
  • Journal of Cell Biology DataViewer <http://jcb-dataviewer.rupress.org>_
  • MIT Cancer Genomics Data <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi>_
  • NCBI Proteins <http://www.ncbi.nlm.nih.gov/guide/proteins/#databases>_
  • NCBI Taxonomy <http://www.ncbi.nlm.nih.gov/taxonomy>_
  • NeuroData <http://neurodata.io>_
  • NIH Microarray data <http://bit.do/VVW6>_ or FTP <ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/>_
  • OpenSNP genotypes data <https://opensnp.org/>_
  • Pathguid - Protein-Protein Interactions Catalog <http://www.pathguide.org/>_
  • Protein Data Bank <http://www.rcsb.org/>_
  • Psychiatric Genomics Consortium <https://www.med.unc.edu/pgc/downloads>_
  • PubChem Project <https://pubchem.ncbi.nlm.nih.gov/>_
  • PubGene (now Coremine Medical) <http://www.pubgene.org/>_
  • Sanger Catalogue of Somatic Mutations in Cancer (COSMIC) <http://cancer.sanger.ac.uk/cosmic>_
  • Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC) <http://www.cancerrxgene.org/>_
  • Sequence Read Archive(SRA) <http://www.ncbi.nlm.nih.gov/Traces/sra/>_
  • Stanford Microarray Data <http://smd.stanford.edu/>_
  • Stowers Institute Original Data Repository <http://www.stowers.org/research/publications/odr>_
  • Systems Science of Biological Dynamics (SSBD) Database <http://ssbd.qbic.riken.jp>_
  • Temple University Hospital EEG Database <https://www.nedcdata.org/drupal/node/12>_
  • The Cancer Genome Atlas (TCGA), available via Broad GDAC <https://gdac.broadinstitute.org/>_
  • The Catalogue of Life <http://www.catalogueoflife.org/content/annual-checklist-archive>_
  • The Personal Genome Project <http://www.personalgenomes.org/>_ or PGP <https://my.pgp-hms.org/public_genetic_data>_
  • UCSC Public Data <http://hgdownload.soe.ucsc.edu/downloads.html>_
  • Universal Protein Resource (UnitProt) <http://www.uniprot.org/downloads>_
  • UniGene <http://www.ncbi.nlm.nih.gov/unigene>_

气候/天气

  • Australian Weather <http://www.bom.gov.au/climate/dwo/>_
  • Brazilian Weather - Historical data (In Portuguese) <http://sinda.crn2.inpe.br/PCD/SITE/novo/site/>_
  • Canadian Meteorological Centre <http://weather.gc.ca/grib/index_e.html>_
  • Climate Data from UEA (updated monthly) <https://crudata.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>_
  • European Climate Assessment & Dataset <http://eca.knmi.nl/>_
  • Global Climate Data Since 1929 <http://en.tutiempo.net/climate>_
  • NASA Global Imagery Browse Services <https://wiki.earthdata.nasa.gov/display/GIBS>_
  • NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>_
  • NOAA Climate Datasets <http://www.ncdc.noaa.gov/data-access/quick-links>_
  • NOAA Realtime Weather Models <http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction>_
  • The World Bank Open Data Resources for Climate Change <http://data.worldbank.org/developers/climate-data-api>_
  • UEA Climatic Research Unit <http://www.cru.uea.ac.uk/data>_
  • WorldClim - Global Climate Data <http://www.worldclim.org>_
  • WU Historical Weather Worldwide <https://www.wunderground.com/history/index.html>_

复杂网络

  • AMiner Citation Network Dataset <http://aminer.org/citation>_
  • CrossRef DOI URLs <https://archive.org/details/doi-urls>_
  • DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>_
  • NBER Patent Citations <http://nber.org/patents/>_
  • Network Repository with Interactive Exploratory Analysis Tools <http://networkrepository.com/>_
  • NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>_
  • Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>_
  • PyPI and Maven Dependency Network <https://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>_
  • Scopus Citation Database <https://www.elsevier.com/solutions/scopus>_
  • Small Network Data <http://www-personal.umich.edu/~mejn/netdata/>_
  • Stanford GraphBase (Steven Skiena) <http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml>_
  • Stanford Large Network Dataset Collection <http://snap.stanford.edu/data/>_
  • Stanford Longitudinal Network Data Sources <http://stanford.edu/group/sonia/dataSources/index.html>_
  • The Koblenz Network Collection <http://konect.uni-koblenz.de/>_
  • The Laboratory for Web Algorithmics (UNIMI) <http://law.di.unimi.it/datasets.php>_
  • The Nexus Network Repository <http://nexus.igraph.org/>_
  • UCI Network Data Repository <https://networkdata.ics.uci.edu/resources.php>_
  • UFL sparse matrix collection <http://www.cise.ufl.edu/research/sparse/matrices/>_
  • WSU Graph Database <http://www.eecs.wsu.edu/mgd/gdb.html>_
  • DIMACS Road Networks Collection <http://www.dis.uniroma1.it/challenge9/download.shtml>_

计算机网络

  • 3.5B Web Pages from CommonCraw 2012 <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>_
  • 53.5B Web clicks of 100K users in Indiana Univ. <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/>_
  • CAIDA Internet Datasets <http://www.caida.org/data/overview/>_
  • ClueWeb09 - 1B web pages <http://lemurproject.org/clueweb09/>_
  • ClueWeb12 - 733M web pages <http://lemurproject.org/clueweb12/>_
  • CommonCrawl Web Data over 7 years <http://commoncrawl.org/the-data/get-started/>_
  • CRAWDAD Wireless datasets from Dartmouth Univ. <https://crawdad.cs.dartmouth.edu/>_
  • Criteo click-through data <http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/>_
  • Open Mobile Data by MobiPerf <https://console.developers.google.com/storage/openmobiledata_public/>_
  • Rapid7 Sonar Internet Scans <https://sonar.labs.rapid7.com/>_
  • UCSD Network Telescope, IPv4 /8 net <http://www.caida.org/projects/network_telescope/>_

上下文数据

  • Context-aware data sets from five domains <http://students.depaul.edu/~yzheng8/DataSets.html#Data>_ or GitHub <https://github.com/irecsys/CARSKit/tree/master/context-aware_data_sets>_

数据挑战

  • Challenges in Machine Learning <http://www.chalearn.org/>_
  • CrowdANALYTIX dataX <http://data.crowdanalytix.com>_
  • D4D Challenge of Orange <http://www.d4d.orange.com/en/home>_
  • DrivenData Competitions for Social Good <http://www.drivendata.org/>_
  • ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>_
  • Kaggle Competition Data <https://www.kaggle.com/>_
  • KDD Cup by Tencent 2012 <http://www.kddcup2012.org/>_
  • Localytics Data Visualization Challenge <https://github.com/localytics/data-viz-challenge>_
  • Netflix Prize <http://www.netflixprize.com/leaderboard>_
  • Space Apps Challenge <https://2015.spaceappschallenge.org>_
  • Telecom Italia Big Data Challenge <https://dandelion.eu/datamine/open-big-data/>_
  • Yelp Dataset Challenge <http://www.yelp.com/dataset_challenge>_
  • Bruteforce Database <https://github.com/duyetdev/bruteforce-database>_

经济学

  • American Economic Ass (AEA) <https://www.aeaweb.org/RFE/toc.php?show=complete>_
  • EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>_
  • Economic Freedom of the World Data <http://www.freetheworld.com/datasets_efw.html>_
  • Historical MacroEconomic Statistics <http://www.historicalstatistics.org/>_
  • International Trade Statistics <http://www.econostatistics.co.za/>_
  • Internet Product Code Database <http://www.upcdatabase.com/>_
  • Joint External Debt Data Hub <http://www.jedh.org/>_
  • Jon Haveman International Trade Data Links <http://www.macalester.edu/research/economics/PAGE/HAVEMAN/Trade.Resources/TradeData.html>_
  • OpenCorporates Database of Companies in the World <https://opencorporates.com/>_
  • Our World in Data <http://ourworldindata.org/>_
  • SciencesPo World Trade Gravity Datasets <http://econ.sciences-po.fr/thierry-mayer/data>_
  • The Atlas of Economic Complexity <http://atlas.cid.harvard.edu>_
  • The Center for International Data <http://cid.econ.ucdavis.edu>_
  • The Observatory of Economic Complexity <http://atlas.media.mit.edu/en/>_
  • UN Commodity Trade Statistics <http://comtrade.un.org/db/>_
  • UN Human Development Reports <http://hdr.undp.org/en>_

教育

  • Student Data from Free Code Camp <http://academictorrents.com/details/030b10dad0846b5aecc3905692890fb02404adbf>_

能源

  • AMPds <http://ampds.org/>_
  • BLUEd <http://nilm.cmubi.org/>_
  • COMBED <http://combed.github.io/>_
  • Dataport <https://dataport.pecanstreet.org/>_
  • ECO <http://www.vs.inf.ethz.ch/res/show.html?what=eco-data>_
  • EIA <http://www.eia.gov/electricity/data/eia923/>_
  • HFED <http://hfed.github.io/>_
  • iAWE <http://iawe.github.io/>_
  • Plaid <http://plaidplug.com/>_
  • REDD <http://redd.csail.mit.edu/>_
  • UK-Dale <http://www.doc.ic.ac.uk/~dk3810/data/>_

金融

  • CBOE Futures Exchange <http://cfe.cboe.com/Data/>_
  • Google Finance <https://www.google.com/finance>_
  • Google Trends <http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0>_
  • NASDAQ <https://data.nasdaq.com/>_
  • OANDA <http://www.oanda.com/>_
  • OSU Financial data <http://fisher.osu.edu/fin/fdf/osudata.htm>_
  • Quandl <https://www.quandl.com/>_
  • St Louis Federal <https://research.stlouisfed.org/fred2/>_
  • Yahoo Finance <http://finance.yahoo.com/>_
  • NYSE Market Data <ftp://ftp.nyxdata.com>_
  • pystock-data <https://github.com/eliangcs/pystock-data>_

地质学

  • Earth Models <http://www.earthmodels.org/>_
  • Smithsonian Institution Global Volcano and Eruption Database <http://volcano.si.edu/>_
  • USGS Earthquake Archives <http://earthquake.usgs.gov/earthquakes/search/>_

GIS/环境

  • `BODC -
搜集汇总
数据集介绍
main_image_url
构建方式
Awesome Public Datasets 是一个广泛收集和整理公共数据源的资源库,涵盖了从农业到机器学习的多个领域。该数据集的构建方式主要依赖于从博客、问答平台和用户反馈中收集数据源链接,并进行系统化的整理和分类。每个数据源都经过筛选,确保其可用性和相关性,最终形成一个结构化的数据集列表。
特点
该数据集的特点在于其广泛的覆盖范围和多样化的数据类型。它不仅包含了免费的数据源,还涵盖了部分付费资源,为用户提供了丰富的选择。数据集按领域分类,如农业、生物学、气候、复杂网络等,便于用户快速定位所需数据。此外,数据集还提供了详细的元数据信息,如数据来源、更新频率和访问方式,帮助用户更好地理解和使用数据。
使用方法
使用 Awesome Public Datasets 时,用户可以通过浏览分类目录或使用搜索功能快速找到感兴趣的数据源。每个数据源条目都附有链接和简要描述,用户可以直接访问原始数据源进行下载或进一步探索。此外,数据集还提供了与其他类似资源库的链接,如 awesome-awesomeness 和 sindresorhus's awesome,方便用户扩展数据搜索范围。
背景与挑战
背景概述
Awesome Public Datasets 是一个广泛收集和整理公共数据源的资源库,涵盖了从农业、生物学到气候、经济等多个领域的数据集。该数据集由社区贡献者caesar0301于GitHub上创建并维护,旨在为研究人员、数据科学家和开发者提供一个便捷的公共数据访问平台。其核心研究问题在于如何高效地整合和分类来自不同领域的数据源,以便用户能够快速找到所需的数据集。该数据集的影响力体现在其广泛的应用场景中,尤其是在数据驱动的科研和商业决策中,为跨学科研究提供了重要的数据支持。
当前挑战
Awesome Public Datasets 面临的主要挑战包括数据源的多样性和复杂性。首先,数据集涵盖的领域广泛,从生物学到气候学,每个领域的数据格式、结构和质量要求各不相同,如何统一管理和呈现这些数据是一个技术难题。其次,数据集的更新和维护需要持续的人力投入,以确保数据的时效性和准确性。此外,部分数据集可能涉及版权或隐私问题,如何在开放数据与法律合规之间找到平衡也是一个重要的挑战。最后,随着数据量的不断增加,如何优化数据检索和分类系统,提升用户体验,也是该数据集未来需要解决的问题。
常用场景
经典使用场景
Awesome Public Datasets 是一个广泛收集和整理公共数据源的资源库,涵盖了从农业、生物学到气候、经济等多个领域。该数据集最经典的使用场景是为研究人员和开发者提供一个集中的平台,方便他们快速访问和下载所需的数据集,从而加速科学研究和应用开发。特别是在数据驱动的科研项目中,该数据集为跨学科研究提供了丰富的数据支持。
解决学术问题
Awesome Public Datasets 解决了学术研究中数据获取困难的问题。通过整合来自不同领域的公开数据集,研究人员可以避免重复劳动,专注于数据分析和模型构建。此外,该数据集还为数据稀缺领域的研究提供了宝贵资源,推动了基因组学、气候科学、经济学等多个学科的前沿研究。其开放性和多样性为学术界提供了前所未有的数据支持。
衍生相关工作
Awesome Public Datasets 衍生了许多经典的研究工作。例如,基于该数据集中的基因组数据,研究人员开发了新的癌症诊断工具;利用气候数据,科学家们构建了更精确的气候变化模型。此外,该数据集还催生了许多开源工具和平台,如数据可视化工具和机器学习框架,进一步推动了数据科学领域的发展。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作