Awesome Public Datasets
收藏github2020-01-08 更新2024-05-31 收录
下载链接:
https://github.com/eliangcs/awesome-public-datasets
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含高质量公开数据集的列表,涵盖了多个领域,如农业、生物学等,数据集来自公共领域,持续更新中。
This is a list of high-quality public datasets spanning various fields such as agriculture, biology, and more. The datasets are sourced from the public domain and are continuously updated.
创建时间:
2016-06-05
原始信息汇总
数据集概述
农业
U.S. Department of Agricultures PLANTS Database <http://www.plants.usda.gov/dl_all.html>_
生物学
1000 Genomes <http://www.1000genomes.org/data>_American Gut (Microbiome Project) <https://github.com/biocore/American-Gut>_Broad Cancer Cell Line Encyclopedia (CCLE) <http://www.broadinstitute.org/ccle/home>_Broad Bioimage Benchmark Collection (BBBC) <https://www.broadinstitute.org/bbbc>_Cell Image Library <http://www.cellimagelibrary.org>_Collaborative Research in Computational Neuroscience (CRCNS) <http://crcns.org/data-sets>_Complete Genomics Public Data <http://www.completegenomics.com/public-data/69-genomes/>_EBI ArrayExpress <http://www.ebi.ac.uk/arrayexpress/>_EBI Protein Data Bank in Europe <http://www.ebi.ac.uk/pdbe/emdb/index.html/>_Electron Microscopy Pilot Image Archive (EMPIAR) <http://www.ebi.ac.uk/pdbe/emdb/empiar/>_ENCODE project <https://www.encodeproject.org>_Ensembl Genomes <http://ensemblgenomes.org/info/genomes>_Gene Expression Omnibus (GEO) <http://www.ncbi.nlm.nih.gov/geo/>_Gene Ontology (GO) <http://geneontology.org/page/download-annotations>_Global Biotic Interactions (GloBI) <https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data>_Harvard Medical School (HMS) LINCS Project <http://lincs.hms.harvard.edu>_Human Genome Diversity Project <http://www.hagsc.org/hgdp/files.html>_Human Microbiome Project (HMP) <http://www.hmpdacc.org/reference_genomes/reference_genomes.php>_ICOS PSP Benchmark <http://ico2s.org/datasets/psp_benchmark.html>_International HapMap Project <http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en>_Journal of Cell Biology DataViewer <http://jcb-dataviewer.rupress.org>_MIT Cancer Genomics Data <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi>_NCBI Proteins <http://www.ncbi.nlm.nih.gov/guide/proteins/#databases>_NCBI Taxonomy <http://www.ncbi.nlm.nih.gov/taxonomy>_NeuroData <http://neurodata.io>_NIH Microarray data <http://bit.do/VVW6>_ orFTP <ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/>_OpenSNP genotypes data <https://opensnp.org/>_Pathguid - Protein-Protein Interactions Catalog <http://www.pathguide.org/>_Protein Data Bank <http://www.rcsb.org/>_Psychiatric Genomics Consortium <https://www.med.unc.edu/pgc/downloads>_PubChem Project <https://pubchem.ncbi.nlm.nih.gov/>_PubGene (now Coremine Medical) <http://www.pubgene.org/>_Sanger Catalogue of Somatic Mutations in Cancer (COSMIC) <http://cancer.sanger.ac.uk/cosmic>_Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC) <http://www.cancerrxgene.org/>_Sequence Read Archive(SRA) <http://www.ncbi.nlm.nih.gov/Traces/sra/>_Stanford Microarray Data <http://smd.stanford.edu/>_Stowers Institute Original Data Repository <http://www.stowers.org/research/publications/odr>_Systems Science of Biological Dynamics (SSBD) Database <http://ssbd.qbic.riken.jp>_Temple University Hospital EEG Database <https://www.nedcdata.org/drupal/node/12>_The Cancer Genome Atlas (TCGA), available via Broad GDAC <https://gdac.broadinstitute.org/>_The Catalogue of Life <http://www.catalogueoflife.org/content/annual-checklist-archive>_The Personal Genome Project <http://www.personalgenomes.org/>_ orPGP <https://my.pgp-hms.org/public_genetic_data>_UCSC Public Data <http://hgdownload.soe.ucsc.edu/downloads.html>_Universal Protein Resource (UnitProt) <http://www.uniprot.org/downloads>_UniGene <http://www.ncbi.nlm.nih.gov/unigene>_
气候/天气
Australian Weather <http://www.bom.gov.au/climate/dwo/>_Brazilian Weather - Historical data (In Portuguese) <http://sinda.crn2.inpe.br/PCD/SITE/novo/site/>_Canadian Meteorological Centre <http://weather.gc.ca/grib/index_e.html>_Climate Data from UEA (updated monthly) <https://crudata.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>_European Climate Assessment & Dataset <http://eca.knmi.nl/>_Global Climate Data Since 1929 <http://en.tutiempo.net/climate>_NASA Global Imagery Browse Services <https://wiki.earthdata.nasa.gov/display/GIBS>_NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>_NOAA Climate Datasets <http://www.ncdc.noaa.gov/data-access/quick-links>_NOAA Realtime Weather Models <http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction>_The World Bank Open Data Resources for Climate Change <http://data.worldbank.org/developers/climate-data-api>_UEA Climatic Research Unit <http://www.cru.uea.ac.uk/data>_WorldClim - Global Climate Data <http://www.worldclim.org>_WU Historical Weather Worldwide <https://www.wunderground.com/history/index.html>_
复杂网络
AMiner Citation Network Dataset <http://aminer.org/citation>_CrossRef DOI URLs <https://archive.org/details/doi-urls>_DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>_NBER Patent Citations <http://nber.org/patents/>_Network Repository with Interactive Exploratory Analysis Tools <http://networkrepository.com/>_NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>_Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>_PyPI and Maven Dependency Network <https://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>_Scopus Citation Database <https://www.elsevier.com/solutions/scopus>_Small Network Data <http://www-personal.umich.edu/~mejn/netdata/>_Stanford GraphBase (Steven Skiena) <http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml>_Stanford Large Network Dataset Collection <http://snap.stanford.edu/data/>_Stanford Longitudinal Network Data Sources <http://stanford.edu/group/sonia/dataSources/index.html>_The Koblenz Network Collection <http://konect.uni-koblenz.de/>_The Laboratory for Web Algorithmics (UNIMI) <http://law.di.unimi.it/datasets.php>_The Nexus Network Repository <http://nexus.igraph.org/>_UCI Network Data Repository <https://networkdata.ics.uci.edu/resources.php>_UFL sparse matrix collection <http://www.cise.ufl.edu/research/sparse/matrices/>_WSU Graph Database <http://www.eecs.wsu.edu/mgd/gdb.html>_DIMACS Road Networks Collection <http://www.dis.uniroma1.it/challenge9/download.shtml>_
计算机网络
3.5B Web Pages from CommonCraw 2012 <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>_53.5B Web clicks of 100K users in Indiana Univ. <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/>_CAIDA Internet Datasets <http://www.caida.org/data/overview/>_ClueWeb09 - 1B web pages <http://lemurproject.org/clueweb09/>_ClueWeb12 - 733M web pages <http://lemurproject.org/clueweb12/>_CommonCrawl Web Data over 7 years <http://commoncrawl.org/the-data/get-started/>_CRAWDAD Wireless datasets from Dartmouth Univ. <https://crawdad.cs.dartmouth.edu/>_Criteo click-through data <http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/>_Open Mobile Data by MobiPerf <https://console.developers.google.com/storage/openmobiledata_public/>_Rapid7 Sonar Internet Scans <https://sonar.labs.rapid7.com/>_UCSD Network Telescope, IPv4 /8 net <http://www.caida.org/projects/network_telescope/>_
上下文数据
Context-aware data sets from five domains <http://students.depaul.edu/~yzheng8/DataSets.html#Data>_ orGitHub <https://github.com/irecsys/CARSKit/tree/master/context-aware_data_sets>_
数据挑战
Challenges in Machine Learning <http://www.chalearn.org/>_CrowdANALYTIX dataX <http://data.crowdanalytix.com>_D4D Challenge of Orange <http://www.d4d.orange.com/en/home>_DrivenData Competitions for Social Good <http://www.drivendata.org/>_ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>_Kaggle Competition Data <https://www.kaggle.com/>_KDD Cup by Tencent 2012 <http://www.kddcup2012.org/>_Localytics Data Visualization Challenge <https://github.com/localytics/data-viz-challenge>_Netflix Prize <http://www.netflixprize.com/leaderboard>_Space Apps Challenge <https://2015.spaceappschallenge.org>_Telecom Italia Big Data Challenge <https://dandelion.eu/datamine/open-big-data/>_Yelp Dataset Challenge <http://www.yelp.com/dataset_challenge>_Bruteforce Database <https://github.com/duyetdev/bruteforce-database>_
经济学
American Economic Ass (AEA) <https://www.aeaweb.org/RFE/toc.php?show=complete>_EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>_Economic Freedom of the World Data <http://www.freetheworld.com/datasets_efw.html>_Historical MacroEconomic Statistics <http://www.historicalstatistics.org/>_International Trade Statistics <http://www.econostatistics.co.za/>_Internet Product Code Database <http://www.upcdatabase.com/>_Joint External Debt Data Hub <http://www.jedh.org/>_Jon Haveman International Trade Data Links <http://www.macalester.edu/research/economics/PAGE/HAVEMAN/Trade.Resources/TradeData.html>_OpenCorporates Database of Companies in the World <https://opencorporates.com/>_Our World in Data <http://ourworldindata.org/>_SciencesPo World Trade Gravity Datasets <http://econ.sciences-po.fr/thierry-mayer/data>_The Atlas of Economic Complexity <http://atlas.cid.harvard.edu>_The Center for International Data <http://cid.econ.ucdavis.edu>_The Observatory of Economic Complexity <http://atlas.media.mit.edu/en/>_UN Commodity Trade Statistics <http://comtrade.un.org/db/>_UN Human Development Reports <http://hdr.undp.org/en>_
教育
Student Data from Free Code Camp <http://academictorrents.com/details/030b10dad0846b5aecc3905692890fb02404adbf>_
能源
AMPds <http://ampds.org/>_BLUEd <http://nilm.cmubi.org/>_COMBED <http://combed.github.io/>_Dataport <https://dataport.pecanstreet.org/>_ECO <http://www.vs.inf.ethz.ch/res/show.html?what=eco-data>_EIA <http://www.eia.gov/electricity/data/eia923/>_HFED <http://hfed.github.io/>_iAWE <http://iawe.github.io/>_Plaid <http://plaidplug.com/>_REDD <http://redd.csail.mit.edu/>_UK-Dale <http://www.doc.ic.ac.uk/~dk3810/data/>_
金融
CBOE Futures Exchange <http://cfe.cboe.com/Data/>_Google Finance <https://www.google.com/finance>_Google Trends <http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0>_NASDAQ <https://data.nasdaq.com/>_OANDA <http://www.oanda.com/>_OSU Financial data <http://fisher.osu.edu/fin/fdf/osudata.htm>_Quandl <https://www.quandl.com/>_St Louis Federal <https://research.stlouisfed.org/fred2/>_Yahoo Finance <http://finance.yahoo.com/>_NYSE Market Data <ftp://ftp.nyxdata.com>_pystock-data <https://github.com/eliangcs/pystock-data>_
地质学
Earth Models <http://www.earthmodels.org/>_Smithsonian Institution Global Volcano and Eruption Database <http://volcano.si.edu/>_USGS Earthquake Archives <http://earthquake.usgs.gov/earthquakes/search/>_
GIS/环境
- `BODC -
搜集汇总
数据集介绍

构建方式
Awesome Public Datasets 是一个广泛收集和整理公共数据源的资源库,涵盖了从农业到机器学习的多个领域。该数据集的构建方式主要依赖于从博客、问答平台和用户反馈中收集数据源链接,并进行系统化的整理和分类。每个数据源都经过筛选,确保其可用性和相关性,最终形成一个结构化的数据集列表。
特点
该数据集的特点在于其广泛的覆盖范围和多样化的数据类型。它不仅包含了免费的数据源,还涵盖了部分付费资源,为用户提供了丰富的选择。数据集按领域分类,如农业、生物学、气候、复杂网络等,便于用户快速定位所需数据。此外,数据集还提供了详细的元数据信息,如数据来源、更新频率和访问方式,帮助用户更好地理解和使用数据。
使用方法
使用 Awesome Public Datasets 时,用户可以通过浏览分类目录或使用搜索功能快速找到感兴趣的数据源。每个数据源条目都附有链接和简要描述,用户可以直接访问原始数据源进行下载或进一步探索。此外,数据集还提供了与其他类似资源库的链接,如 awesome-awesomeness 和 sindresorhus's awesome,方便用户扩展数据搜索范围。
背景与挑战
背景概述
Awesome Public Datasets 是一个广泛收集和整理公共数据源的资源库,涵盖了从农业、生物学到气候、经济等多个领域的数据集。该数据集由社区贡献者caesar0301于GitHub上创建并维护,旨在为研究人员、数据科学家和开发者提供一个便捷的公共数据访问平台。其核心研究问题在于如何高效地整合和分类来自不同领域的数据源,以便用户能够快速找到所需的数据集。该数据集的影响力体现在其广泛的应用场景中,尤其是在数据驱动的科研和商业决策中,为跨学科研究提供了重要的数据支持。
当前挑战
Awesome Public Datasets 面临的主要挑战包括数据源的多样性和复杂性。首先,数据集涵盖的领域广泛,从生物学到气候学,每个领域的数据格式、结构和质量要求各不相同,如何统一管理和呈现这些数据是一个技术难题。其次,数据集的更新和维护需要持续的人力投入,以确保数据的时效性和准确性。此外,部分数据集可能涉及版权或隐私问题,如何在开放数据与法律合规之间找到平衡也是一个重要的挑战。最后,随着数据量的不断增加,如何优化数据检索和分类系统,提升用户体验,也是该数据集未来需要解决的问题。
常用场景
经典使用场景
Awesome Public Datasets 是一个广泛收集和整理公共数据源的资源库,涵盖了从农业、生物学到气候、经济等多个领域。该数据集最经典的使用场景是为研究人员和开发者提供一个集中的平台,方便他们快速访问和下载所需的数据集,从而加速科学研究和应用开发。特别是在数据驱动的科研项目中,该数据集为跨学科研究提供了丰富的数据支持。
解决学术问题
Awesome Public Datasets 解决了学术研究中数据获取困难的问题。通过整合来自不同领域的公开数据集,研究人员可以避免重复劳动,专注于数据分析和模型构建。此外,该数据集还为数据稀缺领域的研究提供了宝贵资源,推动了基因组学、气候科学、经济学等多个学科的前沿研究。其开放性和多样性为学术界提供了前所未有的数据支持。
衍生相关工作
Awesome Public Datasets 衍生了许多经典的研究工作。例如,基于该数据集中的基因组数据,研究人员开发了新的癌症诊断工具;利用气候数据,科学家们构建了更精确的气候变化模型。此外,该数据集还催生了许多开源工具和平台,如数据可视化工具和机器学习框架,进一步推动了数据科学领域的发展。
以上内容由遇见数据集搜集并总结生成



