Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2023
收藏DataCite Commons2024-06-21 更新2024-07-13 收录
下载链接:
https://madata.bib.uni-mannheim.de/429
下载链接
链接失效反馈官方服务:
资源简介:
The Web Data Commons RDFa, Microdata and Microformats data sets has been extracted from the September/October 2023 release of the Common Crawl. In summary, we found structured data within 1.7 billion HTML pages out of the 3.4 billion pages contained in the crawl (50.60%). These pages originate from 15 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (42.89%). Altogether, the extracted data sets consist of 86 billion RDF quads.
网络数据通用库(Web Data Commons)的RDFa、微数据(Microdata)与微格式(Microformats)数据集,提取自2023年9月/10月发布的通用爬虫(Common Crawl)数据集版本。总体而言,本次爬取的34亿个HTML页面中,我们于17亿个页面内识别出结构化数据,占比达50.60%。这些页面源自本次爬取覆盖的3400万个付费顶级域名(pay-level-domain)中的1500万个不同付费顶级域名,占比为42.89%。本次提取的全部数据集总计包含860亿个RDF四元组(RDF quads)。
提供机构:
Mannheim University Library
创建时间:
2024-02-12



