Fuse
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/records/581678
下载链接
链接失效反馈官方服务:
资源简介:
The contributors have provided two related datasets, which together constitute the FUSE spreadsheet corpus2.
+ A Web Analysis dataset of 2,127,284 URLs that return spreadsheet content, along with the full HTTP web server response, formatted as JSON records. This dataset was obtained by filtering through 26.83 billion HTTP responses within the Common Crawl archive.
+ A Binary Analysis dataset of 249,376 spreadsheets, extracted from the 1.9 PB of raw data within the Common Crawl archive. For each spreadsheet, the authors provide JSON metadata containing their analysis, which includes NLP token extraction and spreadsheet metrics.
创建时间:
2020-01-21



