five

CEX Project - Grobid segmentation model Training Dataset

收藏
SSH Open MarketPlace2026-03-06 更新2026-03-14 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/IDnUN3
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset used for training the Grobid segmentation model consists of 227 scholarly English-language PDF articles: - 179 articles drawn from 27 academic disciplines, primarily selected from "Santos EAD, Peroni S, Mucheroni ML ( 2023), "An analysis of citing and referencing habits across all scholarly disciplines: approaches and trends in bibliographic referencing and citing practices". Journal of Documentation, Vol. 79 No. 7 pp. 196–224, doi: https://doi.org/10.1108/JD-10-2022-0234" - 48 articles selected specifically for the presence of footnotes All the documents were annotated following the GROBID segmentation training guidelines, available at: https://grobid.readthedocs.io/en/latest/training/segmentation/. A clarification should be made: in some PDFs, recurring page elements (e.g., vertical text such as “Downloaded from … on [date] …”) were detected. These were not annotated, as they do not belong to the scholarly content of the original publication. Included files: - corpus_metadata.csv — Metadata for all 227 articles, listing internal identifiers, URLs (DOIs when available), titles, access types, and licenses. - OA_corpus.zip — Sharable annotated files used for GROBID segmentation training. - A Python script (create_folds.py) for generating five different configurations for model training and evaluation from the original corpus, along with the evaluation results (evaluation_results).
创建时间:
2026-03-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作