CEX Project - Grobid segmentation model Training Dataset
收藏SSH Open MarketPlace2026-03-06 更新2026-03-14 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/IDnUN3
下载链接
链接失效反馈官方服务:
资源简介:
The dataset used for training the Grobid segmentation model consists of 227 scholarly English-language PDF articles:
- 179 articles drawn from 27 academic disciplines, primarily selected from "Santos EAD, Peroni S, Mucheroni ML ( 2023), "An analysis of citing and referencing habits across all scholarly disciplines: approaches and trends in bibliographic referencing and citing practices". Journal of Documentation, Vol. 79 No. 7 pp. 196–224, doi: https://doi.org/10.1108/JD-10-2022-0234"
- 48 articles selected specifically for the presence of footnotes
All the documents were annotated following the GROBID segmentation training guidelines, available at: https://grobid.readthedocs.io/en/latest/training/segmentation/.
A clarification should be made: in some PDFs, recurring page elements (e.g., vertical text such as “Downloaded from … on [date] …”) were detected. These were not annotated, as they do not belong to the scholarly content of the original publication.
Included files:
- corpus_metadata.csv — Metadata for all 227 articles, listing internal identifiers, URLs (DOIs when available), titles, access types, and licenses.
- OA_corpus.zip — Sharable annotated files used for GROBID segmentation training.
- A Python script (create_folds.py) for generating five different configurations for model training and evaluation from the original corpus, along with the evaluation results (evaluation_results).
创建时间:
2026-03-06



