CEX Project - Grobid segmentation model Training Dataset

SSH Open MarketPlace2026-03-06 更新2026-03-14 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/IDnUN3

下载链接

链接失效反馈

官方服务：

资源简介：

The dataset used for training the Grobid segmentation model consists of 227 scholarly English-language PDF articles: - 179 articles drawn from 27 academic disciplines, primarily selected from "Santos EAD, Peroni S, Mucheroni ML ( 2023), "An analysis of citing and referencing habits across all scholarly disciplines: approaches and trends in bibliographic referencing and citing practices". Journal of Documentation, Vol. 79 No. 7 pp. 196–224, doi: https://doi.org/10.1108/JD-10-2022-0234" - 48 articles selected specifically for the presence of footnotes All the documents were annotated following the GROBID segmentation training guidelines, available at: https://grobid.readthedocs.io/en/latest/training/segmentation/. A clarification should be made: in some PDFs, recurring page elements (e.g., vertical text such as “Downloaded from … on [date] …”) were detected. These were not annotated, as they do not belong to the scholarly content of the original publication. Included files: - corpus_metadata.csv — Metadata for all 227 articles, listing internal identifiers, URLs (DOIs when available), titles, access types, and licenses. - OA_corpus.zip — Sharable annotated files used for GROBID segmentation training. - A Python script (create_folds.py) for generating five different configurations for model training and evaluation from the original corpus, along with the evaluation results (evaluation_results).

创建时间：

2026-03-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集