GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [Data]
收藏DataONE2019-12-09 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:e3c20d740cae52537cc7ac98400a31dbc6040267a40bd4df6207d07a77430220
下载链接
链接失效反馈官方服务:
资源简介:
Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.
创建时间:
2023-11-22



