BASF-AI/ChemRxiv-Paragraphs
收藏Hugging Face2025-11-14 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/BASF-AI/ChemRxiv-Paragraphs
下载链接
链接失效反馈官方服务:
资源简介:
ChemRxiv段落数据集包含来自ChemRxiv论文的段落,这些论文遵循CC BY 4.0和CC BY-NC 4.0许可。数据集通过Grobid工具提取段落,并经过过滤以确保段落的平均日志词概率。数据集包含的训练集段落来自5,848篇CC BY 4.0许可的论文和3,082篇CC BY-NC 4.0许可的论文。
This dataset consists of paragraphs from ChemRxiv papers under **CC BY 4.0** and **CC BY-NC 4.0** licenses. The paragraphs are extracted using the Grobid tool and filtered based on an average log word probability, similar to the approach used in allenai/peS2o. The dataset contains training set paragraphs from 5,848 papers with CC BY 4.0 licenses and 3,082 papers with CC BY-NC 4.0 licenses.
提供机构:
BASF-AI



