allenai/dolma-pes2o-cc-pd
收藏Hugging Face2024-11-24 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/allenai/dolma-pes2o-cc-pd
下载链接
链接失效反馈官方服务:
资源简介:
Dolma PeS2o数据集是peS2o数据集中Creative Commons和公共领域子集,包含了截至2024年10月6日的开放获取论文,其中训练集包含截至2024年8月31日的论文。数据集分为训练集和验证集,分别包含6,254,908和39,112个样本。数据集的文档数量、空白字符数、UTF-8字符数等属性均有详细统计。此外,文档的许可证类型(如CC-BY、CC-BY-SA、CC0、公共领域)以及涵盖的研究领域(如医学、生物学、环境科学等)也有详细分类和统计。
This dataset contains the Creative Commons and public domain subset of open access papers from the peS2o dataset. The cutoff date for the collection is October 6, 2024, with the train set containing papers up to August 31, 2024. The dataset is divided into a train set with 6,254,908 examples and a validation set with 39,112 examples. It covers multiple fields of study, including Medicine, Biology, Environmental Science, Engineering, Computer Science, and more. The number of documents for each field of study is detailed in both the train and validation sets.
提供机构:
allenai



