five

peS2o_filtered

收藏
魔搭社区2026-01-02 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/common-pile/peS2o_filtered
下载链接
链接失效反馈
官方服务:
资源简介:
# PeS2o ## Description This dataset is a version of the [peS2o dataset](https://huggingface.co/datasets/allenai/peS2o) restricted to openly licensed articles. PeS2o is derived from [S2ORC](https://github.com/allenai/s2orc), a corpus of openly licensed abstract and full-text papers that have been converted to a structured format using [Grobid](https://github.com/kermitt2/grobid). Starting from Grobid’s XML output, peS2o filters papers that are too short, have incorrect metadata, are in languages other than English, and contain OCR errors using a combination of heuristic- and model-based filtering steps. Please refer to the peS2o [datasheet](https://huggingface.co/datasets/allenai/peS2o) and [code](https://github.com/allenai/peS2o) for more details on the peS2o processing pipeline. For the openly licensed articles in this collection, per-document license information is available in the `license` entry of the `metadata` field of each example. ## Dataset Statistics | Documents | UTF-8 GB | |-------------|-----------| | 6,117,280 | 182.6 | ## License Issues While we aim to produce datasets with completely accurate licensing information, license laundering and inaccurate metadata can cause us to erroneously assign the incorrect license to some documents (for further discussion of this limitation, please see [our paper](https://huggingface.co/papers/2506.05209)). If you believe you have found an instance of incorrect licensing in this dataset, please [start a discussion](https://github.com/r-three/common-pile/discussions/new) on this repository. ## Other Versions This is the "filtered" version of the openly licensed peS2o dataset. If you are looking for the raw version, you can find it [here](https://huggingface.co/datasets/common-pile/peS2o_raw). ## Citation If you use this dataset, please cite: ```bibtex @article{kandpal2025common, title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}}, author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray}, journal={arXiv preprint}, year={2025} } ``` ```bibtex @techreport{peS2o, author = {Luca Soldaini and Kyle Lo}, year = 2023, title = {{peS2o (Pretraining Efficiently on S2ORC) Dataset}}, institution = {{Allen Institute for AI}}, note = {ODC-By, \url{https://github.com/allenai/pes2o}} } ```

# PeS2o ## 描述 本数据集为[peS2o数据集(peS2o dataset)](https://huggingface.co/datasets/allenai/peS2o)的子集,仅包含开源许可的学术文献。 PeS2o 源自[S2ORC](https://github.com/allenai/s2orc)——一个由开源许可的摘要与全文学术文献组成的语料库,该语料已通过[Grobid](https://github.com/kermitt2/grobid)转换为结构化格式。 以Grobid生成的XML输出为起点,PeS2o通过启发式与基于模型的联合过滤流程,筛除了篇幅过短、元数据(metadata)有误、非英语语言以及包含光学字符识别(Optical Character Recognition,OCR)错误的文献。 有关PeS2o处理流程的更多细节,请参阅其[数据手册(datasheet)](https://huggingface.co/datasets/allenai/peS2o)与[代码库](https://github.com/allenai/peS2o)。 本数据集收录的开源许可文献,其单篇文献的许可信息可在每条样本的`metadata`字段下的`license`条目内查询。 ## 数据集统计 | 文献总数 | UTF-8 存储空间(GB) | |---------|---------------------| | 6,117,280 | 182.6 | ## 许可相关问题 尽管我们致力于生成许可信息完全准确的数据集,但许可洗白(license laundering)与元数据不准确的问题,仍可能导致我们为部分文献错误分配了不当许可。有关该局限性的进一步讨论,请参阅[我们的论文](https://huggingface.co/papers/2506.05209)。 若您发现本数据集内存在许可信息错误的案例,请前往本仓库[发起讨论](https://github.com/r-three/common-pile/discussions/new)。 ## 其他版本 本数据集为开源许可版PeS2o的**过滤版**。若您需要原始版本,可前往[此处](https://huggingface.co/datasets/common-pile/peS2o_raw)获取。 ## 引用 若您使用本数据集,请引用以下文献: bibtex @article{kandpal2025common, title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}}, author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray}, journal={arXiv preprint}, year={2025} } bibtex @techreport{peS2o, author = {Luca Soldaini and Kyle Lo}, year = 2023, title = {{peS2o (Pretraining Efficiently on S2ORC) Dataset}}, institution = {{Allen Institute for AI}}, note = {ODC-By, url{https://github.com/allenai/pes2o}} }
提供机构:
maas
创建时间:
2025-06-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作