five

peS2o

收藏
魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/peS2o
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="center" style="margin-top: -2em"> <img src="https://huggingface.co/datasets/allenai/pes2o/resolve/main/logo.png" alt="peS2o logo. It's a picure of a mortar and pestle with documents flying in." width=384px height=auto> </p> <p align="center" style="font-size: 1.2em; margin-top: -1em"><i>Pretraining Effectively on <a href="https://github.com/allenai/s2orc">S2ORC</a>!</i></p> The peS2o dataset is a collection of ~40M creative open-access academic papers, cleaned, filtered, and formatted for pre-training of language models. It is derived from the [Semantic Scholar Open Research Corpus][2]([Lo et al, 2020][1]), or S2ORC. We release multiple version of peS2o, each with different processing and knowledge cutoff date. We recommend you to use the latest version available. If you use this dataset, please cite: ```bibtex @techreport{peS2o, author = {Luca Soldaini and Kyle Lo}, year = 2023, title = {{peS2o (Pretraining Efficiently on S2ORC) Dataset}}, institution = {{Allen Institute for AI}}, note = {ODC-By, \url{https://github.com/allenai/pes2o}} } ``` ## Document Format Each document in the dataset is a dictionary with the following fields: - `added`: Date the document was added to the corpus. - `created`: Best-guess date for when the document was first published. Some have resolution down to the day, only down to the year. - `id`: Semantic Scholar Corpus ID of the document; it can be used with the [Semantic Scholar API](https://api.semanticscholar.org/) to retrieve metadata about the document (e.g., fields of study, authors). - `source`: Collection from which the document was sourced from. At the moment, two are supported: - `s2orc`: collection of full-text papers - `s2ag`: collection of title and abstracts - `text`: Text of the document. Paragraphs are separated by two newlines (`\n\n`). - `version`: version of peS2o. ------ ## peS2o V2 (Latest) ### Key Facts - *Knowledge cutoff*: 2023-01-03 - *Number of documents*: 38.97M - *Number of whitespace-separated tokens**: 42.01B ### Processing peS2o V2 is largely the same as V1, but it includes additional heuristics s2ag aimed at filtering out OCR errors from abstract. First, we check if the abstract was obtained from Semantic Scholar sources that are likely to contain OCR'ed content. For any abstract derived from those sources, we count how often the text contains subsequences matching `\b([A-Za-z]\s)([a-z]\s)*[A-Za-z]\b`, i.e. individual alpha letters separated by a space. This heuristic matches cases such as `A b stra ct` (2 matching subsequences), where the OCR parser inserted erroneous spaces. Any abstract with more than 4 matching subsequences is removed. #### Statistics | Dataset | Split | # Documents | # Words | |:-------:|:-----:|------------:|---------------:| | s2orc | train | 8,242,162 | 36,088,195,908 | | s2orc | valid | 51,323 | 255,139,074 | | s2ag | train | 30,569,017 | 5,920,099,207 | | s2ag | valid | 109,709 | 24,029,459 | ------- ## peS2o V1 ### Key Facts - *Knowledge cutoff*: 2023-01-03 - *Number of documents*: 67.56M - *Number of whitespace-separated tokens*: 47.37B ### Processing Processing differs slightly wether it was derived from the full-text corpus (`s2orc`) or the title and abstract corpus (`s2ag`). #### S2ORC-derived documents Unfiltered, S2ORC contains 11.3M papers and 46.9B whitespace-separated tokens as of 2023-01-03. To derive peS2o v1, we impose the following constraints: - The paper must have a title and abstract. - From each paper, we use [Grobid](https://github.com/kermitt2/grobid) to extract section headers and paragraphs; figures, tables, and references, and any other non-textual content is removed. Title and abstracts are also available, but they come from the Semantic Scholar metadata (obtained through the APIs), not Grobid. - The paper must be in English. - To determine the language of each document, we use the [pycld3](https://github.com/bsolomon1124/pycld3) library - We run pycld3 on the first 2000 characters of each paragraph in the paper. - The language of the paper is the most common language of the paragraphs. - The paper must have at least 500 whitespace-separated words. - The paper was published after 1969; papers published before this date are often obtained through OCR and contain unrecoverable errors. - The paper must have at least 5 paragraphs. - All sections that have a average log word probability of less than `-20` are removed. - To calculate the average log word probability, we use word frequencies extracted from the [1T Web Ngram corpus](https://catalog.ldc.upenn.edu/LDC2006T13); specifically, we use the list available [created by Rachel Tatman](https://www.kaggle.com/datasets/rtatman/english-word-frequency). A copy is hosted [here](https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/lucas/google-1T-unigram/unigram_freq.csv). - The most frequent word in the paper consists of alpha characters only, and it appears in less than 7.5% of the document. - Words are obtained by splitting the text on whitespace. The train set contains papers published before 2022-12-01; the validation set includes documents published after 2022-12-01 and until 2023-01-03. #### S2AG-derived documents The S2AG corpus contains titles and abstracts of papers in Semantic Scholar. Unfiltered, the corpus contains 91.1M papers and 15.5B whitespace-separated tokens as of 2023-01-03. To derive peS2o v1, we impose the following constraints: - Abstract must be in English. - To calculate the language, we once again use pycld3 - Title must be in English, or have average unigram log probability greater than -20. - Abstract must be in English. - Abstract must have higher than -20 average unigram log probability. - Abstract must have at least 50 words. - Abstract must have no more than 1000 words. - The most frequent word in the union of text and abstract must be a 2+ character alpha word, or it can be `a` followed by a 2+ character alpha word. - Paper was published after 1969. #### Statistics | Dataset | Split | # Documents | # Words | |:-------:|:-------:|:-----------:|:--------------:| |s2orc | train | 8,242,162 | 36,088,195,908 | |s2orc | valid | 51,323 | 255,139,074 | |s2ag | train | 59,382,301 | 11,009,123,378 | |s2ag | valid | 111,228 | 24,398,512 | [1]: https://aclanthology.org/2020.acl-main.447/ [2]: https://github.com/allenai/s2orc

<p align="center" style="margin-top: -2em"> <img src="https://huggingface.co/datasets/allenai/pes2o/resolve/main/logo.png" alt="peS2o 数据集标识:研钵与杵搭配飞舞的学术文档" width=384px height=auto> </p> <p align="center" style="font-size: 1.2em; margin-top: -1em"><i>基于S2ORC高效预训练!</i></p> peS2o数据集是约4000万篇开源学术论文的合集,经过清洗、筛选与格式化处理,适配大语言模型(Large Language Model)预训练任务。该数据集衍生自[Semantic Scholar开放研究语料库][2]([Lo等人,2020][1]),简称S2ORC。 我们发布了多个版本的peS2o,各版本采用不同的处理流程与知识截止日期(Knowledge cutoff)。我们推荐使用当前最新的版本。 如果您使用本数据集,请引用以下文献: bibtex @techreport{peS2o, author = {Luca Soldaini and Kyle Lo}, year = 2023, title = {{peS2o (Pretraining Efficiently on S2ORC) Dataset}}, institution = {{Allen Institute for AI}}, note = {ODC-By, url{https://github.com/allenai/pes2o}} } ## 文档格式 数据集中的每篇文档均为字典格式,包含以下字段: - `added`:文档被加入语料库的日期。 - `created`:文档首次发表的推测日期,部分精确到日,部分仅精确到年。 - `id`:文档的Semantic Scholar语料库ID,可通过[Semantic Scholar API](https://api.semanticsscholar.org/)查询该文档的元数据(如研究领域、作者信息等)。 - `source`:文档的来源集合,当前支持两种类型: - `s2orc`:全文学术论文集合 - `s2ag`:标题与摘要集合 - `text`:文档正文内容,段落间以两个换行符(` `)分隔。 - `version`:peS2o的版本号。 ------ ## peS2o V2(最新版) ### 核心参数 - *知识截止日期(Knowledge cutoff)*:2023-01-03 - *文档总数*:3897万篇 - *空白符分隔的Token(Token)总数*:420.1亿 ### 处理流程 peS2o V2与V1大体一致,但新增了针对s2ag子集的启发式过滤规则,用于过滤摘要中的OCR错误。首先,我们会先判断摘要是否来自可能包含OCR内容的Semantic Scholar源。对于来自此类源的摘要,我们统计文本中匹配正则表达式`([A-Za-z]s)([a-z]s)*[A-Za-z]`的子序列数量,该正则用于匹配单个字母以空格分隔的情况(例如`A b stra ct`,共2个匹配子序列),这类情况通常是OCR解析器插入了错误空格。若某摘要的匹配子序列超过4个,则将其移除。 ### 统计数据 | 数据集 | 划分集 | 文档数量 | 单词总数 | |:-------:|:-----:|------------:|---------------:| | s2orc | 训练集 | 8,242,162 | 36,088,195,908 | | s2orc | 验证集 | 51,323 | 255,139,074 | | s2ag | 训练集 | 30,569,017 | 5,920,099,207 | | s2ag | 验证集 | 109,709 | 24,029,459 | ------ ## peS2o V1 ### 核心参数 - *知识截止日期(Knowledge cutoff)*:2023-01-03 - *文档总数*:6756万篇 - *空白符分隔的Token(Token)总数*:473.7亿 ### 处理流程 根据来源为全文字语料(`s2orc`)或标题摘要语料(`s2ag`),处理流程略有差异。 #### 源自S2ORC的文档 未经过滤的S2ORC语料库截至2023年1月3日包含1130万篇论文与469亿个空白符分隔的Token。为生成peS2o V1,我们施加了如下约束: - 论文必须包含标题与摘要。 - 我们使用[Grobid](https://github.com/kermitt2/grobid)从每篇论文中提取章节标题与段落,移除图表、表格、参考文献及其他非文本内容。标题与摘要同样保留,但来源为通过API获取的Semantic Scholar元数据,而非Grobid提取结果。 - 论文必须为英文。 - 我们使用[pycld3](https://github.com/bsolomon1124/pycld3)库识别每篇文档的语言 - 我们对论文中每个段落的前2000个字符运行pycld3检测 - 论文的语言以各段落检测结果中占比最高的语言为准。 - 论文必须包含至少500个空白符分隔的单词。 - 论文发表于1969年之后:1969年之前的论文通常通过OCR获取,存在无法修复的错误。 - 论文必须包含至少5个段落。 - 我们会移除平均对数词概率低于`-20`的章节。 - 平均对数词概率通过从[1T Web Ngram语料库](https://catalog.ldc.upenn.edu/LDC2006T13)提取的词频计算得出,具体使用的是[Rachel Tatman整理的词频列表](https://www.kaggle.com/datasets/rtatman/english-word-frequency),该列表的副本可在此处获取:<https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/lucas/google-1T-unigram/unigram_freq.csv>。 - 论文中出现频率最高的单词必须仅包含字母字符,且其出现占比不超过文档总词数的7.5%。 - 单词通过按空白符分割文本得到。 训练集包含2022年12月1日之前发表的论文;验证集包含2022年12月1日至2023年1月3日之间发表的文档。 #### 源自S2AG的文档 S2AG语料库包含Semantic Scholar中论文的标题与摘要。未经过滤的S2AG语料库截至2023年1月3日包含9110万篇论文与155亿个空白符分隔的Token。为生成peS2o V1,我们施加了如下约束: - 摘要必须为英文。 - 我们再次使用pycld3计算语言。 - 标题必须为英文,或者其平均单字对数概率高于`-20`。 - 摘要必须为英文。 - 摘要的平均单字对数概率必须高于`-20`。 - 摘要必须包含至少50个单词。 - 摘要的单词数不得超过1000个。 - 标题与摘要的联合文本中出现频率最高的单词必须是长度≥2的字母单词,或是以`a`开头且长度≥2的字母单词。 - 论文发表于1969年之后。 #### 统计数据 | 数据集 | 划分集 | 文档数量 | 单词总数 | |:-------:|:-------:|:-----------:|:--------------:| |s2orc | 训练集 | 8,242,162 | 36,088,195,908 | |s2orc | 验证集 | 51,323 | 255,139,074 | |s2ag | 训练集 | 59,382,301 | 11,009,123,378 | |s2ag | 验证集 | 111,228 | 24,398,512 | [1]: https://aclanthology.org/2020.acl-main.447/ [2]: https://github.com/allenai/s2orc
提供机构:
maas
创建时间:
2025-05-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作