dolma-reddit-to-flashcards-0625

Name: dolma-reddit-to-flashcards-0625
Creator: maas
Published: 2025-11-27 16:41:28
License: 暂无描述

魔搭社区2025-11-27 更新2025-08-02 收录

下载链接：

https://modelscope.cn/datasets/allenai/dolma-reddit-to-flashcards-0625

下载链接

链接失效反馈

官方服务：

资源简介：

## Overview **Dolma Reddit to Flashcards** is a dataset of synthetically-generated QA items created on the basis of filtered Reddit data. The creation of this dataset was motivated by the observation in [Dolma](https://huggingface.co/datasets/allenai/dolma) (Soldaini et al. 2024) that the original Dolma Reddit data showed no benefit from inclusion of thread-level context over isolated submissions and comments, and that clean performance distinctions between tested Reddit versions were limited mainly to the HellaSWAG benchmark. The filtering and rewriting process described here was motivated by the hypothesis that Reddit's thread context can be better leveraged for downstream performance benefits, and that the diverse specialized knowledge present in Reddit should be able to benefit knowledge-based QA tasks such as MMLU. The changes that resulted in the Dolma Reddit to Flashcards dataset have three basic parts: 1. constructing thread contexts inspired by QA structure 2. filtering to high-quality subreddits with relevance for academic topics 3. rewriting the content from those subreddits to reduce noise and increase resemblance to standard MCQA ### Dataset statistics: - 158,283,954 documents - 9,860,465,975 tokens ### Dataset fields: - **id**: IDs contain two six-character alphanumeric strings which can be used to identify the original submission and comment in the PushShift Reddit dataset. For example, the document of ID "part-152-00000_100387_2fv86m_ckd2a31_1" was derived from the concatenation of submission *2fv86m* and comment *ckd2a31* from the PushShift data. - **text**: Text of the QA document. --- ## Dataset Construction The construction of this dataset involved three major phases. ### 1. Reddit data filtering A dataset of submission/comment pairs was derived from the PushShift Reddit dataset (Baumgartner et al. 2020; bulk dump as of March 2023) -- the same dump used for [Dolma Reddit](https://huggingface.co/datasets/allenai/dolma). To leverage thread context while laying groundwork for QA-type structure, we extracted each submission and concatenated it with its top-scoring, top-level comment. (In the case of tied top-scoring comments, we chose the longer of the two.) We then performed further rule-based filtering with the following constraints: - Filter out deleted/removed content. - Filter out content marked as over_18. - Filter out all posts from a list of 26,123 banned or NSFW subreddits. - Filter out posts from likely bot authors (drawn from https://botrank.pastimes.eu/ as of Sept 2024). - Filter out posts containing non-text media. - Perform document-level text deduplication via Bloom filter. ### 2. Retrieval-based subreddit selection Dense retrieval was then used to identify academically-relevant subreddits for further filtering. We adapted search queries from MMLU test questions, and performed dense retrieval with these queries on the filtered Reddit data from Step #2, retaining the top 5 hits for each query. Based on these retrieved outputs, we selected 151 subreddits meeting the following criteria: - Subreddit has >= 20 *unique* retrieved items for queries within a given MMLU category; OR - Subreddit has >=100 retrieved items for queries across all MMLU categories. We then filtered the dataset from Step #1 to retain only documents from subreddits on this list of 151 subreddits. ### 3. Format rewriting Finally, the data from Step #2 was input to a synthetic rewriting pipeline to generate academic QA items with coverage of diverse question formats. We defined 7 categories of question format inspired by variation observed in MMLU, and used these to construct prompts for QA text generation. The format categories are as follows: 1. open-ended 2. statement completion 3. fill-in-the-blank 4. statement truth verification 5. which-of-following-has-property-X 6. which-of-following-is-true 7. in-question options For each format category we constructed a prompt for generating questions of that category given an input text. Below is an example prompt, for the "in-question-options" category. Prompts for other categories differ in 1) the content of the "For format ..." paragraph and 2) the in-context examples (1-3 examples per prompt). ``` I will ask you to convert a text into multiple-choice questions. Here is the text: "{text}" Instructions: Convert the information in the text into academic multiple choice questions. ONLY include questions that are academic. DONOT reference the text in the question. For format, use questions that provide options within the question and give choices for which options are true. Examples: Dogs have which of the following properties? I. They are mammals II. They have five legs. III. They have a tail. A. I only B. II only C. III only D. I and III Answer: D %%%% Which of the following are cities in the US? I. Paris II. Athens III. Chicago A. I only B. II only C. III only D. I, II and III Answer: C Separate ALL questions with "\n%%%%\n". ``` For generating our rewritten QA data, we prompted GPT-4o mini (Jan 2025 version). We iterated over the submission/comment pairs in the data from Step #2, and for each of these texts we sampled a format category and prompted the GPT-4o mini to generate QA pairs for that text and format category. For longer input texts, format categories were resampled and prompted for again, a number of times proportional to the length of the text. Finally, GPT-4o mini outputs were parsed into separate QA items based on the "%%%%" separator, and 50% of items were prepended with the prefix "Question: ". ## Results We validate these data in experiments with OLMo 7B (Groeneveld et al. 2024) models trained to 2T tokens, carrying out continued pretraining on a 50-50 mix of DCLM and Reddit data while annealing the learning rate to zero. We run this continued pretraining with three versions of Reddit data: the filtered data from Step #2, a more loosely-filtered (lower selection threshold) version of Step #2 to serve as baseline, and the rewritten data from Step #3. We find that this dataset has clear downstream benefits for MCQA tasks, with the rewriting in particular yielding substantial improvement over filtered Reddit alone. While the impact of shifting to more stringently filtered data is negligible (MMLU moves 0.615 to 0.612, and MC9 moves .742 to 0.74), the benefit from the rewriting phase is substantial: comparing the rewritten Step #3 Reddit data to the non-rewritten Step #2 Reddit data, **MMLU improves from 0.62 to 0.66**, and **MC9 improves from 0.74 to 0.76**. ``` @techreport{dolma-reddit-to-flashcards, author = {Allyson Ettinger, Luca Soldaini and Kyle Lo}, year = 2025, title = {{Dolma Reddit to Flashcards Dataset}}, institution = {{Allen Institute for AI}}} } ```

## 概述 **Dolma Reddit转抽认卡**是基于过滤后的Reddit数据合成生成的问答（Question Answering，QA）样本数据集。该数据集的构建初衷源于对[Dolma](https://huggingface.co/datasets/allenai/dolma)（Soldaini等人，2024）的观测：原始Dolma Reddit数据中，引入线程级上下文并未比单独的帖子与评论带来性能提升，且不同测试版Reddit数据间的清晰性能差异仅主要体现在HellaSWAG基准测试中。本次研究中的过滤与重写流程，其核心假设为Reddit的线程上下文可被更好地利用以提升下游任务性能，且Reddit中蕴含的多样化专业知识能够助力诸如MMLU这类基于知识的问答任务。最终生成Dolma Reddit转抽认卡数据集的改进工作包含三个核心部分： 1. 构建贴合问答结构的线程上下文 2. 筛选与学术主题相关的高质量子论坛（subreddit） 3. 重写上述子论坛的内容，以降低噪声并提升其与标准多项选择问答（Multiple-Choice Question Answering，MCQA）的相似度。 ### 数据集统计信息 - 158,283,954 条文档 - 9,860,465,975 个Token（Token） ### 数据集字段 - **id**：ID由两个六字符字母数字字符串组成，可用于标识PushShift Reddit数据集中的原始帖子与评论。例如，ID为"part-152-00000_100387_2fv86m_ckd2a31_1"的文档，源自PushShift数据中的帖子*2fv86m*与评论*ckd2a31*的拼接结果。 - **text**：问答文档的文本内容。 --- ## 数据集构建流程该数据集的构建包含三大阶段。 ### 1. Reddit数据过滤我们从PushShift Reddit数据集（Baumgartner等人，2020；2023年3月批量导出）——即[Dolma Reddit](https://huggingface.co/datasets/allenai/dolma)所使用的同款导出数据——中提取帖子与评论配对数据集。为了利用线程上下文并搭建问答结构的基础，我们提取每个帖子，并将其与得分最高的顶层评论进行拼接（若存在多个得分相同的顶层评论，则选取长度更长的那一条）。随后我们基于以下规则执行进一步过滤： - 过滤已删除/被移除的内容 - 过滤标记为18禁的内容 - 过滤来自26123个已封禁或NSFW（Not Safe For Work）子论坛的帖子 - 过滤来自疑似机器人作者的帖子（数据源自https://botrank.pastimes.eu/，截至2024年9月） - 过滤包含非文本媒体的帖子 - 通过布隆过滤器（Bloom filter）执行文档级文本去重 ### 2. 基于检索的子论坛筛选随后我们使用密集检索（dense retrieval）来识别与学术相关的子论坛以进行进一步过滤。我们改编自MMLU测试问题构建检索查询，并使用这些查询对步骤2中过滤后的Reddit数据执行密集检索，为每个查询保留前5个匹配结果。基于这些检索结果，我们筛选出符合以下条件的151个子论坛： - 该子论坛在某一MMLU类别下的检索匹配项数量≥20个**唯一**条目；或 - 该子论坛在所有MMLU类别下的检索匹配项总数≥100个。随后我们将步骤1的数据集过滤至仅保留上述151个子论坛中的文档。 ### 3. 格式重写最后，我们将步骤2得到的数据输入至合成重写流水线，以生成覆盖多样化问题格式的学术问答样本。我们参考MMLU中的题型变体定义了7类问题格式，并以此构建用于生成问答文本的提示词（prompt）。这7类格式分别为： 1. 开放式问答 2. 语句补全 3. 填空式 4. 语句真实性验证 5. 以下哪项具备属性X 6. 以下哪项表述正确 7. 题干内嵌选项式我们为每一类格式构建了对应的提示词，用于基于输入文本生成该类别的问题。以下为"题干内嵌选项式"类别的示例提示词，其余类别的提示词差异在于：1）"针对格式..."段落的内容；2）上下文示例（每个提示词包含1-3个示例）。我将请您将一段文本转换为多项选择题。以下是待处理的文本： "{text}" 指令：请将文本中的信息转换为学术性多项选择题。**仅**包含学术类问题。**请勿**在问题中提及原文本。针对该格式，请生成在题干内嵌选项并给出正确选项的题目。示例如下：狗具备以下哪项特征？ I. 它们是哺乳动物 II. 它们有五条腿 III. 它们有尾巴 A. 仅I B. 仅II C. 仅III D. I和III 答案：D %%%% 以下哪项属于美国城市？ I. 巴黎 II. 雅典 III. 芝加哥 A. 仅I B. 仅II C. 仅III D. I、II和III 答案：C 请使用" %%%% "分隔所有生成的题目。为了生成重写后的问答数据，我们使用GPT-4o mini（2025年1月版）作为生成模型。我们遍历步骤2中的帖子-评论配对数据，为每个文本随机采样一个问题格式，并提示GPT-4o mini为该文本与格式生成问答配对。对于较长的输入文本，我们会按文本长度比例多次重新采样格式并执行提示生成。最后，我们基于"%%%%"分隔符将GPT-4o mini的输出解析为独立的问答样本，并为其中50%的样本添加前缀"Question: "。 ## 实验结果我们使用经过2T个Token训练的OLMo 7B（Groeneveld等人，2024）模型对该数据集进行验证，具体实验为在DCLM与Reddit数据按1:1混合的数据集上进行持续预训练，并将学习率退火至0。我们针对三种版本的Reddit数据执行该持续预训练流程：步骤2中过滤得到的数据集、步骤2中过滤更宽松（更低的筛选阈值）的版本作为基准，以及步骤3中重写后的数据集。我们发现该数据集对多项选择问答任务具备明确的下游性能提升，其中重写流程相较于仅过滤后的Reddit数据带来了显著的性能提升。尽管采用更严格的过滤数据带来的影响可忽略不计（MMLU得分从0.615降至0.612，MC9得分从0.742降至0.74），但重写流程带来的收益十分显著：将步骤3重写后的Reddit数据与步骤2未重写的Reddit数据相比，**MMLU得分从0.62提升至0.66**，**MC9得分从0.74提升至0.76**。 @techreport{dolma-reddit-to-flashcards, author = {Allyson Ettinger, Luca Soldaini and Kyle Lo}, year = 2025, title = {{Dolma Reddit to Flashcards Dataset}}, institution = {{Allen Institute for AI}}} }

提供机构：

maas

创建时间：

2025-07-20

搜集汇总

数据集介绍