Reddit Conversation Corpus

Name: Reddit Conversation Corpus
Creator: GitHub repository
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/nouhadziri/THRED

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个高质量的三轮对话数据集，旨在用于主题感知的响应生成。数据集中的主题标签和词汇均由LDA预测得出，这使得从其他来源获取带标签的文本数据颇具挑战性。该数据集规模宏大，包含920万个样本；其中，作为标记对话语料库，我们抽取了300万个（历史对话、主题、目标）的三元组。该数据集的任务是进行主题感知的响应生成。

This is a high-quality three-turn dialogue dataset designed for topic-aware response generation. The topic labels and vocabulary in this dataset are predicted by LDA, which poses significant challenges to acquiring labeled text data from other sources. This large-scale dataset contains 9.2 million samples; among them, as a labeled dialogue corpus, we extracted 3 million triples formatted as (historical dialogue, topic, target response). The core task of this dataset is topic-aware response generation.

提供机构：

GitHub repository

搜集汇总

背景与挑战

背景概述

Reddit Conversation Corpus是一个用于多轮对话生成研究的大规模数据集，包含从95个Reddit子论坛收集的对话数据，时间跨度为2016年11月至2018年8月。数据集提供不同对话轮次（3、4、5轮）的版本，每条对话包含TAB分隔的话语和主题词，旨在支持上下文和主题感知的响应生成模型训练。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集