Douban Conversation Corpus

Name: Douban Conversation Corpus
Creator: OpenDataLab
Published: 2026-05-24 03:30:14
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/Douban_Conversation_Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

我们发布豆瓣会话语料库，包括一个训练数据集、一个开发集和一个基于检索的聊天机器人的测试集。豆瓣会话语料统计如下表所示。火车瓦尔测试会话响应对 1m 50k 10k 平均每个会话的积极响应 1 1 1.18 无情河童不适用不适用 0.41 每次会话的最小转数 3 3 3 每个会话的最大值 98 91 45 每次会话的平均转数 6.69 6.75 5.95 每个话语的平均字数 18.56 18.50 20.74 测试数据包含 1000 个对话上下文，对于每个上下文，我们创建 10 个响应作为候选。我们招募了三名标注员来判断候选人是否对会议做出了适当的回应。正确的响应意味着响应可以自然地回复给定上下文的消息。每对收到三个标签，大部分标签被视为最终决定。据我们所知，这是第一个用于基于检索的聊天机器人的人工标记测试集。整个语料链接 https://www.dropbox.com/s/90t0qtji9ow20ca/DoubanConversaionCorpus.zip?dl=0 数据模板标签 \t 对话话语（由 \t 分割）\t 响应

We hereby release the Douban Conversation Corpus, which includes a training dataset, a development set, and a retrieval-based chatbot test set. The statistics of the Douban Conversation Corpus are shown in the following table. | Statistic | Training Set | Development Set | Test Set | |---|---|---|---| | Number of conversation-response pairs | 1M | 50K | 10K | | Average positive responses per session | 1 | 1 | 1.18 | | Average negative responses per session | N/A | N/A | 0.41 | | Minimum number of turns per session | 3 | 3 | 3 | | Maximum number of turns per session | 98 | 91 | 45 | | Average number of turns per session | 6.69 | 6.75 | 5.95 | | Average number of words per utterance | 18.56 | 18.50 | 20.74 | The test set contains 1000 dialogue contexts. For each context, we created 10 candidate responses. We recruited three annotators to judge whether the candidates made appropriate responses to the conversation. An appropriate response means that the response can naturally reply to the message of the given context. Each pair receives three labels, and the majority label is taken as the final decision. To the best of our knowledge, this is the first manually annotated test set for retrieval-based chatbots. The entire corpus is available at: https://www.dropbox.com/s/90t0qtji9ow20ca/DoubanConversaionCorpus.zip?dl=0 Data Template: Label Dialogue Utterances (split by ) Response

提供机构：

OpenDataLab

创建时间：

2022-05-30

搜集汇总

数据集介绍