Douban Group Human-Human Context-Response Pairs
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/love1life/chat
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了从豆瓣小组爬取的人与人之间的上下文与回应配对,其中包含了最后一句发言的重写与未重写版本。具体数据信息如下:上下文、最后一句发言、回应以及重写的最后一句发言的平均长度分别为12.69、11.90、15.15和14.27个词。在数据集中,重写与未重写的最后一句发言的比例为1.426:1。数据规模方面,训练集包含6,844,393个四元组,验证集有1000个,测试集有1074个。该数据集的任务旨在进行多轮回应生成以及上下文重写。
This dataset comprises context-response pairs between users crawled from Douban groups, where both the rewritten and unrewritten variants of the final utterance are included. Specific statistical metrics are listed below: the average lengths of the context, final utterance, response, and rewritten final utterance are 12.69, 11.90, 15.15, and 14.27 words, respectively. The ratio of rewritten to unrewritten final utterances in the dataset is 1.426:1. Regarding data scale, the training set contains 6,844,393 quadruples, the validation set includes 1000 samples, and the test set has 1074 samples. The tasks supported by this dataset include multi-turn response generation and context rewriting.
提供机构:
Authors of the paper



