REGen_data(Retrieval Generation Chat dataset)
收藏ieee-dataport.org2025-03-25 收录
下载链接:
https://ieee-dataport.org/documents/regendataretrieval-generation-chat-dataset
下载链接
链接失效反馈官方服务:
资源简介:
The dataset and source code used in paper "Pick the Better and Leave the Rest: Leveraging Multiple Retrieved Results to Guide Response Generation". We conduct experiments on the Retrieval Generation Chat dataset, which contains about five million query-response pairs and provides 3 to 10 retrieved references for each query. Note that there exist samples where the retrieved query is exactly the user’s utterance, and such retrieved candidates are taken as the ground truth and removed from the rest candidates. After that, we also remove those queries with more than ten responses since each reference is probably irrelevant to most replies. Moreover, following the setting of previous studies, only the samples with at least 20% of corresponding satisfying Jaccard(r; r(i)) > 0.3 are leveraged for training, where Jaccard stands for the Jaccard distance. Finally, since each query corresponds to multiple replies, we split the filtered corpus into training (1,179,374), validation (21,462), and test (20,896) sets based on the query.
本数据集及源代码应用于论文《择优而从,弃其余:利用多个检索结果指导响应生成》。本研究在检索生成对话数据集上开展实验,该数据集包含约五百万条查询-响应对,并为每个查询提供3至10条检索引用。请注意,存在检索到的查询与用户话语完全一致的情况,此类检索候选被视为基准真实值,并从其余候选中剔除。随后,我们还剔除了响应超过十条的查询,因为每个引用可能对大多数回复而言均不具有相关性。此外,遵循先前研究的设置,仅选取至少20%的对应满足Jaccard距离(Jaccard distance)大于0.3的Jaccard(r; r(i))样本用于训练,其中Jaccard代表Jaccard距离。最终,鉴于每个查询对应多个回复,我们根据查询将筛选后的语料库划分为训练集(1,179,374条)、验证集(21,462条)和测试集(20,896条)。
提供机构:
ieee-dataport.org



