REGen_data(Retrieval Generation Chat dataset)
收藏DataCite Commons2023-09-15 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/regendataretrieval-generation-chat-dataset
下载链接
链接失效反馈官方服务:
资源简介:
The dataset and source code used in paper "Pick the Better and Leave the Rest: Leveraging Multiple Retrieved Results to Guide Response Generation".
We conduct experiments on the Retrieval Generation Chat dataset, which contains about five million query-response pairs and provides 3 to 10 retrieved references for each query. Note that there exist samples where the retrieved query is exactly the user’s utterance, and such retrieved candidates are taken as the ground truth and removed from the rest candidates. After that, we also remove those queries with more than ten responses since each reference is probably irrelevant to most replies. Moreover, following the setting of previous studies, only the samples with at least 20% of corresponding satisfying Jaccard(r; r(i)) > 0.3 are leveraged for training, where Jaccard stands for the Jaccard distance. Finally, since each query corresponds to multiple replies, we split the filtered corpus into training (1,179,374), validation (21,462), and test (20,896) sets based on the query.
提供机构:
IEEE DataPort
创建时间:
2023-09-15



