ProCIS
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/algoprog/procis
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为ProCIS,包含了Reddit论坛上的多条讨论线索,在这些讨论中,多个用户相互交流,每场对话至少包含一条含有维基百科链接的评论。用户添加的这些维基百科文章通常提供了与对话相关的额外上下文或背景信息,因此它们被视为检索目标。此外,未来开发集按时间顺序包含了训练集之后的对话,这有助于评估检索模型对新兴概念的一般化能力。测试集则包含了逐轮的人工密集相关性标注。该数据集的规模包括5,315,384篇维基百科文章,其中子集分别包含2,830,107条(训练集)、4,165条(开发集)、3,385条(未来开发集)和100条(测试集)对话。该数据集的任务是研究多方对话中的主动搜索。
This dataset, named ProCIS, comprises multiple discussion threads from the Reddit forum, where multiple users interact with each other. Each conversation includes at least one comment containing a Wikipedia link. The Wikipedia articles shared by users typically provide additional contextual or background information relevant to the conversation, so they are treated as retrieval targets. Moreover, the future development set contains conversations chronologically subsequent to those in the training set, which helps evaluate the generalization ability of retrieval models for emerging concepts. The test set includes dense manual relevance annotations per dialogue turn. In terms of scale, the dataset covers 5,315,384 Wikipedia articles, and its subsets respectively contain 2,830,107 (training set), 4,165 (development set), 3,385 (future development set) and 100 (test set) conversations. The core task of this dataset is to study active search in multi-party conversations.



