DaReCzech
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/seznam/dareczech
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为DaReCzech,是一个专为文本相关性排序设计的捷克语数据集,包含了超过160万的查询-文档对,这些对被划分为Train-big、Train-small、Dev和Test四个部分。每条记录包含了查询内容、URL、文档标题、文档正文文本摘录(BTE)以及一个相关性标签。查询均为真实用户输入,仅进行了少量修正,而文档在预处理阶段排除了不相关的内容。总体规模达到了160万对,任务是对文本进行相关性排序。
This dataset, named DaReCzech, is a Czech-language dataset specifically designed for text relevance ranking. It contains over 1.6 million query-document pairs split into four subsets: Train-big, Train-small, Dev, and Test. Each record includes the query content, URL, document title, document body text excerpt (BTE), and a relevance label. All queries are real user inputs with only minor corrections, and the documents were preprocessed to remove irrelevant content. The core task of this dataset is text relevance ranking.



