DaReCzech
收藏arXiv2021-12-03 更新2024-06-21 收录
下载链接:
https://github.com/Seznam/DaReCzech
下载链接
链接失效反馈官方服务:
资源简介:
DaReCzech是由Seznam.cz创建的一个大型捷克语数据集,包含160万条用户查询-文档对,并由人工专家分配了相关性级别。该数据集旨在支持搜索相关性和多语言研究社区的努力,特别关注捷克语环境下的文本相关性排序问题。数据集分为四个部分:Train-big、Train-small、Dev和Test,每个部分都包含不同的查询-文档对,用于训练和评估文本相关性模型。数据集的创建过程涉及从旧数据池中提取数据,并对其进行预处理和标注,以确保数据的质量和适用性。DaReCzech的应用领域主要集中在提高搜索结果的相关性,特别是在捷克语网络搜索环境中,通过提供高质量的训练数据来优化相关性排序算法。
DaReCzech is a large-scale Czech-language dataset developed by Seznam.cz, which contains 1.6 million user query-document pairs with relevance levels annotated by human experts. This dataset is designed to support the work of the search relevance and multilingual research communities, with a special focus on the task of text relevance ranking in the Czech language environment. The dataset is split into four subsets: Train-big, Train-small, Dev, and Test, each holding distinct query-document pairs for training and evaluating text relevance models. The creation of DaReCzech involves extracting data from legacy data pools, followed by preprocessing and annotation to guarantee data quality and suitability. The main application scenarios of DaReCzech center on enhancing search result relevance, particularly within the Czech web search ecosystem, by providing high-quality training data to optimize relevance ranking algorithms.
提供机构:
Seznam.cz
创建时间:
2021-12-03



