Baidu-ULTR
收藏arXiv2024-05-15 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2404.02543v3
下载链接
链接失效反馈官方服务:
资源简介:
Baidu-ULTR数据集是由百度搜索引擎收集的大型搜索数据集,包含超过120亿用户会话的点击数据和397,572个标注的查询-文档对用于评估。该数据集反映了真实用户流量的长尾查询分布,并包含了丰富的用户反馈信息,如点击、停留时间、文档是否被滚动出屏幕以及用户是否在点击后返回结果页。数据集不包含预先计算的排名特征,而是原始搜索查询、标题和摘要,这些内容为了隐私保护已被标记化处理。Baidu-ULTR数据集特别适合训练和评估基于变换器的排名模型,如MonoBERT或MonoT5,旨在解决搜索引擎中的无偏学习排名问题。
The Baidu-ULTR dataset is a large-scale search dataset collected from the Baidu Search Engine, containing click data from over 12 billion user sessions and 397,572 annotated query-document pairs for evaluation. This dataset reflects the long-tail query distribution of real user traffic, and includes rich user feedback information such as clicks, dwell time, whether documents are scrolled out of the screen, and whether users return to the results page after clicking. The dataset does not contain pre-computed ranking features, but instead provides raw search queries, titles and abstracts, which have been tokenized for privacy protection. The Baidu-ULTR dataset is particularly suitable for training and evaluating transformer-based ranking models such as MonoBERT or MonoT5, aiming to address the unbiased learning-to-rank problem in search engines.
提供机构:
阿姆斯特丹大学
创建时间:
2024-04-03



