five

Data for: CLAIRE: A combinatorial visual analytics system for information retrieval evaluation

收藏
NIAID Data Ecosystem2026-03-10 收录
下载链接:
https://data.mendeley.com/datasets/mdwvttzt48
下载链接
链接失效反馈
官方服务:
资源简介:
We considered the following standard and shared collec- tions, each track using 50 different topics: • TREC Adhoc tracks T07 and T08: they focus on a news search task and adopt a corpus of about 528K news documents. • TREC Web tracks T09 and T10: focus on a Web search task and adopt a corpus of 1.7M Web pages. • TREC Terabyte tracks T14 and T15: focus on a Web search task and adopt a corpus of 125M Web pages. We considered three main components of an IR system: stop list, stemmer, and IR model. We selected a set of alternative implementations of each component and, by using the Ter- rier v.4.02 open source system, we created a run for each system defined by combining the available components in all possible ways. The selected components are: • Stop list: nostop, indri, lucene, snowball, smart, terrier; • Stemmer: nolug, weakPorter, porter, snowballPorter, krovetz, lovins; • Model: bb2, bm25, dfiz, dfree, dirichletlm, dlh, dph, hiemstralm, ifb2, inb2, inl2, inexpb2, jskls, lemurtfidf, lgd, pl2, tfidf. Overall, these components define a 6 × 6 × 17 factorial design with a GoP consisting of 612 system runs. They represent nearly all the state-of-the-art components which constitute the common denominator almost always present in any IR system for English retrieval and thus they are a good account of what can be found in many different operational settings.

本研究采用以下标准共享数据集集合,每个评测赛道均包含50个不同的查询主题: • 文本检索会议(Text Retrieval Conference, TREC)特设检索(Adhoc)赛道T07与T08:该赛道聚焦新闻搜索任务,采用包含约52.8万篇新闻文档的语料库。 • TREC网页检索(Web)赛道T09与T10:聚焦网页搜索任务,采用包含170万张网页的语料库。 • TREC太字节级网页检索(Terabyte)赛道T14与T15:聚焦网页搜索任务,采用包含1.25亿张网页的语料库。 本研究涵盖信息检索(Information Retrieval, IR)系统的三大核心组件:停用词表、词干提取器与信息检索模型。我们为每个组件选取了多种可选实现方案,并基于Terrier v4.02开源检索系统,通过以所有可能的方式组合可用组件,为每一种由此定义的系统生成了对应的检索运行结果。所选取的组件如下: • 停用词表:nostop、indri、lucene、snowball、smart、terrier; • 词干提取器:nolug、weakPorter、porter、snowballPorter、krovetz、lovins; • 检索模型:bb2、bm25、dfiz、dfree、dirichletlm、dlh、dph、hiemstralm、ifb2、inb2、inl2、inexpb2、jskls、lemurtfidf、lgd、pl2、tfidf。 整体而言,上述组件构成了一个6×6×17的析因实验设计,其所覆盖的全部实验组合共计612组系统检索运行结果,形成了完整的实验集合(GoP, Group of Points)。这些组件几乎涵盖了当前英语信息检索系统中最主流的核心模块,是各类实际检索系统中普遍存在的通用基础组件,因此能够充分代表诸多不同实际应用场景下的检索系统配置方案。
创建时间:
2018-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作