Data for: CLAIRE: A combinatorial visual analytics system for information retrieval evaluation
收藏NIAID Data Ecosystem2026-03-10 收录
下载链接:
https://data.mendeley.com/datasets/mdwvttzt48
下载链接
链接失效反馈官方服务:
资源简介:
We considered the following standard and shared collec- tions, each track using 50 different topics:
• TREC Adhoc tracks T07 and T08: they focus on a news search task and adopt a corpus of about 528K news documents.
• TREC Web tracks T09 and T10: focus on a Web search task and adopt a corpus of 1.7M Web pages.
• TREC Terabyte tracks T14 and T15: focus on a Web search task and adopt a corpus of 125M Web pages.
We considered three main components of an IR system: stop list, stemmer, and IR model. We selected a set of alternative implementations of each component and, by using the Ter- rier v.4.02 open source system, we created a run for each system defined by combining the available components in all possible ways. The selected components are:
• Stop list: nostop, indri, lucene, snowball,
smart, terrier;
• Stemmer: nolug, weakPorter, porter,
snowballPorter, krovetz, lovins;
• Model: bb2, bm25, dfiz, dfree, dirichletlm, dlh, dph, hiemstralm, ifb2, inb2, inl2, inexpb2, jskls, lemurtfidf, lgd, pl2, tfidf.
Overall, these components define a 6 × 6 × 17 factorial design with a GoP consisting of 612 system runs. They represent nearly all the state-of-the-art components which constitute the common denominator almost always present
in any IR system for English retrieval and thus they are a good account of what can be found in many different operational settings.
本研究采用以下标准共享数据集集合,每个评测赛道均包含50个不同的查询主题:
• 文本检索会议(Text Retrieval Conference, TREC)特设检索(Adhoc)赛道T07与T08:该赛道聚焦新闻搜索任务,采用包含约52.8万篇新闻文档的语料库。
• TREC网页检索(Web)赛道T09与T10:聚焦网页搜索任务,采用包含170万张网页的语料库。
• TREC太字节级网页检索(Terabyte)赛道T14与T15:聚焦网页搜索任务,采用包含1.25亿张网页的语料库。
本研究涵盖信息检索(Information Retrieval, IR)系统的三大核心组件:停用词表、词干提取器与信息检索模型。我们为每个组件选取了多种可选实现方案,并基于Terrier v4.02开源检索系统,通过以所有可能的方式组合可用组件,为每一种由此定义的系统生成了对应的检索运行结果。所选取的组件如下:
• 停用词表:nostop、indri、lucene、snowball、smart、terrier;
• 词干提取器:nolug、weakPorter、porter、snowballPorter、krovetz、lovins;
• 检索模型:bb2、bm25、dfiz、dfree、dirichletlm、dlh、dph、hiemstralm、ifb2、inb2、inl2、inexpb2、jskls、lemurtfidf、lgd、pl2、tfidf。
整体而言,上述组件构成了一个6×6×17的析因实验设计,其所覆盖的全部实验组合共计612组系统检索运行结果,形成了完整的实验集合(GoP, Group of Points)。这些组件几乎涵盖了当前英语信息检索系统中最主流的核心模块,是各类实际检索系统中普遍存在的通用基础组件,因此能够充分代表诸多不同实际应用场景下的检索系统配置方案。
创建时间:
2018-08-01



