BeIR/arguana
收藏数据集卡片 for BEIR Benchmark
数据集描述
数据集摘要
BEIR 是一个异构基准,由 18 个不同的数据集组成,涵盖 9 种信息检索任务:
- 事实检查:FEVER, Climate-FEVER, SciFact
- 问答:NQ, HotpotQA, FiQA-2018
- 生物医学信息检索:TREC-COVID, BioASQ, NFCorpus
- 新闻检索:TREC-NEWS, Robust04
- 论点检索:Touche-2020, ArguAna
- 重复问题检索:Quora, CqaDupstack
- 引用预测:SCIDOCS
- 推文检索:Signal-1M
- 实体检索:DBPedia
所有这些数据集都已预处理,可供实验使用。
支持的任务和排行榜
数据集支持一个排行榜,评估模型在任务特定指标(如 F1 或 EM)上的表现,以及它们从 Wikipedia 检索支持信息的能力。
语言
所有任务均为英语(en)。
数据集结构
所有 BEIR 数据集必须包含语料库、查询和 qrels(相关性判断文件)。它们必须采用以下格式:
corpus文件:一个.jsonl文件(jsonlines),包含一个字典列表,每个字典有三个字段_id(唯一文档标识符)、title(文档标题,可选)和text(文档段落或段落)。例如:{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}queries文件:一个.jsonl文件(jsonlines),包含一个字典列表,每个字典有两个字段_id(唯一查询标识符)和text(查询文本)。例如:{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}qrels文件:一个.tsv文件(制表符分隔),包含三列,即query-id、corpus-id和score(按此顺序)。第一行作为标题。例如:q1 doc1 1
数据实例
一个 BEIR 数据集的高级示例:
python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on the philosophy of science. He is best known to the general public for his mass–energy equivalence formula E = mc2, which has been dubbed the worlds most famous equation. He received the 1921 Nobel Prize in Physics for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect, a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # 如果标题不存在,保持为空字符串 "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, }
queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" }
qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, }
数据字段
所有配置的示例具有以下特征:
语料库
corpus:一个dict特征,表示文档标题和段落文本,由以下部分组成:_id:一个string特征,表示唯一文档 IDtitle:一个string特征,表示文档标题。text:一个string特征,表示文档文本。
查询
queries:一个dict特征,表示查询,由以下部分组成:_id:一个string特征,表示唯一查询 IDtext:一个string特征,表示查询文本。
Qrels
qrels:一个dict特征,表示查询文档相关性判断,由以下部分组成:_id:一个string特征,表示查询 ID_id:一个string特征,表示文档 ID。score:一个int32特征,表示查询和文档之间的相关性判断。
数据分割
| 数据集 | 网站 | BEIR 名称 | 类型 | 查询数量 | 语料库大小 | 相关文档/查询 | 下载链接 | md5 |
|---|---|---|---|---|---|---|---|---|
| MSMARCO | Homepage | msmarco |
train<br>dev<br>test |
6,980 | 8.84M | 1.1 | Link | 444067daf65d982533ea17ebd59501e4 |
| TREC-COVID | Homepage | trec-covid |
test |
50 | 171K | 493.5 | Link | ce62140cb23feb9becf6270d0d1fe6d1 |
| NFCorpus | Homepage | nfcorpus |
train<br>dev<br>test |
323 | 3.6K | 38.2 | Link | a89dba18a62ef92f7d323ec890a0d38d |
| BioASQ | Homepage | bioasq |
train<br>test |
500 | 14.91M | 8.05 | No | How to Reproduce? |
| NQ | Homepage | nq |
train<br>test |
3,452 | 2.68M | 1.2 | Link | d4d3d2e48787a744b6f6e691ff534307 |
| HotpotQA | Homepage | hotpotqa |
train<br>dev<br>test |
7,405 | 5.23M | 2.0 | Link | f412724f78b0d91183a0e86805e16114 |
| FiQA-2018 | Homepage | fiqa |
train<br>dev<br>test |
648 | 57K | 2.6 | Link | 17918ed23cd04fb15047f73e6c3bd9d9 |
| Signal-1M(RT) | Homepage | signal1m |
test |
97 | 2.86M | 19.6 | No | How to Reproduce? |
| TREC-NEWS | Homepage | trec-news |
test |
57 | 595K | 19.6 | No | How to Reproduce? |
| ArguAna | Homepage | arguana |
test |
1,406 | 8.67K | 1.0 | Link | 8ad3e3c2a5867cdced806d6503f29b99 |
| Touche-2020 | Homepage | webis-touche2020 |
test |
49 | 382K | 19.0 | Link | 46f650ba5a527fc69e0a6521c5a23563 |
| CQADupstack | Homepage | cqadupstack |
test |
13,145 | 457K | 1.4 | Link | 4e41456d7df8ee7760a7f866133bda78 |
| Quora | Homepage | quora |
dev<br>test |
10,000 | 523K | 1.6 | Link | 18fb154900ba42a600f84b839c173167 |
| DBPedia | Homepage | dbpedia-entity |
dev<br>test |
400 | 4.63M | 38.2 | Link | c2a39eb420a3164af735795df012ac2c |
| SCIDOCS | Homepage | scidocs |
test |
1,000 | 25K | 4.9 | Link | 38121350fc3a4d2f48850f6aff52e4a9 |
| FEVER | Homepage | fever |
train<br>dev<br>test |
6,666 | 5.42M | 1.2 | Link | 5a818580227bfb4b35bb6fa46d9b6c03 |
| Climate-FEVER | Homepage | climate-fever |
test |
1,535 | 5.42M | 3.0 | Link | 8b66f0a9126c521bae2bde127b4dc99d |
| SciFact | Homepage | scifact |
train<br>test |
300 | 5K | 1.1 | Link | 5f7d1de60b170fc8027bb7898e2efca1 |
| Robust04 | Homepage | robust04 |
test |
249 | 528K | 69.9 | No | How to Reproduce? |
数据集创建
策划理由
[需要更多信息]
源数据
初始数据收集和规范化
[需要更多信息]
源语言生产者是谁?
[需要更多信息]
注释
注释过程
[需要更多信息]
注释者是谁?
[需要更多信息]
个人和敏感信息
[需要更多信息]
使用数据集的注意事项
数据集的社会影响
[需要更多信息]
偏见的讨论
[需要更多信息]
其他已知限制
[需要更多信息]
附加信息




