BeIR/scifact
收藏Hugging Face2026-04-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BeIR/scifact
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
paperswithcode_id: beir
pretty_name: BEIR Benchmark
task_categories:
- zero-shot-classification
- text-retrieval
task_ids:
- document-retrieval
- entity-linking-retrieval
- fact-checking-retrieval
tags:
- biomedical-information-retrieval
- citation-prediction-retrieval
- passage-retrieval
- news-retrieval
- argument-retrieval
- zero-shot-information-retrieval
- tweet-retrieval
- question-answering-retrieval
- duplication-question-retrieval
- zero-shot-retrieval
configs:
- config_name: corpus
data_files:
- split: corpus
path: corpus/corpus-*
- config_name: queries
data_files:
- split: queries
path: queries/queries-*
dataset_info:
- config_name: corpus
features:
- name: _id
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: corpus
num_bytes: 4469916
num_examples: 5183
download_size: 4469916
dataset_size: 4469916
- config_name: queries
features:
- name: _id
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_bytes: 64982
num_examples: 1109
download_size: 64982
dataset_size: 64982
---
# Dataset Card for BEIR Benchmark
> **`scifact` is one of the datasets from the Fact Checking task within BEIR, measuring scientific article retrieval for a given scientific claim.**
## Dataset Description
- **Homepage:** https://beir.ai
- **Repository:** https://beir.ai
- **Paper:** https://openreview.net/forum?id=wCu6T5xFjeJ
- **Leaderboard:** https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns
- **Point of Contact:** nandan.thakur@uwaterloo.ca
### Dataset Summary
BEIR is a heterogeneous benchmark built from 18 diverse datasets representing 9 information retrieval tasks.
- Fact-checking: [FEVER](http://fever.ai), [Climate-FEVER](http://climatefever.ai), [SciFact](https://github.com/allenai/scifact)
- Question-Answering: [NQ](https://ai.google.com/research/NaturalQuestions), [HotpotQA](https://hotpotqa.github.io), [FiQA-2018](https://sites.google.com/view/fiqa/)
- Bio-Medical IR: [TREC-COVID](https://ir.nist.gov/covidSubmit/index.html), [BioASQ](http://bioasq.org), [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/)
- News Retrieval: [TREC-NEWS](https://trec.nist.gov/data/news2019.html), [Robust04](https://trec.nist.gov/data/robust/04.guidelines.html)
- Argument Retrieval: [Touche-2020](https://webis.de/events/touche-20/shared-task-1.html), [ArguAna](tp://argumentation.bplaced.net/arguana/data)
- Duplicate Question Retrieval: [Quora](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs), [CqaDupstack](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/)
- Citation-Prediction: [SCIDOCS](https://allenai.org/data/scidocs)
- Tweet Retrieval: [Signal-1M](https://research.signal-ai.com/datasets/signal1m-tweetir.html)
- Entity Retrieval: [DBPedia](https://github.com/iai-group/DBpedia-Entity/)
### Languages
All tasks are in English (`en`).
## Dataset Structure
This dataset uses the standard BEIR retrieval layout and includes:
- `corpus`: one row per document with `_id`, `title`, `text`
- `queries`: one row per query with `_id`, `title`, `text`
### Data Fields
- `_id` (`string`): unique identifier
- `title` (`string`): title (empty string when unavailable)
- `text` (`string`): document/query text
### Data Instances
A high level example of any BEIR dataset:
```python
corpus = {
"doc1" : {
"title": "Albert Einstein",
"text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
its influence on the philosophy of science. He is best known to the general public for his mass–energy \
equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
of the photoelectric effect', a pivotal step in the development of quantum theory."
},
"doc2" : {
"title": "", # Keep title an empty string if not present
"text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\
with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
},
}
queries = {
"q1" : "Who developed the mass-energy equivalence formula?",
"q2" : "Which beer is brewed with a large proportion of wheat?"
}
qrels = {
"q1" : {"doc1": 1},
"q2" : {"doc2": 1},
}
```
### Scifact Data Splits
| Subset | Split | Rows |
| --- | --- | ---: |
| corpus | corpus | 5,183 |
| queries | queries | 1,109 |
### BEIR Direct Download
You can also download BEIR datasets directly (without loading through Hugging Face datasets) using the links below.
| Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
| --- | --- | --- | --- | ---: | ---: | ---: | --- | --- |
| MSMARCO | [Homepage](https://microsoft.github.io/msmarco/) | `msmarco` | `train` `dev` `test` | 6,980 | 8.84M | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip) | `444067daf65d982533ea17ebd59501e4` |
| TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html) | `trec-covid` | `test` | 50 | 171K | 493.5 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) | `ce62140cb23feb9becf6270d0d1fe6d1` |
| NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | `nfcorpus` | `train` `dev` `test` | 323 | 3.6K | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip) | `a89dba18a62ef92f7d323ec890a0d38d` |
| BioASQ | [Homepage](http://bioasq.org) | `bioasq` | `train` `test` | 500 | 14.91M | 8.05 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#2-bioasq) |
| NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | `nq` | `train` `test` | 3,452 | 2.68M | 1.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nq.zip) | `d4d3d2e48787a744b6f6e691ff534307` |
| HotpotQA | [Homepage](https://hotpotqa.github.io) | `hotpotqa` | `train` `dev` `test` | 7,405 | 5.23M | 2.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/hotpotqa.zip) | `f412724f78b0d91183a0e86805e16114` |
| FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | `fiqa` | `train` `dev` `test` | 648 | 57K | 2.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip) | `17918ed23cd04fb15047f73e6c3bd9d9` |
| Signal-1M(RT) | [Homepage](https://research.signal-ai.com/datasets/signal1m-tweetir.html) | `signal1m` | `test` | 97 | 2.86M | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#4-signal-1m) |
| TREC-NEWS | [Homepage](https://trec.nist.gov/data/news2019.html) | `trec-news` | `test` | 57 | 595K | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#1-trec-news) |
| ArguAna | [Homepage](http://argumentation.bplaced.net/arguana/data) | `arguana` | `test` | 1,406 | 8.67K | 1.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip) | `8ad3e3c2a5867cdced806d6503f29b99` |
| Touche-2020 | [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | `webis-touche2020` | `test` | 49 | 382K | 19.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/webis-touche2020.zip) | `46f650ba5a527fc69e0a6521c5a23563` |
| CQADupstack | [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | `cqadupstack` | `test` | 13,145 | 457K | 1.4 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/cqadupstack.zip) | `4e41456d7df8ee7760a7f866133bda78` |
| Quora | [Homepage](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | `quora` | `dev` `test` | 10,000 | 523K | 1.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/quora.zip) | `18fb154900ba42a600f84b839c173167` |
| DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | `dbpedia-entity` | `dev` `test` | 400 | 4.63M | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/dbpedia-entity.zip) | `c2a39eb420a3164af735795df012ac2c` |
| SCIDOCS | [Homepage](https://allenai.org/data/scidocs) | `scidocs` | `test` | 1,000 | 25K | 4.9 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scidocs.zip) | `38121350fc3a4d2f48850f6aff52e4a9` |
| FEVER | [Homepage](http://fever.ai) | `fever` | `train` `dev` `test` | 6,666 | 5.42M | 1.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fever.zip) | `5a818580227bfb4b35bb6fa46d9b6c03` |
| Climate-FEVER | [Homepage](http://climatefever.ai) | `climate-fever` | `test` | 1,535 | 5.42M | 3.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/climate-fever.zip) | `8b66f0a9126c521bae2bde127b4dc99d` |
| SciFact | [Homepage](https://github.com/allenai/scifact) | `scifact` | `train` `test` | 300 | 5K | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip) | `5f7d1de60b170fc8027bb7898e2efca1` |
| Robust04 | [Homepage](https://trec.nist.gov/data/robust/04.guidelines.html) | `robust04` | `test` | 249 | 528K | 69.9 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#3-robust04) |
## Citation Information
```bibtex
@inproceedings{
thakur2021beir,
title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
year={2021},
url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}
```
提供机构:
BeIR
原始信息汇总
数据集概述
数据集名称
- BEIR Benchmark
数据集属性
- 语言: 英语 (
en) - 许可证: CC-BY-SA-4.0
- 多语言性: 单语种
数据集大小
- MSMARCO: 1M<n<10M
- TREC-COVID: 100k<n<1M
- NFCorpus: 1K<n<10K
- NQ: 1M<n<10M
- HotpotQA: 1M<n<10M
- FiQA: 10K<n<100K
- ArguAna: 1K<n<10K
- Touche-2020: 100K<n<1M
- CQADupstack: 100K<n<1M
- Quora: 100K<n<1M
- DBpedia: 1M<n<10M
- SCIDOCS: 10K<n<100K
- FEVER: 1M<n<10M
- Climate-FEVER: 1M<n<10M
- SciFact: 1K<n<10K
支持的任务
- 任务类别:
- 文本检索
- 零样本检索
- 信息检索
- 零样本信息检索
- 具体任务:
- 段落检索
- 实体链接检索
- 事实检查检索
- 推文检索
- 引用预测检索
- 重复问题检索
- 论证检索
- 新闻检索
- 生物医学信息检索
- 问答检索
数据集结构
- 数据实例:
- 语料库:
.jsonl文件,包含文档ID、标题和文本。 - 查询:
.jsonl文件,包含查询ID和文本。 - qrels:
.tsv文件,包含查询ID、文档ID和评分。
- 语料库:
数据集创建
- 许可证信息: CC-BY-SA-4.0
- 引用信息: 引用格式如README文件所示。
- 贡献者: 感谢 @Nthakur20 添加此数据集。
搜集汇总
数据集介绍

构建方式
在科学事实核查领域,SciFact数据集作为BEIR基准的重要组成部分,其构建过程体现了严谨的学术规范。该数据集源自艾伦人工智能研究所的开源项目,通过系统性地收集和整理科学文献中的主张与证据对。研究人员从已发表的科学论文中提取具体主张,并关联支持或反驳这些主张的原文段落,从而构建了一个包含5,183篇文档和1,109条查询的结构化语料库。整个构建流程确保了数据来源的权威性与标注的一致性,为信息检索模型提供了高质量的评估基础。
使用方法
使用SciFact数据集时,研究者通常遵循标准的信息检索评估范式。数据集以BEIR统一格式组织,包含独立的语料库、查询集及关联标注文件。用户可首先加载语料库与查询,利用检索模型为每个查询生成候选文档排序列表,随后通过计算NDCG@10、MAP或Recall等指标,与标注的真实相关文档进行比对,以量化模型性能。该数据集支持零样本检索评估,常用于测试预训练语言模型或稠密检索系统在未经特定领域微调下的泛化能力,是推动科学信息检索技术进步的关键工具。
背景与挑战
背景概述
在信息检索领域,科学事实核查任务对模型的精确性与可靠性提出了极高要求。SciFact数据集作为BEIR基准测试的重要组成部分,由艾伦人工智能研究所于2020年创建,旨在评估模型从科学文献中检索支持或反驳特定科学主张的证据的能力。该数据集聚焦于生物医学领域的学术论文,核心研究问题在于如何高效准确地定位与科学论断相关的权威文献片段,从而推动自动化事实核查系统的发展,对提升学术信息可信度与科研效率具有深远影响。
当前挑战
SciFact数据集所应对的领域挑战在于科学事实核查的复杂性,包括处理专业术语密集、逻辑关系隐晦的学术文本,以及区分证据的确证性与反驳性。在构建过程中,研究人员需从海量科学文献中人工标注精确的证据片段,确保每项主张与相关证据的对应关系严谨无误,同时维持数据集的规模与质量平衡,以支撑检索模型在零样本场景下的稳健评估。
常用场景
经典使用场景
在科学文献检索领域,SciFact数据集作为BEIR基准测试的关键组成部分,其经典使用场景聚焦于零样本科学事实核查任务。该数据集通过提供科学主张与相关研究论文之间的对应关系,使研究者能够评估信息检索模型在未经特定领域训练的情况下,从大规模科学文献中精准定位支持或反驳特定主张的证据的能力。这种设置模拟了真实科研环境中快速验证科学论断的需求,为模型泛化性能提供了严谨的测试平台。
解决学术问题
SciFact数据集有效解决了信息检索研究中模型泛化能力评估的难题。传统检索模型往往在特定数据集上表现优异,却难以迁移至新领域。该数据集通过构建科学事实核查这一复杂任务,促使研究社区开发能够跨领域理解科学文本、推理主张与证据间逻辑关系的检索方法。其意义在于推动了零样本检索技术的发展,为构建更具鲁棒性和通用性的检索系统奠定了实证基础,深刻影响了检索模型评估范式的演进。
实际应用
在实际应用层面,SciFact数据集为自动化科学事实核查系统与学术知识服务平台提供了核心训练与评估资源。例如,在学术出版过程中,该系统可辅助编辑快速核实论文引用的准确性;在科研信息平台中,能帮助学者高效检索支持其研究假设的文献证据。此外,该数据集支撑的工具还可集成于教育领域,用于培养学生批判性评估科学信息的能力,从而在科研诚信维护与科学素养提升方面发挥切实作用。
数据集最近研究
最新研究方向
在科学事实核查领域,SciFact数据集作为BEIR基准的关键组成部分,正推动着信息检索模型的前沿探索。当前研究聚焦于零样本检索能力的提升,尤其是在处理科学声明与文献证据的精准匹配方面。随着大型语言模型在跨领域任务中的广泛应用,该数据集被用于评估模型在未见领域中的泛化性能,促进了检索模型在生物医学、气候科学等专业场景下的适应性研究。热点事件如人工智能在学术诚信与虚假信息检测中的部署,进一步凸显了SciFact在验证科学主张真实性方面的重要价值。其影响在于为构建可靠、可解释的自动化事实核查系统提供了标准化的评估框架,对推动科学交流的透明性与可信度具有深远意义。
以上内容由遇见数据集搜集并总结生成



