five

BeIR/scidocs-qrels

收藏
Hugging Face2022-10-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BeIR/scidocs-qrels
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: [] language_creators: [] language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual paperswithcode_id: beir pretty_name: BEIR Benchmark size_categories: msmarco: - 1M<n<10M trec-covid: - 100k<n<1M nfcorpus: - 1K<n<10K nq: - 1M<n<10M hotpotqa: - 1M<n<10M fiqa: - 10K<n<100K arguana: - 1K<n<10K touche-2020: - 100K<n<1M cqadupstack: - 100K<n<1M quora: - 100K<n<1M dbpedia: - 1M<n<10M scidocs: - 10K<n<100K fever: - 1M<n<10M climate-fever: - 1M<n<10M scifact: - 1K<n<10K source_datasets: [] task_categories: - text-retrieval - zero-shot-retrieval - information-retrieval - zero-shot-information-retrieval task_ids: - passage-retrieval - entity-linking-retrieval - fact-checking-retrieval - tweet-retrieval - citation-prediction-retrieval - duplication-question-retrieval - argument-retrieval - news-retrieval - biomedical-information-retrieval - question-answering-retrieval --- # Dataset Card for BEIR Benchmark ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/UKPLab/beir - **Repository:** https://github.com/UKPLab/beir - **Paper:** https://openreview.net/forum?id=wCu6T5xFjeJ - **Leaderboard:** https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns - **Point of Contact:** nandan.thakur@uwaterloo.ca ### Dataset Summary BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks: - Fact-checking: [FEVER](http://fever.ai), [Climate-FEVER](http://climatefever.ai), [SciFact](https://github.com/allenai/scifact) - Question-Answering: [NQ](https://ai.google.com/research/NaturalQuestions), [HotpotQA](https://hotpotqa.github.io), [FiQA-2018](https://sites.google.com/view/fiqa/) - Bio-Medical IR: [TREC-COVID](https://ir.nist.gov/covidSubmit/index.html), [BioASQ](http://bioasq.org), [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) - News Retrieval: [TREC-NEWS](https://trec.nist.gov/data/news2019.html), [Robust04](https://trec.nist.gov/data/robust/04.guidelines.html) - Argument Retrieval: [Touche-2020](https://webis.de/events/touche-20/shared-task-1.html), [ArguAna](tp://argumentation.bplaced.net/arguana/data) - Duplicate Question Retrieval: [Quora](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs), [CqaDupstack](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) - Citation-Prediction: [SCIDOCS](https://allenai.org/data/scidocs) - Tweet Retrieval: [Signal-1M](https://research.signal-ai.com/datasets/signal1m-tweetir.html) - Entity Retrieval: [DBPedia](https://github.com/iai-group/DBpedia-Entity/) All these datasets have been preprocessed and can be used for your experiments. ```python ``` ### Supported Tasks and Leaderboards The dataset supports a leaderboard that evaluates models against task-specific metrics such as F1 or EM, as well as their ability to retrieve supporting information from Wikipedia. The current best performing models can be found [here](https://eval.ai/web/challenges/challenge-page/689/leaderboard/). ### Languages All tasks are in English (`en`). ## Dataset Structure All BEIR datasets must contain a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his mass–energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ### Data Splits | Dataset | Website| BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 | | -------- | -----| ---------| --------- | ----------- | ---------| ---------| :----------: | :------:| | MSMARCO | [Homepage](https://microsoft.github.io/msmarco/)| ``msmarco`` | ``train``<br>``dev``<br>``test``| 6,980 | 8.84M | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip) | ``444067daf65d982533ea17ebd59501e4`` | | TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| ``test``| 50| 171K| 493.5 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) | ``ce62140cb23feb9becf6270d0d1fe6d1`` | | NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus`` | ``train``<br>``dev``<br>``test``| 323 | 3.6K | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip) | ``a89dba18a62ef92f7d323ec890a0d38d`` | | BioASQ | [Homepage](http://bioasq.org) | ``bioasq``| ``train``<br>``test`` | 500 | 14.91M | 8.05 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#2-bioasq) | | NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | ``nq``| ``train``<br>``test``| 3,452 | 2.68M | 1.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nq.zip) | ``d4d3d2e48787a744b6f6e691ff534307`` | | HotpotQA | [Homepage](https://hotpotqa.github.io) | ``hotpotqa``| ``train``<br>``dev``<br>``test``| 7,405 | 5.23M | 2.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/hotpotqa.zip) | ``f412724f78b0d91183a0e86805e16114`` | | FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | ``fiqa`` | ``train``<br>``dev``<br>``test``| 648 | 57K | 2.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip) | ``17918ed23cd04fb15047f73e6c3bd9d9`` | | Signal-1M(RT) | [Homepage](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | ``test``| 97 | 2.86M | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#4-signal-1m) | | TREC-NEWS | [Homepage](https://trec.nist.gov/data/news2019.html) | ``trec-news`` | ``test``| 57 | 595K | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#1-trec-news) | | ArguAna | [Homepage](http://argumentation.bplaced.net/arguana/data) | ``arguana``| ``test`` | 1,406 | 8.67K | 1.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip) | ``8ad3e3c2a5867cdced806d6503f29b99`` | | Touche-2020| [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| ``test``| 49 | 382K | 19.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/webis-touche2020.zip) | ``46f650ba5a527fc69e0a6521c5a23563`` | | CQADupstack| [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| ``test``| 13,145 | 457K | 1.4 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/cqadupstack.zip) | ``4e41456d7df8ee7760a7f866133bda78`` | | Quora| [Homepage](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| ``dev``<br>``test``| 10,000 | 523K | 1.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/quora.zip) | ``18fb154900ba42a600f84b839c173167`` | | DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| ``dev``<br>``test``| 400 | 4.63M | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/dbpedia-entity.zip) | ``c2a39eb420a3164af735795df012ac2c`` | | SCIDOCS| [Homepage](https://allenai.org/data/scidocs) | ``scidocs``| ``test``| 1,000 | 25K | 4.9 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scidocs.zip) | ``38121350fc3a4d2f48850f6aff52e4a9`` | | FEVER | [Homepage](http://fever.ai) | ``fever``| ``train``<br>``dev``<br>``test``| 6,666 | 5.42M | 1.2| [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fever.zip) | ``5a818580227bfb4b35bb6fa46d9b6c03`` | | Climate-FEVER| [Homepage](http://climatefever.ai) | ``climate-fever``|``test``| 1,535 | 5.42M | 3.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/climate-fever.zip) | ``8b66f0a9126c521bae2bde127b4dc99d`` | | SciFact| [Homepage](https://github.com/allenai/scifact) | ``scifact``| ``train``<br>``test``| 300 | 5K | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip) | ``5f7d1de60b170fc8027bb7898e2efca1`` | | Robust04 | [Homepage](https://trec.nist.gov/data/robust/04.guidelines.html) | ``robust04``| ``test``| 249 | 528K | 69.9 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#3-robust04) | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @inproceedings{ thakur2021beir, title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021}, url={https://openreview.net/forum?id=wCu6T5xFjeJ} } ``` ### Contributions Thanks to [@Nthakur20](https://github.com/Nthakur20) for adding this dataset.
提供机构:
BeIR
原始信息汇总

BEIR Benchmark 数据集概述

数据集描述

数据集总结

BEIR是一个异构基准,由18个不同的数据集组成,涵盖9个信息检索任务,包括事实检查、问答、生物医学信息检索、新闻检索、论点检索、重复问题检索、引用预测、推文检索和实体检索。

支持的任务和排行榜

该数据集支持一个排行榜,评估模型在特定任务上的表现,如F1或EM,以及从维基百科检索支持信息的能力。当前最佳模型的表现可以在此处查看。

语言

所有任务均使用英语。

数据集结构

数据实例

BEIR数据集包含三个主要部分:corpusqueriesqrelscorpus文件包含文档的标题和文本,queries文件包含查询文本,qrels文件包含查询和文档之间的相关性评分。

数据字段

  • Corpus: 包含文档ID、标题和文本。
  • Queries: 包含查询ID和文本。
  • Qrels: 包含查询ID、文档ID和相关性评分。

数据分割

数据集根据不同的任务和数据集有不同的分割,如训练集、开发集和测试集。每个数据集的大小和相关性评分也有所不同。

数据集创建

来源数据

  • 初始数据收集和标准化: 信息不足。
  • 源语言生产者: 信息不足。

注释

  • 注释过程: 信息不足。
  • 注释者: 信息不足。

个人和敏感信息

  • 个人和敏感信息: 信息不足。

使用数据的考虑

数据集的社会影响

  • 社会影响: 信息不足。

偏见讨论

  • 偏见: 信息不足。

其他已知限制

  • 其他限制: 信息不足。

附加信息

数据集管理者

  • 管理者: 信息不足。

许可信息

  • 许可: 采用CC-BY-SA-4.0许可。

引用信息

  • 引用: 引用时请使用提供的引用格式。

贡献

  • 贡献者: 感谢@Nthakur20添加此数据集。
搜集汇总
数据集介绍
main_image_url
构建方式
在信息检索领域,构建一个全面且多样化的评估基准对于推动模型发展至关重要。BEIR基准通过整合18个异构数据集,涵盖了事实核查、问答系统、生物医学检索等九大任务类型,每个数据集均经过标准化预处理,确保格式统一。构建过程中,原始数据被转化为包含语料库、查询及相关性标注的三元组结构,采用JSON Lines和TSV格式存储,便于模型直接加载与评估。这种系统化的集成方法不仅保留了各数据集的领域特性,还为跨任务的零样本检索性能提供了可靠的验证平台。
特点
BEIR基准的显著特点在于其异构性与广泛覆盖性,它融合了多个独立来源的数据集,如SCIDOCS用于引文预测,TREC-COVID专注于生物医学检索,而FEVER则服务于事实核查任务。数据集规模跨度从千级到百万级文档,语言均为英语,且遵循统一的评估框架,支持标准信息检索指标。这种设计使得研究者能够在多样化的真实场景中测试模型泛化能力,尤其适合评估零样本检索系统的鲁棒性与适应性。
使用方法
使用BEIR基准时,研究者可通过其GitHub仓库提供的工具直接加载预处理后的数据,每个子集均包含语料库、查询及相关性标注文件。典型流程包括:首先加载指定数据集的JSON Lines格式语料与查询,再结合TSV格式的qrels文件进行相关性评估;模型可根据标准接口实现检索与排序,并利用内置评估脚本计算NDCG@10或MAP等指标。该基准支持零样本设置,允许在未经过特定任务训练的情况下测试模型性能,为信息检索领域的比较研究提供了便捷且标准化的实验环境。
背景与挑战
背景概述
BEIR基准数据集由Nandan Thakur等研究人员于2021年构建,旨在为零样本信息检索模型提供异构评估框架。该数据集整合了来自18个不同数据源的多样化任务,涵盖事实核查、问答系统、生物医学检索及学术文献引用预测等多个领域。其核心研究问题聚焦于提升检索模型在未见任务上的泛化能力,通过统一的数据格式和评估标准,推动了信息检索领域向更高效、更通用的方向发展,对学术界和工业界的模型研发产生了深远影响。
当前挑战
BEIR数据集面临的挑战主要体现在两个方面:在领域问题层面,它需解决信息检索中零样本学习的核心难题,即模型如何适应多样化的任务类型和领域分布,例如从生物医学文献到社交媒体内容的跨域检索;在构建过程中,挑战源于异构数据源的整合与标准化,包括不同数据格式的统一、质量评估的一致性维护,以及大规模语料库的预处理与标注工作,这些都需要精细的工程设计和严谨的学术验证。
常用场景
经典使用场景
在信息检索领域,BEIR/scidocs-qrels数据集作为BEIR基准测试的关键组成部分,其经典使用场景聚焦于科学文献的引用预测任务。该数据集通过提供结构化的查询、文档语料库及相关性标注,为评估检索模型在学术文本中的性能提供了标准化环境。研究人员通常利用该数据集训练和测试密集检索模型、交叉编码器以及零样本检索系统,以衡量模型在科学文档中准确识别相关引用的能力。这种评估不仅推动了检索技术的进步,也为科学知识的高效组织与发现奠定了方法论基础。
实际应用
在实际应用层面,BEIR/scidocs-qrels数据集支撑了学术搜索引擎和科学知识图谱的构建。基于该数据集训练的检索模型能够高效处理科研人员的文献查询需求,精准推荐相关研究论文或技术报告。例如,在数字图书馆系统中,此类技术可帮助用户快速定位支撑特定论点的参考文献;在科研协作平台上,它能辅助学者发现跨学科的潜在关联工作,从而加速科学创新的进程。
衍生相关工作
围绕该数据集衍生的经典工作包括密集检索模型如DPR、ANCE和ColBERT的基准测试与优化。这些研究通过BEIR/scidocs-qrels评估了不同模型架构在科学文献检索中的有效性,并催生了如SPLADE等稀疏-稠密混合检索方法。此外,零样本检索框架如Contriever和InstructOR也利用该数据集验证了预训练模型在未见任务上的迁移能力,推动了通用检索代理的发展。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作