income/cqadupstack-wordpress-top-20-gen-queries

Name: income/cqadupstack-wordpress-top-20-gen-queries
Creator: income
Published: 2023-01-24 19:53:33
License: 暂无描述

Hugging Face2023-01-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/income/cqadupstack-wordpress-top-20-gen-queries

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language_creators: [] language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual paperswithcode_id: beir pretty_name: BEIR Benchmark size_categories: msmarco: - 1M<n<10M trec-covid: - 100k<n<1M nfcorpus: - 1K<n<10K nq: - 1M<n<10M hotpotqa: - 1M<n<10M fiqa: - 10K<n<100K arguana: - 1K<n<10K touche-2020: - 100K<n<1M cqadupstack: - 100K<n<1M quora: - 100K<n<1M dbpedia: - 1M<n<10M scidocs: - 10K<n<100K fever: - 1M<n<10M climate-fever: - 1M<n<10M scifact: - 1K<n<10K source_datasets: [] task_categories: - text-retrieval --- # NFCorpus: 20 generated queries (BEIR Benchmark) This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset. - DocT5query model used: [BeIR/query-gen-msmarco-t5-base-v1](https://huggingface.co/BeIR/query-gen-msmarco-t5-base-v1) - id (str): unique document id in NFCorpus in the BEIR benchmark (`corpus.jsonl`). - Questions generated: 20 - Code used for generation: [evaluate_anserini_docT5query_parallel.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/sparse/evaluate_anserini_docT5query_parallel.py) Below contains the old dataset card for the BEIR benchmark. # Dataset Card for BEIR Benchmark ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/UKPLab/beir - **Repository:** https://github.com/UKPLab/beir - **Paper:** https://openreview.net/forum?id=wCu6T5xFjeJ - **Leaderboard:** https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns - **Point of Contact:** nandan.thakur@uwaterloo.ca ### Dataset Summary BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks: - Fact-checking: [FEVER](http://fever.ai), [Climate-FEVER](http://climatefever.ai), [SciFact](https://github.com/allenai/scifact) - Question-Answering: [NQ](https://ai.google.com/research/NaturalQuestions), [HotpotQA](https://hotpotqa.github.io), [FiQA-2018](https://sites.google.com/view/fiqa/) - Bio-Medical IR: [TREC-COVID](https://ir.nist.gov/covidSubmit/index.html), [BioASQ](http://bioasq.org), [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) - News Retrieval: [TREC-NEWS](https://trec.nist.gov/data/news2019.html), [Robust04](https://trec.nist.gov/data/robust/04.guidelines.html) - Argument Retrieval: [Touche-2020](https://webis.de/events/touche-20/shared-task-1.html), [ArguAna](tp://argumentation.bplaced.net/arguana/data) - Duplicate Question Retrieval: [Quora](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs), [CqaDupstack](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) - Citation-Prediction: [SCIDOCS](https://allenai.org/data/scidocs) - Tweet Retrieval: [Signal-1M](https://research.signal-ai.com/datasets/signal1m-tweetir.html) - Entity Retrieval: [DBPedia](https://github.com/iai-group/DBpedia-Entity/) All these datasets have been preprocessed and can be used for your experiments. ```python ``` ### Supported Tasks and Leaderboards The dataset supports a leaderboard that evaluates models against task-specific metrics such as F1 or EM, as well as their ability to retrieve supporting information from Wikipedia. The current best performing models can be found [here](https://eval.ai/web/challenges/challenge-page/689/leaderboard/). ### Languages All tasks are in English (`en`). ## Dataset Structure All BEIR datasets must contain a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his massâ€“energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German WeiÃŸbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ### Data Splits | Dataset | Website| BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 | | -------- | -----| ---------| --------- | ----------- | ---------| ---------| :----------: | :------:| | MSMARCO | [Homepage](https://microsoft.github.io/msmarco/)| ``msmarco`` | ``train`` ``dev`` ``test``| 6,980 | 8.84M | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip) | ``444067daf65d982533ea17ebd59501e4`` | | TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| ``test``| 50| 171K| 493.5 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) | ``ce62140cb23feb9becf6270d0d1fe6d1`` | | NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus`` | ``train`` ``dev`` ``test``| 323 | 3.6K | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip) | ``a89dba18a62ef92f7d323ec890a0d38d`` | | BioASQ | [Homepage](http://bioasq.org) | ``bioasq``| ``train`` ``test`` | 500 | 14.91M | 8.05 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#2-bioasq) | | NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | ``nq``| ``train`` ``test``| 3,452 | 2.68M | 1.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nq.zip) | ``d4d3d2e48787a744b6f6e691ff534307`` | | HotpotQA | [Homepage](https://hotpotqa.github.io) | ``hotpotqa``| ``train`` ``dev`` ``test``| 7,405 | 5.23M | 2.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/hotpotqa.zip) | ``f412724f78b0d91183a0e86805e16114`` | | FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | ``fiqa`` | ``train`` ``dev`` ``test``| 648 | 57K | 2.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip) | ``17918ed23cd04fb15047f73e6c3bd9d9`` | | Signal-1M(RT) | [Homepage](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | ``test``| 97 | 2.86M | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#4-signal-1m) | | TREC-NEWS | [Homepage](https://trec.nist.gov/data/news2019.html) | ``trec-news`` | ``test``| 57 | 595K | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#1-trec-news) | | ArguAna | [Homepage](http://argumentation.bplaced.net/arguana/data) | ``arguana``| ``test`` | 1,406 | 8.67K | 1.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip) | ``8ad3e3c2a5867cdced806d6503f29b99`` | | Touche-2020| [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| ``test``| 49 | 382K | 19.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/webis-touche2020.zip) | ``46f650ba5a527fc69e0a6521c5a23563`` | | CQADupstack| [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| ``test``| 13,145 | 457K | 1.4 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/cqadupstack.zip) | ``4e41456d7df8ee7760a7f866133bda78`` | | Quora| [Homepage](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| ``dev`` ``test``| 10,000 | 523K | 1.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/quora.zip) | ``18fb154900ba42a600f84b839c173167`` | | DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| ``dev`` ``test``| 400 | 4.63M | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/dbpedia-entity.zip) | ``c2a39eb420a3164af735795df012ac2c`` | | SCIDOCS| [Homepage](https://allenai.org/data/scidocs) | ``scidocs``| ``test``| 1,000 | 25K | 4.9 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scidocs.zip) | ``38121350fc3a4d2f48850f6aff52e4a9`` | | FEVER | [Homepage](http://fever.ai) | ``fever``| ``train`` ``dev`` ``test``| 6,666 | 5.42M | 1.2| [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fever.zip) | ``5a818580227bfb4b35bb6fa46d9b6c03`` | | Climate-FEVER| [Homepage](http://climatefever.ai) | ``climate-fever``|``test``| 1,535 | 5.42M | 3.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/climate-fever.zip) | ``8b66f0a9126c521bae2bde127b4dc99d`` | | SciFact| [Homepage](https://github.com/allenai/scifact) | ``scifact``| ``train`` ``test``| 300 | 5K | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip) | ``5f7d1de60b170fc8027bb7898e2efca1`` | | Robust04 | [Homepage](https://trec.nist.gov/data/robust/04.guidelines.html) | ``robust04``| ``test``| 249 | 528K | 69.9 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#3-robust04) | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @inproceedings{ thakur2021beir, title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021}, url={https://openreview.net/forum?id=wCu6T5xFjeJ} } ``` ### Contributions Thanks to [@Nthakur20](https://github.com/Nthakur20) for adding this dataset.Top-20 generated queries for every passage in NFCorpus # Dataset Card for BEIR Benchmark ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/UKPLab/beir - **Repository:** https://github.com/UKPLab/beir - **Paper:** https://openreview.net/forum?id=wCu6T5xFjeJ - **Leaderboard:** https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns - **Point of Contact:** nandan.thakur@uwaterloo.ca ### Dataset Summary BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks: - Fact-checking: [FEVER](http://fever.ai), [Climate-FEVER](http://climatefever.ai), [SciFact](https://github.com/allenai/scifact) - Question-Answering: [NQ](https://ai.google.com/research/NaturalQuestions), [HotpotQA](https://hotpotqa.github.io), [FiQA-2018](https://sites.google.com/view/fiqa/) - Bio-Medical IR: [TREC-COVID](https://ir.nist.gov/covidSubmit/index.html), [BioASQ](http://bioasq.org), [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) - News Retrieval: [TREC-NEWS](https://trec.nist.gov/data/news2019.html), [Robust04](https://trec.nist.gov/data/robust/04.guidelines.html) - Argument Retrieval: [Touche-2020](https://webis.de/events/touche-20/shared-task-1.html), [ArguAna](tp://argumentation.bplaced.net/arguana/data) - Duplicate Question Retrieval: [Quora](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs), [CqaDupstack](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) - Citation-Prediction: [SCIDOCS](https://allenai.org/data/scidocs) - Tweet Retrieval: [Signal-1M](https://research.signal-ai.com/datasets/signal1m-tweetir.html) - Entity Retrieval: [DBPedia](https://github.com/iai-group/DBpedia-Entity/) All these datasets have been preprocessed and can be used for your experiments. ```python ``` ### Supported Tasks and Leaderboards The dataset supports a leaderboard that evaluates models against task-specific metrics such as F1 or EM, as well as their ability to retrieve supporting information from Wikipedia. The current best performing models can be found [here](https://eval.ai/web/challenges/challenge-page/689/leaderboard/). ### Languages All tasks are in English (`en`). ## Dataset Structure All BEIR datasets must contain a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his massâ€“energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German WeiÃŸbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ### Data Splits | Dataset | Website| BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 | | -------- | -----| ---------| --------- | ----------- | ---------| ---------| :----------: | :------:| | MSMARCO | [Homepage](https://microsoft.github.io/msmarco/)| ``msmarco`` | ``train`` ``dev`` ``test``| 6,980 | 8.84M | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip) | ``444067daf65d982533ea17ebd59501e4`` | | TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| ``test``| 50| 171K| 493.5 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) | ``ce62140cb23feb9becf6270d0d1fe6d1`` | | NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus`` | ``train`` ``dev`` ``test``| 323 | 3.6K | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip) | ``a89dba18a62ef92f7d323ec890a0d38d`` | | BioASQ | [Homepage](http://bioasq.org) | ``bioasq``| ``train`` ``test`` | 500 | 14.91M | 8.05 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#2-bioasq) | | NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | ``nq``| ``train`` ``test``| 3,452 | 2.68M | 1.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nq.zip) | ``d4d3d2e48787a744b6f6e691ff534307`` | | HotpotQA | [Homepage](https://hotpotqa.github.io) | ``hotpotqa``| ``train`` ``dev`` ``test``| 7,405 | 5.23M | 2.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/hotpotqa.zip) | ``f412724f78b0d91183a0e86805e16114`` | | FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | ``fiqa`` | ``train`` ``dev`` ``test``| 648 | 57K | 2.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip) | ``17918ed23cd04fb15047f73e6c3bd9d9`` | | Signal-1M(RT) | [Homepage](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | ``test``| 97 | 2.86M | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#4-signal-1m) | | TREC-NEWS | [Homepage](https://trec.nist.gov/data/news2019.html) | ``trec-news`` | ``test``| 57 | 595K | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#1-trec-news) | | ArguAna | [Homepage](http://argumentation.bplaced.net/arguana/data) | ``arguana``| ``test`` | 1,406 | 8.67K | 1.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip) | ``8ad3e3c2a5867cdced806d6503f29b99`` | | Touche-2020| [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| ``test``| 49 | 382K | 19.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/webis-touche2020.zip) | ``46f650ba5a527fc69e0a6521c5a23563`` | | CQADupstack| [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| ``test``| 13,145 | 457K | 1.4 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/cqadupstack.zip) | ``4e41456d7df8ee7760a7f866133bda78`` | | Quora| [Homepage](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| ``dev`` ``test``| 10,000 | 523K | 1.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/quora.zip) | ``18fb154900ba42a600f84b839c173167`` | | DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| ``dev`` ``test``| 400 | 4.63M | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/dbpedia-entity.zip) | ``c2a39eb420a3164af735795df012ac2c`` | | SCIDOCS| [Homepage](https://allenai.org/data/scidocs) | ``scidocs``| ``test``| 1,000 | 25K | 4.9 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scidocs.zip) | ``38121350fc3a4d2f48850f6aff52e4a9`` | | FEVER | [Homepage](http://fever.ai) | ``fever``| ``train`` ``dev`` ``test``| 6,666 | 5.42M | 1.2| [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fever.zip) | ``5a818580227bfb4b35bb6fa46d9b6c03`` | | Climate-FEVER| [Homepage](http://climatefever.ai) | ``climate-fever``|``test``| 1,535 | 5.42M | 3.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/climate-fever.zip) | ``8b66f0a9126c521bae2bde127b4dc99d`` | | SciFact| [Homepage](https://github.com/allenai/scifact) | ``scifact``| ``train`` ``test``| 300 | 5K | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip) | ``5f7d1de60b170fc8027bb7898e2efca1`` | | Robust04 | [Homepage](https://trec.nist.gov/data/robust/04.guidelines.html) | ``robust04``| ``test``| 249 | 528K | 69.9 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#3-robust04) | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @inproceedings{ thakur2021beir, title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021}, url={https://openreview.net/forum?id=wCu6T5xFjeJ} } ``` ### Contributions Thanks to [@Nthakur20](https://github.com/Nthakur20) for adding this dataset.

提供机构：

income

原始信息汇总

数据集卡片 for BEIR Benchmark

数据集描述

数据集摘要

BEIR 是一个异构基准，由 18 个不同的数据集组成，代表了 9 种信息检索任务：

事实核查：FEVER, Climate-FEVER, SciFact
问答：NQ, HotpotQA, FiQA-2018
生物医学信息检索：TREC-COVID, BioASQ, NFCorpus
新闻检索：TREC-NEWS, Robust04
论点检索：Touche-2020, ArguAna
重复问题检索：Quora, CqaDupstack
引用预测：SCIDOCS
推文检索：Signal-1M
实体检索：DBPedia

所有这些数据集都已预处理，可供实验使用。

支持的任务和排行榜

该数据集支持一个排行榜，评估模型在特定任务指标（如 F1 或 EM）上的表现，以及它们从维基百科检索支持信息的能力。

当前最佳表现的模型可以在这里找到。

语言

所有任务均为英语 (en)。

数据集结构

所有 BEIR 数据集必须包含语料库、查询和 qrels（相关性判断文件）。它们必须采用以下格式：

corpus 文件：一个 .jsonl 文件（jsonlines），包含一系列字典，每个字典有三个字段：_id 表示唯一文档标识符，title 表示文档标题（可选），text 表示文档段落或段落。例如：{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
queries 文件：一个 .jsonl 文件（jsonlines），包含一系列字典，每个字典有两个字段：_id 表示唯一查询标识符，text 表示查询文本。例如：{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
qrels 文件：一个 .tsv 文件（制表符分隔），包含三列，即 query-id、corpus-id 和 score。保持第一行为标题。例如：q1 doc1 1

数据实例

任何 beir 数据集的高级示例：

python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on the philosophy of science. He is best known to the general public for his massâ€“energy equivalence formula E = mc2, which has been dubbed the worlds most famous equation. He received the 1921 Nobel Prize in Physics for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect, a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # 如果没有标题，保持为空字符串 "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of malted barley. The two main varieties are German WeiÃŸbier and Belgian witbier; other types include Lambic (made with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, }

queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" }

qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, }

数据字段

所有配置的示例具有以下特征：

语料库

corpus：表示文档标题和段落文本的字典特征，包含：
- _id：表示唯一文档 id 的字符串特征
  - title：表示文档标题的字符串特征。
  - text：表示文档文本的字符串特征。

查询

queries：表示查询的字典特征，包含：
- _id：表示唯一查询 id 的字符串特征
- text：表示查询文本的字符串特征。

Qrels

qrels：表示查询文档相关性判断的字典特征，包含：
- _id：表示查询 id 的字符串特征
  - _id：表示文档 id 的字符串特征。
  - score：表示查询和文档之间相关性判断的 int32 特征。

数据分割

数据集	网站	BEIR-Name	类型	查询数量	语料库大小	相关文档/查询	下载链接	md5
MSMARCO	主页	`msmarco`	`train`<br>`dev`<br>`test`	6,980	8.84M	1.1	链接	`444067daf65d982533ea17ebd59501e4`
TREC-COVID	主页	`trec-covid`	`test`	50	171K	493.5	链接	`ce62140cb23feb9becf6270d0d1fe6d1`
NFCorpus	主页	`nfcorpus`	`train`<br>`dev`<br>`test`	323	3.6K	38.2	链接	`a89dba18a62ef92f7d323ec890a0d38d`
BioASQ	主页	`bioasq`	`train`<br>`test`	500	14.91M	8.05	无	如何重现？
NQ	主页	`nq`	`train`<br>`test`	3,452	2.68M	1.2	链接	`d4d3d2e48787a744b6f6e691ff534307`
HotpotQA	主页	`hotpotqa`	`train`<br>`dev`<br>`test`	7,405	5.23M	2.0	链接	`f412724f78b0d91183a0e86805e16114`
FiQA-2018	主页	`fiqa`	`train`<br>`dev`<br>`test`	648	57K	2.6	链接	`17918ed23cd04fb15047f73e6c3bd9d9`
Signal-1M(RT)	主页	`signal1m`	`test`	97	2.86M	19.6	无	如何重现？
TREC-NEWS	主页	`trec-news`	`test`	57	595K	19.6	无	如何重现？
ArguAna	主页	`arguana`	`test`	1,406	8.67K	1.0	链接	`8ad3e3c2a5867cdced806d6503f29b99`
Touche-2020	主页	`webis-touche2020`	`test`	49	382K	19.0	链接	`46f650ba5a527fc69e0a6521c5a23563`
CQADupstack	主页	`cqadupstack`	`test`	13,145	457K	1.4	链接	`4e41456d7df8ee7760a7f866133bda78`
Quora	主页	`quora`	`dev`<br>`test`	10,000	523K	1.6	链接	`18fb154900ba42a600f84b839c173167`
DBPedia	主页	`dbpedia-entity`	`dev`<br>`test`	400	4.63M	38.2	链接	`c2a39eb420a3164af735795df012ac2c`
SCIDOCS	主页	`scidocs`	`test`	1,000	25K	4.9	链接	`38121350fc3a4d2f48850f6aff52e4a9`
FEVER	主页	`fever`	`train`<br>`dev`<br>`test`	6,666	5.42M	1.2	链接	`5a818580227bfb4b35bb6fa46d9b6c03`
Climate-FEVER	主页	`climate-fever`	`test`	1,535	5.42M	3.0	链接	`8b66f0a9126c521bae2bde127b4dc99d`
SciFact	主页	`scifact`	`train`<br>`test`	300	5K	1.1	链接	`5f7d1de60b170fc8027bb7898e2efca1`
Robust04	主页	`robust04`	`test`	249	528K	69.9	无	如何重现？

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

源语言生产者是谁？

[需要更多信息]

注释

注释过程

[需要更多信息]

注释者是谁？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据集的注意事项

数据集的社会影响

[需要更多信息]

偏见的讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策展人

[需要更多信息

搜集汇总

数据集介绍

构建方式

该数据集源自BEIR基准测试中的NFCorpus子集，通过DocT5query模型（BeIR/query-gen-msmarco-t5-base-v1）为每个语料段落生成20条合成查询。生成过程基于Anserini框架的并行评估脚本实现，利用预训练的序列到序列模型从文档文本中自动推导出潜在的用户查询，从而构建了包含文档标识符与对应生成问题的高质量查询集合。

特点

作为BEIR异构基准的一部分，该数据集聚焦于生物医学信息检索领域，涵盖323个查询与约3,600篇文档，平均每篇文档拥有38.2条相关判断。其核心特色在于为每个文档提供了20条多样化的合成查询，这些查询通过神经生成模型产生，有效扩展了原始语料的查询覆盖范围，为评估检索模型在零样本场景下的泛化能力提供了丰富资源。

使用方法

数据集遵循BEIR标准格式，包含corpus.jsonl（文档语料）、queries.jsonl（生成查询）和qrels.tsv（相关性判断）三部分。使用时可通过HuggingFace Datasets库加载，将文档与查询进行检索匹配。研究者常将其用于训练或评估稀疏检索（如BM25）与稠密检索模型，通过计算查询与文档的相似度得分来验证模型在生物医学领域的信息检索性能。

背景与挑战

背景概述

信息检索领域的评估基准长期受限于任务单一与领域狭窄，难以衡量模型在零样本场景下的泛化能力。为此，Nandan Thakur、Nils Reimers等来自英国伦敦大学学院（UKPLab）与加拿大滑铁卢大学的研究人员，于2021年在NeurIPS数据集与基准轨道上发布了BEIR（Benchmark for Evaluation of Information Retrieval）异构基准。该基准整合了涵盖事实核查、问答、生物医学检索、新闻检索等九类任务的18个数据集，旨在为检索模型的零样本迁移能力提供标准化评估。其中，NFCorpus作为生物医学检索子集，聚焦于营养与健康领域的文献检索，其语料库规模虽小（约3600篇文档），却因专业术语密集与查询-文档关联稀疏而构成独特挑战。BEIR的诞生迅速成为信息检索领域评价零样本模型的主流标准，推动了密集检索与稀疏检索技术的对比研究。

当前挑战

该数据集所面临的挑战首先体现在领域适配的复杂性上：NFCorpus作为生物医学检索子集，其文档内容涉及高度专业化的营养学与临床医学知识，通用检索模型常因缺乏领域语料预训练而难以捕捉术语间的语义关联，导致零样本场景下的召回率显著低于领域微调模型。构建过程中，为缓解原始NFCorpus查询数量不足（仅323条）对模型训练与评估的制约，研究团队采用DocT5query模型（BeIR/query-gen-msmarco-t5-base-v1）为每篇文档生成20条合成查询，该过程面临查询质量与真实用户意图之间的语义鸿沟——自动生成的查询可能偏离原始文档的核心信息焦点，甚至引入噪声，从而影响下游检索性能的可靠评估。此外，合成查询的多样性控制以及生成模型对源域（MS MARCO）风格的过拟合倾向，亦对数据集的生态效度构成潜在威胁。

常用场景

经典使用场景

该数据集作为BEIR基准测试中NFCorpus子集的扩充组件，其经典使用场景聚焦于零样本信息检索模型的评估与优化。研究者借助DocT5query模型为NFCorpus中每篇生物医学文献生成20条合成查询，从而构建出丰富的查询-文档配对样本。这一设计使得模型能够在缺乏领域特定标注数据的情况下，通过检索生成的查询来检验其对生物医学文本语义的理解能力，尤其适用于评估检索系统在跨领域迁移时的鲁棒性与泛化性能。

衍生相关工作

该数据集衍生了多项经典工作，其中最具代表性的是BEIR基准测试框架的提出，该框架整合了18个异构数据集，系统评估检索模型在零样本场景下的表现。此外，DocT5query模型作为查询生成的核心工具，被广泛应用于其他领域的检索增强生成任务。后续研究还在此基础上发展了基于对比学习的检索预训练方法，以及融合合成数据与真实数据的混合训练策略，进一步提升了模型在低资源场景中的检索质量。

数据集最近研究