five

income/cqadupstack-wordpress-top-20-gen-queries

收藏
Hugging Face2023-01-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/income/cqadupstack-wordpress-top-20-gen-queries
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: [] language_creators: [] language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual paperswithcode_id: beir pretty_name: BEIR Benchmark size_categories: msmarco: - 1M<n<10M trec-covid: - 100k<n<1M nfcorpus: - 1K<n<10K nq: - 1M<n<10M hotpotqa: - 1M<n<10M fiqa: - 10K<n<100K arguana: - 1K<n<10K touche-2020: - 100K<n<1M cqadupstack: - 100K<n<1M quora: - 100K<n<1M dbpedia: - 1M<n<10M scidocs: - 10K<n<100K fever: - 1M<n<10M climate-fever: - 1M<n<10M scifact: - 1K<n<10K source_datasets: [] task_categories: - text-retrieval --- # NFCorpus: 20 generated queries (BEIR Benchmark) This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset. - DocT5query model used: [BeIR/query-gen-msmarco-t5-base-v1](https://huggingface.co/BeIR/query-gen-msmarco-t5-base-v1) - id (str): unique document id in NFCorpus in the BEIR benchmark (`corpus.jsonl`). - Questions generated: 20 - Code used for generation: [evaluate_anserini_docT5query_parallel.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/sparse/evaluate_anserini_docT5query_parallel.py) Below contains the old dataset card for the BEIR benchmark. # Dataset Card for BEIR Benchmark ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/UKPLab/beir - **Repository:** https://github.com/UKPLab/beir - **Paper:** https://openreview.net/forum?id=wCu6T5xFjeJ - **Leaderboard:** https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns - **Point of Contact:** nandan.thakur@uwaterloo.ca ### Dataset Summary BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks: - Fact-checking: [FEVER](http://fever.ai), [Climate-FEVER](http://climatefever.ai), [SciFact](https://github.com/allenai/scifact) - Question-Answering: [NQ](https://ai.google.com/research/NaturalQuestions), [HotpotQA](https://hotpotqa.github.io), [FiQA-2018](https://sites.google.com/view/fiqa/) - Bio-Medical IR: [TREC-COVID](https://ir.nist.gov/covidSubmit/index.html), [BioASQ](http://bioasq.org), [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) - News Retrieval: [TREC-NEWS](https://trec.nist.gov/data/news2019.html), [Robust04](https://trec.nist.gov/data/robust/04.guidelines.html) - Argument Retrieval: [Touche-2020](https://webis.de/events/touche-20/shared-task-1.html), [ArguAna](tp://argumentation.bplaced.net/arguana/data) - Duplicate Question Retrieval: [Quora](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs), [CqaDupstack](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) - Citation-Prediction: [SCIDOCS](https://allenai.org/data/scidocs) - Tweet Retrieval: [Signal-1M](https://research.signal-ai.com/datasets/signal1m-tweetir.html) - Entity Retrieval: [DBPedia](https://github.com/iai-group/DBpedia-Entity/) All these datasets have been preprocessed and can be used for your experiments. ```python ``` ### Supported Tasks and Leaderboards The dataset supports a leaderboard that evaluates models against task-specific metrics such as F1 or EM, as well as their ability to retrieve supporting information from Wikipedia. The current best performing models can be found [here](https://eval.ai/web/challenges/challenge-page/689/leaderboard/). ### Languages All tasks are in English (`en`). ## Dataset Structure All BEIR datasets must contain a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his mass–energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ### Data Splits | Dataset | Website| BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 | | -------- | -----| ---------| --------- | ----------- | ---------| ---------| :----------: | :------:| | MSMARCO | [Homepage](https://microsoft.github.io/msmarco/)| ``msmarco`` | ``train``<br>``dev``<br>``test``| 6,980 | 8.84M | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip) | ``444067daf65d982533ea17ebd59501e4`` | | TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| ``test``| 50| 171K| 493.5 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) | ``ce62140cb23feb9becf6270d0d1fe6d1`` | | NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus`` | ``train``<br>``dev``<br>``test``| 323 | 3.6K | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip) | ``a89dba18a62ef92f7d323ec890a0d38d`` | | BioASQ | [Homepage](http://bioasq.org) | ``bioasq``| ``train``<br>``test`` | 500 | 14.91M | 8.05 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#2-bioasq) | | NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | ``nq``| ``train``<br>``test``| 3,452 | 2.68M | 1.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nq.zip) | ``d4d3d2e48787a744b6f6e691ff534307`` | | HotpotQA | [Homepage](https://hotpotqa.github.io) | ``hotpotqa``| ``train``<br>``dev``<br>``test``| 7,405 | 5.23M | 2.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/hotpotqa.zip) | ``f412724f78b0d91183a0e86805e16114`` | | FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | ``fiqa`` | ``train``<br>``dev``<br>``test``| 648 | 57K | 2.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip) | ``17918ed23cd04fb15047f73e6c3bd9d9`` | | Signal-1M(RT) | [Homepage](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | ``test``| 97 | 2.86M | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#4-signal-1m) | | TREC-NEWS | [Homepage](https://trec.nist.gov/data/news2019.html) | ``trec-news`` | ``test``| 57 | 595K | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#1-trec-news) | | ArguAna | [Homepage](http://argumentation.bplaced.net/arguana/data) | ``arguana``| ``test`` | 1,406 | 8.67K | 1.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip) | ``8ad3e3c2a5867cdced806d6503f29b99`` | | Touche-2020| [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| ``test``| 49 | 382K | 19.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/webis-touche2020.zip) | ``46f650ba5a527fc69e0a6521c5a23563`` | | CQADupstack| [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| ``test``| 13,145 | 457K | 1.4 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/cqadupstack.zip) | ``4e41456d7df8ee7760a7f866133bda78`` | | Quora| [Homepage](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| ``dev``<br>``test``| 10,000 | 523K | 1.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/quora.zip) | ``18fb154900ba42a600f84b839c173167`` | | DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| ``dev``<br>``test``| 400 | 4.63M | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/dbpedia-entity.zip) | ``c2a39eb420a3164af735795df012ac2c`` | | SCIDOCS| [Homepage](https://allenai.org/data/scidocs) | ``scidocs``| ``test``| 1,000 | 25K | 4.9 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scidocs.zip) | ``38121350fc3a4d2f48850f6aff52e4a9`` | | FEVER | [Homepage](http://fever.ai) | ``fever``| ``train``<br>``dev``<br>``test``| 6,666 | 5.42M | 1.2| [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fever.zip) | ``5a818580227bfb4b35bb6fa46d9b6c03`` | | Climate-FEVER| [Homepage](http://climatefever.ai) | ``climate-fever``|``test``| 1,535 | 5.42M | 3.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/climate-fever.zip) | ``8b66f0a9126c521bae2bde127b4dc99d`` | | SciFact| [Homepage](https://github.com/allenai/scifact) | ``scifact``| ``train``<br>``test``| 300 | 5K | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip) | ``5f7d1de60b170fc8027bb7898e2efca1`` | | Robust04 | [Homepage](https://trec.nist.gov/data/robust/04.guidelines.html) | ``robust04``| ``test``| 249 | 528K | 69.9 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#3-robust04) | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @inproceedings{ thakur2021beir, title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021}, url={https://openreview.net/forum?id=wCu6T5xFjeJ} } ``` ### Contributions Thanks to [@Nthakur20](https://github.com/Nthakur20) for adding this dataset.Top-20 generated queries for every passage in NFCorpus # Dataset Card for BEIR Benchmark ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/UKPLab/beir - **Repository:** https://github.com/UKPLab/beir - **Paper:** https://openreview.net/forum?id=wCu6T5xFjeJ - **Leaderboard:** https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns - **Point of Contact:** nandan.thakur@uwaterloo.ca ### Dataset Summary BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks: - Fact-checking: [FEVER](http://fever.ai), [Climate-FEVER](http://climatefever.ai), [SciFact](https://github.com/allenai/scifact) - Question-Answering: [NQ](https://ai.google.com/research/NaturalQuestions), [HotpotQA](https://hotpotqa.github.io), [FiQA-2018](https://sites.google.com/view/fiqa/) - Bio-Medical IR: [TREC-COVID](https://ir.nist.gov/covidSubmit/index.html), [BioASQ](http://bioasq.org), [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) - News Retrieval: [TREC-NEWS](https://trec.nist.gov/data/news2019.html), [Robust04](https://trec.nist.gov/data/robust/04.guidelines.html) - Argument Retrieval: [Touche-2020](https://webis.de/events/touche-20/shared-task-1.html), [ArguAna](tp://argumentation.bplaced.net/arguana/data) - Duplicate Question Retrieval: [Quora](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs), [CqaDupstack](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) - Citation-Prediction: [SCIDOCS](https://allenai.org/data/scidocs) - Tweet Retrieval: [Signal-1M](https://research.signal-ai.com/datasets/signal1m-tweetir.html) - Entity Retrieval: [DBPedia](https://github.com/iai-group/DBpedia-Entity/) All these datasets have been preprocessed and can be used for your experiments. ```python ``` ### Supported Tasks and Leaderboards The dataset supports a leaderboard that evaluates models against task-specific metrics such as F1 or EM, as well as their ability to retrieve supporting information from Wikipedia. The current best performing models can be found [here](https://eval.ai/web/challenges/challenge-page/689/leaderboard/). ### Languages All tasks are in English (`en`). ## Dataset Structure All BEIR datasets must contain a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his mass–energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ### Data Splits | Dataset | Website| BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 | | -------- | -----| ---------| --------- | ----------- | ---------| ---------| :----------: | :------:| | MSMARCO | [Homepage](https://microsoft.github.io/msmarco/)| ``msmarco`` | ``train``<br>``dev``<br>``test``| 6,980 | 8.84M | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip) | ``444067daf65d982533ea17ebd59501e4`` | | TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| ``test``| 50| 171K| 493.5 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) | ``ce62140cb23feb9becf6270d0d1fe6d1`` | | NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus`` | ``train``<br>``dev``<br>``test``| 323 | 3.6K | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip) | ``a89dba18a62ef92f7d323ec890a0d38d`` | | BioASQ | [Homepage](http://bioasq.org) | ``bioasq``| ``train``<br>``test`` | 500 | 14.91M | 8.05 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#2-bioasq) | | NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | ``nq``| ``train``<br>``test``| 3,452 | 2.68M | 1.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nq.zip) | ``d4d3d2e48787a744b6f6e691ff534307`` | | HotpotQA | [Homepage](https://hotpotqa.github.io) | ``hotpotqa``| ``train``<br>``dev``<br>``test``| 7,405 | 5.23M | 2.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/hotpotqa.zip) | ``f412724f78b0d91183a0e86805e16114`` | | FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | ``fiqa`` | ``train``<br>``dev``<br>``test``| 648 | 57K | 2.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip) | ``17918ed23cd04fb15047f73e6c3bd9d9`` | | Signal-1M(RT) | [Homepage](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | ``test``| 97 | 2.86M | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#4-signal-1m) | | TREC-NEWS | [Homepage](https://trec.nist.gov/data/news2019.html) | ``trec-news`` | ``test``| 57 | 595K | 19.6 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#1-trec-news) | | ArguAna | [Homepage](http://argumentation.bplaced.net/arguana/data) | ``arguana``| ``test`` | 1,406 | 8.67K | 1.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip) | ``8ad3e3c2a5867cdced806d6503f29b99`` | | Touche-2020| [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| ``test``| 49 | 382K | 19.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/webis-touche2020.zip) | ``46f650ba5a527fc69e0a6521c5a23563`` | | CQADupstack| [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| ``test``| 13,145 | 457K | 1.4 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/cqadupstack.zip) | ``4e41456d7df8ee7760a7f866133bda78`` | | Quora| [Homepage](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| ``dev``<br>``test``| 10,000 | 523K | 1.6 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/quora.zip) | ``18fb154900ba42a600f84b839c173167`` | | DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| ``dev``<br>``test``| 400 | 4.63M | 38.2 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/dbpedia-entity.zip) | ``c2a39eb420a3164af735795df012ac2c`` | | SCIDOCS| [Homepage](https://allenai.org/data/scidocs) | ``scidocs``| ``test``| 1,000 | 25K | 4.9 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scidocs.zip) | ``38121350fc3a4d2f48850f6aff52e4a9`` | | FEVER | [Homepage](http://fever.ai) | ``fever``| ``train``<br>``dev``<br>``test``| 6,666 | 5.42M | 1.2| [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fever.zip) | ``5a818580227bfb4b35bb6fa46d9b6c03`` | | Climate-FEVER| [Homepage](http://climatefever.ai) | ``climate-fever``|``test``| 1,535 | 5.42M | 3.0 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/climate-fever.zip) | ``8b66f0a9126c521bae2bde127b4dc99d`` | | SciFact| [Homepage](https://github.com/allenai/scifact) | ``scifact``| ``train``<br>``test``| 300 | 5K | 1.1 | [Link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip) | ``5f7d1de60b170fc8027bb7898e2efca1`` | | Robust04 | [Homepage](https://trec.nist.gov/data/robust/04.guidelines.html) | ``robust04``| ``test``| 249 | 528K | 69.9 | No | [How to Reproduce?](https://github.com/UKPLab/beir/blob/main/examples/dataset#3-robust04) | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @inproceedings{ thakur2021beir, title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021}, url={https://openreview.net/forum?id=wCu6T5xFjeJ} } ``` ### Contributions Thanks to [@Nthakur20](https://github.com/Nthakur20) for adding this dataset.
提供机构:
income
原始信息汇总

数据集卡片 for BEIR Benchmark

数据集描述

数据集摘要

BEIR 是一个异构基准,由 18 个不同的数据集组成,代表了 9 种信息检索任务:

所有这些数据集都已预处理,可供实验使用。

支持的任务和排行榜

该数据集支持一个排行榜,评估模型在特定任务指标(如 F1 或 EM)上的表现,以及它们从维基百科检索支持信息的能力。

当前最佳表现的模型可以在这里找到。

语言

所有任务均为英语 (en)。

数据集结构

所有 BEIR 数据集必须包含语料库、查询和 qrels(相关性判断文件)。它们必须采用以下格式:

  • corpus 文件:一个 .jsonl 文件(jsonlines),包含一系列字典,每个字典有三个字段:_id 表示唯一文档标识符,title 表示文档标题(可选),text 表示文档段落或段落。例如:{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
  • queries 文件:一个 .jsonl 文件(jsonlines),包含一系列字典,每个字典有两个字段:_id 表示唯一查询标识符,text 表示查询文本。例如:{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
  • qrels 文件:一个 .tsv 文件(制表符分隔),包含三列,即 query-idcorpus-idscore。保持第一行为标题。例如:q1 doc1 1

数据实例

任何 beir 数据集的高级示例:

python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on the philosophy of science. He is best known to the general public for his mass–energy equivalence formula E = mc2, which has been dubbed the worlds most famous equation. He received the 1921 Nobel Prize in Physics for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect, a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # 如果没有标题,保持为空字符串 "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, }

queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" }

qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, }

数据字段

所有配置的示例具有以下特征:

语料库

  • corpus:表示文档标题和段落文本的字典特征,包含:
    • _id:表示唯一文档 id 的字符串特征
      • title:表示文档标题的字符串特征。
      • text:表示文档文本的字符串特征。

查询

  • queries:表示查询的字典特征,包含:
    • _id:表示唯一查询 id 的字符串特征
    • text:表示查询文本的字符串特征。

Qrels

  • qrels:表示查询文档相关性判断的字典特征,包含:
    • _id:表示查询 id 的字符串特征
      • _id:表示文档 id 的字符串特征。
      • score:表示查询和文档之间相关性判断的 int32 特征。

数据分割

数据集 网站 BEIR-Name 类型 查询数量 语料库大小 相关文档/查询 下载链接 md5
MSMARCO 主页 msmarco train<br>dev<br>test 6,980 8.84M 1.1 链接 444067daf65d982533ea17ebd59501e4
TREC-COVID 主页 trec-covid test 50 171K 493.5 链接 ce62140cb23feb9becf6270d0d1fe6d1
NFCorpus 主页 nfcorpus train<br>dev<br>test 323 3.6K 38.2 链接 a89dba18a62ef92f7d323ec890a0d38d
BioASQ 主页 bioasq train<br>test 500 14.91M 8.05 如何重现?
NQ 主页 nq train<br>test 3,452 2.68M 1.2 链接 d4d3d2e48787a744b6f6e691ff534307
HotpotQA 主页 hotpotqa train<br>dev<br>test 7,405 5.23M 2.0 链接 f412724f78b0d91183a0e86805e16114
FiQA-2018 主页 fiqa train<br>dev<br>test 648 57K 2.6 链接 17918ed23cd04fb15047f73e6c3bd9d9
Signal-1M(RT) 主页 signal1m test 97 2.86M 19.6 如何重现?
TREC-NEWS 主页 trec-news test 57 595K 19.6 如何重现?
ArguAna 主页 arguana test 1,406 8.67K 1.0 链接 8ad3e3c2a5867cdced806d6503f29b99
Touche-2020 主页 webis-touche2020 test 49 382K 19.0 链接 46f650ba5a527fc69e0a6521c5a23563
CQADupstack 主页 cqadupstack test 13,145 457K 1.4 链接 4e41456d7df8ee7760a7f866133bda78
Quora 主页 quora dev<br>test 10,000 523K 1.6 链接 18fb154900ba42a600f84b839c173167
DBPedia 主页 dbpedia-entity dev<br>test 400 4.63M 38.2 链接 c2a39eb420a3164af735795df012ac2c
SCIDOCS 主页 scidocs test 1,000 25K 4.9 链接 38121350fc3a4d2f48850f6aff52e4a9
FEVER 主页 fever train<br>dev<br>test 6,666 5.42M 1.2 链接 5a818580227bfb4b35bb6fa46d9b6c03
Climate-FEVER 主页 climate-fever test 1,535 5.42M 3.0 链接 8b66f0a9126c521bae2bde127b4dc99d
SciFact 主页 scifact train<br>test 300 5K 1.1 链接 5f7d1de60b170fc8027bb7898e2efca1
Robust04 主页 robust04 test 249 528K 69.9 如何重现?

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

源语言生产者是谁?

[需要更多信息]

注释

注释过程

[需要更多信息]

注释者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据集的注意事项

数据集的社会影响

[需要更多信息]

偏见的讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策展人

[需要更多信息

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集源自BEIR基准测试中的NFCorpus子集,通过DocT5query模型(BeIR/query-gen-msmarco-t5-base-v1)为每个语料段落生成20条合成查询。生成过程基于Anserini框架的并行评估脚本实现,利用预训练的序列到序列模型从文档文本中自动推导出潜在的用户查询,从而构建了包含文档标识符与对应生成问题的高质量查询集合。
特点
作为BEIR异构基准的一部分,该数据集聚焦于生物医学信息检索领域,涵盖323个查询与约3,600篇文档,平均每篇文档拥有38.2条相关判断。其核心特色在于为每个文档提供了20条多样化的合成查询,这些查询通过神经生成模型产生,有效扩展了原始语料的查询覆盖范围,为评估检索模型在零样本场景下的泛化能力提供了丰富资源。
使用方法
数据集遵循BEIR标准格式,包含corpus.jsonl(文档语料)、queries.jsonl(生成查询)和qrels.tsv(相关性判断)三部分。使用时可通过HuggingFace Datasets库加载,将文档与查询进行检索匹配。研究者常将其用于训练或评估稀疏检索(如BM25)与稠密检索模型,通过计算查询与文档的相似度得分来验证模型在生物医学领域的信息检索性能。
背景与挑战
背景概述
信息检索领域的评估基准长期受限于任务单一与领域狭窄,难以衡量模型在零样本场景下的泛化能力。为此,Nandan Thakur、Nils Reimers等来自英国伦敦大学学院(UKPLab)与加拿大滑铁卢大学的研究人员,于2021年在NeurIPS数据集与基准轨道上发布了BEIR(Benchmark for Evaluation of Information Retrieval)异构基准。该基准整合了涵盖事实核查、问答、生物医学检索、新闻检索等九类任务的18个数据集,旨在为检索模型的零样本迁移能力提供标准化评估。其中,NFCorpus作为生物医学检索子集,聚焦于营养与健康领域的文献检索,其语料库规模虽小(约3600篇文档),却因专业术语密集与查询-文档关联稀疏而构成独特挑战。BEIR的诞生迅速成为信息检索领域评价零样本模型的主流标准,推动了密集检索与稀疏检索技术的对比研究。
当前挑战
该数据集所面临的挑战首先体现在领域适配的复杂性上:NFCorpus作为生物医学检索子集,其文档内容涉及高度专业化的营养学与临床医学知识,通用检索模型常因缺乏领域语料预训练而难以捕捉术语间的语义关联,导致零样本场景下的召回率显著低于领域微调模型。构建过程中,为缓解原始NFCorpus查询数量不足(仅323条)对模型训练与评估的制约,研究团队采用DocT5query模型(BeIR/query-gen-msmarco-t5-base-v1)为每篇文档生成20条合成查询,该过程面临查询质量与真实用户意图之间的语义鸿沟——自动生成的查询可能偏离原始文档的核心信息焦点,甚至引入噪声,从而影响下游检索性能的可靠评估。此外,合成查询的多样性控制以及生成模型对源域(MS MARCO)风格的过拟合倾向,亦对数据集的生态效度构成潜在威胁。
常用场景
经典使用场景
该数据集作为BEIR基准测试中NFCorpus子集的扩充组件,其经典使用场景聚焦于零样本信息检索模型的评估与优化。研究者借助DocT5query模型为NFCorpus中每篇生物医学文献生成20条合成查询,从而构建出丰富的查询-文档配对样本。这一设计使得模型能够在缺乏领域特定标注数据的情况下,通过检索生成的查询来检验其对生物医学文本语义的理解能力,尤其适用于评估检索系统在跨领域迁移时的鲁棒性与泛化性能。
衍生相关工作
该数据集衍生了多项经典工作,其中最具代表性的是BEIR基准测试框架的提出,该框架整合了18个异构数据集,系统评估检索模型在零样本场景下的表现。此外,DocT5query模型作为查询生成的核心工具,被广泛应用于其他领域的检索增强生成任务。后续研究还在此基础上发展了基于对比学习的检索预训练方法,以及融合合成数据与真实数据的混合训练策略,进一步提升了模型在低资源场景中的检索质量。
数据集最近研究
最新研究方向
在信息检索领域的零样本评估浪潮中,BEIR基准测试集凭借其涵盖9类检索任务与18个异构数据源的独特架构,已成为衡量模型泛化能力的黄金标准。该数据集聚焦于前沿的DocT5query技术,通过为NFCorpus等子集内的每个文档生成20条合成查询,显著强化了检索系统对语义关联的建模深度。这一方向紧密关联近期密集检索与稀疏检索的融合趋势,尤其在生物医学文献检索场景中,合成查询的引入有效缓解了标注数据匮乏的瓶颈,推动了零样本跨域检索的实用化进程。其意义在于为预训练语言模型在低资源领域的适配提供了可复现的评估范式,同时揭示了查询生成质量对检索精度的关键影响,为下一代基于生成式AI的检索系统奠定了方法论基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作