IR-Cocktail/hotpotqa

Name: IR-Cocktail/hotpotqa
Creator: IR-Cocktail
Published: 2024-05-22 15:19:04
License: 暂无描述

Hugging Face2024-05-22 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/IR-Cocktail/hotpotqa

下载链接

链接失效反馈

官方服务：

资源简介：

## Data Description - **Homepage:** https://github.com/KID-22/Cocktail - **Repository:** https://github.com/KID-22/Cocktail - **Paper:** [Needs More Information] ## Dataset Summary All the 16 benchmarked datasets in Cocktail are listed in the following table. | Dataset | Raw Website | Cocktail Website | Cocktail-Name | md5 for Processed Data | Domain | Relevancy | # Test Query | # Corpus | | ------------- | ------------------------------------------------------------ | ------------------ | ---------------------------------- | ----------- | --------- | ------------ | -------- |-------- | | MS MARCO | [Homepage](https://microsoft.github.io/msmarco/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/msmarco) | `msmarco` | `985926f3e906fadf0dc6249f23ed850f` | Misc. | Binary | 6,979 | 542,203 | | DL19 | [Homepage](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dl19) | `dl19` | `d652af47ec0e844af43109c0acf50b74` | Misc. | Binary | 43 | 542,203 | | DL20 | [Homepage](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dl20) | `dl20` | `3afc48141dce3405ede2b6b937c65036` | Misc. | Binary | 54 | 542,203 | | TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/trec-covid) | `trec-covid` | `1e1e2264b623d9cb7cb50df8141bd535` | Bio-Medical | 3-level | 50 | 128,585 | | NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nfcorpus) | `nfcorpus` | `695327760647984c5014d64b2fee8de0` | Bio-Medical | 3-level | 323 | 3,633 | | NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nq) | `nq` | `a10bfe33efdec54aafcc974ac989c338` | Wikipedia | Binary | 3,446 | 104,194 | | HotpotQA | [Homepage](https://hotpotqa.github.io/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/hotpotqa) | `hotpotqa` | `74467760fff8bf8fbdadd5094bf9dd7b` | Wikipedia | Binary | 7,405 | 111,107 | | FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/fiqa) | `fiqa` | `4e1e688539b0622630fb6e65d39d26fa` | Finance | Binary | 648 | 57,450 | | Touché-2020 | [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/webis-touche2020) | `webis-touche2020` | `d58ec465ccd567d8f75edb419b0faaed` | Misc. | 3-level | 49 | 101,922 | | CQADupStack | [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19) | `cqadupstack` | `d48d963bc72689c765f381f04fc26f8b` | StackEx. | Binary | 1,563 | 39,962 | | DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity) | `dbpedia-entity` | `43292f4f1a1927e2e323a4a7fa165fc1` | Wikipedia | 3-level | 400 | 145,037 | | SCIDOCS | [Homepage](https://allenai.org/data/scidocs) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/scidocs) | `scidocs` | `4058c0915594ab34e9b2b67f885c595f` | Scientific | Binary | 1,000 | 25,259 | | FEVER | [Homepage](http://fever.ai/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/fever) | `fever` | `98b631887d8c38772463e9633c477c69` | Wikipedia | Binary | 6,666 | 114,529 | | Climate-FEVER | [Homepage](http://climatefever.ai/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/climate-fever) | `climate-fever` | `5734d6ac34f24f5da496b27e04ff991a` | Wikipedia | Binary | 1,535 | 101,339 | | SciFact | [Homepage](https://github.com/allenai/scifact) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/scifact) | `scifact` | `b5b8e24ccad98c9ca959061af14bf833` | Scientific | Binary | 300 | 5,183 | | NQ-UTD | [Homepage](https://anonymous.4open.science/r/Cocktail-BA4B/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nq-utd) | `nq-utd` | `2e12e66393829cd4be715718f99d2436` | Misc. | 3-level | 80 | 800 | ## Dataset Structure ```shell . ├── corpus # * documents │ ├── human.jsonl # * human-written corpus │ └── llama-2-7b-chat-tmp0.2.jsonl # * llm-generated corpus ├── qrels │ └── test.tsv # * relevance for queries └── queries.jsonl # * quereis ``` All Cocktail datasets must contain a humman-written corpus, a LLM-generated corpus, queries and qrels. They must be in the following format: - `corpus`: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "title", "text": "text"}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "q1_text"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` Cite as: ``` @article{cocktail, title={Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration}, author={Dai, Sunhao and Liu, Weihao and Zhou, Yuqi and Pang, Liang and Ruan, Rongju and Wang, Gang and Dong, Zhenhua and Xu, Jun and Wen, Ji-Rong}, journal={Findings of the Association for Computational Linguistics: ACL 2024}, year={2024} } @article{dai2024neural, title={Neural Retrievers are Biased Towards LLM-Generated Content}, author={Dai, Sunhao and Zhou, Yuqi and Pang, Liang and Liu, Weihao and Hu, Xiaolin and Liu, Yong and Zhang, Xiao and Wang, Gang and Xu, Jun}, journal={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, year={2024} } ```

提供机构：

IR-Cocktail

原始信息汇总

数据集概述

数据集列表

数据集	原始网站	Cocktail网站	Cocktail名称	处理后数据的md5值	领域	相关性级别	测试查询数量	语料库大小
MS MARCO	https://microsoft.github.io/msmarco/	https://huggingface.co/datasets/IR-Cocktail/msmarco	msmarco	985926f3e906fadf0dc6249f23ed850f	Misc.	Binary	6,979	542,203
DL19	https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019	https://huggingface.co/datasets/IR-Cocktail/dl19	dl19	d652af47ec0e844af43109c0acf50b74	Misc.	Binary	43	542,203
DL20	https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020	https://huggingface.co/datasets/IR-Cocktail/dl20	dl20	3afc48141dce3405ede2b6b937c65036	Misc.	Binary	54	542,203
TREC-COVID	https://ir.nist.gov/covidSubmit/index.html	https://huggingface.co/datasets/IR-Cocktail/trec-covid	trec-covid	1e1e2264b623d9cb7cb50df8141bd535	Bio-Medical	3-level	50	128,585
NFCorpus	https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/	https://huggingface.co/datasets/IR-Cocktail/nfcorpus	nfcorpus	695327760647984c5014d64b2fee8de0	Bio-Medical	3-level	323	3,633
NQ	https://ai.google.com/research/NaturalQuestions	https://huggingface.co/datasets/IR-Cocktail/nq	nq	a10bfe33efdec54aafcc974ac989c338	Wikipedia	Binary	3,446	104,194
HotpotQA	https://hotpotqa.github.io/	https://huggingface.co/datasets/IR-Cocktail/hotpotqa	hotpotqa	74467760fff8bf8fbdadd5094bf9dd7b	Wikipedia	Binary	7,405	111,107
FiQA-2018	https://sites.google.com/view/fiqa/	https://huggingface.co/datasets/IR-Cocktail/fiqa	fiqa	4e1e688539b0622630fb6e65d39d26fa	Finance	Binary	648	57,450
Touché-2020	https://webis.de/events/touche-20/shared-task-1.html	https://huggingface.co/datasets/IR-Cocktail/webis-touche2020	webis-touche2020	d58ec465ccd567d8f75edb419b0faaed	Misc.	3-level	49	101,922
CQADupStack	http://nlp.cis.unimelb.edu.au/resources/cqadupstack/	https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19	cqadupstack	d48d963bc72689c765f381f04fc26f8b	StackEx.	Binary	1,563	39,962
DBPedia	https://github.com/iai-group/DBpedia-Entity/	https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity	dbpedia-entity	43292f4f1a1927e2e323a4a7fa165fc1	Wikipedia	3-level	400	145,037
SCIDOCS	https://allenai.org/data/scidocs	https://huggingface.co/datasets/IR-Cocktail/scidocs	scidocs	4058c0915594ab34e9b2b67f885c595f	Scientific	Binary	1,000	25,259
FEVER	http://fever.ai/	https://huggingface.co/datasets/IR-Cocktail/fever	fever	98b631887d8c38772463e9633c477c69	Wikipedia	Binary	6,666	114,529
Climate-FEVER	http://climatefever.ai/	https://huggingface.co/datasets/IR-Cocktail/climate-fever	climate-fever	5734d6ac34f24f5da496b27e04ff991a	Wikipedia	Binary	1,535	101,339
SciFact	https://github.com/allenai/scifact	https://huggingface.co/datasets/IR-Cocktail/scifact	scifact	b5b8e24ccad98c9ca959061af14bf833	Scientific	Binary	300	5,183
NQ-UTD	https://anonymous.4open.science/r/Cocktail-BA4B/	https://huggingface.co/datasets/IR-Cocktail/nq-utd	nq-utd	2e12e66393829cd4be715718f99d2436	Misc.	3-level	80	800

数据集结构

shell . ├── corpus # 文档 │ ├── human.jsonl # 人类编写的语料库 │ └── llama-2-7b-chat-tmp0.2.jsonl # LLM生成的语料库 ├── qrels │ └── test.tsv # 查询的相关性 └── queries.jsonl # 查询

数据集必须包含人类编写的语料库、LLM生成的语料库、查询和相关性文件。格式如下：

corpus: .jsonl 文件，包含字典列表，每个字典包含三个字段：_id（唯一文档标识符），title（文档标题，可选）和text（文档段落或段落）。
queries 文件：.jsonl 文件，包含字典列表，每个字典包含两个字段：_id（唯一查询标识符）和text（查询文本）。
qrels 文件：.tsv 文件，包含三个列：query-id，corpus-id 和 score。第一行作为标题。

5,000+

优质数据集

54 个

任务类型

进入经典数据集