IR-Cocktail/dl20
收藏Hugging Face2024-05-22 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/IR-Cocktail/dl20
下载链接
链接失效反馈官方服务:
资源简介:
## Data Description
- **Homepage:** https://github.com/KID-22/Cocktail
- **Repository:** https://github.com/KID-22/Cocktail
- **Paper:** [Needs More Information]
## Dataset Summary
All the 16 benchmarked datasets in Cocktail are listed in the following table.
| Dataset | Raw Website | Cocktail Website | Cocktail-Name | md5 for Processed Data | Domain | Relevancy | # Test Query | # Corpus |
| ------------- | ------------------------------------------------------------ | ------------------ | ---------------------------------- | ----------- | --------- | ------------ | -------- |-------- |
| MS MARCO | [Homepage](https://microsoft.github.io/msmarco/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/msmarco) | `msmarco` | `985926f3e906fadf0dc6249f23ed850f` | Misc. | Binary | 6,979 | 542,203 |
| DL19 | [Homepage](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dl19) | `dl19` | `d652af47ec0e844af43109c0acf50b74` | Misc. | Binary | 43 | 542,203 |
| DL20 | [Homepage](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dl20) | `dl20` | `3afc48141dce3405ede2b6b937c65036` | Misc. | Binary | 54 | 542,203 |
| TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/trec-covid) | `trec-covid` | `1e1e2264b623d9cb7cb50df8141bd535` | Bio-Medical | 3-level | 50 | 128,585 |
| NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nfcorpus) | `nfcorpus` | `695327760647984c5014d64b2fee8de0` | Bio-Medical | 3-level | 323 | 3,633 |
| NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nq) | `nq` | `a10bfe33efdec54aafcc974ac989c338` | Wikipedia | Binary | 3,446 | 104,194 |
| HotpotQA | [Homepage](https://hotpotqa.github.io/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/hotpotqa) | `hotpotqa` | `74467760fff8bf8fbdadd5094bf9dd7b` | Wikipedia | Binary | 7,405 | 111,107 |
| FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/fiqa) | `fiqa` | `4e1e688539b0622630fb6e65d39d26fa` | Finance | Binary | 648 | 57,450 |
| Touché-2020 | [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/webis-touche2020) | `webis-touche2020` | `d58ec465ccd567d8f75edb419b0faaed` | Misc. | 3-level | 49 | 101,922 |
| CQADupStack | [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19) | `cqadupstack` | `d48d963bc72689c765f381f04fc26f8b` | StackEx. | Binary | 1,563 | 39,962 |
| DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity) | `dbpedia-entity` | `43292f4f1a1927e2e323a4a7fa165fc1` | Wikipedia | 3-level | 400 | 145,037 |
| SCIDOCS | [Homepage](https://allenai.org/data/scidocs) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/scidocs) | `scidocs` | `4058c0915594ab34e9b2b67f885c595f` | Scientific | Binary | 1,000 | 25,259 |
| FEVER | [Homepage](http://fever.ai/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/fever) | `fever` | `98b631887d8c38772463e9633c477c69` | Wikipedia | Binary | 6,666 | 114,529 |
| Climate-FEVER | [Homepage](http://climatefever.ai/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/climate-fever) | `climate-fever` | `5734d6ac34f24f5da496b27e04ff991a` | Wikipedia | Binary | 1,535 | 101,339 |
| SciFact | [Homepage](https://github.com/allenai/scifact) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/scifact) | `scifact` | `b5b8e24ccad98c9ca959061af14bf833` | Scientific | Binary | 300 | 5,183 |
| NQ-UTD | [Homepage](https://anonymous.4open.science/r/Cocktail-BA4B/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nq-utd) | `nq-utd` | `2e12e66393829cd4be715718f99d2436` | Misc. | 3-level | 80 | 800 |
## Dataset Structure
```shell
.
├── corpus # * documents
│ ├── human.jsonl # * human-written corpus
│ └── llama-2-7b-chat-tmp0.2.jsonl # * llm-generated corpus
├── qrels
│ └── test.tsv # * relevance for queries
└── queries.jsonl # * quereis
```
All Cocktail datasets must contain a humman-written corpus, a LLM-generated corpus, queries and qrels.
They must be in the following format:
- `corpus`: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "title", "text": "text"}`
- `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "q1_text"}`
- `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1`
Cite as:
```
@article{cocktail,
title={Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration},
author={Dai, Sunhao and Liu, Weihao and Zhou, Yuqi and Pang, Liang and Ruan, Rongju and Wang, Gang and Dong, Zhenhua and Xu, Jun and Wen, Ji-Rong},
journal={Findings of the Association for Computational Linguistics: ACL 2024},
year={2024}
}
@article{dai2024neural,
title={Neural Retrievers are Biased Towards LLM-Generated Content},
author={Dai, Sunhao and Zhou, Yuqi and Pang, Liang and Liu, Weihao and Hu, Xiaolin and Liu, Yong and Zhang, Xiao and Wang, Gang and Xu, Jun},
journal={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
year={2024}
}
```
## 数据集说明
- **主页**:https://github.com/KID-22/Cocktail
- **代码仓库**:https://github.com/KID-22/Cocktail
- **论文**:[需补充更多信息]
## 数据集概览
Cocktail纳入的16个基准测试数据集如下表所示。
| 数据集名称 | 原始官网 | Cocktail平台官网 | Cocktail命名 | 预处理数据MD5值 | 所属领域 | 相关性标注级别 | 测试查询数 | 语料库规模 |
| ---------- | -------- | ---------------- | ------------ | -------------- | -------- | -------------- | ---------- | ---------- |
| MS MARCO | [主页](https://microsoft.github.io/msmarco/) | [主页](https://huggingface.co/datasets/IR-Cocktail/msmarco) | `msmarco` | `985926f3e906fadf0dc6249f23ed850f` | 其他 | 二元标注 | 6,979 | 542,203 |
| DL19 | [主页](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019) | [主页](https://huggingface.co/datasets/IR-Cocktail/dl19) | `dl19` | `d652af47ec0e844af43109c0acf50b74` | 其他 | 二元标注 | 43 | 542,203 |
| DL20 | [主页](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020) | [主页](https://huggingface.co/datasets/IR-Cocktail/dl20) | `dl20` | `3afc48141dce3405ede2b6b937c65036` | 其他 | 二元标注 | 54 | 542,203 |
| TREC-COVID | [主页](https://ir.nist.gov/covidSubmit/index.html) | [主页](https://huggingface.co/datasets/IR-Cocktail/trec-covid) | `trec-covid` | `1e1e2264b623d9cb7cb50df8141bd535` | 生物医学 | 三级标注 | 50 | 128,585 |
| NFCorpus | [主页](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | [主页](https://huggingface.co/datasets/IR-Cocktail/nfcorpus) | `nfcorpus` | `695327760647984c5014d64b2fee8de0` | 生物医学 | 三级标注 | 323 | 3,633 |
| NQ | [主页](https://ai.google.com/research/NaturalQuestions) | [主页](https://huggingface.co/datasets/IR-Cocktail/nq) | `nq` | `a10bfe33efdec54aafcc974ac989c338` | 维基百科 | 二元标注 | 3,446 | 104,194 |
| HotpotQA | [主页](https://hotpotqa.github.io/) | [主页](https://huggingface.co/datasets/IR-Cocktail/hotpotqa) | `hotpotqa` | `74467760fff8bf8fbdadd5094bf9dd7b` | 维基百科 | 二元标注 | 7,405 | 111,107 |
| FiQA-2018 | [主页](https://sites.google.com/view/fiqa/) | [主页](https://huggingface.co/datasets/IR-Cocktail/fiqa) | `fiqa` | `4e1e688539b0622630fb6e65d39d26fa` | 金融 | 二元标注 | 648 | 57,450 |
| Touché-2020 | [主页](https://webis.de/events/touche-20/shared-task-1.html) | [主页](https://huggingface.co/datasets/IR-Cocktail/webis-touche2020) | `webis-touche2020` | `d58ec465ccd567d8f75edb419b0faaed` | 其他 | 三级标注 | 49 | 101,922 |
| CQADupStack | [主页](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | [主页](https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19) | `cqadupstack` | `d48d963bc72689c765f381f04fc26f8b` | StackExchange | 二元标注 | 1,563 | 39,962 |
| DBPedia | [主页](https://github.com/iai-group/DBpedia-Entity/) | [主页](https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity) | `dbpedia-entity` | `43292f4f1a1927e2e323a4a7fa165fc1` | 维基百科 | 三级标注 | 400 | 145,037 |
| SCIDOCS | [主页](https://allenai.org/data/scidocs) | [主页](https://huggingface.co/datasets/IR-Cocktail/scidocs) | `scidocs` | `4058c0915594ab34e9b2b67f885c595f` | 科研学术 | 二元标注 | 1,000 | 25,259 |
| FEVER | [主页](http://fever.ai/) | [主页](https://huggingface.co/datasets/IR-Cocktail/fever) | `fever` | `98b631887d8c38772463e9633c477c69` | 维基百科 | 二元标注 | 6,666 | 114,529 |
| Climate-FEVER | [主页](http://climatefever.ai/) | [主页](https://huggingface.co/datasets/IR-Cocktail/climate-fever) | `climate-fever` | `5734d6ac34f24f5da496b27e04ff991a` | 维基百科 | 二元标注 | 1,535 | 101,339 |
| SciFact | [主页](https://github.com/allenai/scifact) | [主页](https://huggingface.co/datasets/IR-Cocktail/scifact) | `scifact` | `b5b8e24ccad98c9ca959061af14bf833` | 科研学术 | 二元标注 | 300 | 5,183 |
| NQ-UTD | [主页](https://anonymous.4open.science/r/Cocktail-BA4B/) | [主页](https://huggingface.co/datasets/IR-Cocktail/nq-utd) | `nq-utd` | `2e12e66393829cd4be715718f99d2436` | 其他 | 三级标注 | 80 | 800 |
## 数据集结构
shell
.
├── corpus # 语料库
│ ├── human.jsonl # * 人工撰写语料
│ └── llama-2-7b-chat-tmp0.2.jsonl # * 大语言模型生成语料
├── qrels
│ └── test.tsv # * 查询相关性标注文件
└── queries.jsonl # * 查询文件
所有Cocktail数据集均需包含人工撰写语料、大语言模型(Large Language Model, LLM)生成语料、查询集与相关性标注文件,其格式要求如下:
- `corpus`:采用`.jsonl`(JSON Lines)格式的文件,包含一组字典列表,每个字典包含三个字段:唯一文档标识符`_id`、可选的文档标题`title`,以及文档段落或篇章的文本内容`text`。示例:`{"_id": "doc1", "title": "标题", "text": "文本内容"}`
- `queries`文件:采用`.jsonl`格式的文件,包含一组字典列表,每个字典包含两个字段:唯一查询标识符`_id`与查询文本`text`。示例:`{"_id": "q1", "text": "查询文本"}`
- `qrels`文件:采用`.tsv`(制表符分隔值)格式的文件,包含三列,依次为`查询ID`、`语料ID`与`相关性分数`,首行保留表头。示例:`q1 doc1 1`
## 引用格式
@article{cocktail,
title={Cocktail: 融合大语言模型生成文档的综合信息检索基准测试集},
author={Dai, Sunhao and Liu, Weihao and Zhou, Yuqi and Pang, Liang and Ruan, Rongju and Wang, Gang and Dong, Zhenhua and Xu, Jun and Wen, Ji-Rong},
journal={《计算语言学协会2024年年会研究辑刊》},
year={2024}
}
@article{dai2024neural,
title={神经检索器对大语言模型生成内容存在偏好},
author={Dai, Sunhao and Zhou, Yuqi and Pang, Liang and Liu, Weihao and Hu, Xiaolin and Liu, Yong and Zhang, Xiao and Wang, Gang and Xu, Jun},
journal={《第30届ACM SIGKDD知识发现与数据挖掘会议论文集》},
year={2024}
}
提供机构:
IR-Cocktail
原始信息汇总
数据集概述
本数据集包含16个基准数据集,每个数据集均提供了详细的信息,包括原始网站、Cocktail网站、处理后的数据md5校验和、数据集所属领域、相关性评估级别、测试查询数量和语料库大小。以下是各数据集的详细信息:
| 数据集 | 原始网站 | Cocktail网站 | Cocktail名称 | md5校验和 | 领域 | 相关性评估 | 测试查询数量 | 语料库大小 |
|---|---|---|---|---|---|---|---|---|
| MS MARCO | Homepage | Homepage | msmarco |
985926f3e906fadf0dc6249f23ed850f |
Misc. | Binary | 6,979 | 542,203 |
| DL19 | Homepage | Homepage | dl19 |
d652af47ec0e844af43109c0acf50b74 |
Misc. | Binary | 43 | 542,203 |
| DL20 | Homepage | Homepage | dl20 |
3afc48141dce3405ede2b6b937c65036 |
Misc. | Binary | 54 | 542,203 |
| TREC-COVID | Homepage | Homepage | trec-covid |
1e1e2264b623d9cb7cb50df8141bd535 |
Bio-Medical | 3-level | 50 | 128,585 |
| NFCorpus | Homepage | Homepage | nfcorpus |
695327760647984c5014d64b2fee8de0 |
Bio-Medical | 3-level | 323 | 3,633 |
| NQ | Homepage | Homepage | nq |
a10bfe33efdec54aafcc974ac989c338 |
Wikipedia | Binary | 3,446 | 104,194 |
| HotpotQA | Homepage | Homepage | hotpotqa |
74467760fff8bf8fbdadd5094bf9dd7b |
Wikipedia | Binary | 7,405 | 111,107 |
| FiQA-2018 | Homepage | Homepage | fiqa |
4e1e688539b0622630fb6e65d39d26fa |
Finance | Binary | 648 | 57,450 |
| Touché-2020 | Homepage | Homepage | webis-touche2020 |
d58ec465ccd567d8f75edb419b0faaed |
Misc. | 3-level | 49 | 101,922 |
| CQADupStack | Homepage | Homepage | cqadupstack |
d48d963bc72689c765f381f04fc26f8b |
StackEx. | Binary | 1,563 | 39,962 |
| DBPedia | Homepage | Homepage | dbpedia-entity |
43292f4f1a1927e2e323a4a7fa165fc1 |
Wikipedia | 3-level | 400 | 145,037 |
| SCIDOCS | Homepage | Homepage | scidocs |
4058c0915594ab34e9b2b67f885c595f |
Scientific | Binary | 1,000 | 25,259 |
| FEVER | Homepage | Homepage | fever |
98b631887d8c38772463e9633c477c69 |
Wikipedia | Binary | 6,666 | 114,529 |
| Climate-FEVER | Homepage | Homepage | climate-fever |
5734d6ac34f24f5da496b27e04ff991a |
Wikipedia | Binary | 1,535 | 101,339 |
| SciFact | Homepage | Homepage | scifact |
b5b8e24ccad98c9ca959061af14bf833 |
Scientific | Binary | 300 | 5,183 |
| NQ-UTD | Homepage | Homepage | nq-utd |
2e12e66393829cd4be715718f99d2436 |
Misc. | 3-level | 80 | 800 |
数据集结构
所有Cocktail数据集必须包含以下结构:
shell . ├── corpus # * documents │ ├── human.jsonl # * human-written corpus │ └── llama-2-7b-chat-tmp0.2.jsonl # * llm-generated corpus ├── qrels │ └── test.tsv # * relevance for queries └── queries.jsonl # * quereis
具体格式要求如下:
corpus: 一个.jsonl文件,包含一系列字典,每个字典包含三个字段:_id(唯一文档标识符),title(文档标题,可选)和text(文档段落或文本)。queries文件:一个.jsonl文件,包含一系列字典,每个字典包含两个字段:_id(唯一查询标识符)和text(查询文本)。qrels文件:一个.tsv文件,包含三个列:query-id,corpus-id和score,按此顺序排列。第一行作为标题。
搜集汇总
数据集介绍

背景与挑战
背景概述
IR-Cocktail/dl20 is a benchmark dataset for information retrieval, featuring a mix of human and LLM-generated documents, with binary relevance judgments. It includes 54 test queries and a corpus of 542,203 documents, but currently has generation errors due to data file column mismatches.
以上内容由遇见数据集搜集并总结生成



