five

IR-Cocktail/dl20

收藏
Hugging Face2024-05-22 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/IR-Cocktail/dl20
下载链接
链接失效反馈
官方服务:
资源简介:
## Data Description - **Homepage:** https://github.com/KID-22/Cocktail - **Repository:** https://github.com/KID-22/Cocktail - **Paper:** [Needs More Information] ## Dataset Summary All the 16 benchmarked datasets in Cocktail are listed in the following table. | Dataset | Raw Website | Cocktail Website | Cocktail-Name | md5 for Processed Data | Domain | Relevancy | # Test Query | # Corpus | | ------------- | ------------------------------------------------------------ | ------------------ | ---------------------------------- | ----------- | --------- | ------------ | -------- |-------- | | MS MARCO | [Homepage](https://microsoft.github.io/msmarco/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/msmarco) | `msmarco` | `985926f3e906fadf0dc6249f23ed850f` | Misc. | Binary | 6,979 | 542,203 | | DL19 | [Homepage](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dl19) | `dl19` | `d652af47ec0e844af43109c0acf50b74` | Misc. | Binary | 43 | 542,203 | | DL20 | [Homepage](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dl20) | `dl20` | `3afc48141dce3405ede2b6b937c65036` | Misc. | Binary | 54 | 542,203 | | TREC-COVID | [Homepage](https://ir.nist.gov/covidSubmit/index.html) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/trec-covid) | `trec-covid` | `1e1e2264b623d9cb7cb50df8141bd535` | Bio-Medical | 3-level | 50 | 128,585 | | NFCorpus | [Homepage](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nfcorpus) | `nfcorpus` | `695327760647984c5014d64b2fee8de0` | Bio-Medical | 3-level | 323 | 3,633 | | NQ | [Homepage](https://ai.google.com/research/NaturalQuestions) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nq) | `nq` | `a10bfe33efdec54aafcc974ac989c338` | Wikipedia | Binary | 3,446 | 104,194 | | HotpotQA | [Homepage](https://hotpotqa.github.io/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/hotpotqa) | `hotpotqa` | `74467760fff8bf8fbdadd5094bf9dd7b` | Wikipedia | Binary | 7,405 | 111,107 | | FiQA-2018 | [Homepage](https://sites.google.com/view/fiqa/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/fiqa) | `fiqa` | `4e1e688539b0622630fb6e65d39d26fa` | Finance | Binary | 648 | 57,450 | | Touché-2020 | [Homepage](https://webis.de/events/touche-20/shared-task-1.html) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/webis-touche2020) | `webis-touche2020` | `d58ec465ccd567d8f75edb419b0faaed` | Misc. | 3-level | 49 | 101,922 | | CQADupStack | [Homepage](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19) | `cqadupstack` | `d48d963bc72689c765f381f04fc26f8b` | StackEx. | Binary | 1,563 | 39,962 | | DBPedia | [Homepage](https://github.com/iai-group/DBpedia-Entity/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity) | `dbpedia-entity` | `43292f4f1a1927e2e323a4a7fa165fc1` | Wikipedia | 3-level | 400 | 145,037 | | SCIDOCS | [Homepage](https://allenai.org/data/scidocs) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/scidocs) | `scidocs` | `4058c0915594ab34e9b2b67f885c595f` | Scientific | Binary | 1,000 | 25,259 | | FEVER | [Homepage](http://fever.ai/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/fever) | `fever` | `98b631887d8c38772463e9633c477c69` | Wikipedia | Binary | 6,666 | 114,529 | | Climate-FEVER | [Homepage](http://climatefever.ai/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/climate-fever) | `climate-fever` | `5734d6ac34f24f5da496b27e04ff991a` | Wikipedia | Binary | 1,535 | 101,339 | | SciFact | [Homepage](https://github.com/allenai/scifact) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/scifact) | `scifact` | `b5b8e24ccad98c9ca959061af14bf833` | Scientific | Binary | 300 | 5,183 | | NQ-UTD | [Homepage](https://anonymous.4open.science/r/Cocktail-BA4B/) | [Homepage](https://huggingface.co/datasets/IR-Cocktail/nq-utd) | `nq-utd` | `2e12e66393829cd4be715718f99d2436` | Misc. | 3-level | 80 | 800 | ## Dataset Structure ```shell . ├── corpus # * documents │ ├── human.jsonl # * human-written corpus │ └── llama-2-7b-chat-tmp0.2.jsonl # * llm-generated corpus ├── qrels │ └── test.tsv # * relevance for queries └── queries.jsonl # * quereis ``` All Cocktail datasets must contain a humman-written corpus, a LLM-generated corpus, queries and qrels. They must be in the following format: - `corpus`: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "title", "text": "text"}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "q1_text"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` Cite as: ``` @article{cocktail, title={Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration}, author={Dai, Sunhao and Liu, Weihao and Zhou, Yuqi and Pang, Liang and Ruan, Rongju and Wang, Gang and Dong, Zhenhua and Xu, Jun and Wen, Ji-Rong}, journal={Findings of the Association for Computational Linguistics: ACL 2024}, year={2024} } @article{dai2024neural, title={Neural Retrievers are Biased Towards LLM-Generated Content}, author={Dai, Sunhao and Zhou, Yuqi and Pang, Liang and Liu, Weihao and Hu, Xiaolin and Liu, Yong and Zhang, Xiao and Wang, Gang and Xu, Jun}, journal={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, year={2024} } ```

## 数据集说明 - **主页**:https://github.com/KID-22/Cocktail - **代码仓库**:https://github.com/KID-22/Cocktail - **论文**:[需补充更多信息] ## 数据集概览 Cocktail纳入的16个基准测试数据集如下表所示。 | 数据集名称 | 原始官网 | Cocktail平台官网 | Cocktail命名 | 预处理数据MD5值 | 所属领域 | 相关性标注级别 | 测试查询数 | 语料库规模 | | ---------- | -------- | ---------------- | ------------ | -------------- | -------- | -------------- | ---------- | ---------- | | MS MARCO | [主页](https://microsoft.github.io/msmarco/) | [主页](https://huggingface.co/datasets/IR-Cocktail/msmarco) | `msmarco` | `985926f3e906fadf0dc6249f23ed850f` | 其他 | 二元标注 | 6,979 | 542,203 | | DL19 | [主页](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019) | [主页](https://huggingface.co/datasets/IR-Cocktail/dl19) | `dl19` | `d652af47ec0e844af43109c0acf50b74` | 其他 | 二元标注 | 43 | 542,203 | | DL20 | [主页](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020) | [主页](https://huggingface.co/datasets/IR-Cocktail/dl20) | `dl20` | `3afc48141dce3405ede2b6b937c65036` | 其他 | 二元标注 | 54 | 542,203 | | TREC-COVID | [主页](https://ir.nist.gov/covidSubmit/index.html) | [主页](https://huggingface.co/datasets/IR-Cocktail/trec-covid) | `trec-covid` | `1e1e2264b623d9cb7cb50df8141bd535` | 生物医学 | 三级标注 | 50 | 128,585 | | NFCorpus | [主页](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | [主页](https://huggingface.co/datasets/IR-Cocktail/nfcorpus) | `nfcorpus` | `695327760647984c5014d64b2fee8de0` | 生物医学 | 三级标注 | 323 | 3,633 | | NQ | [主页](https://ai.google.com/research/NaturalQuestions) | [主页](https://huggingface.co/datasets/IR-Cocktail/nq) | `nq` | `a10bfe33efdec54aafcc974ac989c338` | 维基百科 | 二元标注 | 3,446 | 104,194 | | HotpotQA | [主页](https://hotpotqa.github.io/) | [主页](https://huggingface.co/datasets/IR-Cocktail/hotpotqa) | `hotpotqa` | `74467760fff8bf8fbdadd5094bf9dd7b` | 维基百科 | 二元标注 | 7,405 | 111,107 | | FiQA-2018 | [主页](https://sites.google.com/view/fiqa/) | [主页](https://huggingface.co/datasets/IR-Cocktail/fiqa) | `fiqa` | `4e1e688539b0622630fb6e65d39d26fa` | 金融 | 二元标注 | 648 | 57,450 | | Touché-2020 | [主页](https://webis.de/events/touche-20/shared-task-1.html) | [主页](https://huggingface.co/datasets/IR-Cocktail/webis-touche2020) | `webis-touche2020` | `d58ec465ccd567d8f75edb419b0faaed` | 其他 | 三级标注 | 49 | 101,922 | | CQADupStack | [主页](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | [主页](https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19) | `cqadupstack` | `d48d963bc72689c765f381f04fc26f8b` | StackExchange | 二元标注 | 1,563 | 39,962 | | DBPedia | [主页](https://github.com/iai-group/DBpedia-Entity/) | [主页](https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity) | `dbpedia-entity` | `43292f4f1a1927e2e323a4a7fa165fc1` | 维基百科 | 三级标注 | 400 | 145,037 | | SCIDOCS | [主页](https://allenai.org/data/scidocs) | [主页](https://huggingface.co/datasets/IR-Cocktail/scidocs) | `scidocs` | `4058c0915594ab34e9b2b67f885c595f` | 科研学术 | 二元标注 | 1,000 | 25,259 | | FEVER | [主页](http://fever.ai/) | [主页](https://huggingface.co/datasets/IR-Cocktail/fever) | `fever` | `98b631887d8c38772463e9633c477c69` | 维基百科 | 二元标注 | 6,666 | 114,529 | | Climate-FEVER | [主页](http://climatefever.ai/) | [主页](https://huggingface.co/datasets/IR-Cocktail/climate-fever) | `climate-fever` | `5734d6ac34f24f5da496b27e04ff991a` | 维基百科 | 二元标注 | 1,535 | 101,339 | | SciFact | [主页](https://github.com/allenai/scifact) | [主页](https://huggingface.co/datasets/IR-Cocktail/scifact) | `scifact` | `b5b8e24ccad98c9ca959061af14bf833` | 科研学术 | 二元标注 | 300 | 5,183 | | NQ-UTD | [主页](https://anonymous.4open.science/r/Cocktail-BA4B/) | [主页](https://huggingface.co/datasets/IR-Cocktail/nq-utd) | `nq-utd` | `2e12e66393829cd4be715718f99d2436` | 其他 | 三级标注 | 80 | 800 | ## 数据集结构 shell . ├── corpus # 语料库 │ ├── human.jsonl # * 人工撰写语料 │ └── llama-2-7b-chat-tmp0.2.jsonl # * 大语言模型生成语料 ├── qrels │ └── test.tsv # * 查询相关性标注文件 └── queries.jsonl # * 查询文件 所有Cocktail数据集均需包含人工撰写语料、大语言模型(Large Language Model, LLM)生成语料、查询集与相关性标注文件,其格式要求如下: - `corpus`:采用`.jsonl`(JSON Lines)格式的文件,包含一组字典列表,每个字典包含三个字段:唯一文档标识符`_id`、可选的文档标题`title`,以及文档段落或篇章的文本内容`text`。示例:`{"_id": "doc1", "title": "标题", "text": "文本内容"}` - `queries`文件:采用`.jsonl`格式的文件,包含一组字典列表,每个字典包含两个字段:唯一查询标识符`_id`与查询文本`text`。示例:`{"_id": "q1", "text": "查询文本"}` - `qrels`文件:采用`.tsv`(制表符分隔值)格式的文件,包含三列,依次为`查询ID`、`语料ID`与`相关性分数`,首行保留表头。示例:`q1 doc1 1` ## 引用格式 @article{cocktail, title={Cocktail: 融合大语言模型生成文档的综合信息检索基准测试集}, author={Dai, Sunhao and Liu, Weihao and Zhou, Yuqi and Pang, Liang and Ruan, Rongju and Wang, Gang and Dong, Zhenhua and Xu, Jun and Wen, Ji-Rong}, journal={《计算语言学协会2024年年会研究辑刊》}, year={2024} } @article{dai2024neural, title={神经检索器对大语言模型生成内容存在偏好}, author={Dai, Sunhao and Zhou, Yuqi and Pang, Liang and Liu, Weihao and Hu, Xiaolin and Liu, Yong and Zhang, Xiao and Wang, Gang and Xu, Jun}, journal={《第30届ACM SIGKDD知识发现与数据挖掘会议论文集》}, year={2024} }
提供机构:
IR-Cocktail
原始信息汇总

数据集概述

本数据集包含16个基准数据集,每个数据集均提供了详细的信息,包括原始网站、Cocktail网站、处理后的数据md5校验和、数据集所属领域、相关性评估级别、测试查询数量和语料库大小。以下是各数据集的详细信息:

数据集 原始网站 Cocktail网站 Cocktail名称 md5校验和 领域 相关性评估 测试查询数量 语料库大小
MS MARCO Homepage Homepage msmarco 985926f3e906fadf0dc6249f23ed850f Misc. Binary 6,979 542,203
DL19 Homepage Homepage dl19 d652af47ec0e844af43109c0acf50b74 Misc. Binary 43 542,203
DL20 Homepage Homepage dl20 3afc48141dce3405ede2b6b937c65036 Misc. Binary 54 542,203
TREC-COVID Homepage Homepage trec-covid 1e1e2264b623d9cb7cb50df8141bd535 Bio-Medical 3-level 50 128,585
NFCorpus Homepage Homepage nfcorpus 695327760647984c5014d64b2fee8de0 Bio-Medical 3-level 323 3,633
NQ Homepage Homepage nq a10bfe33efdec54aafcc974ac989c338 Wikipedia Binary 3,446 104,194
HotpotQA Homepage Homepage hotpotqa 74467760fff8bf8fbdadd5094bf9dd7b Wikipedia Binary 7,405 111,107
FiQA-2018 Homepage Homepage fiqa 4e1e688539b0622630fb6e65d39d26fa Finance Binary 648 57,450
Touché-2020 Homepage Homepage webis-touche2020 d58ec465ccd567d8f75edb419b0faaed Misc. 3-level 49 101,922
CQADupStack Homepage Homepage cqadupstack d48d963bc72689c765f381f04fc26f8b StackEx. Binary 1,563 39,962
DBPedia Homepage Homepage dbpedia-entity 43292f4f1a1927e2e323a4a7fa165fc1 Wikipedia 3-level 400 145,037
SCIDOCS Homepage Homepage scidocs 4058c0915594ab34e9b2b67f885c595f Scientific Binary 1,000 25,259
FEVER Homepage Homepage fever 98b631887d8c38772463e9633c477c69 Wikipedia Binary 6,666 114,529
Climate-FEVER Homepage Homepage climate-fever 5734d6ac34f24f5da496b27e04ff991a Wikipedia Binary 1,535 101,339
SciFact Homepage Homepage scifact b5b8e24ccad98c9ca959061af14bf833 Scientific Binary 300 5,183
NQ-UTD Homepage Homepage nq-utd 2e12e66393829cd4be715718f99d2436 Misc. 3-level 80 800

数据集结构

所有Cocktail数据集必须包含以下结构:

shell . ├── corpus # * documents │ ├── human.jsonl # * human-written corpus │ └── llama-2-7b-chat-tmp0.2.jsonl # * llm-generated corpus ├── qrels │ └── test.tsv # * relevance for queries └── queries.jsonl # * quereis

具体格式要求如下:

  • corpus: 一个.jsonl文件,包含一系列字典,每个字典包含三个字段:_id(唯一文档标识符),title(文档标题,可选)和text(文档段落或文本)。
  • queries文件:一个.jsonl文件,包含一系列字典,每个字典包含两个字段:_id(唯一查询标识符)和text(查询文本)。
  • qrels文件:一个.tsv文件,包含三个列:query-idcorpus-idscore,按此顺序排列。第一行作为标题。
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
IR-Cocktail/dl20 is a benchmark dataset for information retrieval, featuring a mix of human and LLM-generated documents, with binary relevance judgments. It includes 54 test queries and a corpus of 542,203 documents, but currently has generation errors due to data file column mismatches.
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作