IR-Cocktail/cqadupstack
收藏Hugging Face2024-05-22 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/IR-Cocktail/cqadupstack
下载链接
链接失效反馈官方服务:
资源简介:
Cocktail数据集是一个综合性的信息检索基准数据集,集成了LLM生成的文档。该数据集包含了16个基准数据集,涵盖了多个领域,如生物医学、维基百科、金融、科学等。每个数据集都包含人类编写的语料库、LLM生成的语料库、查询和相关性文件,且这些文件必须遵循特定的格式。数据集的结构包括corpus目录(包含人类编写的语料库和LLM生成的语料库)、qrels目录(包含测试查询的相关性文件)和queries.jsonl文件(包含查询信息)。
Cocktail数据集是一个综合性的信息检索基准数据集,集成了LLM生成的文档。该数据集包含了16个基准数据集,涵盖了多个领域,如生物医学、维基百科、金融、科学等。每个数据集都包含人类编写的语料库、LLM生成的语料库、查询和相关性文件,且这些文件必须遵循特定的格式。数据集的结构包括corpus目录(包含人类编写的语料库和LLM生成的语料库)、qrels目录(包含测试查询的相关性文件)和queries.jsonl文件(包含查询信息)。
提供机构:
IR-Cocktail
原始信息汇总
数据集概述
数据集列表
| 数据集 | 原始网站 | Cocktail网站 | Cocktail名称 | 处理后数据的md5值 | 领域 | 相关性级别 | 测试查询数量 | 语料库大小 |
|---|---|---|---|---|---|---|---|---|
| MS MARCO | https://microsoft.github.io/msmarco/ | https://huggingface.co/datasets/IR-Cocktail/msmarco | msmarco | 985926f3e906fadf0dc6249f23ed850f | Misc. | Binary | 6,979 | 542,203 |
| DL19 | https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019 | https://huggingface.co/datasets/IR-Cocktail/dl19 | dl19 | d652af47ec0e844af43109c0acf50b74 | Misc. | Binary | 43 | 542,203 |
| DL20 | https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020 | https://huggingface.co/datasets/IR-Cocktail/dl20 | dl20 | 3afc48141dce3405ede2b6b937c65036 | Misc. | Binary | 54 | 542,203 |
| TREC-COVID | https://ir.nist.gov/covidSubmit/index.html | https://huggingface.co/datasets/IR-Cocktail/trec-covid | trec-covid | 1e1e2264b623d9cb7cb50df8141bd535 | Bio-Medical | 3-level | 50 | 128,585 |
| NFCorpus | https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/ | https://huggingface.co/datasets/IR-Cocktail/nfcorpus | nfcorpus | 695327760647984c5014d64b2fee8de0 | Bio-Medical | 3-level | 323 | 3,633 |
| NQ | https://ai.google.com/research/NaturalQuestions | https://huggingface.co/datasets/IR-Cocktail/nq | nq | a10bfe33efdec54aafcc974ac989c338 | Wikipedia | Binary | 3,446 | 104,194 |
| HotpotQA | https://hotpotqa.github.io/ | https://huggingface.co/datasets/IR-Cocktail/hotpotqa | hotpotqa | 74467760fff8bf8fbdadd5094bf9dd7b | Wikipedia | Binary | 7,405 | 111,107 |
| FiQA-2018 | https://sites.google.com/view/fiqa/ | https://huggingface.co/datasets/IR-Cocktail/fiqa | fiqa | 4e1e688539b0622630fb6e65d39d26fa | Finance | Binary | 648 | 57,450 |
| Touché-2020 | https://webis.de/events/touche-20/shared-task-1.html | https://huggingface.co/datasets/IR-Cocktail/webis-touche2020 | webis-touche2020 | d58ec465ccd567d8f75edb419b0faaed | Misc. | 3-level | 49 | 101,922 |
| CQADupStack | http://nlp.cis.unimelb.edu.au/resources/cqadupstack/ | https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19 | cqadupstack | d48d963bc72689c765f381f04fc26f8b | StackEx. | Binary | 1,563 | 39,962 |
| DBPedia | https://github.com/iai-group/DBpedia-Entity/ | https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity | dbpedia-entity | 43292f4f1a1927e2e323a4a7fa165fc1 | Wikipedia | 3-level | 400 | 145,037 |
| SCIDOCS | https://allenai.org/data/scidocs | https://huggingface.co/datasets/IR-Cocktail/scidocs | scidocs | 4058c0915594ab34e9b2b67f885c595f | Scientific | Binary | 1,000 | 25,259 |
| FEVER | http://fever.ai/ | https://huggingface.co/datasets/IR-Cocktail/fever | fever | 98b631887d8c38772463e9633c477c69 | Wikipedia | Binary | 6,666 | 114,529 |
| Climate-FEVER | http://climatefever.ai/ | https://huggingface.co/datasets/IR-Cocktail/climate-fever | climate-fever | 5734d6ac34f24f5da496b27e04ff991a | Wikipedia | Binary | 1,535 | 101,339 |
| SciFact | https://github.com/allenai/scifact | https://huggingface.co/datasets/IR-Cocktail/scifact | scifact | b5b8e24ccad98c9ca959061af14bf833 | Scientific | Binary | 300 | 5,183 |
| NQ-UTD | https://anonymous.4open.science/r/Cocktail-BA4B/ | https://huggingface.co/datasets/IR-Cocktail/nq-utd | nq-utd | 2e12e66393829cd4be715718f99d2436 | Misc. | 3-level | 80 | 800 |
数据集结构
shell . ├── corpus # 文档集合 │ ├── human.jsonl # 人类编写的语料库 │ └── llama-2-7b-chat-tmp0.2.jsonl # LLM生成的语料库 ├── qrels │ └── test.tsv # 查询的相关性评分 └── queries.jsonl # 查询集合
数据集必须包含人类编写的语料库、LLM生成的语料库、查询和相关性评分。格式要求如下:
corpus:.jsonl文件,包含一系列字典,每个字典包含三个字段:_id(唯一文档标识符)、title(文档标题,可选)和text(文档段落或文本)。queries文件:.jsonl文件,包含一系列字典,每个字典包含两个字段:_id(唯一查询标识符)和text(查询文本)。qrels文件:.tsv文件,包含三个列:query-id、corpus-id和score。第一行作为标题。



