IR-Cocktail/cqadupstack

Name: IR-Cocktail/cqadupstack
Creator: IR-Cocktail
Published: 2024-05-22 15:19:45
License: 暂无描述

Hugging Face2024-05-22 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/IR-Cocktail/cqadupstack

下载链接

链接失效反馈

官方服务：

资源简介：

Cocktail数据集是一个综合性的信息检索基准数据集，集成了LLM生成的文档。该数据集包含了16个基准数据集，涵盖了多个领域，如生物医学、维基百科、金融、科学等。每个数据集都包含人类编写的语料库、LLM生成的语料库、查询和相关性文件，且这些文件必须遵循特定的格式。数据集的结构包括corpus目录（包含人类编写的语料库和LLM生成的语料库）、qrels目录（包含测试查询的相关性文件）和queries.jsonl文件（包含查询信息）。

提供机构：

IR-Cocktail

原始信息汇总

数据集概述

数据集列表

数据集	原始网站	Cocktail网站	Cocktail名称	处理后数据的md5值	领域	相关性级别	测试查询数量	语料库大小
MS MARCO	https://microsoft.github.io/msmarco/	https://huggingface.co/datasets/IR-Cocktail/msmarco	msmarco	985926f3e906fadf0dc6249f23ed850f	Misc.	Binary	6,979	542,203
DL19	https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019	https://huggingface.co/datasets/IR-Cocktail/dl19	dl19	d652af47ec0e844af43109c0acf50b74	Misc.	Binary	43	542,203
DL20	https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020	https://huggingface.co/datasets/IR-Cocktail/dl20	dl20	3afc48141dce3405ede2b6b937c65036	Misc.	Binary	54	542,203
TREC-COVID	https://ir.nist.gov/covidSubmit/index.html	https://huggingface.co/datasets/IR-Cocktail/trec-covid	trec-covid	1e1e2264b623d9cb7cb50df8141bd535	Bio-Medical	3-level	50	128,585
NFCorpus	https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/	https://huggingface.co/datasets/IR-Cocktail/nfcorpus	nfcorpus	695327760647984c5014d64b2fee8de0	Bio-Medical	3-level	323	3,633
NQ	https://ai.google.com/research/NaturalQuestions	https://huggingface.co/datasets/IR-Cocktail/nq	nq	a10bfe33efdec54aafcc974ac989c338	Wikipedia	Binary	3,446	104,194
HotpotQA	https://hotpotqa.github.io/	https://huggingface.co/datasets/IR-Cocktail/hotpotqa	hotpotqa	74467760fff8bf8fbdadd5094bf9dd7b	Wikipedia	Binary	7,405	111,107
FiQA-2018	https://sites.google.com/view/fiqa/	https://huggingface.co/datasets/IR-Cocktail/fiqa	fiqa	4e1e688539b0622630fb6e65d39d26fa	Finance	Binary	648	57,450
Touché-2020	https://webis.de/events/touche-20/shared-task-1.html	https://huggingface.co/datasets/IR-Cocktail/webis-touche2020	webis-touche2020	d58ec465ccd567d8f75edb419b0faaed	Misc.	3-level	49	101,922
CQADupStack	http://nlp.cis.unimelb.edu.au/resources/cqadupstack/	https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19	cqadupstack	d48d963bc72689c765f381f04fc26f8b	StackEx.	Binary	1,563	39,962
DBPedia	https://github.com/iai-group/DBpedia-Entity/	https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity	dbpedia-entity	43292f4f1a1927e2e323a4a7fa165fc1	Wikipedia	3-level	400	145,037
SCIDOCS	https://allenai.org/data/scidocs	https://huggingface.co/datasets/IR-Cocktail/scidocs	scidocs	4058c0915594ab34e9b2b67f885c595f	Scientific	Binary	1,000	25,259
FEVER	http://fever.ai/	https://huggingface.co/datasets/IR-Cocktail/fever	fever	98b631887d8c38772463e9633c477c69	Wikipedia	Binary	6,666	114,529
Climate-FEVER	http://climatefever.ai/	https://huggingface.co/datasets/IR-Cocktail/climate-fever	climate-fever	5734d6ac34f24f5da496b27e04ff991a	Wikipedia	Binary	1,535	101,339
SciFact	https://github.com/allenai/scifact	https://huggingface.co/datasets/IR-Cocktail/scifact	scifact	b5b8e24ccad98c9ca959061af14bf833	Scientific	Binary	300	5,183
NQ-UTD	https://anonymous.4open.science/r/Cocktail-BA4B/	https://huggingface.co/datasets/IR-Cocktail/nq-utd	nq-utd	2e12e66393829cd4be715718f99d2436	Misc.	3-level	80	800

数据集结构

shell . ├── corpus # 文档集合 │ ├── human.jsonl # 人类编写的语料库 │ └── llama-2-7b-chat-tmp0.2.jsonl # LLM生成的语料库 ├── qrels │ └── test.tsv # 查询的相关性评分 └── queries.jsonl # 查询集合

数据集必须包含人类编写的语料库、LLM生成的语料库、查询和相关性评分。格式要求如下：

corpus: .jsonl 文件，包含一系列字典，每个字典包含三个字段：_id（唯一文档标识符）、title（文档标题，可选）和text（文档段落或文本）。
queries 文件: .jsonl 文件，包含一系列字典，每个字典包含两个字段：_id（唯一查询标识符）和text（查询文本）。
qrels 文件: .tsv 文件，包含三个列：query-id、corpus-id 和 score。第一行作为标题。

5,000+

优质数据集

54 个

任务类型

进入经典数据集