IR-Cocktail/fever

Name: IR-Cocktail/fever
Creator: IR-Cocktail
Published: 2024-05-22 15:20:38
License: 暂无描述

Hugging Face2024-05-22 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/IR-Cocktail/fever

下载链接

链接失效反馈

官方服务：

资源简介：

Cocktail数据集是一个综合性的信息检索基准数据集，包含了16个基准数据集，涵盖了多个领域如生物医学、维基百科、金融、科学等。每个数据集都包含人类编写的语料库和LLM生成的语料库、查询以及相关性文件。数据集的格式包括`.jsonl`文件（用于语料库和查询）和`.tsv`文件（用于相关性文件）。这些数据集旨在评估信息检索系统在处理不同类型文档时的性能。

提供机构：

IR-Cocktail

原始信息汇总

数据集概述

本数据集包含16个基准数据集，每个数据集均涉及不同的领域和相关性评估标准。以下是各数据集的详细信息：

数据集	原始网站	数据集网站	数据集名称	处理后数据的md5值	领域	相关性评估	测试查询数量	语料库大小
MS MARCO	https://microsoft.github.io/msmarco/	https://huggingface.co/datasets/IR-Cocktail/msmarco	msmarco	985926f3e906fadf0dc6249f23ed850f	杂项	二元	6,979	542,203
DL19	https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019	https://huggingface.co/datasets/IR-Cocktail/dl19	dl19	d652af47ec0e844af43109c0acf50b74	杂项	二元	43	542,203
DL20	https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020	https://huggingface.co/datasets/IR-Cocktail/dl20	dl20	3afc48141dce3405ede2b6b937c65036	杂项	二元	54	542,203
TREC-COVID	https://ir.nist.gov/covidSubmit/index.html	https://huggingface.co/datasets/IR-Cocktail/trec-covid	trec-covid	1e1e2264b623d9cb7cb50df8141bd535	生物医学	三级	50	128,585
NFCorpus	https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/	https://huggingface.co/datasets/IR-Cocktail/nfcorpus	nfcorpus	695327760647984c5014d64b2fee8de0	生物医学	三级	323	3,633
NQ	https://ai.google.com/research/NaturalQuestions	https://huggingface.co/datasets/IR-Cocktail/nq	nq	a10bfe33efdec54aafcc974ac989c338	维基百科	二元	3,446	104,194
HotpotQA	https://hotpotqa.github.io/	https://huggingface.co/datasets/IR-Cocktail/hotpotqa	hotpotqa	74467760fff8bf8fbdadd5094bf9dd7b	维基百科	二元	7,405	111,107
FiQA-2018	https://sites.google.com/view/fiqa/	https://huggingface.co/datasets/IR-Cocktail/fiqa	fiqa	4e1e688539b0622630fb6e65d39d26fa	金融	二元	648	57,450
Touché-2020	https://webis.de/events/touche-20/shared-task-1.html	https://huggingface.co/datasets/IR-Cocktail/webis-touche2020	webis-touche2020	d58ec465ccd567d8f75edb419b0faaed	杂项	三级	49	101,922
CQADupStack	http://nlp.cis.unimelb.edu.au/resources/cqadupstack/	https://huggingface.co/datasets/IR-Cocktail/dcqadupstackl19	cqadupstack	d48d963bc72689c765f381f04fc26f8b	堆栈交换	二元	1,563	39,962
DBPedia	https://github.com/iai-group/DBpedia-Entity/	https://huggingface.co/datasets/IR-Cocktail/dbpedia-entity	dbpedia-entity	43292f4f1a1927e2e323a4a7fa165fc1	维基百科	三级	400	145,037
SCIDOCS	https://allenai.org/data/scidocs	https://huggingface.co/datasets/IR-Cocktail/scidocs	scidocs	4058c0915594ab34e9b2b67f885c595f	科学	二元	1,000	25,259
FEVER	http://fever.ai/	https://huggingface.co/datasets/IR-Cocktail/fever	fever	98b631887d8c38772463e9633c477c69	维基百科	二元	6,666	114,529
Climate-FEVER	http://climatefever.ai/	https://huggingface.co/datasets/IR-Cocktail/climate-fever	climate-fever	5734d6ac34f24f5da496b27e04ff991a	维基百科	二元	1,535	101,339
SciFact	https://github.com/allenai/scifact	https://huggingface.co/datasets/IR-Cocktail/scifact	scifact	b5b8e24ccad98c9ca959061af14bf833	科学	二元	300	5,183
NQ-UTD	https://anonymous.4open.science/r/Cocktail-BA4B/	https://huggingface.co/datasets/IR-Cocktail/nq-utd	nq-utd	2e12e66393829cd4be715718f99d2436	杂项	三级	80	800

数据集结构

所有数据集遵循以下结构：

shell . ├── corpus # 文档集合 │ ├── human.jsonl # 人类编写的语料库 │ └── llama-2-7b-chat-tmp0.2.jsonl # LLM生成的语料库 ├── qrels │ └── test.tsv # 查询的相关性评估 └── queries.jsonl # 查询集合

数据集的具体格式要求如下：

corpus：.jsonl 文件，包含一系列字典，每个字典包含三个字段：_id（唯一文档标识符），title（文档标题，可选），text（文档段落或文本）。
queries 文件：.jsonl 文件，包含一系列字典，每个字典包含两个字段：_id（唯一查询标识符），text（查询文本）。
qrels 文件：.tsv 文件，包含三个列：query-id，corpus-id，score，第一行为标题。

5,000+

优质数据集

54 个

任务类型

进入经典数据集