IR-Cocktail/nfcorpus
收藏数据集概述
数据集列表
| 数据集 | 原始网站 | Cocktail网站 | Cocktail名称 | 处理后数据的md5值 | 领域 | 相关性级别 | 测试查询数量 | 语料库大小 |
|---|---|---|---|---|---|---|---|---|
| MS MARCO | Homepage | Homepage | msmarco |
985926f3e906fadf0dc6249f23ed850f |
Misc. | Binary | 6,979 | 542,203 |
| DL19 | Homepage | Homepage | dl19 |
d652af47ec0e844af43109c0acf50b74 |
Misc. | Binary | 43 | 542,203 |
| DL20 | Homepage | Homepage | dl20 |
3afc48141dce3405ede2b6b937c65036 |
Misc. | Binary | 54 | 542,203 |
| TREC-COVID | Homepage | Homepage | trec-covid |
1e1e2264b623d9cb7cb50df8141bd535 |
Bio-Medical | 3-level | 50 | 128,585 |
| NFCorpus | Homepage | Homepage | nfcorpus |
695327760647984c5014d64b2fee8de0 |
Bio-Medical | 3-level | 323 | 3,633 |
| NQ | Homepage | Homepage | nq |
a10bfe33efdec54aafcc974ac989c338 |
Wikipedia | Binary | 3,446 | 104,194 |
| HotpotQA | Homepage | Homepage | hotpotqa |
74467760fff8bf8fbdadd5094bf9dd7b |
Wikipedia | Binary | 7,405 | 111,107 |
| FiQA-2018 | Homepage | Homepage | fiqa |
4e1e688539b0622630fb6e65d39d26fa |
Finance | Binary | 648 | 57,450 |
| Touché-2020 | Homepage | Homepage | webis-touche2020 |
d58ec465ccd567d8f75edb419b0faaed |
Misc. | 3-level | 49 | 101,922 |
| CQADupStack | Homepage | Homepage | cqadupstack |
d48d963bc72689c765f381f04fc26f8b |
StackEx. | Binary | 1,563 | 39,962 |
| DBPedia | Homepage | Homepage | dbpedia-entity |
43292f4f1a1927e2e323a4a7fa165fc1 |
Wikipedia | 3-level | 400 | 145,037 |
| SCIDOCS | Homepage | Homepage | scidocs |
4058c0915594ab34e9b2b67f885c595f |
Scientific | Binary | 1,000 | 25,259 |
| FEVER | Homepage | Homepage | fever |
98b631887d8c38772463e9633c477c69 |
Wikipedia | Binary | 6,666 | 114,529 |
| Climate-FEVER | Homepage | Homepage | climate-fever |
5734d6ac34f24f5da496b27e04ff991a |
Wikipedia | Binary | 1,535 | 101,339 |
| SciFact | Homepage | Homepage | scifact |
b5b8e24ccad98c9ca959061af14bf833 |
Scientific | Binary | 300 | 5,183 |
| NQ-UTD | Homepage | Homepage | nq-utd |
2e12e66393829cd4be715718f99d2436 |
Misc. | 3-level | 80 | 800 |
数据集结构
- corpus: 包含人类编写和LLM生成的文档,格式为
.jsonl,每个文档包含_id,title,text字段。 - queries: 包含查询信息,格式为
.jsonl,每个查询包含_id,text字段。 - qrels: 包含查询与文档的相关性评分,格式为
.tsv,包含query-id,corpus-id,score字段。
引用格式
@article{cocktail, title={Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration}, author={Dai, Sunhao and Liu, Weihao and Zhou, Yuqi and Pang, Liang and Ruan, Rongju and Wang, Gang and Dong, Zhenhua and Xu, Jun and Wen, Ji-Rong}, journal={Findings of the Association for Computational Linguistics: ACL 2024}, year={2024} }
@article{dai2024neural, title={Neural Retrievers are Biased Towards LLM-Generated Content}, author={Dai, Sunhao and Zhou, Yuqi and Pang, Liang and Liu, Weihao and Hu, Xiaolin and Liu, Yong and Zhang, Xiao and Wang, Gang and Xu, Jun}, journal={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, year={2024} }



