BeIR/scidocs-generated-queries

Name: BeIR/scidocs-generated-queries
Creator: BeIR
Published: 2022-10-23 06:12:52
License: 暂无描述

Hugging Face2022-10-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/BeIR/scidocs-generated-queries

下载链接

链接失效反馈

官方服务：

资源简介：

BEIR Benchmark 是一个异构的基准测试，包含了来自18个不同数据集的9种信息检索任务。这些数据集涵盖了事实核查、问答、生物医学信息检索、新闻检索、论点检索、重复问题检索、引文预测、推文检索和实体检索等多个领域。所有数据集都经过预处理，可以直接用于实验。数据集支持的任务包括文本检索、零样本检索、信息检索和零样本信息检索等。数据集的结构包括语料库、查询和相关性判断文件，格式为JSONL和TSV。数据集的所有任务均为英文。

The BEIR Benchmark is a heterogeneous benchmark encompassing 9 information retrieval tasks sourced from 18 distinct datasets. These datasets cover multiple domains including fact checking, question answering, biomedical information retrieval, news retrieval, argument retrieval, duplicate question retrieval, citation prediction, tweet retrieval, and entity retrieval. All datasets have been preprocessed and are directly usable for experimental purposes. The supported tasks include text retrieval, zero-shot retrieval, information retrieval, and zero-shot information retrieval, among others. The benchmark's structure comprises corpus, query, and relevance judgment files available in both JSONL and TSV formats. All tasks in this benchmark are in English.

提供机构：

BeIR

原始信息汇总

BEIR Benchmark 数据集概述

数据集描述

数据集概要

BEIR是一个异构基准，由18个不同数据集组成，涵盖9个信息检索任务。这些任务包括事实检查、问答、生物医学信息检索、新闻检索、论点检索、重复问题检索、引用预测、推文检索和实体检索。

支持的任务和排行榜

数据集支持针对特定任务的排行榜，评估模型在F1或EM等指标上的表现，以及从维基百科检索支持信息的能力。当前最佳模型的表现可以在此处查看。

语言

所有任务均使用英语（en）。

数据集结构

数据实例

BEIR数据集包含三个主要部分：corpus（文档库）、queries（查询）和qrels（相关性判断文件）。每个部分都有特定的格式和内容。

数据字段

corpus: 包含文档的唯一标识符、标题和文本内容。
queries: 包含查询的唯一标识符和文本内容。
qrels: 包含查询标识符、文档标识符和相关性评分。

数据分割

数据集根据不同任务被分割为训练、开发和测试集。每个数据集的大小和相关性评分各不相同，具体信息可参考数据集详情页。

数据集创建

数据集来源

数据集由多个源数据集组成，每个源数据集都有其特定的任务和领域。

许可证信息

数据集遵循cc-by-sa-4.0许可证。

引用信息

@inproceedings{ thakur2021beir, title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Nandan Thakur and Nils Reimers and Andreas R{"u}ckl{e} and Abhishek Srivastava and Iryna Gurevych}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021}, url={https://openreview.net/forum?id=wCu6T5xFjeJ} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集