ArtifactAI/arxiv-beir-500k-generated-queries

Name: ArtifactAI/arxiv-beir-500k-generated-queries
Creator: ArtifactAI
Published: 2023-06-21 13:56:49
License: 暂无描述

Hugging Face2023-06-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ArtifactAI/arxiv-beir-500k-generated-queries

下载链接

链接失效反馈

官方服务：

资源简介：

### Dataset Summary A BEIR style dataset derived from [ArXiv](https://arxiv.org/) ### Languages All tasks are in English (`en`). ## Dataset Structure The dataset contains a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his massâ€“energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German WeiÃŸbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @misc{arxiv-beir-500k-generated-queries, title={arxiv-beir-500k-generated-queries}, author={Matthew Kenney}, year={2023} } ```

提供机构：

ArtifactAI

原始信息汇总

数据集概述

本数据集是一个BEIR风格的集合，源自ArXiv，所有任务均使用英语（en）。

数据集结构

数据集包含三个主要部分：

corpus文件：.jsonl格式，包含一系列字典，每个字典包含三个字段：_id（唯一文档标识符），title（文档标题，可选）和text（文档段落或文本）。
queries文件：.jsonl格式，包含一系列字典，每个字典包含两个字段：_id（唯一查询标识符）和text（查询文本）。
qrels文件：.tsv格式，包含三个列：query-id，corpus-id和score，第一行为标题。

数据实例

数据集示例包括：

corpus：文档标题和文本内容。
queries：查询文本。
qrels：查询与文档的相关性判断。

数据字段

corpus：包含_id（字符串，唯一文档ID），title（字符串，文档标题）和text（字符串，文档文本）。
queries：包含_id（字符串，唯一查询ID）和text（字符串，查询文本）。
qrels：包含_id（字符串，查询ID），_id（字符串，文档ID）和score（整数，相关性判断）。

引用信息

引用方式：

@misc{arxiv-beir-500k-generated-queries, title={arxiv-beir-500k-generated-queries}, author={Matthew Kenney}, year={2023} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集