five

ArtifactAI/arxiv-beir-500k-generated-queries

收藏
Hugging Face2023-06-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ArtifactAI/arxiv-beir-500k-generated-queries
下载链接
链接失效反馈
官方服务:
资源简介:
### Dataset Summary A BEIR style dataset derived from [ArXiv](https://arxiv.org/) ### Languages All tasks are in English (`en`). ## Dataset Structure The dataset contains a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his mass–energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @misc{arxiv-beir-500k-generated-queries, title={arxiv-beir-500k-generated-queries}, author={Matthew Kenney}, year={2023} } ```
提供机构:
ArtifactAI
原始信息汇总

数据集概述

本数据集是一个BEIR风格的集合,源自ArXiv,所有任务均使用英语(en)。

数据集结构

数据集包含三个主要部分:

  • corpus文件:.jsonl格式,包含一系列字典,每个字典包含三个字段:_id(唯一文档标识符),title(文档标题,可选)和text(文档段落或文本)。
  • queries文件:.jsonl格式,包含一系列字典,每个字典包含两个字段:_id(唯一查询标识符)和text(查询文本)。
  • qrels文件:.tsv格式,包含三个列:query-idcorpus-idscore,第一行为标题。

数据实例

数据集示例包括:

  • corpus:文档标题和文本内容。
  • queries:查询文本。
  • qrels:查询与文档的相关性判断。

数据字段

  • corpus:包含_id(字符串,唯一文档ID),title(字符串,文档标题)和text(字符串,文档文本)。
  • queries:包含_id(字符串,唯一查询ID)和text(字符串,查询文本)。
  • qrels:包含_id(字符串,查询ID),_id(字符串,文档ID)和score(整数,相关性判断)。

引用信息

引用方式:

@misc{arxiv-beir-500k-generated-queries, title={arxiv-beir-500k-generated-queries}, author={Matthew Kenney}, year={2023} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作