five

ArtifactAI/arxiv-beir-math-generated-queries

收藏
Hugging Face2023-06-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ArtifactAI/arxiv-beir-math-generated-queries
下载链接
链接失效反馈
官方服务:
资源简介:
### Dataset Summary A BEIR style dataset derived from [ArXiv](https://arxiv.org/). The dataset consists of corpus/query pairs derived from ArXiv abstracts from the following categories: "math.AC", "math.AG", "math.AP", "math.AT", "math.CA", "math.CO", "math.CT", "math.CV", "math.DG", "math.DS", "math.FA", "math.GM", "math.GN", "math.GR", "math.GT", "math.HO", "math.IT", "math.KT", "math.LO", "math.MG", "math.MP", "math.NA", "math.NT", "math.OA", "math.OC", "math.PR", "math.QA", "math.RA", "math.RT", "math.SG", "math.SP", "math.ST", "math-ph". ### Languages All tasks are in English (`en`). ## Dataset Structure The dataset contains a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his mass–energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @misc{arxiv-beir-math-generated-queries, title={arxiv-beir-math-generated-queries}, author={Matthew Kenney}, year={2023} } ```
提供机构:
ArtifactAI
原始信息汇总

数据集概述

该数据集是基于ArXiv的BEIR风格数据集,包含从ArXiv摘要中提取的语料库/查询对,涵盖以下类别:"math.AC", "math.AG", "math.AP", "math.AT", "math.CA", "math.CO", "math.CT", "math.CV", "math.DG", "math.DS", "math.FA", "math.GM", "math.GN", "math.GR", "math.GT", "math.HO", "math.IT", "math.KT", "math.LO", "math.MG", "math.MP", "math.NA", "math.NT", "math.OA", "math.OC", "math.PR", "math.QA", "math.RA", "math.RT", "math.SG", "math.SP", "math.ST", "math-ph"。

语言

所有任务均为英语(en)。

数据集结构

数据集包含语料库、查询和qrels(相关性判断文件),格式如下:

  • corpus文件:一个.jsonl文件(jsonlines),包含一系列字典,每个字典包含三个字段:_id(唯一文档标识符)、title(文档标题,可选)和text(文档段落或段落)。例如:{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
  • queries文件:一个.jsonl文件(jsonlines),包含一系列字典,每个字典包含两个字段:_id(唯一查询标识符)和text(查询文本)。例如:{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
  • qrels文件:一个.tsv文件(制表符分隔),包含三列,即query-idcorpus-idscore,顺序如下。第一行作为标题。例如:q1 doc1 1

数据实例

一个高层次的BEIR数据集示例:

python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on the philosophy of science. He is best known to the general public for his mass–energy equivalence formula E = mc2, which has been dubbed the worlds most famous equation. He received the 1921 Nobel Prize in Physics for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect, a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # 如果标题不存在,保持为空字符串 "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, }

queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" }

qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, }

数据字段

所有配置的示例具有以下特征:

语料库

  • corpus:表示文档标题和段落文本的字典特征,包含:
    • _id:表示唯一文档ID的字符串特征
      • title:表示文档标题的字符串特征。
      • text:表示文档文本的字符串特征。

查询

  • queries:表示查询的字典特征,包含:
    • _id:表示唯一查询ID的字符串特征
    • text:表示查询文本的字符串特征。

Qrels

  • qrels:表示查询文档相关性判断的字典特征,包含:
    • _id:表示查询ID的字符串特征
      • _id:表示文档ID的字符串特征。
      • score:表示查询和文档之间相关性判断的int32特征。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作