ArtifactAI/arxiv-beir-math-generated-queries
收藏数据集概述
该数据集是基于ArXiv的BEIR风格数据集,包含从ArXiv摘要中提取的语料库/查询对,涵盖以下类别:"math.AC", "math.AG", "math.AP", "math.AT", "math.CA", "math.CO", "math.CT", "math.CV", "math.DG", "math.DS", "math.FA", "math.GM", "math.GN", "math.GR", "math.GT", "math.HO", "math.IT", "math.KT", "math.LO", "math.MG", "math.MP", "math.NA", "math.NT", "math.OA", "math.OC", "math.PR", "math.QA", "math.RA", "math.RT", "math.SG", "math.SP", "math.ST", "math-ph"。
语言
所有任务均为英语(en)。
数据集结构
数据集包含语料库、查询和qrels(相关性判断文件),格式如下:
corpus文件:一个.jsonl文件(jsonlines),包含一系列字典,每个字典包含三个字段:_id(唯一文档标识符)、title(文档标题,可选)和text(文档段落或段落)。例如:{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}queries文件:一个.jsonl文件(jsonlines),包含一系列字典,每个字典包含两个字段:_id(唯一查询标识符)和text(查询文本)。例如:{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}qrels文件:一个.tsv文件(制表符分隔),包含三列,即query-id、corpus-id和score,顺序如下。第一行作为标题。例如:q1 doc1 1
数据实例
一个高层次的BEIR数据集示例:
python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on the philosophy of science. He is best known to the general public for his mass–energy equivalence formula E = mc2, which has been dubbed the worlds most famous equation. He received the 1921 Nobel Prize in Physics for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect, a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # 如果标题不存在,保持为空字符串 "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, }
queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" }
qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, }
数据字段
所有配置的示例具有以下特征:
语料库
corpus:表示文档标题和段落文本的字典特征,包含:_id:表示唯一文档ID的字符串特征title:表示文档标题的字符串特征。text:表示文档文本的字符串特征。
查询
queries:表示查询的字典特征,包含:_id:表示唯一查询ID的字符串特征text:表示查询文本的字符串特征。
Qrels
qrels:表示查询文档相关性判断的字典特征,包含:_id:表示查询ID的字符串特征_id:表示文档ID的字符串特征。score:表示查询和文档之间相关性判断的int32特征。



