five

ArtifactAI/arxiv-beir-cs-ml-generated-queries

收藏
Hugging Face2023-06-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ArtifactAI/arxiv-beir-cs-ml-generated-queries
下载链接
链接失效反馈
官方服务:
资源简介:
### Dataset Summary A BEIR style dataset derived from [ArXiv](https://arxiv.org/). The dataset consists of corpus/query pairs derived from ArXiv abstracts from the following categories: "cs.CL", "cs.AI", "cs.CV", "cs.HC", "cs.IR", "cs.RO", "cs.NE", "stat.ML" ### Languages All tasks are in English (`en`). ## Dataset Structure The dataset contains a corpus, queries and qrels (relevance judgments file). They must be in the following format: - `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}` - `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}` - `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1` ### Data Instances A high level example of any beir dataset: ```python corpus = { "doc1" : { "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \ one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \ its influence on the philosophy of science. He is best known to the general public for his mass–energy \ equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \ Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \ of the photoelectric effect', a pivotal step in the development of quantum theory." }, "doc2" : { "title": "", # Keep title an empty string if not present "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \ malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\ with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)." }, } queries = { "q1" : "Who developed the mass-energy equivalence formula?", "q2" : "Which beer is brewed with a large proportion of wheat?" } qrels = { "q1" : {"doc1": 1}, "q2" : {"doc2": 1}, } ``` ### Data Fields Examples from all configurations have the following features: ### Corpus - `corpus`: a `dict` feature representing the document title and passage text, made up of: - `_id`: a `string` feature representing the unique document id - `title`: a `string` feature, denoting the title of the document. - `text`: a `string` feature, denoting the text of the document. ### Queries - `queries`: a `dict` feature representing the query, made up of: - `_id`: a `string` feature representing the unique query id - `text`: a `string` feature, denoting the text of the query. ### Qrels - `qrels`: a `dict` feature representing the query document relevance judgements, made up of: - `_id`: a `string` feature representing the query id - `_id`: a `string` feature, denoting the document id. - `score`: a `int32` feature, denoting the relevance judgement between query and document. ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Cite as: ``` @misc{arxiv-beir-cs-ml-generated-queries, title={arxiv-beir-cs-ml-generated-queries}, author={Matthew Kenney}, year={2023} } ```
提供机构:
ArtifactAI
原始信息汇总

数据集概述

该数据集是一个BEIR风格的集合,源自ArXiv,包含以下类别的ArXiv摘要的语料库/查询对:"cs.CL", "cs.AI", "cs.CV", "cs.HC", "cs.IR", "cs.RO", "cs.NE", "stat.ML"。所有任务语言为英语(en)。

数据集结构

数据集包含三个主要部分:

  • 语料库文件.jsonl格式,包含一系列字典,每个字典包含三个字段:_id(唯一文档标识符),title(文档标题,可选),text(文档段落或文本)。
  • 查询文件.jsonl格式,包含一系列字典,每个字典包含两个字段:_id(唯一查询标识符),text(查询文本)。
  • qrels文件.tsv格式,包含三个列:query-idcorpus-idscore,表示查询与文档的相关性评分。

数据实例

数据集的示例包括:

  • 语料库:字典形式,包含文档的_idtitletext
  • 查询:字典形式,包含查询的_idtext
  • qrels:字典形式,包含查询_id、文档_id和相关性score

数据字段

  • 语料库:包含_id(字符串,唯一文档标识符),title(字符串,文档标题),text(字符串,文档文本)。
  • 查询:包含_id(字符串,唯一查询标识符),text(字符串,查询文本)。
  • qrels:包含_id(字符串,查询标识符),_id(字符串,文档标识符),score(整数,相关性评分)。

引用信息

引用格式为:

@misc{arxiv-beir-cs-ml-generated-queries, title={arxiv-beir-cs-ml-generated-queries}, author={Matthew Kenney}, year={2023} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作