ArtifactAI/arxiv-beir-cs-ml-generated-queries
收藏Hugging Face2023-06-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ArtifactAI/arxiv-beir-cs-ml-generated-queries
下载链接
链接失效反馈官方服务:
资源简介:
### Dataset Summary
A BEIR style dataset derived from [ArXiv](https://arxiv.org/). The dataset consists of corpus/query pairs derived from ArXiv abstracts from the following categories: "cs.CL", "cs.AI", "cs.CV", "cs.HC", "cs.IR", "cs.RO", "cs.NE", "stat.ML"
### Languages
All tasks are in English (`en`).
## Dataset Structure
The dataset contains a corpus, queries and qrels (relevance judgments file). They must be in the following format:
- `corpus` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with three fields `_id` with unique document identifier, `title` with document title (optional) and `text` with document paragraph or passage. For example: `{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}`
- `queries` file: a `.jsonl` file (jsonlines) that contains a list of dictionaries, each with two fields `_id` with unique query identifier and `text` with query text. For example: `{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}`
- `qrels` file: a `.tsv` file (tab-seperated) that contains three columns, i.e. the `query-id`, `corpus-id` and `score` in this order. Keep 1st row as header. For example: `q1 doc1 1`
### Data Instances
A high level example of any beir dataset:
```python
corpus = {
"doc1" : {
"title": "Albert Einstein",
"text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
its influence on the philosophy of science. He is best known to the general public for his mass–energy \
equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
of the photoelectric effect', a pivotal step in the development of quantum theory."
},
"doc2" : {
"title": "", # Keep title an empty string if not present
"text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\
with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
},
}
queries = {
"q1" : "Who developed the mass-energy equivalence formula?",
"q2" : "Which beer is brewed with a large proportion of wheat?"
}
qrels = {
"q1" : {"doc1": 1},
"q2" : {"doc2": 1},
}
```
### Data Fields
Examples from all configurations have the following features:
### Corpus
- `corpus`: a `dict` feature representing the document title and passage text, made up of:
- `_id`: a `string` feature representing the unique document id
- `title`: a `string` feature, denoting the title of the document.
- `text`: a `string` feature, denoting the text of the document.
### Queries
- `queries`: a `dict` feature representing the query, made up of:
- `_id`: a `string` feature representing the unique query id
- `text`: a `string` feature, denoting the text of the query.
### Qrels
- `qrels`: a `dict` feature representing the query document relevance judgements, made up of:
- `_id`: a `string` feature representing the query id
- `_id`: a `string` feature, denoting the document id.
- `score`: a `int32` feature, denoting the relevance judgement between query and document.
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
[Needs More Information]
### Citation Information
Cite as:
```
@misc{arxiv-beir-cs-ml-generated-queries,
title={arxiv-beir-cs-ml-generated-queries},
author={Matthew Kenney},
year={2023}
}
```
提供机构:
ArtifactAI
原始信息汇总
数据集概述
该数据集是一个BEIR风格的集合,源自ArXiv,包含以下类别的ArXiv摘要的语料库/查询对:"cs.CL", "cs.AI", "cs.CV", "cs.HC", "cs.IR", "cs.RO", "cs.NE", "stat.ML"。所有任务语言为英语(en)。
数据集结构
数据集包含三个主要部分:
- 语料库文件:
.jsonl格式,包含一系列字典,每个字典包含三个字段:_id(唯一文档标识符),title(文档标题,可选),text(文档段落或文本)。 - 查询文件:
.jsonl格式,包含一系列字典,每个字典包含两个字段:_id(唯一查询标识符),text(查询文本)。 - qrels文件:
.tsv格式,包含三个列:query-id,corpus-id,score,表示查询与文档的相关性评分。
数据实例
数据集的示例包括:
- 语料库:字典形式,包含文档的
_id、title和text。 - 查询:字典形式,包含查询的
_id和text。 - qrels:字典形式,包含查询
_id、文档_id和相关性score。
数据字段
- 语料库:包含
_id(字符串,唯一文档标识符),title(字符串,文档标题),text(字符串,文档文本)。 - 查询:包含
_id(字符串,唯一查询标识符),text(字符串,查询文本)。 - qrels:包含
_id(字符串,查询标识符),_id(字符串,文档标识符),score(整数,相关性评分)。
引用信息
引用格式为:
@misc{arxiv-beir-cs-ml-generated-queries, title={arxiv-beir-cs-ml-generated-queries}, author={Matthew Kenney}, year={2023} }



