arena-arxiv-7-2-24
收藏魔搭社区2025-11-12 更新2024-09-07 收录
下载链接:
https://modelscope.cn/datasets/MTEB/arena-arxiv-7-2-24
下载链接
链接失效反馈官方服务:
资源简介:
# mteb/arena-arxiv-7-2-24 Dataset
## Overview
This dataset, `mteb/arena-arxiv-7-2-24`, is a comprehensive collection of scientific papers from ArXiv up to July 2, 2024. It is designed for use in the MTEB (Massive Text Embedding Benchmark) arena, where various embedding models compete and are ranked based on their performance.
## What is ArXiv?
ArXiv (pronounced "archive") is a free distribution service and open-access archive for scholarly articles. Founded in 1991, it has become a crucial platform for researchers to share their work quickly and efficiently, particularly in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
ArXiv allows researchers to upload preprints (versions of academic papers before peer review) as well as post-prints (versions after peer review). This enables rapid dissemination of new research findings and fosters open scientific communication.
## Dataset Structure
Each instance in the dataset represents a single paper from ArXiv and contains the following fields:
1. **id** (string): A unique identifier for the paper, typically in the format "YYMM.NNNNN" where YY is the year, MM is the month, and NNNNN is a sequential number.
2. **title** (string): The full title of the paper.
3. **abstract** (string): A summary of the paper's content, typically written by the authors.
4. **categories** (string): The ArXiv categories associated with the paper. Papers can belong to multiple categories, reflecting their interdisciplinary nature.
## Example Instance
Here's an example of what a single instance in the dataset looks like:
```json
{
"id": "0704.0001",
"title": "Calculation of prompt diphoton production cross sections at Tevatron and LHC energies",
"abstract": "A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pair transverse momentum and rapidity are presented, as well as the rapidity distribution of the diphoton system. The direct photon contribution is shown separately from the photon fragmentation contributions. The fragmentation contributions are approximated by the use of the first order perturbative contribution only.",
"categories": "hep-ph"
}
```
## Usage
This dataset is primarily intended for training and evaluating embedding models in the MTEB arena. Researchers and developers can use it to:
1. Train new embedding models on a diverse set of scientific texts.
2. Evaluate the performance of existing embedding models on scientific literature.
3. Conduct research on topic modeling, document classification, or information retrieval in the scientific domain.
## Ethical Considerations
When using this dataset, please be aware of potential biases in the scientific literature and the limitations of using preprint data. Not all papers in ArXiv have undergone peer review, so the quality and accuracy of the content may vary.
## Updates and Maintenance
This dataset represents ArXiv papers up to July 2, 2024. For instructions on how to create this dataset again with newer data, please refer to the [create_index_chunks.py script](https://github.com/embeddings-benchmark/arena/blob/main/retrieval/create_index_chunks.py#L107) in the embeddings-benchmark/arena repository.
# mteb/arena-arxiv-7-2-24 数据集
## 概述
本数据集`mteb/arena-arxiv-7-2-24`是截至2024年7月2日的arXiv学术论文综合合集,专为MTEB(大规模文本嵌入基准,Massive Text Embedding Benchmark)竞技场设计,该场景下各类嵌入模型将基于自身性能展开比拼并接受排名。
## 什么是arXiv?
arXiv(发音为“archive”)是面向学术论文的免费分发服务与开放获取存档库。该平台于1991年创立,现已成为研究者快速高效分享研究成果的核心平台,覆盖物理学、数学、计算机科学、定量生物学、定量金融、统计学、电气工程与系统科学以及经济学等领域。
arXiv支持研究者上传预印本(同行评审前的学术论文版本)与后印本(同行评审后的学术论文版本),这一机制推动了新研究成果的快速传播,并促进了开放科学交流。
## 数据集结构
数据集中的每个样本对应一篇arXiv学术论文,包含以下字段:
1. **id**(字符串类型):论文的唯一标识符,格式通常为“YYMM.NNNNN”,其中YY代表年份,MM代表月份,NNNNN为序列编号。
2. **title**(字符串类型):论文的完整标题。
3. **abstract**(字符串类型):论文内容摘要,通常由作者撰写。
4. **categories**(字符串类型):论文所属的arXiv分类标签。一篇论文可归属多个分类,体现其跨学科属性。
## 样本示例
以下为数据集中单条样本的示例:
json
{
"id": "0704.0001",
"title": "Calculation of prompt diphoton production cross sections at Tevatron and LHC energies",
"abstract": "A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pair transverse momentum and rapidity are presented, as well as the rapidity distribution of the diphoton system. The direct photon contribution is shown separately from the photon fragmentation contributions. The fragmentation contributions are approximated by the use of the first order perturbative contribution only.",
"categories": "hep-ph"
}
## 使用场景
本数据集主要用于MTEB竞技场中的嵌入模型训练与评估。研究者与开发者可利用该数据集完成以下工作:
1. 基于多样化的科学文本训练全新的嵌入模型;
2. 在科学文献场景下评估现有嵌入模型的性能;
3. 开展科学领域内的主题建模、文档分类或信息检索相关研究。
## 伦理考量
使用本数据集时,请留意科学文献中潜在的偏倚问题,以及使用预印本数据的局限性。arXiv上并非所有论文都经过同行评审,因此内容的质量与准确性可能存在差异。
## 更新与维护
本数据集收录截至2024年7月2日的arXiv论文。如需了解如何基于更新数据重新构建该数据集,请参阅embeddings-benchmark/arena仓库中的[create_index_chunks.py脚本](https://github.com/embeddings-benchmark/arena/blob/main/retrieval/create_index_chunks.py#L107)。
提供机构:
maas
创建时间:
2024-09-06



