ArxivCap
收藏魔搭社区2025-12-05 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/MMInstruction/ArxivCap
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for ArxivCap
## Table of Contents
- [Dataset Card for ArxivCap](#dataset-card-for-arxivcap)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Curation Process](#curation-process)
- [Dataset Structure](#dataset-structure)
- [Data Loading](#data-loading)
- [Data Fields](#data-fields)
- [Data Instances](#data-instances)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Paper:** [Multimodal ArXiv](https://arxiv.org/abs/2403.00231)
- **Point of Contact:** nlp.lilei@gmail.com
- **HomePage**: https://mm-arxiv.github.io/
### Data Instances
<details>
<summary>Example-1 of single (image, caption) pairs</summary>
"......" stands for omitted parts.

```
{
'src': 'arXiv_src_2112_060/2112.08947',
'meta':
{
'meta_from_kaggle':
{
'journey': '',
'license': 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/',
'categories': 'cs.ET'
},
'meta_from_s2':
{
'citationCount': 8,
'influentialCitationCount': 0,
'publicationTypes': ['JournalArticle']
}
},
'arxiv_id': '2112.08947',
'title': 'Computational metrics and parameters of an injection-locked large area semiconductor laser for neural network computing',
'abstract': 'Artificial neural networks have become a staple computing technique in many fields. Yet, they present fundamental differences with classical computing hardware in the way they process information. Photonic implementations of neural network architectures potentially offer fundamental advantages over their electronic counterparts in terms of speed, processing parallelism, scalability and energy efficiency. Scalable and high performance photonic neural networks (PNNs) have been demonstrated, yet they remain scarce. In this work, we study the performance of such a scalable, fully parallel and autonomous PNN based on a large area vertical-cavity surface-emitting laser\n(LA-VCSEL). We show how the performance varies with different physical parameters, namely, injection wavelength, injection power, and bias current. Furthermore, we link these physical parameters to the general computational measures of consistency and dimensionality. We present a general method of gauging dimensionality in high dimensional nonlinear systems subject to noise, which could be applied to many systems in the context of neuromorphic computing. Our work will inform future implementations of spatially multiplexed VCSEL PNNs.\n',
'caption_images':
[
{
'caption': '(a) Working principle of the LA-VCSEL spatially multiplexed reservoir. (b) Input information $\\mathbf{u}$ and the subsequent LA-VCSEL response for 3-bit binary headers. The graph shows the target output $y^{\\text{target}}$ (yellow) for classifying header 001 and different reservoir outputs $y^{\\text{out}}$ of decreasing mean square error (MSE) (red, blue and green). (c) Schematic illustration of the error landscape, showing the MSE as a function of the output weights configuration. The outlined (red, blue and green) Boolean matrices correspond to the output weights giving the output from (b). (d) Representative performance of the PNN on a 6-bit header recognition task.',
'cil_pairs':
[
{
'sub_caption': '',
'image_file': 'arXiv_src_2112_060/2112.08947_0.jpg',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2016x1063 at 0x7F098E288040>,
'image_ocr': ['(a)', 'LA-VCSEL', 'DMDa', 'DMD', 'MMF', 'DET', 'Win', 'xt', 'Spatial positions', 'Output', 'Input', 'Wint', 'Carrier diffusion', 'Cavity diffraction', 'Reservoir', '(d)50', '6bit HR', 'Error(MSE)', '830', '001', '000', '001', '100', '001', '111', 'ER', 'S', '10', '0', 'Configuration DMD.', '0', '1000', 'Input examples', 'Learning epochs']
}
]
}
......
]
}
```
</details>
<details>
<summary>Example-2 of multiple images and subcaptions</summary>
"......" stands for omitted parts.

```
{
'src': 'arXiv_src_0309_001/quant-ph0309051',
'meta':
{
'meta_from_kaggle': {'journey': '', 'license': '', 'categories': 'quant-ph'},
'meta_from_s2': {'citationCount': 9, 'influentialCitationCount': 1, 'publicationTypes': ['JournalArticle']}
},
'arxiv_id': 'quant-ph/0309051',
'title': 'Implementing a Quantum Algorithm with Exchange-Coupled Quantum Dots: a Feasibility study.',
'abstract': '\nWe present Monte Carlo wavefunction simulations for quantum computations employing an exchange-coupled array of quantum dots. Employing a combination of experimentally and theoretically available parameters, we find that gate fidelities greater than 98 \\% may be obtained with current experimental and technological capabilities. Application to an encoded 3 qubit\n(nine physical qubits) Deutsch-Josza computation indicates that the algorithmic fidelity is more a question of the total time to implement the gates than of the physical complexity of those gates.\n',
'caption_images':
[
......
{
'caption': 'Representation of analytic sequence of local transformations that transform the 19-exchange sequence $U_{cnot}^{exchange}$ from Ref. {divincenzo00} into the true CNOT in the computational basis. The exchange gates and times corresponding to the elementary local transformations are then obtained using the quaternion representation of the desired $SU(2)$ unitaries (see Appendix <ref> for details).',
'cil_pairs':
[
{
'sub_caption': 'A single qubit gate ($\\frac{\\sqrt{3}}{2}-\\frac{i}{2}\\sigma_y$) acting on the second logical qubit diagonalizes the 19-gate exchange sequence. The resulting diagonal 4-by-4 matrix is then converted into the C-PHASE by $\\sigma_z$-rotations acting on both the first and the second qubit, with angles $\\phi=0.612497$ and $\\theta=-0.547580$, respectively. These values are determined from the analytic solutions to a linear equation system with 3 unknowns: $\\phi$, $\\theta$ and a global phase. See Appendix <ref> for details as to how these parameters were obtained.',
'image_file': 'arXiv_src_0309_001/quant-ph0309051_4.jpg',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2016x493 at 0x7F102471EF70>,
'image_ocr': ['Exch,', '7', 'C', '2', '+', '2', '2', 'CNOT', '2', '2', 'PHASE']
},
{
'sub_caption': 'The C-PHASE gate can be transformed into the CNOT gate by acting with Hadamard gates on the second qubit before and after the C-PHASE gate.',
'image_file': 'arXiv_src_0309_001/quant-ph0309051_5.jpg',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2016x411 at 0x7F102471EDC0>,
'image_ocr': ['C', '2', 'PHASE']
}
]
},
......
]
}
```
</details>
### Dataset Summary
The ArxivCap dataset consists of 6.4 million images and 3.9 million captions with 193 million words from 570k academic papers accompanied with abstracts and titles. (papers before **June 2023**)
### Curation Process
Refer to our paper for the curation and filter process.
## Dataset Structure
### Data Loading
```python
from datasets import load_dataset
dataset = load_dataset("MMInstruction/ArxivCap")
dataset["train"] # list of dictionaries
```
---
```bash
# for quick download in linux
set -e
sudo apt-get install git-lfs -y
git clone https://huggingface.co/datasets/MMInstruction/ArxivCap
cd ArxivCap/data
```
```python
# then you can load the parquet files in python use something like
data = load_dataset(
"parquet",
data_files="/path/to/parquet/arXiv_src_9912_001.parquet"
)
```
### Data Fields
One record refers to one paper:
- src: **String**. "\<Arxiv Tar File Name>/\<Folder Name in Tar File>"e.g. "arXiv_src_2112_060/2112.08947"
- arxiv_id: **String**. Arxiv id of the paper, e.g. "2112.08947"
- title: **String**. Title of the paper.
- abstract: **String**. Abstract of the paper.
- meta:
- meta_from_kaggle: refers to [arXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv)
- journey: **String**. Information about the journal the paper was published in.
- licence: **String**. License for the paper.
- categories: **String**. Categories / tags in the ArXiv system.
- meta_from_s2: refers to [SEMANTIC SCHOLAR](https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/get_graph_get_paper)
- citationCount: **Integer**. Total number of citations S2 has found for this paper
- influentialCitationCount: **Integer**. Refers [here](https://www.semanticscholar.org/faq#influential-citations)
- publicationTypes: **List[String]**. Journal Article, Conference, Review, etc.
- caption_images:
- caption: **String**. Main caption.
- cil_pairs:
- sub_caption: **String**. Subcaption for the image.
- image_file: **String**. Unique file name for the image.
- image: **PIL.Image.Image**. A PIL.Image.Image object containing the image.
- image_ocr: **List[String]**. OCR result for the image using [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
```python
import datasets
features = datasets.Features(
{
"src": datasets.Value("string"),
"arxiv_id": datasets.Value("string"),
"title": datasets.Value("string"),
"abstract": datasets.Value("string"),
"meta": {
"meta_from_kaggle": {
"journey": datasets.Value("string"),
"license": datasets.Value("string"),
"categories": datasets.Value("string"),
},
"meta_from_s2": {
"citationCount": datasets.Value("int32"),
"influentialCitationCount": datasets.Value("int32"),
"publicationTypes": [datasets.Value("string")],
}
},
"caption_images": [{
"caption": datasets.Value("string"),
"cil_pairs": [{
"sub_caption": datasets.Value("string"),
"image_file": datasets.Value("string"),
"image": datasets.Image(),
"image_ocr": [datasets.Value("string")],
}]
}]
}
)
```
## Additional Information
### Licensing Information
ArxivCap is released under [CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/).
### Citation Information
```
@inproceedings{li-etal-2024-multimodal-arxiv,
title = "Multimodal {A}r{X}iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models",
author = "Li, Lei and
Wang, Yuqi and
Xu, Runxin and
Wang, Peiyi and
Feng, Xiachong and
Kong, Lingpeng and
Liu, Qi",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.775",
doi = "10.18653/v1/2024.acl-long.775",
pages = "14369--14387"
}
```
# ArxivCap 数据集卡片
## 目录
- [ArxivCap 数据集卡片](#arxivcap-数据集卡片)
- [目录](#目录)
- [数据集概述](#数据集概述)
- [数据集摘要](#数据集摘要)
- [数据整理流程](#数据整理流程)
- [数据集结构](#数据集结构)
- [数据加载](#数据加载)
- [数据字段说明](#数据字段说明)
- [数据样例](#数据样例)
- [附加信息](#附加信息)
- [授权信息](#授权信息)
- [引用信息](#引用信息)
## 数据集概述
- **关联论文:** [多模态arXiv(Multimodal ArXiv)](https://arxiv.org/abs/2403.00231)
- **联系邮箱:** nlp.lilei@gmail.com
- **项目主页:** https://mm-arxiv.github.io/
### 数据样例
<details>
<summary>单图像-描述对样例1</summary>
「......」代表省略内容。

{
'src': 'arXiv_src_2112_060/2112.08947',
'meta':
{
'meta_from_kaggle':
{
'journey': '',
'license': 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/',
'categories': 'cs.ET'
},
'meta_from_s2':
{
'citationCount': 8,
'influentialCitationCount': 0,
'publicationTypes': ['JournalArticle']
}
},
'arxiv_id': '2112.08947',
'title': '面向神经网络计算的注入锁定大面积半导体激光器的计算指标与参数',
'abstract': '人工神经网络已成为众多领域的主流计算技术,但其信息处理方式与经典计算硬件存在本质差异。神经网络架构的光子实现方案在速度、处理并行性、可扩展性与能效等方面,相较电子实现方案具有显著优势。目前已演示出可扩展且高性能的光子神经网络(PNNs,Photonic Neural Networks),但相关研究仍较为稀缺。本文研究了一种基于大面积垂直腔面发射激光器(LA-VCSEL)的可扩展、全并行自主光子神经网络的性能,分析了其性能随注入波长、注入功率与偏置电流等物理参数的变化规律。此外,本文将这些物理参数与一致性和维度等通用计算指标相关联,提出了一种在受噪声影响的高维非线性系统中评估维度的通用方法,该方法可应用于神经形态计算(neuromorphic computing)场景下的众多系统。本文的研究将为空间复用VCSEL光子神经网络的后续实现提供参考。
',
'caption_images':
[
{
'caption': '(a) LA-VCSEL空间复用储层的工作原理。(b) 输入信息$\mathbf{u}$以及3位二进制包头的LA-VCSEL响应。图中展示了用于分类包头001的目标输出$y^{\text{target}}$(黄色),以及均方误差(MSE,mean square error)逐渐降低的不同储层输出$y^{\text{out}}$(红、蓝、绿)。(c) 误差曲面示意图,展示了作为输出权重配置函数的均方误差。标注的(红、蓝、绿)布尔矩阵对应于(b)中输出对应的输出权重。(d) 光子神经网络在6位包头识别任务中的典型性能。',
'cil_pairs':
[
{
'sub_caption': '',
'image_file': 'arXiv_src_2112_060/2112.08947_0.jpg',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2016x1063 at 0x7F098E288040>,
'image_ocr': ['(a)', 'LA-VCSEL', 'DMDa', 'DMD', 'MMF', 'DET', 'Win', 'xt', 'Spatial positions', 'Output', 'Input', 'Wint', 'Carrier diffusion', 'Cavity diffraction', 'Reservoir', '(d)50', '6bit HR', 'Error(MSE)', '830', '001', '000', '001', '100', '001', '111', 'ER', 'S', '10', '0', 'Configuration DMD.', '0', '1000', 'Input examples', 'Learning epochs']
}
]
}
......
]
}
</details>
<details>
<summary>多图像及子描述样例2</summary>
「......」代表省略内容。

{
'src': 'arXiv_src_0309_001/quant-ph0309051',
'meta':
{
'meta_from_kaggle': {'journey': '', 'license': '', 'categories': 'quant-ph'},
'meta_from_s2': {'citationCount': 9, 'influentialCitationCount': 1, 'publicationTypes': ['JournalArticle']}
},
'arxiv_id': 'quant-ph/0309051',
'title': '利用交换耦合量子点实现量子算法:可行性研究',
'abstract': '
本文提出了针对采用交换耦合量子点阵列的量子计算的蒙特卡洛波函数模拟。结合实验与理论可得参数,我们发现利用当前的实验与技术能力可获得高于98%的门保真度。将其应用于编码后的3量子比特(9个物理量子比特)Deutsch-Josza计算表明,算法保真度主要取决于实现门操作的总时长,而非这些门的物理复杂度。
',
'caption_images':
[
......
{
'caption': '将参考文献{divincenzo00}中的19次交换序列$U_{cnot}^{exchange}$转换为计算基下真实CNOT门的解析局部变换序列。随后,利用所需$SU(2)$酉矩阵的四元数表示得到对应基本局部变换的交换门与时间(详见附录<ref>)。',
'cil_pairs':
[
{
'sub_caption': '作用于第二逻辑量子比特的单量子比特门($\frac{\sqrt{3}}{2}-\frac{i}{2}\sigma_y$)可对角化19次门交换序列。所得对角4×4矩阵随后可通过分别作用于第一和第二量子比特的$\sigma_z$旋转转换为C-PHASE门,旋转角度分别为$\phi=0.612497$和$\theta=-0.547580$。这些参数由包含3个未知数($\phi$、$\theta$和全局相位)的线性方程组的解析解确定。详见附录<ref>了解这些参数的获取细节。',
'image_file': 'arXiv_src_0309_001/quant-ph0309051_4.jpg',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2016x493 at 0x7F102471EF70>,
'image_ocr': ['Exch,', '7', 'C', '2', '+', '2', '2', 'CNOT', '2', '2', 'PHASE']
},
{
'sub_caption': '通过在C-PHASE门前后分别作用于第二量子比特的哈达玛(Hadamard)门,可将C-PHASE门转换为CNOT门。',
'image_file': 'arXiv_src_0309_001/quant-ph0309051_5.jpg',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2016x411 at 0x7F102471EDC0>,
'image_ocr': ['C', '2', 'PHASE']
}
]
},
......
]
}
</details>
### 数据集摘要
ArxivCap数据集包含来自57万篇学术论文的640万张图像与390万条描述文本,总词量达1.93亿,每条文本均附带对应论文的标题与摘要,所覆盖的论文均发布于2023年6月之前。
### 数据整理流程
有关数据整理与筛选流程,请参阅我们的关联论文。
## 数据集结构
### 数据加载
python
from datasets import load_dataset
dataset = load_dataset("MMInstruction/ArxivCap")
dataset["train"] # 字典列表格式
---
bash
# 用于Linux系统下快速下载
set -e
sudo apt-get install git-lfs -y
git clone https://huggingface.co/datasets/MMInstruction/ArxivCap
cd ArxivCap/data
python
# 随后可通过如下方式在Python中加载Parquet文件
data = load_dataset(
"parquet",
data_files="/path/to/parquet/arXiv_src_9912_001.parquet"
)
### 数据字段说明
单条数据对应一篇学术论文:
- src:**字符串类型**。格式为「<Arxiv压缩包文件名>/<压缩包内文件夹名>」,示例:"arXiv_src_2112_060/2112.08947"
- arxiv_id:**字符串类型**。论文的arXiv编号,示例:"2112.08947"
- title:**字符串类型**。论文标题
- abstract:**字符串类型**。论文摘要
- meta:元数据
- meta_from_kaggle:源自[arXiv数据集(arXiv Dataset)](https://www.kaggle.com/datasets/Cornell-University/arxiv)
- journey:**字符串类型**。论文发表期刊相关信息
- license:**字符串类型**。论文授权协议
- categories:**字符串类型**。arXiv系统中的分类/标签
- meta_from_s2:源自[语义学术库(SEMANTIC SCHOLAR)](https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/get_graph_get_paper)
- citationCount:**整数类型**。语义学术库统计的该论文总引用量
- influentialCitationCount:**整数类型**。详见[此处](https://www.semanticscholar.org/faq#influential-citations)
- publicationTypes:**字符串列表**。论文发表类型,如期刊文章、会议论文、综述等
- caption_images:图像-描述对集合
- caption:**字符串类型**。主描述文本
- cil_pairs:图像-子描述对集合
- sub_caption:**字符串类型**。对应图像的子标题
- image_file:**字符串类型**。图像唯一文件名
- image:**PIL图像对象(PIL.Image.Image)**。存储图像的PIL.Image.Image实例
- image_ocr:**字符串列表**。使用[飞桨OCR(PaddleOCR)](https://github.com/PaddlePaddle/PaddleOCR)得到的图像OCR识别结果
python
import datasets
features = datasets.Features(
{
"src": datasets.Value("string"),
"arxiv_id": datasets.Value("string"),
"title": datasets.Value("string"),
"abstract": datasets.Value("string"),
"meta": {
"meta_from_kaggle": {
"journey": datasets.Value("string"),
"license": datasets.Value("string"),
"categories": datasets.Value("string"),
},
"meta_from_s2": {
"citationCount": datasets.Value("int32"),
"influentialCitationCount": datasets.Value("int32"),
"publicationTypes": [datasets.Value("string")],
}
},
"caption_images": [{
"caption": datasets.Value("string"),
"cil_pairs": [{
"sub_caption": datasets.Value("string"),
"image_file": datasets.Value("string"),
"image": datasets.Image(),
"image_ocr": [datasets.Value("string")],
}]
}]
}
)
## 附加信息
### 授权信息
ArxivCap数据集采用[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)授权协议发布。
### 引用信息
@inproceedings{li-etal-2024-multimodal-arxiv,
title = "Multimodal {A}r{X}iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models",
author = "Li, Lei and
Wang, Yuqi and
Xu, Runxin and
Wang, Peiyi and
Feng, Xiachong and
Kong, Lingpeng and
Liu, Qi",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.775",
doi = "10.18653/v1/2024.acl-long.775",
pages = "14369--14387"
}
提供机构:
maas
创建时间:
2025-02-08



