lfh266/S1-MMAlign
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lfh266/S1-MMAlign
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- image-to-text
- visual-question-answering
- feature-extraction
language:
- en
tags:
- science
- multimodal
- physics
- biology
- chemistry
- engineering
- large-scale
size_categories:
- 10M<n<100M
---
<h1>S1-MMAlign</h1>
<p><b>A Large-Scale Multi-Disciplinary Scientific Multimodal Dataset</b></p>
**S1-MMAlign** is a large-scale, multi-disciplinary multimodal dataset comprising over **15.5 million** high-quality image-text pairs derived from **2.5 million** open-access scientific papers.
Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. **S1-MMAlign** aims to bridge this gap. Unlike simple "image-reading," scientific understanding requires traversing multiple semantic layers involving variables, structures, hypotheses, and inferences. This dataset is built to address this "short board" in current data resources.
The dataset captures diverse visual modalities—including experimental setups, heatmaps, and microscopic imagery—spanning major disciplines such as **Mathematics, Physics, Chemistry, Biology, Astronomy, Earth Science, Medicine, Engineering, and Computer Science**.
We anticipate that researchers and enthusiasts will utilize this dataset for training foundational AI for Science models, advancing scientific reasoning, and improving cross-modal understanding in specialized domains.
### Dataset Information
**Total Image-Text Pairs:** > 15,500,000
**Source Papers:** ~ 2,500,000
**Disciplines Covered:** 9 Major STEM Fields
**Alignment Improvement:** +18.21% (CLIP Score vs. Raw Data)
**License:** CC BY-NC 4.0
### How was the data processed?
To address the pervasive issue of weak alignment in raw scientific captions, we introduced an AI-ready semantic enhancement pipeline. We utilized the **Qwen-VL** multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts.
Technical validation demonstrates significant quality improvements: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an **18.21%** improvement in image-text alignment.
**Recommendation: Please use the `recaption` field for model training.**
* **`image_path`**: The relative path to the image file.
* **`recaption`** (Recommended): The **AI-enhanced caption** generated by our pipeline (Qwen-VL). It synthesizes context from the paper abstract and citations to provide a semantically rich description, significantly outperforming the raw caption in alignment and quality.
* **`caption`**: The original, raw caption extracted from the paper figures (often noisy or sparse).
* **`metadata`**: Additional information including source paper arxiv_id and title.
### Note on File Structure
**The relative paths of the images provided in the `jsonl` file must follow the file structure we provide in order to be used correctly.** Please ensure you maintain the directory hierarchy after downloading and decompressing the dataset. Do not flatten the folder structure, as the metadata relies on specific relative paths.
---
### Citation
If you find this dataset useful, please cite our work:
```bibtex
@article{s1mmalign2026,
title={S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure–Text Understanding},
author={He Wang and Longteng Guo and Pengkang Huo and Xuanxu Lin and Yichen Yuan and Jie Jiang and Jing Liu},
journal={ArXiv preprint},
url={https://arxiv.org/abs/2601.00264},
year={2026}
}
```
### License and Copyright
**This dataset is released under the CC BY-NC 4.0 license for research and non-commercial use only.**
* **Non-Commercial:** Commercial use of the dataset or any images is strictly prohibited.
* **Copyrights:** The images contained in this dataset are extracted from publicly accessible scientific publications. All copyrights of the original figures remain with their original authors or publishers.
* **Compliance:** Users must ensure their use complies with the copyrights of the original publications.
---
> 许可证:CC BY-NC 4.0
> 任务类别:
> - 图像到文本
> - 视觉问答
> - 特征提取
> 语言:
> - 英语
> 标签:
> - 科学
> - 多模态
> - 物理学
> - 生物学
> - 化学
> - 工程学
> - 大规模
> 样本规模类别:
> - 1000万 < 样本数 < 1亿
---
<h1>S1-MMAlign</h1>
<p><b>大规模多学科科学多模态数据集</b></p>
**S1-MMAlign** 是一款大规模多学科科学多模态数据集,包含超过**1550万**对高质量图像-文本对,数据源自**250万**篇开放获取学术论文。
多模态学习已推动通用领域任务实现革命性突破,但在科学发现领域的应用却因复杂科学图像与稀疏文本描述间的深层语义鸿沟而受阻。**S1-MMAlign** 旨在填补这一空白。与简单的“图像解读”不同,科学理解需要跨越变量、结构、假设与推理等多层语义维度。本数据集正是为解决当前数据资源中的这一“短板”而构建。
该数据集涵盖实验装置、热图、显微图像等多种视觉模态,覆盖**数学、物理学、化学、生物学、天文学、地球科学、医学、工程学与计算机科学**等9大STEM(科学、技术、工程、数学)领域。
我们期望研究人员与爱好者能够利用本数据集,训练面向科学场景的基础AI模型,推进科学推理技术发展,并优化专业领域内的跨模态理解能力。
### 数据集信息
**图像-文本对总数:** 超过1550万
**来源论文数量:** 约250万
**覆盖学科:** 9大STEM领域
**对齐效果提升:** 相较于原始数据,CLIP得分提升18.21%
**许可证:** CC BY-NC 4.0
### 数据处理流程
为解决原始科学标注普遍存在的对齐度不足问题,我们搭建了适配AI训练的语义增强流水线。我们采用**Qwen-VL**多模态大模型系列,通过整合论文摘要与引用上下文信息,为图像生成新的标注文本。
技术验证结果显示数据集质量得到显著提升:基于SciBERT的伪困惑度指标表明语义歧义性降低,而CLIP得分显示图像-文本对齐效果提升**18.21%**。
**使用建议:** 推荐使用`recaption`字段进行模型训练。
* **`image_path`**:图像文件的相对路径。
* **`recaption`**(推荐使用):由本流水线(Qwen-VL)生成的**AI增强型标注文本**。该标注通过整合论文摘要与引用上下文,生成语义丰富的描述,在对齐度与质量上均显著优于原始标注。
* **`caption`**:从论文图表中提取的原始标注文本(通常存在噪声或信息稀疏问题)。
* **`metadata`**:附加信息,包括来源论文的arxiv编号与标题。
### 文件结构说明
为确保正常使用,`jsonl`文件中提供的图像相对路径必须与本数据集提供的文件结构一致。请在下载并解压数据集后保留完整的目录层级,请勿扁平化文件夹结构,因为元数据依赖特定的相对路径。
---
### 引用声明
如果您觉得本数据集对您有帮助,请引用我们的工作:
bibtex
@article{s1mmalign2026,
title={S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure–Text Understanding},
author={He Wang and Longteng Guo and Pengkang Huo and Xuanxu Lin and Yichen Yuan and Jie Jiang and Jing Liu},
journal={ArXiv preprint},
url={https://arxiv.org/abs/2601.00264},
year={2026}
}
### 许可证与版权声明
**本数据集采用CC BY-NC 4.0许可证发布,仅可用于研究与非商业用途。**
* **非商业用途:** 严禁将数据集或其中任何图像用于商业目的。
* **版权归属:** 本数据集内的图像均提取自公开可获取的学术出版物,原始图表的所有版权归原作者或出版机构所有。
* **合规要求:** 用户需确保其使用行为符合原出版物的版权规定。
提供机构:
lfh266



