Cambrian-10M
收藏魔搭社区2025-12-26 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Cambrian-10M
下载链接
链接失效反馈官方服务:
资源简介:
# Cambrian-10M Dataset
**Please see paper & website for more information:**
- https://cambrian-mllm.github.io/
- https://arxiv.org/abs/2406.16860
## Overview
Cambrian-10M is a comprehensive dataset designed for instruction tuning, particularly in multimodal settings involving visual interaction data. The dataset is crafted to address the scarcity of high-quality multimodal instruction-tuning data and to maintain the language abilities of multimodal large language models (LLMs).
## Data Collection
### Multimodal Data Sources
Unlike language data, multimodal instruction-tuning data is much rarer and harder to collect. To address this, we leverage existing multimodal benchmarks and datasets involving visual interaction data, such as Visual Question Answering (VQA) and Optical Character Recognition (OCR) data. This approach helps mitigate the catastrophic forgetting commonly observed when fine-tuning multimodal LLMs.
### Language-Only Instruction-Following Data
To ensure the preservation of language capabilities, we also collect a small volume of high-quality language-only instruction-following data from the community.
### Targeted Internet Data Collection Engine
We introduce a data engine designed to create large-scale, reliable, high-quality knowledge-based multimodal instruction tuning data. The engine works as follows:
1. **Field and Subfield Selection**: The engine selects a target field and subfield, such as “Physics”.
2. **Topic Identification**: An LLM like GPT-4 identifies topics within the field (e.g., “Newton’s Laws”).
3. **Reliable Source Search**: The engine searches reliable sources like Wikipedia for each topic.
4. **Text-Image Association Extraction**: The parser extracts image-caption-text tuples from the sources.
5. **Q&A Pair Generation**: The caption-text is fed to an LLM, such as GPT-3.5, to generate instruction-type Q&A pairs about the image.
These Q&A pairs, along with the images, form our VQA dataset.
### GPT Rewriting
We also incorporate recent MLLMs such as GPT-4v and GPT-4o to generate extended responses and free-form instruction tuning data. To play with gpt generated data, use
[gpt4v_77k](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/gpt4v_77k.jsonl), Curated [gpt4o_60k](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/gpt4o_60k.jsonl)
- [gpt4v_77k](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/gpt4v_77k.jsonl) contains more extended responses from Cambrian-10M.
- [gpt4o_60k](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/gpt4o_60k.jsonl) contains more creative data in visual interactions.
## Cambrian-10M Composition
The Cambrian-10M dataset consists of approximately 9.784 million data points, offering a diverse range of data for various research applications. The composition of the dataset is visualized in Fig. 9.
## Cambrian-7M
We make an initial effort to study data curation. In particular, we find the following data ratio to perform most optimally
- **Language**: 21.00%
- **General**: 34.52%
- **OCR**: 27.22%
- **Counting**: 8.71%
- **Math**: 7.20%
- **Code**: 0.87%
- **Science**: 0.88%

## Getting Started with Cambrian Data
Before you start, ensure you have sufficient storage space to download and process the data.
Cambrian-10M contains a total of 10 million images collected from previous datasets, an internet data engine, and GPT-generated instruction tuning data. Follow these steps to get started:
1. **Download the Data Repository**
Download the data repository. Note that due to Hugging Face policy constraints, the data folder is archived into tar files. We also split the `allava` and `data_engine` data into smaller tar files because they exceed the 50 GB size limit.
2. **Merge Tar Files**
To explore the Cambrian-10M dataset, first merge the different parts of `allava` and `data_engine` together:
```bash
python merge_tars.py
```
3. **Extract Tar Files**
Then, extract all the tar files into the current directory:
```bash
python extract.py
```
4. **Training with Cambrian**
You can train with the raw [Cambrian10M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/Cambrian10M.jsonl), Curated [Cambrian7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/Cambrian7M.jsonl). We recommend using
the Curated [Cambrian7M with system prompt](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/jsons/Cambrian7M_withsystemprompt.jsonl) that also alleviates 'answer machine' problem.
# 寒武纪-10M(Cambrian-10M)数据集
**如需获取更多信息,请参阅论文及官方网站:**
- https://cambrian-mllm.github.io/
- https://arxiv.org/abs/2406.16860
## 概述
寒武纪-10M(Cambrian-10M)是一款专为指令微调设计的综合性数据集,尤其适用于包含视觉交互数据的多模态场景。该数据集旨在解决高质量多模态指令微调数据稀缺的问题,并维持多模态大语言模型的语言能力。
## 数据采集
### 多模态数据来源
与纯语言数据不同,多模态指令微调数据更为稀缺且难以采集。为解决这一痛点,我们利用现有包含视觉交互数据的多模态基准数据集,例如视觉问答(Visual Question Answering,VQA)与光学字符识别(Optical Character Recognition,OCR)数据。该方案可有效缓解多模态大语言模型微调时常见的灾难性遗忘问题。
### 纯语言指令遵循数据
为确保语言能力的保留,我们还从社区中采集了少量高质量的纯语言指令遵循数据。
### 定向互联网数据采集引擎
我们提出了一款数据引擎,用于生成大规模、可靠且高质量的知识型多模态指令微调数据。该引擎的运行流程如下:
1. **领域与子领域选择**:引擎选定目标领域及子领域,例如“物理学”。
2. **主题识别**:借助GPT-4等大语言模型识别该领域内的主题(如“牛顿运动定律”)。
3. **可靠源检索**:引擎针对每个主题,从维基百科等可靠来源中检索相关信息。
4. **图文关联提取**:解析器从上述来源中提取图像-标题-文本三元组。
5. **问答对生成**:将标题与文本输入至GPT-3.5等大语言模型,生成针对该图像的指令型问答对。
上述问答对与图像共同构成了我们的VQA数据集。
### GPT改写
我们还引入了GPT-4v、GPT-4o等近期推出的多模态大语言模型,以生成扩展回复与自由格式的指令微调数据。如需使用GPT生成的数据,请访问:
- [gpt4v_77k](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/gpt4v_77k.jsonl),该数据集包含更多来自寒武纪-10M的扩展回复数据;
- [gpt4o_60k](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/gpt4o_60k.jsonl),该数据集包含更多视觉交互场景下的创意数据。
## 寒武纪-10M数据集构成
该数据集包含约978.4万个数据样本,可为各类研究应用提供丰富多样的数据支持。其构成详见图9。
## 寒武纪-7M(Cambrian-7M)子集
我们开展了初步研究以探索数据精选策略。实验发现,以下数据占比可实现最优性能:
- **语言类**:21.00%
- **通用类**:34.52%
- **OCR类**:27.22%
- **计数类**:8.71%
- **数学类**:7.20%
- **代码类**:0.87%
- **科学类**:0.88%

## 快速上手寒武纪-10M数据集
在开始使用前,请确保拥有足够的存储空间以下载并处理数据集。
寒武纪-10M数据集共包含1000万张图像,数据来源涵盖过往数据集、互联网数据引擎以及GPT生成的指令微调数据。请按照以下步骤进行操作:
1. **下载数据仓库**
下载数据仓库。请注意,受Hugging Face平台政策限制,数据文件夹已归档为tar文件。由于`allava`与`data_engine`数据体量超过50GB,我们将其拆分为多个小型tar文件。
2. **合并tar文件**
如需探索该数据集,请先合并`allava`与`data_engine`的各个拆分文件:
bash
python merge_tars.py
3. **解压tar文件**
随后,将所有tar文件解压至当前目录:
bash
python extract.py
4. **基于寒武纪-10M数据集进行训练**
你可使用原始的[Cambrian10M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/Cambrian10M.jsonl)或精选后的[Cambrian7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/resolve/main/jsons/Cambrian7M.jsonl)进行训练。我们推荐使用带系统提示的精选[Cambrian7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/jsons/Cambrian7M_withsystemprompt.jsonl),该版本可有效缓解“回答机器化”问题。
提供机构:
maas
创建时间:
2024-06-28
搜集汇总
数据集介绍

背景与挑战
背景概述
Cambrian-10M是一个大规模多模态指令调优数据集,旨在解决高质量视觉交互数据稀缺的问题,并防止多模态大语言模型在微调时出现灾难性遗忘。它通过整合现有多模态基准、专门的数据引擎和GPT生成数据,提供了约978.4万数据点,涵盖语言、OCR、数学等多种类型,适用于多模态研究和模型训练。
以上内容由遇见数据集搜集并总结生成



